Research on the Classification Ability of Deep Belief Networks on Small and Medium Datasets

Abstract Recent theoretical advances in the learning of deep artificial neural networks have made it possible to overcome a vanishing gradient problem. This limitation has been overcome using a pre-training step, where deep belief networks formed by the stacked Restricted Boltzmann Machines perform unsupervised learning. Once a pre-training step is done, network weights are fine-tuned using regular error back propagation while treating network as a feed-forward net. In the current paper we perform the comparison of described approach and commonly used classification approaches on some well-known classification data sets from the UCI repository as well as on one mid-sized proprietary data set.


60
) , ( 1) , ( , ( Theoretical foundations for learning deep belief networks (DBNs) were laid down by Geoffrey Hinton, for example, see [1].Bengio gives a great overview over deep architectures in general -see [2][3].DBNs are formed by stacked Restricted Boltzmann Machines (RBMs).Recently DBNs, RBMs and other Deep Architectures were successfully applied to a wide range of classification tasks outperforming other approaches [4], [6][7][8].In [9] there is evidence that adding more layers helps in recognition/classification tasks.However, [5] showed that DBNs were outperformed on some classification tasks.The current paper aims at comparing RBMs and DBNs classification performance against some well-known classifiers like SVMs and Random Forest Trees on some wellknown small classification UCI [10] data sets as well as a single mid-sized proprietary document classification data set.This paper is structured as follows: Section 2 provides theoretical background for RBMs and DBNs as well as describes pre-training procedure for feed-forward error backpropagation artificial neural networks.Sections 3 and 4 describe experimental setup and present experiments results, while Section 5 concludes the paper.

A. Restricted Boltzmann Machine
RBMs are stochastic generative neural networks that can learn probability distributions over a set of their input vectors.The main consequence of this definition is that such a neural network learns p(data) instead of p(label | data) -essentially these models are modelling data, not labels.This allows us to deal with unlabelled or partially labelled data.Besides, restricted Boltzmann machines can be represented as a bipartite graph with two sets of neurons -visible and hidden ones (v, h), refer to Fig. 1.Neurons in both layers are symmetrically connected.RBMs are Energy-based Models (EBM) [28], that associate scalar energy to each configuration, so an overall network state can be represented as follows: where v i , h j are binary states of visible unit i and hidden units j, a i , b j are their biases and w ij is the weight between them.As in general RBM contains stochastic binary units, meaning that its binary unit state is defined by probability of its weights, the shaping of energy function allows obtaining more plausible probability distributions for network neurons.This means that RBM network learns distributions of (v, h); in other words, the probability of joint configuration over both hidden and visible units depends on the energy of that joint configuration compared to energy of all other joint configurations -this can be written as follows: (2) here Z represents all other possible configurations of visible and hidden units: ( Network assigns probability p(v|h) as follows: (4) Thus, to acquire high probability of visible training vector we need to adjust weights and biases of weights to biases and hidden units to lower energy of training vector and raise energy of other training vectors (especially those that have low energy).According to [14], the derivative of a visible training vector with respect to weights is as follows: (5) or model distributions.We can notice that (1) can be translated to free energy formula: Please refer to [15] in regard to how this (1) -( 6) translation is done.Due to the fact that we deal with stochastic binary neurons, (6) can be even more simplified: Free energy is omitted in next formulas, but it will be reused in Conditional Restricted Boltzmann Machines.Equation ( 5) can be rewritten as: (8) Due to the fact that there is no connection between neurons within a layer, it is relatively easy to get expectations for data distribution: (9) where sigm is the sigmoid functionsigm(a)=1/(1+exp(a)).And similarly for visible units: (10) In ( 9) and (10) a, b are biases and v, h, w are visible and hidden unit states, respectively, w ij is their associated weight.Thus, we assign 1 or 0 to hidden or visible neurons with a defined probability.It is much more difficult to get model distributions, but in 2002 G.Hinton discovered [29] an elegant solution to this problem.Thus, instead of: (11) Hinton proposed to use: (12) where reconstructed expectation of distribution can be calculated by Contrastive Divergence (CD) algorithm [14], see Fig. 2, which depicts a single step of CD algorithm.As it is highlighted in ( 6) and ( 7), RBM uses stochastic binary units (there are real valued extensions).One step of CD contains two phases -positive and negative ones.In the positive phase, one needs to clamp a training visible vector on a visible layer and calculate new states of hidden neurons using (7).In the negative phase, one needs to calculate new states of visible units.This new state of a visible layer can be thought as "fantasy".CD with such a single step is referred to as CD1, the more steps are taken, the better approximation to model distribution will be acquired.It was discovered that even a single step is enough, Fig. 2. Depicts single step of Contrastive Divergence algorithm at least at early learning stages.At later stages, one can switch to CD3, CD5 and CD10.Apart from CD, another algorithm, called Persistent Contrastive Divergence, was proposed in [11].There are some other nuances in regard to CD learning algorithm, which can be found in [14].

B. Deep Belief Networks
When CD was found, it was proposed in [1] to stack trained RBMs in a greedy manner to form the so-called Deep Belief Networks (DBN).The idea was to cleverly train RBM on a training vector, then after finishing the training process to use the first RBM hidden layer neuron activations as input for a visible layer of the second stacked RBM to train it and continue this procedure for all subsequent layers.When overall training is performed, the found network weights can be fine-tuned with a regular Error Back Propagation algorithm.For graphical representation see Fig. 3.
It is argued that such a deep network is capable of building complex hierarchical feature representations.For example, when one wants to classify digits "3" and "8", it is quite a problematic task because digit "3" is somewhat entangled in the "8"-th digit manifold; thus, the necessity for hierarchical features arise -and in such tasks DBNs and Deep Architectures outperform many other classifiers.
It can be worth noting that the reason why neural networks were abandoned in favour of SVMs is that on the one hand we did not have enough training data and computational power and on the other hand it was quite problematic to train really deep architectures due to a "vanishing gradient" problem, which shows itself at higher levels or during Recurrent Neural Network (RNN) training (each RNN can be represented as regular FFNN with a large number of layers, so this is a common problem for deep layers).For a vanishing gradient problem in regard to RNN, see [12].In [3] Bengio justifies greedy pre-training, and in [19] the author provides experimental results that show higher accuracy acquired by pre-trained FFNN and demonstrates that solutions found lay in different areas of function space (see page 8 in [19]).

C. Conditional Restricted Boltzmann Machine
Since the introduction of RBMs, different authors have proposed various modifications of RBMs, especially Conditional Restricted Boltzmann Machine (CRBM), proposed in [8], [13], for graphical representation see Fig. 4. The main idea here was to adjust RBMs for more successful application to discriminative problems.Apart from [8], [13], one of the first attempts to use RBMs/DBNs for a classification task was made by Hinton, Osindero and Yee Whye Teh in [16].We will consider only (a) type CRBM, which uses target vector u for which it holds two additional weight matrices W uh and W uv (for hidden and visible layers, respectively.According to [16], CRBM models the joint distribution of an input x = (x 1 , …x n ) and target class y using a hidden layer of binary stochastic units h = (h 1 ,…,h H ).This is done by first defining an energy function: (13) with parameters and where (14) is "one out of C" representation for y.From an energy function, we can assign probabilities to values of y, v and h as follows: ( where Z is the normalization constant (partition function) that is already known (from ( 3)) and ensures that ( 15) is a valid probability distribution.Similarly to standard RBM computing p(y,v,h) is computationally intractable, but it is possible to do Gibbs sampling, which gives conditional distributions.When conditioning on the visible layer we have: And when conditioning for the hidden layer we have: It is also possible to compute p(y|v) exactly and hence perform the classification.Thus, after some transformations (please refer to [16]) it is possible to derive: Here F(y,v) is free energy that is already known.
According to [16], one way of interpreting ( 19) is that, when assigning probabilities to a particular class y for some input v, the Classification RBM looks at how well the input v fits or aligns with the different filters associated with the rows W j of W. These filters are shared across the different classes, but different classes will make comparisons with different filters by controlling the class-dependent biases Y jy .Notice also that two similar classes could share some filters in W, that is, both could simultaneously have large positive values of Y jy for some rows W j .Along with that [16] describes a hybrid RBM learning approach, which uses descriptive learning combined with generative learning adjusted using some parameter alpha.Such a generative approach outperformed RBM+NN approach (RBM used as a pre-training step) on

III. EXPERIMENTAL SETUP
For our experiments we used generative DBN implementation (unmodified source codes taken from https://github.com/rasmusbergpalm/DeepLearnToolbox),which afterwards was used as a pre-training step for finetuning FFNN.For all experiments we used 10-fold crossvalidation, i.e., we divided the whole data set into ten parts and used nine parts to train model and the last 10 th part to run a classification test, in the next run the part used for training was changed to be different.Thus, on all 10 runs the same 10 data parts were used, but the training part was always different.We report classification accuracy testing rates averaged over 10-fold cross-validation runs.
Apart from the mentioned Energy-based models and DBN architecture, for comparison purposes we used Random Forests (RF) implementation (unmodified source code was taken from https://code.google.com/p/randomforest-matlab/),for classifier details see [17].Along with RF for some data sets we provided SVN accuracy rates taken from other studies [18].We performed our tests on 2 standard classification benchmarking data sets: glass identification (http://archive.ics.uci.edu/ml/datasets/Glass+Identification) and ionosphere (http://archive.ics.uci.edu/ml/datasets/ionosphere).Both of them are multivariate real valued datasets related to classification problem.We tested the discussed algorithms on a mid-sized proprietary data set containing 12916 binary data vectors of length 200 (initially there were vectors of length 5000, but we picked only 200 most representative features).These vectors represented bags of words extracted from financial documents.There were 11 classes in this data set and classes were represented by: 608, 1331, 1542, 995, 1009, 500, 731, 1220, 2788, 78 and 2114 data vectors, respectively.As can be seen, class 10 is quite poorly represented.There were numerous overlapping vectors belonging to different classes and in general such a data set could be considered quite hard to classify.Figure 5 represents the visualization of this data set (utilizing data vectors of full length equal to 5000) by means of fast t-Distributed Stochastic Neighbour Embedding, for visualization algorithm details refer to [20][21][22].

TABLE I CLASSIFICATION ACCURACY RATES
Random Forests were used with default settings for all data sets.
For a proprietary data set generative RBMs were trained with 800 hidden neurons and 3000 training epochs, for Ionosphere and Glass data sets it was trained with 100 hidden neurons and 1000 training epochs.
Classification RBM without fine-tuning for a proprietary data set was trained with 800 hidden neurons and 3000 training epochs, for Ionosphere and Glass data sets it was trained with 100 hidden neurons and 1000 training epochs.
DBN and FFNN were trained using 2 hidden layers with 200 neurons for a proprietary data set and with 10 and 32 neurons in each hidden layer, respectively.On Glass/Ionosphere data sets RBMs were trained on 100 epochs for RBM and for 100 epochs for FFNN fine-tuning on Glass/Ionosphere datacsets.In all cases FFNN (used in general setup and fine-tuning stage) was trained using Cross Entropy as a loss function.
SVM used for classification is fine-tuned implementation based on libSVM library [23].
Table I shows classification accuracy rates on different data sets.It is clearly seen that RBMs and DBN networks clearly lose in terms of accuracy to Random Forests and SVM-based classifiers.It can be seen that FFNN with two hidden layers outperforms DBN.Our results resemble ones in [5].Moreover, DBN network shows extremely low performance even compared to RBMs.The first observation is that Ionosphere has 32 features and Glass Classification only 10.In contrast, our proprietary data set holds 200 features, but all of them are binary.Thus, it seems that having problems with lower dimensionality (or with several features preselected by some other algorithms) can badly influence RBM classification rates (we should note that we conducted partial experiments on a proprietary data set with larger feature vectors (2000 features), but performance was even worse than with 200 features for DBN).In contrast to real-valued Nevertheless, SVM and RF given such initial information were able to outperform RBMs, DBN and FFNN.As to DBN it was trying to build higher-level hierarchical features based on quite poor representation given by RBM in the first layer.However, in our case it seems that all features were uncorrelated and their combination at higher levels provided a low value if at all.The same logic applies to FFNN with two hidden neuron layers.We conducted additional experiments, which showed that adding additional hidden layers did not help DBN to perform better.Looking at DBNs, their main point is to learn a hidden layer of filters or sparse bases (sparse codes) that can be combined in subsequent layer(s) either in FFNN or even SVM (for example, see [24]).In contrast, for our data sets it seems that the learning of such filters that would model the appearance of several bits in a vector instead of the single one is inappropriate for the reviewed data sets.While such sparse coding is a good thing for high-dimensional data, it is obviously not the best choice for dense data sets.In general, our findings somewhat contradict the results in [25], where Hinton argues that DBNs with an exponentially large count of hidden layers and size equal to an input vector can model an arbitrary input vector with arbitrary accuracy, but again we performed only partial experiments with 3 and 4 hidden layers, while Hinton talks about much larger amount.
The same discussion about RBM and DBN representational power is held in [26][27].While such theoretical discussions are important in a way they give theoretical justifications of methods, but as our experiments show for some specific data sets the referenced classification approaches do not work very well using acceptable models (both in terms of size and training time).All successful DBN and RBM applications reported in the referenced papers are related to high-dimensional data sets, such as documents, images and alike.While these data sets are extremely perspective research area, it is clear that for lowdimensional or pre-cleared data such approaches with default settings are not the best choice.LeCunn generalizes many classification approaches as Energy-based Models and treats them all as Energy-based Learning, so in theory it is possible to leave architecture and inference algorithm, but do adjustments in a loss function and possibly a learning algorithm.

V. CONCLUSION
We performed comparison of RBM+FNN, CRBM, DBN and FFNN in classification tests using two small benchmarking UCI data sets and single proprietary mid-sized data set.It was shown that RBMs lost in terms of accuracy rates to RF and SVM approaches, while DBN was proven to be useless because showed very poor performance, FFNN showed performance slightly worse than RBM with finetuning, which aligned with the reported good influence of pretraining phase.Tests on proprietary 200 feature data set showed that even such number of features could be insufficient to learn good separation hyper-planes for classification.Building hierarchical features through DBN showed to be useless.An increase in the number of neurons in the RBM hidden layer proved to have some positive effect, but it badly influenced training time and proved to give a negligible increase in accuracy.In general, it is obvious that existing approaches allow RBMs and DBNs to deal with highdimensional data, where we have a large number of sample vectors to be learned from.Moreover, RBMs allow us to perform training on unlabeled data, which is a huge gain in certain scenarios.
Future research directions can include searching for reasons why RBMs are outperformed by RFs and SVMs and looking for possible solutions to increase performance of RBM.Energy-based Model framework [27] is a good candidate that can help in solving the latter problem.Another direction is searching for metrics that would allow us to tell beforehand whether specific data set can be successfully modelled by RBMs and DBNs.Apart from that, experiments with Partially Restricted Boltzmann Machines and Deep Boltzmann Machines can be conducted to see how well they perform.

Fig. 3 .
Fig. 3. (a) Denotes DBN formed by stacked RBMs; (b) shows how a regular Feed-Forward Neural Network (FFNN) is formed using weights acquired during DBN training to perform fine-tuning using a standard error back propagation

Fig.
Fig. Visualization of proprietary data set -12916 binary bags of words representing 11 financial document classes