Nonnegative/Binary matrix factorization for image classification … – Nature.com
NBMF9 extracts features by decomposing data into basis and coefficient matrices. The dataset was converted into a positive matrix to prepare the input for the NBMF method. If the dataset contains n-dimensional m data, then the input is an (n times m) matrix V. The input matrix is decomposed into a base matrix W of (n times k), representing the dataset features, and a coefficient matrix H of (k times m), representing the combination of features selected to reconstruct the original matrix. Then,
$$begin{aligned} V approx WH, end{aligned}$$
(1)
where W and H are positive and binary matrices, respectively. The column number k of W corresponds to the number of features extracted from the data and can be set to any value. To minimize the difference between the left and right sides of Eq.(1), W and H are updated alternately as
$$ W: = mathop {{text{arg}}~{text{min}}}limits_{{X in mathbb{R}^{{ + n times k}} }} parallel V - XHparallel _{F} + alpha parallel Xparallel _{F} , $$
(2)
$$ H: = mathop {{text{arg}}~{text{min}}}limits_{{X in { 0,1} ^{{k times m}} }} parallel V - WXparallel _{F} , $$
(3)
where (parallel cdot parallel _F) denotes the Frobenius norm. The components of W and H are initially given randomly. The hyperparameter (alpha ) is a positive real value that prevents overfitting and is set to (alpha = 1.0 times 10^{-4}).
In previous studies, the Projected Gradient Method (PGM) was used to update Eq.(2)16. The loss function that updates Eq.(2) is defined as
$$begin{aligned} f_{W}(varvec{x}) = parallel varvec{v} - H^{text{T}} varvec{x} parallel ^{2} + alpha parallel varvec{x} parallel ^{2}, end{aligned}$$
(4)
where (varvec{x}^{text{T}}) and (varvec{v}^{text{T}}) are the row vectors of W and V, respectively. The gradient of Eq.(4) is expressed as
$$begin{aligned} nabla f_{W} = -H (varvec{v} - H^T varvec{x}) + alpha varvec{x}. end{aligned}$$
(5)
The PGM minimizes the loss functions in Eq.(4) by updating (varvec{x}):
$$begin{aligned} varvec{x}^{t+1} = Pleft[varvec{x}^t - gamma _t nabla f_W (varvec{x}^t)right], end{aligned}$$
(6)
where (gamma _t) is the learning rate and
$$begin{aligned} P[x_i] = {left{ begin{array}{ll} 0 &{} (x_i le 0), \ x_i &{} (0< x_i < x_mathrm{{max}}), \ x_mathrm{{max}} &{} (x_mathrm{{max}} le x_i), end{array}right. } end{aligned}$$
(7)
where (x_mathrm{{max}}) is the upper bound and is set to (x_mathrm{{max}}=1). Eq.(7) is a projection that keeps the components of (varvec{x}) nonnegative.
However, because H is a binary matrix, Eq.(3) can be regarded as a combinatorial optimization problem that can be minimized by using an annealing method. To solve Eq.(3) using a D-Wave machine, a quantum annealing computer, we formulated the loss function as a quadratic unconstrained binary optimization model:
$$begin{aligned} f_{H}(varvec{q}) = sum _i sum _r W_{ri}left(W_{ri} - 2 v_{r}right) q_i + 2 sum _{i (8) where (varvec{q}) and (varvec{v}) are the column vectors of H and V, respectively. After the alternate updating method converges, we obtain W and H which minimize the difference between the left and right sides of Eq.(1). W consists of representative features extracted from the input data, and H represents the combination of features in W using binary values to reconstruct V. Therefore, V can be approximated as the product of W and H. Previous studies used NBMF to extract features from facial images9. When the number of annealing steps is small, the computation time is shorter than a classical combinatorial optimization solver. However, using the D-Wave machine is disadvantageous in that the computing time increases linearly with the number of annealings, whereas the classical solver does not significantly change the computing time. The results were compared with NMF14. Unlike NBMF, matrix H in NMF is positive and not binary. While the matrix H produced by NBMF was sparser than NMF, the difference between V and WH of NBMF was approximately 2.17 times larger than NMF. Although NBMF can have a shorter data processing time than the classical method, it is inferior to NMF as a machine-learning method in accuracy. Moreover, because previous studies did not demonstrate tasks beyond data reconstruction, the usefulness of NBMF as a machine-learning model is uncertain. In this study, we propose the application of NBMF to a multiclass classification model. Inspired by the structure of a fully connected neural network (FCNN), we define an image classification model using NBMF. In an FCNN, image data are fed into the network as input, as shown in Fig.1, and the predicted classes are obtained as the output of the network through the hidden layers. An overview of a fully-connected neural network. To perform fully connected network learning using NBMF, we interpret the structure shown in Fig.1 as a single-matrix decomposition. When the input and output layers of the FCNN are combined into one input layer, the network becomes a two-layer network with the same structure as NBMF. As the input to the training network by NBMF, we used a matrix consisting of image data and the corresponding class information. Class information is represented by a one-hot vector multiplied by an arbitrary real number g. The image data and class information vectors are combined row-wise and eventually transformed into an input matrix V. We use NBMF to decompose V to obtain the basis matrix W and the coefficient matrix H, as shown in Fig.2. The column vectors in H correspond to the nodes in the hidden layer of the FCNN network, and the components of W correspond to the weights of the edges. The number of feature dimensions k in the NBMF corresponds to the number of nodes in the hidden layer of the FCNN. An overview of training by NBMF. To obtain H, we minimize Eq.(8) by using an annealing solver, as in a previous study. However, to obtain W by minimizing Eq.(4), we propose using the projected Root Mean Square Propagation (RMSProp) method instead of the PGM used in a previous study. RMSProp is a gradient descent method that adjusts the learning and decay rates to help the solution escape local minima17. RMSProp updates the vector (varvec{h}), whose components are denoted by (h_i) as $$begin{aligned} h^{t+1}_{i} = beta h^{t}_{i} + (1-beta ) g^{2}_{i}, end{aligned}$$ (9) where (beta ) is the decay rate, (varvec{g} = nabla f_W), and vector (varvec{x}) is $$begin{aligned} varvec{x}^{t+1} = varvec{x}^{t} - eta frac{1}{sqrt{varvec{h}^{t} + epsilon }} nabla f_{W}, end{aligned}$$ (10) where (eta ) is the learning rate, and (epsilon ) is a small value that prevents computational errors. After updating (varvec{x}) using Eq.(10), we apply the projections described in Eq.(7), to ensure that the solution does not exceed the bounds. We propose this method as a projected RMSProp. In Fig.3, we demonstrated the information contained in W. Because the row vectors of W correspond to those of V, W consists of (W_1) corresponding to the image data information, and (W_2) corresponding to the class information. We plotted four column vectors selected from W trained with MNIST handwritten digit images under the conditions (m = 300) and (k=40), as shown in Fig.3. The images in Fig.3 show the column vectors of (W_1). The blue histograms show the frequencies at which the column vectors were selected to reconstruct the training data images with each label. The orange bar graphs show the component values of the corresponding column vectors of (W_2). For example, the image in Fig.3a resembles Number 0. From the histogram next to the image, we understand that the image is often used to reconstruct the training data labeled as 0. In the bar graph on the right, the corresponding column vector of (W_2) has the largest component value at an index of 0. This indicates that the column vector corresponding to the image has a feature of Number 0. Similarly, the image in Fig.3b has a label of 9. However, the image in Fig.3c appears to have curved features. From the histogram and bar graph next to the image, it appears that the image is often used to represent labels 2 and 3. This result is consistent with the fact that both numbers have a curve, which explains why the column vector of (W_1) was used in the reconstruction of images with labels 2 and 3. The image in Fig.3d has the shape of a straight line, and the corresponding histogram shows that the image is mainly used to express label 1 and is also frequently used to express label 6. Because Number 6 has a straight-line part, the result is reasonable. The figure shows four sets of images, (a), (b), (c), and (d), corresponding to column vectors selected from W. Each set contains an image, a histogram, and a bar graph. The image represents a column vector of (W_1), and the histogram shows how often the column vector was selected to reconstruct the training data images with each label. The orange bar graph plots the component values of the corresponding column vector of (W_2). In our multiclass classification model using NBMF, we used the trained matrices (W_1) and (W_2) to classify the test data in the workflow shown in Fig. 4. An overview of testing by NBMF. First, we decompose the test data matrix (V_text{test}) to obtain (H_text{test}) by using (W_1). Here, M represents the amount of test data, which corresponds to the number of column vectors of (V_text{test}). We use Eq.(3) for decomposition. Each column vector of (H_text{test}) represents the features selected from the trained (W_1) to approximate the corresponding column vector of (V_text{test}). Second, we multiply (W_2) by (H_text{test}) to obtain (U_text{test}), which expresses the prediction of the class vector corresponding to each column vector in (V_text{test}). Finally, we applied the softmax function to the components of (U_text{test}) and considered the index with the largest component value in each column vector to be the predicted class. View original post here:
Nonnegative/Binary matrix factorization for image classification ... - Nature.com