CN113158798A

CN113158798A - Short video classification method based on multi-mode feature complete representation

Info

Publication number: CN113158798A
Application number: CN202110282974.3A
Authority: CN
Inventors: 井佩光; 张丽娟; 苏育挺
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-07-23

Abstract

The invention discloses a short video classification method based on multi-mode feature complete representation, which comprises the following steps: for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned; for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video; a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video; the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.

Description

Short video classification method based on multi-mode feature complete representation

Technical Field

The invention relates to the field of short video classification, in particular to a short video classification method based on multi-mode characteristic complete representation.

Background

In recent years, along with popularization of intelligent terminals and fire and heat of social networks, more and more information is presented by adopting multimedia contents, and a high-definition camera, a large-capacity storage and a high-speed network connection create extremely convenient shooting and sharing conditions for users, so that massive multimedia data are created.

The short video is used as novel user generated content, and is greatly popular in a social network by virtue of unique advantages of low creation threshold, fragmented content, strong social attributes and the like. Especially, since 2011, with the popularization of mobile internet terminals, the speed increase of networks and the reduction of traffic charges, short videos have rapidly gained support and favor of multiple parties including large content platforms, fans, capital and the like. There is data showing that global mobile video traffic has taken up more than half of the total traffic of mobile data and continues to grow at a high rate. The enormous size of short video data easily overwhelms the information that users need, making it difficult for users to find the desired content of short video information, so how to efficiently process and utilize this information becomes critical.

Artificial intelligence technology represented by deep learning is one of the most popular technologies at present, and is widely applied to many fields such as computer vision.

Therefore, the introduction of the short video classification task is beneficial to promoting the innovation of related topics in the computer vision and multimedia fields, and has important application value and practical significance for improving the user experience and developing the industry.

Disclosure of Invention

The invention provides a short video frequency classification method based on multi-mode feature complete expression, which solves the problem of short video multi-label classification and evaluates the result, and is described in detail as follows:

a method for short video classification based on a complete representation of multi-modal features, the method comprising:

for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned;

for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video;

a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video;

the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.

Wherein the two types of visual modality features are potentially represented as: the unique visual modality potential representation is complementary to the visual modality potential representation under the different modality information.

Further, the exploring the correlation between labels and updating the label representation by using inverse covariance estimation and a graph attention network to obtain the label vector representation corresponding to the short video specifically comprises:

introducing an inverse covariance estimate, finding an inverse covariance matrix S for a given label matrix V^-1To characterize the pair-wise relationship of the labels, i.e. define the graph relationship function to initialize the graph structure S;

and converting the label matrix V input into the network into a new label matrix, inputting the new label matrix into a graph relation function G (-) and calculating a graph structure S' under the new label matrix.

Wherein, the multi-head trans-modal fusion scheme based on multi-head attention is as follows: and querying the labels by using the short video visual characteristic public potential representation, calculating the correlation, and aligning the short video visual modal public potential representation and the label matrix.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention researches a multi-modal representation learning problem in a short video, provides a deep multi-modal unified representation learning scheme taking visual modal information as a main part and other modal information as an auxiliary part, constructs information complementarity among four subspace learning modalities from a modality missing angle to obtain potential representations of two types of visual modal characteristics, and utilizes automatic coding and decoding network fusion to the potential representations of the two types of visual modal characteristics to obtain a common potential representation of the visual modal characteristics in consideration of consistency of the visual modal characteristic information. The process considers the problem of modal loss and the complementarity and consistency of modal information at the same time, and fully utilizes the modal information of the short video;

2. the invention explores the label information space of the short video, and provides a new idea for label correlation learning from two aspects of inverse covariance estimation and a graph attention network;

3. aiming at the disadvantages of limited duration and insufficient information of a short video, the invention suggests that a visual modal public potential representation and a label representation are respectively learned from two angles of content information and label information of the short video, and a multi-head cross-modal fusion strategy based on multi-head attention is proposed for the two representations to obtain a final label prediction score.

The method fully utilizes the modal information of the short video to learn the visual modal representation and the label representation which have important effects on the multi-label classification task, and is favorable for improving the accuracy of the short video multi-label classification task.

Drawings

FIG. 1 is a diagram of the overall network framework of a short video classification method based on a complete representation of multi-modal features;

FIG. 2 is a subspace learning framework diagram;

fig. 3 is data of experimental results.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

The embodiment of the invention provides a short video classification method based on multi-mode complete feature representation, which makes full use of content information and label information of a short video, and as shown in figure 1, the method comprises the following steps:

101: for content information, according to experience, semantic feature representation of visual modalities in a short video multi-label classification task is important, so that representation learning based on the visual modality features is provided, the visual modality features are taken as the main, four subspaces are constructed from the aspect of modality missing, information complementarity between modalities is learned, and potential representation of two types of visual modality features is obtained. In consideration of consistency of visual modal characteristic information, in order to obtain more compact visual modal characteristic representation, two types of visual modal characteristic potential representations obtained by four subspaces are fused by utilizing an automatic coding and decoding network to learn common potential representation of the visual modal characteristics;

102: for label information, a unique convex form (inverse covariance estimation) and a graph attention network are adopted to explore the correlation among labels and update label representation, and label vector representation corresponding to short videos is obtained;

the tag vector representation is used to explore a tag representation suitable for a short video data set, participating in the multi-head cross-modality fusion network of step 103 together with the common potential representation of visual modality features of step 101;

103: two information spaces are represented: the common potential representation of the visual modal features obtained in step 101 and the label representation obtained in step 102 propose a multi-head trans-modal fusion scheme based on multi-head attention, which is used for obtaining a label prediction score of the short video;

the output of the multi-head cross-modal fusion network can be regarded as the label prediction score of the input short video and is directly used in the classification loss function.

104: the overall loss function consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.

The performance of the scheme is evaluated by five evaluation indexes, namely coverage rate, ranking loss, average precision, Hamming loss and first marking error, so that the objectivity of an experimental result is ensured.

In a specific implementation, before step 101, the method further includes:

inputting short videos, and extracting three modal characteristics of vision, sound and track by using a classical deep learning network respectively.

In summary, the embodiment of the invention obtains the label prediction score of the input short video by utilizing the multi-modal learning and label learning related theories and combining the advantages of the deep learning network, and the classification result is accurate and effective.

Example 2

The scheme in example 1 is further described below by combining the calculation formula and examples, and the following description refers to:

201: inputting a complete short video by the model, and respectively extracting three modal characteristics of vision, audio and track;

for visual mode, extracting key frames, applying a classical image feature extraction network ResNet (residual network) to all video key frames, and then performing an averaging (AvePolling) operation to obtain visual mode features X_vOverall characteristic z of_v：

Wherein ResNet (·): residual network, ave boost (·): average operation, X_v: short video original visual features, β v: the network parameters to be learned are,

visual modal characteristics z_vIs d in the dimension of_v。

For the audio mode, drawing the sound spectrogram, and using "CNN + for the spectrogramExtracting sound feature z by LSTM (convolutional neural network + long-short term memory network)'_a：

Wherein, CNN (·): convolutional neural network, LSTM (·): long and short term memory networks, X_a: short video original audio feature, beta_a: the network parameters to be learned are,

audio modal characteristic z_aIs d in the dimension of_a. For track mode, extracting track characteristic z from time domain and space domain jointly by using TDD (track pool depth convolution descriptor) method_t：

Wherein, TDD (·): network of trajectory depth descriptors, X_t: original track information, beta, of short video_t: the network parameters to be learned are,

modal characteristics of trajectory z_tIs d in the dimension of_t。

202: modality subspace learning based on visual modalities;

the model considers the visual, audio and trajectory modalities of short video. For a specific short video, the video pictures are generally contained, namely, the visual modal characteristics exist, but the missing situations of the other two modalities are uncertain, and the missing situations of the different modalities are four in total. According to experience, the potential representation of the visual modality is crucial in the task of short video multi-label classification, so four subspaces are constructed based on the potential representation learning of the visual modality, namely, the two main cases are discussed: unique visual modality potential representation and visual modality potential representation under complementation of different modality information to ensure pairingThe visual modality potential representation is fully mined. (wherein, visual modality feature z_vAudio modal characteristics z_aTrace modal characteristics z_tAre all obtained in step 201. )

(ii) unique visual modality potential representation

Using extracted visual modal features z_vLearn its specific potential representation h_v：

Wherein the content of the first and second substances,

visual feature specific mapper, θ_v: the network parameters to be learned are,

visual modality potential representation h_vIs d in the dimension of_h。

Visual modal potential representation under different modal information complementation

And quantitatively analyzing the complementary relation between different modal information and visual modal information by introducing a normalized index function, converting other modal characteristics into corresponding characteristics in a visual representation space, adding the corresponding characteristics and the visual modal characteristics, and sending the sum and the visual modal characteristics into a characteristic fusion mapper to obtain visual modal potential representation after information complementation.

When only the visual modal characteristics z are present_vAnd an audio modal characteristic z_aFirstly, calculating the incidence matrix U of the two modal characteristics^a：

Wherein the content of the first and second substances,

visual modal characteristics z_vTranspose of (d)_v: visual modal characteristics z_vDimension of (d)_a: audio modal characteristic z_aThe dimension (c) of (a) is,

correlation matrix U^aIs d in the dimension of_v×d_a。

Then, a correlation score matrix between the modalities is calculated

Wherein softmax (·): normalized exponential function (same below), d_v: visual modal characteristics z_vDimension of (d)_a: audio modal characteristic z_aThe dimension (c) of (a) is,

correlation score matrix

Is d in the dimension of_v×d_a。

Using a correlation score matrix

Characterizing the audio modality z_aTransforming to visual representation space to obtain audio modal feature representation in visual representation space

Wherein the content of the first and second substances,

audio modal characteristic z_aThe transpose of (a) is performed,

audio modal feature representation in visual representation space

Is d in the dimension of_v。

Finally, the original visual modal characteristics z are measured_vAnd audio modality features in visual representation space

Added and sent to a feature fusion mapper phi_aGenerating a visual modality potential representation supplemented with audio modality information

Wherein, theta_a: the features to be learned fuse the mapper parameters,

the corresponding elements between the vectors are added up,

visual modality potential representation generated by feature fusion mapper

Is d in the dimension of_h。

When only the visual modal characteristics z are available_vAnd a modal signature z of the trajectory_tAnd then, obtaining the visual modal potential representation after the track modal information is supplemented by adopting the same strategy as I.

Wherein, U^t: visual modal characteristics z_vAnd a modal signature z of the trajectory_tThe correlation matrix of (a) is obtained,

correlation matrix U^tIs d in the dimension of_v×d_t，

Visual modal characteristics z_vThe transposing of (1).

Wherein the content of the first and second substances,

a correlation score matrix between visual modalities and trajectory modalities,

correlation score matrix

Is d in the dimension of_v×d_t。

Wherein the content of the first and second substances,

the trajectory modal characteristics in the visual representation space,

original trajectory modal characteristics z_tThe transpose of (a) is performed,

modal characterization of trajectories

Is d in the dimension of_v。

Wherein phi is_t: feature fusion mapper, θ_t: the features to be learned fuse the mapper parameters,

visual modality potential representation generated by feature fusion mapper

Is d in the dimension of_h。

When the visual mode characteristic z_vAudio modal characteristics z_aTrace modal characteristics z_tWhen both exist, it is considered to supplement the visual information with the audio information and the trajectory information jointly.

Firstly, acquiring joint information representation z of audio modality and track modality_at：

Wherein concat (·): a cascade function of the feature vectors is provided,

joint information representation z_atIs d in the dimension of_a+d_t. And then, the same strategy as I is adopted to obtain a new potential representation of the visual mode when the information of the three modes exists.

Wherein, U^at: the correlation matrix between the three modes is,

visual modal characteristics z_vThe transpose of (a) is performed,

correlation matrix U^atIs d in the dimension of_v×(d_a+d_t)。

Wherein the content of the first and second substances,

a matrix of correlation scores between the three modalities,

correlation score matrix

Is d in the dimension of_v×(d_a+d_t)。

Wherein the content of the first and second substances,

a joint information representation of the audio modality and the trajectory modality in the visual representation space,

the original audio modality and the trajectory modality are combined with a transposition of the information representation,

is d in the dimension of_v。

Wherein phi is_at: feature fusion mapper, θ_at: the features to be learned fuse the mapper parameters,

visual modality potential representation generated by feature fusion mapper

Is d in the dimension of_h。

203: the consistency of potential representation of the visual mode is learned by the automatic coding and decoding network;

the visual modality potential representations learned by the subspaces should be similar, in theory they all characterize the same visual content. The two types of potential representations of the visual modalities learned in step 202 are projected into a common space as much as possible using an automatic codec network. The scheme has two advantages, on one hand, overfitting of data is prevented to a certain extent, dimension reduction is carried out on the data, and more compact visual modal potential representation is obtained; on the other hand, the effective connection among the four subspaces is strengthened, so that the subspace learning becomes more meaningful. Two types of potential representations of visual modalities are obtained in step 202: unique visual modality potential representation h_vVisual modality potential representation in complementary to modality

Where m is the { a, t, at }, and they are concatenated to obtain the vector u, i.e., the vector

Then u is input into an automatic coding and decoding network to obtain the common visual modePotential representation h and reconstructed representation

Thereby obtaining a reconstruction loss function

s.t.h＝g_ae(u；W_ae)，

Wherein, g_ae(. o): coding network, g_dg(. o): degenerate network, W_ae: encoding a parameter to be learned of a network, W_dg: the parameters to be learned of the degraded network,

the dimension of the common potential representation of visual modalities h is d_u，

Reconstructed representation

Is 2d_h。

204: learning a label information space of the short video;

one of the key issues for the multi-label classification task is to explore label relationships. An attention network is constructed to explore tag correlations and compute a tag matrix. To this end, the concept of a graph is first introduced. For the label set Y ═ Y₁,y₂,…,y_CConsider graph G (V, E), where V represents a set of label nodes and E ∈ | V | × | V | represents an adjacency matrix of label relationships. In particular, for any label node v_iIts neighborhood node is defined as ρ_i(j)＝{j|v_j∈V}U{v_iThe original label matrix is V ═ V }₁,v₂,…,v_C]Wherein

Is an initial vector representation of the label node C,

the original feature dimension representing the label is n.

(1) Building an initial graph structure

Since the initial relationship between the labels is unknown, an inverse covariance estimate is introduced, finding the inverse covariance matrix S for a given label matrix V^-1To characterize the pairwise relationships of the labels, i.e. to define a graph relationship function: g (v) ═ tr (VS)^-1V^T)(19)

s.t.S≥0；tr(S)＝1

The graph structure S is initialized. The solution of the model is S, which minimizes G (V). The analytical solution expression for calculating S is:

wherein tr (·): trace of matrix, V^T: transpose of the label matrix.

(2) Picture attention learning

For learning label node representation, a unique graph attention learning network is provided, which comprises two steps of node feature learning and node relation learning:

first, node feature learning. Consider the conversion of a tag matrix V input into the network into a new tag matrix

Wherein, M (·): feature mapping function applied on each label node, v_j: j-th label node representation, s_ij: relationship score, v 'for tag i and tag j'_i: new feature of tag i, v'_C: label (R)C, the new characteristics of the alloy material,

the dimensions of the new label features.

And secondly, learning the node relation. Inputting the new label matrix V 'learned in the first step into the graph relation function G (-) and calculating a graph structure S' under the new label matrix:

wherein, V'^T: transpose of the new tag matrix. Note that: v ', S' are inputs to the next layer of the graph attention learning layer (equation-21). Thus, the model establishes 2 to 3 attention learning layers in total, and finally obtains the structured label matrix

The dimension of the label matrix P is d_u×C。

205: in order to obtain the label prediction score of the short video, an information fusion scheme based on multi-head attention is proposed for the visual modality public potential representation h obtained in the step 203 and the structured label matrix P obtained in the step 204.

Multi-headed note allows the model to jointly process information from different representation subspaces at different locations. First, the query matrix Q, the key matrix K, and the value matrix V in this task are calculated.

Analyzing the characteristics of the short video multi-label classification task, a short video may contain a plurality of labels, namely, the relation between the visual feature representation and the label representation of the short video is multi-coupled, and the classification task is facilitated by explicitly researching the coupling relation. Therefore, a multi-head cross-modal fusion layer is provided, the labels are queried by using the short video visual feature common representation, the correlation between the labels is calculated, and the short video visual feature common representation and the label matrix are aligned.

First, consider a labelA correlation of the representation and the visual feature representation. Computing a visual modality common potential representation h and an ith class label vector p_iIs scored as a relation of_i：

Wherein, cos (·): cosine similarity function, | ·| non-conducting phosphor₂: computing the 2-norm, h, of the vector^T: the visual modalities share a transpose of the underlying representation. Thereby obtaining a relation vector of the short video visual feature representation and the label representation

Inspired by a multi-head attention mechanism, a multi-head cross-modal fusion layer is provided to calculate a label representation corresponding to the visual feature representation. For the e-th head, a weighted projection H of the visual feature representation in label space is computed^e：

Wherein the content of the first and second substances,

the projection parameters of the potential representation are common to the visual modalities,

the projection parameters of the relationship score vector,

H^eis d in the dimension of_k×d_k，(·)^T: the transpose of the matrix is computed. Projecting the visual weighting H^eFusing the label matrix P to obtain a label representation F with semantic perception attribute^e：

Wherein the content of the first and second substances,

the projection parameters of the representation of the label,

label representation F^eIs d in the dimension of_kAnd (4) x C. And finally, the output of a plurality of attention heads is cascaded and is subjected to linear projection to obtain the label prediction score of the short video

Wherein the content of the first and second substances,

linear projection matrix, concat (g): cascade function, F¹；F²；…；F^e: the e heads respectively calculated label representation,

predictive score

Is of the dimension of_C。

206: the traditional multi-label classification loss is adopted to measure the gap between the predicted label score and the real label information:

wherein, log (·): logarithmic function, y: the real tag information of the short video,

label prediction scores for short videos.

Therefore, the overall loss function of the model

Where λ is the equilibrium classification loss

And reconstruction loss

The compromise parameter of (1).

In the whole training and testing process, the performance of the model is evaluated by five evaluation indexes of Coverage ratio Coverage, ranking loss RankingLoss, average precision mAP, Hamming loss HamminLoss and header error One-error, wherein: (1) coverage is used to calculate how many tags are needed on average to cover all the correct tags for an instance, it is loosely tied to the accuracy of the best level of recall, the smaller the value, the better the performance; (2) the average score of the reverse label pair of the ranking loss RankingLoss calculation example is smaller, and the performance is better; (3) mAP represents the average of the m categories of accuracy, the larger the value, the better the performance; (4) hamming loss Hammingloss measures the times of wrong division of the label, and the smaller the value is, the better the performance is; (5) the number of times that the label with the maximum prediction probability value is not in the real label set is firstly marked by the error One-error, and the smaller the value is, the better the performance is. (the results are shown in FIG. 3)

In summary, the invention learns the common potential representation and the label representation of the visual modality from two angles of the content information and the label information respectively aiming at the disadvantages of 'limited time and insufficient information' of the short video, and finally fuses the representations of the two information spaces to obtain the label prediction score, and the whole process fully utilizes the information of each modality of the short video. Firstly, a multi-modal representation learning problem in a short video is researched, a deep multi-modal unified representation learning scheme taking visual modal information as a main part and other modal information as an auxiliary part is provided, specifically, information complementarity among four subspace learning modalities is constructed from a modality missing angle, consistency of visual modal characteristic information is further considered, and an automatic coding and decoding network is utilized to learn common potential representation of the visual modalities; then, label information of the short video is explored, and a new idea of label correlation learning is provided from two aspects of inverse covariance estimation and a graph attention network; and finally, a multi-head cross-modal information fusion scheme based on multi-head attention is provided for the representation of the two information spaces to obtain a final label prediction score.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A short video classification method based on a multi-modal complete feature representation, the method comprising:

2. The short video classification method based on the multi-modal complete feature representation as claimed in claim 1, wherein the two types of visual modal features are potentially represented as: the unique visual modality potential representation is complementary to the visual modality potential representation under the different modality information.

3. The short video classification method based on multi-modal complete feature representation according to claim 2, wherein the unique visual modality potential representation is:

wherein the content of the first and second substances,

mapper, theta, representing visual characteristics_vRepresents the network parameters to be learned and,

representing potential representations of visual modalities h_vIs d in the dimension of_h；z_vRepresenting the original visual modality characteristics.

4. The method for classifying short video based on complete multi-modal feature representation according to claim 3, wherein the visual modality potential representation under the complementation of different modality information is:

the original visual modal characteristics z_vAnd audio modality features in visual representation space

Wherein, theta_a: the features to be learned fuse the mapper parameters,

adding corresponding elements between the vectors;

visual modality potential representation supplemented with trajectory modality information

Wherein phi is_t: feature fusion mapper, θ_t: the features to be learned fuse mapper parameters;

when the original visual modal characteristics z_vAudio modal characteristics z_aTrace modal characteristics z_tWhen both exist, the audio information and the track information are jointly used for supplementing the visual information to obtain a new potential representation of the visual mode

Wherein phi is_at: feature fusion mapper, θ_at: the features to be learned fuse mapper parameters.

5. The method of claim 1, wherein the reconstruction loss function is:

wherein the content of the first and second substances,

u is a series vector of the series,

h is a common potential representation of the visual modality,

for reconstruction representation, g_ae(. o): coding network, g_dg(. o): degenerate network, W_ae: encoding a parameter to be learned of a network, W_dg: the parameters to be learned of the degraded network,

Reconstructed representation

Is 2d_h。

6. The method for classifying short videos based on complete multi-modal feature representation according to claim 1, wherein the exploring the correlation between tags and updating the tag representation by using inverse covariance estimation and a graph attention network to obtain the tag vector representation corresponding to the short videos specifically comprises:

introducing an inverse covariance estimate, finding an inverse covariance matrix S for a given label matrix V^-1To characterize the pairwise relationship of the tags;

converting the label matrix V input into the network into a new label matrix, inputting the new label matrix into a graph relation function G (g), and calculating a graph structure S' under the new label matrix.

7. The method according to claim 1, wherein the multi-head cross-modal fusion scheme based on multi-head attention is as follows:

and querying the labels by using the short video visual characteristic public potential representation, calculating the correlation, and aligning the short video visual modal public potential representation and the label matrix.