CN104102705A

CN104102705A - Digital media object classification method based on large margin distributed learning

Info

Publication number: CN104102705A
Application number: CN201410326282.4A
Authority: CN
Inventors: 周志华; 张腾
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-07-09
Filing date: 2014-07-09
Publication date: 2014-10-15
Anticipated expiration: 2034-07-09
Also published as: CN104102705B

Abstract

The invention discloses a digital media object classification method based on large-interval distribution learning. In order to overcome the noise problem of digital media object category labels, the classification problem of digital media objects is finally formalized into a Convex quadratic optimization problem, and according to whether to use nonlinear kernel function and the characteristics of the training digital media object library itself, two optimization algorithms based on dual coordinate descent and average stochastic gradient descent are given. The situation is optional. If the user chooses the nonlinear kernel function, DCD is selected as the optimization algorithm during training; if the user chooses the linear kernel function, and there are many samples or features in the training digital media object library, ASGD is selected as the optimization algorithm during training, otherwise it is still Choose DCD as the optimization algorithm.

Description

A Digital Media Object Classification Method Based on Large Margin Distribution Learning

技术领域technical field

本发明涉及一种数字媒体对象分类方法，特别是一种基于大间隔分布学习的数字媒体对象分类方法。The invention relates to a digital media object classification method, in particular to a digital media object classification method based on large interval distribution learning.

背景技术Background technique

当下的人类社会已经全面进入了数字化阶段，目前用来传播信息的图像、文本、视频、音频等媒介均是以二进制编码的形式来记录、处理的，这些编码后的图像、文本、视频、音频统称为数字媒体对象。数字媒体对象因其具有图、文、声、像并茂的立体表现特点，已广泛应用于各行各业，如遥感测控、互联网站、数字电视、电话通信等。这些行业每天都会积累大量的数据，因此随着数据量的不断膨胀，如何对数字媒体对象进行有效地组织管理变得越来越重要，而其核心问题就是数字媒体对象的分类。科学的分类既可以为存储这些数字媒体对象提供便利；在之后的服务如数字媒体检索中，也可以更快速地给出效果更好的检索结果。在数字媒体对象的分类任务中，每个数字媒体对象都会有一个对应的类别标记，这些类别标记通常是由人进行手工标注得到的，因此不可避免地会引入一些噪声。传统的基于大间隔的分类方法，如支持向量机(以下均简记为SVM)，因其只考虑了单个样本的间隔，因此对噪声比较敏感，不适合直接用来对数字媒体对象进行分类。基于这一发现，本发明提出一种基于大间隔分布学习的数字媒体对象分类方法，该方法通过利用整个间隔分布的信息，而不是单个样本的间隔，因此避免了对噪声的敏感，很好地解决了数字媒体对象分类的问题。The current human society has entered the stage of digitalization in an all-round way. Currently, images, texts, videos, audios and other media used to disseminate information are recorded and processed in the form of binary codes. These encoded images, texts, videos, audios Collectively referred to as Digital Media Objects. Digital media objects have been widely used in all walks of life, such as remote sensing measurement and control, Internet sites, digital TV, telephone communications, etc. These industries accumulate a large amount of data every day, so as the amount of data continues to expand, how to effectively organize and manage digital media objects becomes more and more important, and the core issue is the classification of digital media objects. Scientific classification can not only facilitate the storage of these digital media objects; in subsequent services such as digital media retrieval, it can also provide faster and better retrieval results. In the classification task of digital media objects, each digital media object will have a corresponding category label. These category labels are usually manually marked by humans, so some noise will inevitably be introduced. Traditional classification methods based on large intervals, such as support vector machines (hereinafter abbreviated as SVM), are sensitive to noise because they only consider the interval of a single sample, and are not suitable for directly classifying digital media objects. Based on this finding, the present invention proposes a digital media object classification method based on large-interval distribution learning, which avoids sensitivity to noise by utilizing the information of the entire interval distribution instead of the interval of a single sample The problem of digital media object classification is solved.

发明内容Contents of the invention

发明目的：考虑到数字媒体对象的类别标记通常含有不少噪声，本发明基于大间隔分布学习的思想，提出了一种对噪声不敏感的数字媒体对象分类方法。该方法通过充分利用整个间隔分布的信息，最大化间隔均值同时最小化间隔方差，避免了对噪声的敏感，很好地解决了数字媒体对象分类的问题。Purpose of the invention: Considering that the category marks of digital media objects usually contain a lot of noise, the present invention proposes a noise-insensitive digital media object classification method based on the idea of large interval distribution learning. By making full use of the information of the entire interval distribution, the method maximizes the interval mean and minimizes the interval variance, avoids sensitivity to noise, and solves the problem of digital media object classification well.

技术方案：一种基于大间隔分布学习的数字媒体对象分类方法，首先，用户先准备好一个数字媒体对象库，其中每一个数字媒体对象都带有类别标记，这些就是训练数据。接着，将训练数字媒体对象转换成特征表示，具体来说，将训练数字媒体对象输入到特征提取算法中，得到数字媒体对象的特征向量。数字媒体对象的特征提取方法有很多种，可以用一个方法对应一个特征，例如，对于一幅图像，其亮度可以作为该对象的一个特征，对比度则可以作为另外一个特征。记总的特征个数为d，那么就将每个数字媒体对象都对应到d维欧氏空间中的一个向量了。然后将所有训练数字媒体对象对应的特征向量及其类别标记都输入进分类模型的训练算法，训练完之后就可以得到分类模型。在预测阶段，用户将待预测的数字媒体对象输入分类模型，分类模型即可输出其预测的类别标记。在训练分类模型时，为了克服数字媒体对象类别标记的噪声问题，本发明基于大间隔分布学习的思想，提出一种对噪声不敏感的数字媒体对象分类方法LDM，通过最大化间隔均值同时最小化间隔方差，最终将数字媒体对象的分类问题形式化成一个凸二次优化问题，并根据是否使用非线性核函数以及训练数字媒体对象库本身的特征(如样本个数，特征稀疏性等)，给出了分别基于对偶坐标下降(以下均简记为DCD)和基于平均随机梯度下降(以下均简记为ASGD)两种寻优算法的实现，用户可根据实际情况自行选择。若用户选择非线性核函数，则训练时选择DCD作为寻优算法；若用户选择线性核函数，且训练数字媒体对象库样本很多或特征很稀疏，则训练时选择ASGD作为寻优算法，否则依然选择DCD作为寻优算法。Technical solution: a digital media object classification method based on large-interval distribution learning. First, the user prepares a digital media object library, in which each digital media object has a category mark, which is the training data. Next, convert the training digital media object into a feature representation, specifically, input the training digital media object into a feature extraction algorithm to obtain a feature vector of the digital media object. There are many methods for feature extraction of digital media objects. One method can be used to correspond to a feature. For example, for an image, its brightness can be used as a feature of the object, and contrast can be used as another feature. Record the total number of features as d, then each digital media object corresponds to a vector in the d-dimensional Euclidean space. Then, all the feature vectors corresponding to the training digital media objects and their category marks are input into the training algorithm of the classification model, and the classification model can be obtained after the training. In the prediction stage, the user inputs the digital media object to be predicted into the classification model, and the classification model can output its predicted category label. When training the classification model, in order to overcome the noise problem of the digital media object category label, the present invention is based on the idea of large interval distribution learning, and proposes a digital media object classification method LDM that is not sensitive to noise, by maximizing the interval mean while minimizing interval variance, and finally formalize the classification problem of digital media objects into a convex quadratic optimization problem, and according to whether to use the nonlinear kernel function and the characteristics of the training digital media object library itself (such as the number of samples, feature sparsity, etc.), give Two optimization algorithms based on dual coordinate descent (abbreviated as DCD below) and based on average stochastic gradient descent (abbreviated as ASGD below) have been proposed, and users can choose according to the actual situation. If the user chooses the nonlinear kernel function, DCD is selected as the optimization algorithm during training; if the user chooses the linear kernel function, and there are many samples or features in the training digital media object library, ASGD is selected as the optimization algorithm during training, otherwise it is still Choose DCD as the optimization algorithm.

有益效果：与现有技术相比，本发明充分利用训练数字媒体对象库的间隔分布信息，通过最大化间隔均值同时最小化间隔方差，克服了数字媒体对象分类问题中类别标记的噪声问题，同时还保持了SVM原有的优点，最终取得了很好的分类效果。Beneficial effects: Compared with the prior art, the present invention makes full use of the interval distribution information of the training digital media object library, and by maximizing the interval mean value while minimizing the interval variance, it overcomes the noise problem of the category label in the digital media object classification problem, and at the same time It also maintains the original advantages of SVM, and finally achieved a good classification effect.

附图说明Description of drawings

图1是本发明原理流程图；Fig. 1 is a principle flow chart of the present invention;

图2是本发明的流程图；Fig. 2 is a flow chart of the present invention;

图3是根据DCD寻优算法训练分类模型的流程图；Fig. 3 is the flowchart of training classification model according to DCD optimization algorithm;

图4是根据ASGD寻优算法训练分类模型的流程图。Fig. 4 is a flow chart of training a classification model according to the ASGD optimization algorithm.

具体实施方式Detailed ways

下面结合具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

如图1所示，基于大间隔分布学习的数字媒体对象分类方法，首先，用户先准备好一个数字媒体对象库，对于其中的每一个数字媒体对象，通过标注或众包方法，获得对应的类别标记，形成训练数据。接着，将训练数字媒体对象转换成特征表示，具体来说，将训练数字媒体对象输入到特征提取算法中，得到数字媒体对象的特征向量。然后将所有训练数字媒体对象对应的特征向量及其类别标记都输入进分类模型的训练算法，训练完之后就可以得到分类模型。在预测阶段，用户将测试数字媒体对象库中的待预测的数字媒体对象输入分类模型，分类模型输出分类结果。As shown in Figure 1, the digital media object classification method based on large interval distribution learning, first, the user prepares a digital media object library, and for each digital media object in it, obtains the corresponding category by labeling or crowdsourcing mark to form the training data. Next, convert the training digital media object into a feature representation, specifically, input the training digital media object into a feature extraction algorithm to obtain a feature vector of the digital media object. Then, all the feature vectors corresponding to the training digital media objects and their category marks are input into the training algorithm of the classification model, and the classification model can be obtained after the training. In the prediction phase, the user inputs the digital media objects to be predicted in the test digital media object library into the classification model, and the classification model outputs classification results.

本发明的主要流程如图2所示。步骤1是起始动作，步骤2获得所有训练数字媒体对象的特征向量矩阵和类别标记向量其中X是d×m的实数矩阵，第i列对应数字媒体对象x_i，y是m维的实数向量。步骤3接受用户输入，用户输入包括寻优算法的选择，间隔方差、间隔均值和总体损失的权重系数λ₁、λ₂、C以及核函数参数(若选择线性核则无参数)。步骤4根据用户的输入做判断，若选择DCD作为寻优算法，则转步骤5，其详细说明如图3所示；若选择ASGD作为寻优算法，则转步骤6，其详细说明如图4所示。步骤7使用训练好的分类模型对没有类别标记的数字媒体对象进行分类，步骤8输出分类结果，最终结束于步骤9。The main flow of the present invention is shown in Figure 2. Step 1 is the starting action, and step 2 obtains the feature vector matrix of all training digital media objects and the class label vector Where X is a d×m real number matrix, the i-th column corresponds to the digital media object x _i , and y is an m-dimensional real number vector. Step 3 accepts user input, which includes selection of optimization algorithm, weight coefficients λ ₁ , λ ₂ , C of interval variance, interval mean and overall loss, and kernel function parameters (no parameters if linear kernel is selected). Step 4 makes a judgment based on the user's input. If DCD is selected as the optimization algorithm, then go to step 5, and the detailed description is shown in Figure 3; if ASGD is selected as the optimal algorithm, then go to step 6, and the detailed description is shown in Figure 4 shown. Step 7 uses the trained classification model to classify the digital media objects without class labels, step 8 outputs the classification result, and finally ends in step 9.

图3说明如何根据DCD寻优算法训练分类模型，步骤50为开始动作。步骤51中，基于特征向量矩阵X计算核矩阵G，这里所用的核函数由用户指定，常见的有RBF核、多项式核、Sigmoid核、线性核等，每一个数字媒体对象在G中都对应着某一行和某一列。步骤52中，将优化问题的解β初始化为全0向量，按(1)式计算矩阵H和向量p：Fig. 3 illustrates how to train the classification model according to the DCD optimization algorithm, step 50 is the start action. In step 51, the kernel matrix G is calculated based on the eigenvector matrix X. The kernel function used here is specified by the user. Common ones include RBF kernel, polynomial kernel, Sigmoid kernel, linear kernel, etc., and each digital media object corresponds to G in G. A certain row and a certain column. In step 52, the solution β of the optimization problem is initialized as a vector of all 0s, and matrix H and vector p are calculated according to formula (1):

其中Y是以y为对角线元素的对角矩阵，e是m维全1向量。矩阵H中含有间隔方差的信息，向量p也和间隔均值相关，同时它们也是最终要优化的目标函数中的二次项和一次项。步骤53判断β是否已经收敛，判断的依据是当前的β与上一轮的β的差值的某个范数(通常选择2-范数)是否小于预先设定的阈值。若β已经收敛，则转步骤56，输出β，训练结束；否则转步骤54。步骤54和步骤55是DCD的核心部分，由于LDM形式化后的目标函数是凸二次函数，约束是去耦合的上下界约束，因此选用DCD作为寻优算法有个好处，每次选取一个变量，保持其它变量不变，那么只优化该变量就是一个一维二次函数在指定区间上取最小值的问题，这个问题是有解析解的。具体来说，设当前的解为β，随机选取第i维作为优化变量，其它维固定不变，那么有如下的更新公式where Y is a diagonal matrix with y as the diagonal element, and e is an m-dimensional full 1 vector. The matrix H contains the information of the interval variance, and the vector p is also related to the interval mean, and they are also the quadratic and first-order items in the final objective function to be optimized. Step 53 judges whether β has converged, based on whether a certain norm (usually 2-norm) of the difference between the current β and the previous round of β is smaller than a preset threshold. If β has converged, then go to step 56, output β, and the training ends; otherwise, go to step 54. Steps 54 and 55 are the core parts of DCD. Since the objective function after the formalization of LDM is a convex quadratic function, and the constraints are decoupled upper and lower bound constraints, there is an advantage in choosing DCD as the optimization algorithm. Each time a variable is selected , keeping other variables unchanged, then only optimizing this variable is a problem of taking the minimum value of a one-dimensional quadratic function on a specified interval, and this problem has an analytical solution. Specifically, assuming that the current solution is β, the i-th dimension is randomly selected as the optimization variable, and the other dimensions are fixed, then there is the following update formula

${β β}_{i i}^{new new} = = min min ((max max (({β β}_{i i} - - {[[Hβ Hβ + + β β]]}_{i i} / / {h h}_{ii i},, 00)),, C C)),, - - - - - - ((66))$

其中[Hβ+β]_i是向量Hβ+β的第i维，h_ii是矩阵H对角线上的第i个元素。步骤54随机选取β_i作为优化变量，步骤55依据(2)式来更新β_i，之后转回步骤53进行迭代直至收敛。where [Hβ+β] _i is the i-th dimension of the vector Hβ+β, and h _ii is the i-th element on the diagonal of the matrix H. Step 54 randomly selects β _i as the optimization variable, step 55 updates β _i according to formula (2), and then returns to step 53 for iteration until convergence.

图4说明如何根据ASGD寻优算法训练分类模型，步骤60为开始动作。步骤61将优化问题的解w初始化为全0向量。步骤62判断w是否已经收敛，判断依据是当前的w与上一轮的w的差值的某个范数(通常选择2-范数)是否小于预先设定的阈值。若w已经收敛，则转步骤66，输出w，训练结束；否则转步骤63。步骤63、步骤64和步骤65是ASGD的核心部分，ASGD的核心思想是用目标函数梯度的无偏估计来替代梯度作为下降方向，这样可以避免数据量很大时，计算梯度相当耗时的问题，因为梯度的无偏估计一般来说都是很容易计算的。对于SVM，ASGD每轮只需随机采样一个样本就可以得到其目标函数梯度的无偏估计，LDM在其基础上额外引入了间隔均值和间隔方差，其中间隔均值梯度的无偏估计通过随机采样一个样本就可以得到，间隔方差梯度的无偏估计则需要随机采样两个样本，这就是步骤63。假设随机采样出的样本为分别为x_i和x_j，就是通过式(3)就可以得到目标函数梯度的无偏估计，Fig. 4 illustrates how to train the classification model according to the ASGD optimization algorithm, step 60 is the start action. Step 61 initializes the solution w of the optimization problem as a vector of all 0s. Step 62 judges whether w has converged, based on whether a certain norm (usually 2-norm) of the difference between the current w and w of the previous round is smaller than a preset threshold. If w has converged, go to step 66, output w, and the training ends; otherwise, go to step 63. Step 63, step 64 and step 65 are the core parts of ASGD. The core idea of ASGD is to use the unbiased estimate of the gradient of the objective function to replace the gradient as the direction of descent, which can avoid the problem of time-consuming calculation of the gradient when the amount of data is large. , because unbiased estimates of the gradient are generally easy to compute. For SVM, ASGD only needs to randomly sample one sample in each round to obtain an unbiased estimate of the gradient of its objective function. LDM additionally introduces the interval mean and interval variance on the basis of it. The unbiased estimate of the interval mean gradient is obtained by randomly sampling a The sample can be obtained, and the unbiased estimation of the interval variance gradient needs to randomly sample two samples, which is step 63 . Assuming that the randomly sampled samples are x _i and x _j respectively, the unbiased estimate of the gradient of the objective function can be obtained through formula (3),

其中λ₁、λ₂、C分别是间隔方差、间隔均值和总体损失的权重系数，集合是有损失的样本的下标集合，这就是步骤64。之后设置步长ηt＝1/t，和梯度下降一样按式(4)更新w就可以了，Among them, λ ₁ , λ ₂ , and C are the weight coefficients of interval variance, interval mean and overall loss respectively, and the set is the subscript set of samples with loss, which is step 64 . Then set the step size ηt=1/t, and update w according to formula (4) just like gradient descent.

${w w}_{t t + + 11} = = {w w}_{t t} - - ηt ηt &dtri; &dtri; g g ((w w,, {x x}_{i i},, {x x}_{j j})) - - - - - - ((88))$

这就是步骤65，之后转回步骤62进行迭代直至收敛。This is step 65, after which it turns back to step 62 to iterate until convergence.

Claims

1. the digital media object sorting technique based on large-spacing Distributed learning, is characterized in that:

First, what a is first set up and comprise digital media object message digit library of media objects as training data, each digital media object in described digital media object storehouse is with classification mark;

Then, convert training digital media object to character representation, specifically, training digital media object is input in feature extraction algorithm, obtain the proper vector of digital media object;

Then, all training digital media object characteristic of correspondence vectors and classification mark thereof are all inputted to the training algorithm into disaggregated model, after having trained, obtain disaggregated model; At forecast period, user is by digital media object input disaggregated model to be predicted, and disaggregated model is the classification mark of exportable its prediction;

When train classification models, by maximize margin average simultaneous minimization interval variance, the classification problem form of digital media object changes into a protruding double optimization problem the most at last, and according to the feature of whether using Non-linear Kernel function and training digital media object storehouse itself, provided and based on dual coordinates, declined respectively and the realization based on two kinds of optimizing algorithms of mean random Gradient Descent, user can select voluntarily according to actual conditions; If user selects Non-linear Kernel function, while training, select DCD as optimizing algorithm; If user selects linear kernel function, and training digital media object storehouse sample is a lot of or feature is very sparse, selects ASGD as optimizing algorithm, otherwise still select DCD as optimizing algorithm while training.

2. the digital media object sorting technique based on large-spacing Distributed learning as claimed in claim 1, is characterized in that:

According to DCD optimizing algorithm train classification models step, be:

Step 51, calculates nuclear matrix G based on eigenvectors matrix X, and each digital media object is corresponding certain a line and a certain row in G;

Step 52, is initialized as full 0 vector by the optimum solution β of optimization problem, presses (1) formula compute matrix H and vectorial p:

Wherein Y be take the diagonal matrix that y is diagonal entry, and e is that m ties up complete 1 vector;

Step 53, judges whether β restrains, and whether certain norm according to being the difference of current β and last round of β of judgement is less than predefined threshold value; If β restrains, go to step 56, output β, training finishes; Otherwise go to step 54;

Step 54, establishing current solution is β, chooses at random i dimension β _ias optimized variable, other dimension immobilizes,

Step 55, upgrades β according to (2) formula _i,

New formula more

β_{i}^{new} = \min (\max (β_{i} - {[Hβ + β]}_{i} / h_{ii}, 0), C), - - - (2)

Go back to afterwards that step 53 is carried out iteration until convergence;

Step 56, output β, training finishes.

3. the digital media object sorting technique based on large-spacing Distributed learning as claimed in claim 1, is characterized in that:

According to the step of ASGD optimizing algorithm train classification models, be:

Step 61, is initialized as full 0 vector by the optimum solution w of optimization problem;

Step 62, judges whether w restrains, and basis for estimation is whether certain norm of the difference of current w and last round of w is less than predefined threshold value; If w restrains, go to step 66, output w, training finishes; Otherwise go to step 63;

Step 63, from training data, stochastic sampling goes out two digital media object characteristic of correspondence vector x _iand x _j;

Step 64, through type (3) just can obtain target function gradient without inclined to one side estimation,

Wherein, C is the weight coefficient of the overall loss that sets in advance of user, set it is the indexed set of lossy sample;

Step 65, arranges step-length η t=1/t, by formula (4), upgrades w,

w_{t + 1} = w_{t} - ηt &dtri; g (w, x_{i}, x_{j}) - - - (4)

, go back to afterwards that step 62 is carried out iteration until convergence;

Step 66, output w, training finishes.