CN105335350A

CN105335350A - Language identification method based on ensemble learning

Info

Publication number: CN105335350A
Application number: CN201510644536.1A
Authority: CN
Inventors: 冯冲; 高小燕; 黄河燕
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-10-08
Filing date: 2015-10-08
Publication date: 2016-02-17

Abstract

The invention relates to a language identification method based on ensemble learning, and belongs to the technical field of natural language processing and application. The method comprises: firstly, selecting bootstrap samples from a training set D according to a preset extraction proportion parameter max_samples, to obtain a training set Db; then, based on the Db, selecting sample features according to a feature selection proportion parameter max_features, and filtering the Db based on the selected features, to obtain a training set Dt; based on the Dt, training four kinds of basic classifiers of a polynomial Naive Bayes (MNBBL), a random forest (RFBL), a support vector machine (SVMBL), and a linear model (LMBL); and finally, using majority voting to combine the four kinds of basic classifiers into a better classifier; and using the classifier to identify to-be-identified samples. Compared with the prior art, the method can identify short text language of national minority, and accuracy is improved.

Description

Language identification method based on ensemble learning

Technical Field

The invention relates to a language identification technology of minority languages, in particular to a language identification method based on ensemble learning, and belongs to the technical field of natural language processing application.

Background

With the increasing globalization trend, international communication is becoming more and more intimate, and people in various countries and regions frequently come and go due to the needs of economy, politics, culture and tourism, so that people urgently need to break through the limit of language and freely communicate, and language identification becomes more and more important. The method has strong application value in voice recognition, information retrieval, automatic machine translation, national defense and daily life, gradually draws wide attention in related research and application fields, for example, language recognition can be regarded as a filtering technology, interesting languages are directly provided for users in the information retrieval, and the burden of a search engine is reduced.

Language identification automatically identifies the language type to which a document or a sentence belongs.

In the existing language identification technology, the method is widely applied to a language identification method based on an N-element model. However, the method is not satisfactory for the language identification problem between short texts or similar languages.

In fact, language identification between similar languages is difficult and has its linguistics root. A country or region has experienced historical changes from ancient times to our days, deriving several languages and even language variants similar to the original language. For example, Portuguese has two language variants of Brazilian Portuguese and European Portuguese. Therefore, similar languages or language variants share many lexical and grammatical structural features, and it becomes more difficult to distinguish between similar languages or language variants.

In recent times, some research work has been carried out. Such as the graph-based n-gram method LIGA. However, LIGA has domain limitations, which can reduce accuracy once other new domain vocabularies exist. In addition, it has also been proposed to use bag-of-words models to distinguish language variants. The article converts words of the long text into vectors by using a vector space model, and then uses a classifier for classification. The method has the defects that the vector space model usually solves the problem of long texts, and on the problem of short texts, the vector space is too sparse, dimension disasters are easily caused, and the effect is poor.

Disclosure of Invention

The invention aims to provide a Bagging-based similar language identification method aiming at the language identification problem of short texts in the current minority languages.

The invention integrates polynomial Bayes, random forests, support vector machines and linear models into a stronger classifier, constructs training sets of different versions, and performs feature filtering on a data set to increase the difference among different sub-learners, thereby effectively solving the problem of short text language identification of similar languages.

The rough process of the invention is that firstly, a bootstrap sample is selected from a training set D according to a preset extraction proportion parameter max _ samples to obtain the training set D_b(ii) a Secondly based on D_bSelecting sample characteristics according to the characteristic selection proportion parameter max _ features, and selecting the sample characteristics based on the selected characteristic pair D_bFiltering to obtain a training set D_t(ii) a Again based on D_tTraining four basic classifiers of polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL); and finally, combining the four basic classifiers into a stronger classifier by utilizing majority voting.

The purpose of the invention is realized by the following technical scheme:

a language identification method based on ensemble learning mainly comprises the following similar language identification steps based on Bagging:

step 1, training four basic classifiers, namely polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL), based on a training data set D by the following processes:

(1)t＝1；

(2) from the training data set D { (x)₁，y₁)，(x₂，y₂)，...，(x_n，y_n) Selecting a bootstrap sample as a training set D according to a preset extraction ratio parameter max _ samples_bWhere D contains n instances of the labeled class (x)_i，y_i) Each instance x_i＝[x_i1，x_i2，...，x_id]^TIs a vector containing d features, y_iIs x_iClass to which i ∈ [1, n ] belongs]，y_i∈ Y, Y is {1,2, … q }, q represents the number of categories the sample belongs to;

(3) based on training set D_bSelecting sample characteristics according to a preset characteristic selection proportion parameter max _ features, and performing characteristic filtering based on the selected characteristics to obtain a training set D after the characteristic filtering_t；

(4) Based on D_tTraining four basic classifiers, namely polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL), to obtain the t-th classifier of each basic classifier, wherein the t-th classifier is expressed in the following form:

M_t＝MNB(D_t)；

R_t＝RF(D_t)；

S_t＝SVM(D_t)；

L_t＝LM(D_t)；

wherein M is_tDenotes the t MNBBL classifier, R_tDenotes the t-th RFBL classifier, S_tDenotes the t-th SVMBL classifier, L_tRepresents the t-th LMBL classifier;

(5) t is t + 1; if T is less than or equal to T, turning to (2); wherein T is a preset training frequency;

step 2, identifying the sample x to be identified by using the four basic classifiers trained in the step 1 through the following processes to obtain prediction classes of x corresponding to the four classifiers:

(1) according to the features selected by the t-th classifier, performing feature filtering on the x to obtain a filtered sample x to be identified_t，t∈[1，T]；

(2) T-th classifier pair x using four basic classifiers_tIdentification is carried out to obtain an identification result M_t(x_t)、R_t(x_t)、S_t(x_t) And L_t(x_t)；

(3) Simple voting rule is adopted for four basic classifiersPrediction class y to corresponding base classifier x_m、y_r、y_sAnd y_lThe mathematical expression is as follows:

y_{m} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = M_{t} (x_{t}));

y_{r} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = R_{t} (x_{t}));

y_{s} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = S_{t} (x_{t}));

y_{l} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = L_{t} (x_{t}));

wherein,

L (a = b) = \{\begin{matrix} 0 & a! = b \\ 1 & a = b \end{matrix};

and 3, combining the four basic classifiers into a stronger classifier by using an integration strategy.

Preferably, the training data set D in step 1 can be obtained by the following process:

(1) preparing a training corpus and preprocessing the training corpus to obtain an initial corpus set;

(2) obtaining a sample of the data set according to the identification target normalized data set sample of the initial corpus set;

(3) and selecting a feature space, and vectorizing the sample of the data set based on the feature space to obtain a training data set D.

Preferably, substep (3) of step 1 is further performed by:

(1) selecting a proportional parameter max _ features according to the characteristics, performing characteristic selection on the data set and marking;

(2) and (4) performing feature filtering according to the bootstrap samples in the feature training set selected in the step (1) to form a new training data set.

Preferably, the selection of the bootstrap sample and the selection of the sample characteristics in the step 1 are both selected randomly.

Preferably, when the simple voting rule in step 2 obtains that the votes of the plurality of categories are the same and are all the largest, the prediction category of x is determined according to the priority order of the categories from low to high.

Preferably, the integration strategy in step 3 is any one of a simple voting rule, a bayesian voting method, and an integration method based on D-S evidence theory.

Preferably, when the integration policy is a simple voting rule, if the obtained votes of the plurality of categories are the same and are the largest, the final identification category of x is determined by the high-to-low priority order MNBBBL > RFBL > SVMBL > LMBL according to the basic classifier.

Advantageous effects

The invention designs a language identification method based on ensemble learning. Compared with the existing method, the method can identify the short text language of the minority, and the accuracy rate is improved.

Drawings

Fig. 1 is a schematic flow chart of a language identification method based on ensemble learning according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples.

The implementation steps of the invention are explained in detail by taking the identification of three minority national languages of Wei, Kazakh and Korke cumin as an example:

before describing the embodiments in detail, the following formalized symbols and definitions are given:

training data set D { (x)₁，y₁)，(x₂，y₂，)，...，(x_n，y_n) Contains a total of n labeled instances. Wherein each instance x_i＝[x_i1，x_i2，...，x_id]^TIs a d-dimensional vector, y_iIs x_iThe category to which it belongs.

The similar language identification based on Bagging is divided into two steps, namely a data set preparation step and a language identification step.

The data set preparation steps are as follows:

step 100, preparing and preprocessing a training corpus: since there is no corpus of identification languages of the vernacular, the kazakh and the korckii that has been disclosed so far, this embodiment collects news data from the Tianshan web site and the minority website, and performs necessary preprocessing on the collected data, such as removing messy codes and special symbols, filtering out webpage mark symbols, extracting useful text information, and the like, to obtain an initial corpus.

Step 101, according to an identification target normalized data set sample: since the embodiment mainly aims at language identification of the short text, the length l of the short text is set to be 5 to 60 characters; sentences which accord with the short text length l are extracted from the initial corpus to be used as samples of the data set.

Step 102, selecting a feature space, and vectorizing the sample of the data set based on the feature space: in this embodiment, the vocabulary appearing in the sample of the data set is selected as the feature space, the number of the vocabulary is d, so the number of the features is d, and the size of the feature space is d; aiming at each text sample, adopting a vector space model to combine each text sample x_iMapped as a feature vector x_i＝[x_i1，x_i2，...，x_id]^TAnd d represents a vector dimension, i.e., the number of features. Assuming that there are 80 words, the number of features, d, is 80. x is the number of_ik(k∈[1，d]) The weight representing the current feature word may be set according to whether a word in the vocabulary is present in the text sample, such as: if so, the weight x of the current feature word_ikAnd recording as 1, otherwise, recording as 0.

The work flow of the language identification step is shown in fig. 1.

Now, it is assumed that the training data set D obtained by the above steps contains 100 samples, i.e., n is 100, and each sample is an 80-dimensional vector, i.e., the feature number D is 80. Class y to which a sample belongs_i(i∈[1，n]) ∈ Y, Y is {1,2,3 … q }, q is the number of categories to which the sample belongs, if q is 2, the problem of binary classification is solved, and in the embodiment, q is 3.

Inputting: a training data set D initializes the number T of training rounds and a sample x to be recognized;

and (3) outputting: class y to which prediction sample x belongs^*；

Step 200, adjusting the extraction proportion parameter max _ samples, and selecting the bootstrap sample from the training set as the training set D_b；

Specifically, given a training data set, a Bagging method is first used to obtain bootstrap samples of different versions.

A bootstrap sample refers to that training samples are extracted through replacing the training set, the number of extracted samples is the product of the total number n of the training set samples and an extraction proportion parameter max _ samples, namely n max _ samples, and the extracted samples are marked as a new training set D_b. The number of the bootstrap samples extracted from different versions is the same, and finally, a plurality of training sets with different versions and the same size are obtained.

Having samples drawn back will likely result in the same sample being successfully drawn multiple times, with some samples being successfully drawn 0 times. Thus, only some of the successfully extracted samples will appear in the training set, while the remaining samples are not extracted into the training set.

Adjusting the extraction scale parameter max _ samples, wherein the training set contains 100 samples, and when max _ samples is 0.5, the algorithm will replace the total training set to randomly extract 50 samples as a new training set D_bAt this time D_bThe number of samples m ═ n × max _ samples included in (a);

step 201, based on training set D_bAccording to preset characteristicsSample characteristics are selected according to the proportion parameter max _ features, characteristic filtering is carried out on the basis of the selected characteristics, and a training set D after the characteristic filtering is obtained_t；

Step 201a, selecting a proportional parameter max _ features according to characteristics, performing characteristic selection on a data set, and marking;

adjusting a feature selection proportion parameter max _ features, wherein the feature number of the sample is 80, when max _ features is 0.8, taking 64 features of the randomly extracted sample as a new training set sample, and marking the extracted features;

step 201b, according to the features selected in step 201a, the training set D obtained in step 200 is subjected to_bThe bootstrap sample in (1) is subjected to feature filtering, so that a new training data set D can be formed_tAt this time D_tThe number m of samples included in (1) is n max samples, and the dimension d' of each sample is d max features;

202, training a polynomial naive Bayes (MNBBL), a Random Forest (RFBL), a Support Vector Machine (SVMBL) and a Linear Model (LMBL) basic classifier; the training process is illustrated below by taking the training of a polynomial naive bayes (MNBBL) basic classifier as an example:

step 202a, based on the training data set D_tTraining a polynomial naive Bayes basic classifier, which is expressed as follows:

M_t＝MNB(D_t)

wherein M is_tRepresentation based on D_tTraining the obtained classifier;

because the training content of the polynomial naive Bayes classifier is the basic knowledge in the field, the description is omitted here;

step 202b, repeating the step 200, the step 201 and the step 202aT times to obtain T naive Bayes basic classifiers M_tWherein, T ∈ [1, T]Representing the number of training rounds;

in a step 202c, the process is carried out,according to the feature marks made in step 201a, feature extraction is carried out on the sample x to be identified, and the sample x is marked as x_t；

Step 202d, based on M_t(t∈[1，T]) Using majority voting rule of simple voting rules to x_tIdentification is carried out to obtain identification y_m。

A simple voting rule formalized can be expressed as follows:

y_{m} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = M_{t} (x_{t}))

wherein

l (a = b) = {\begin{matrix} 0 & a! = b \\ 1 & a = b \end{matrix},

Y denotes the corresponding class label in the data set D, Y_mIndicating the identified category label.

For the simple voting rule described above, it is preferable that there are more than 1 y corresponding toWhen the values are the same and are all maximum values, taking the minimum y as y_m。

The meaning of the formula is:

1. traversing the class label, now assuming that q is 2, i.e. it is a binary problem:

when y is 1, the formula is writtenThe resulting value was S1 and,

when y is 2, the formula is recordedThe resulting value was S2;

2. comparing S1 and S2, finding the maximum value, and using the corresponding y value as the classifier predicted class label:

for example, when S1> S2, the prediction class label of the naive bayes base classifier is 1 at this time.

And (4) training the random forest, the support vector machine and the linear model classifier as usual, and finally obtaining T random forest basic classifiers, T support vector machine basic classifiers and T linear model basic classifiers. For the input sample x to be recognized, each classifier respectively predicts a class label, and the prediction class of MNNBBL is recorded as y_mThe prediction class of RFBL is y_rThe prediction class of SVMBL is y_sPrediction class of LMBL is y_l。

Step 203, the basic classifiers are combined into a stronger classifier by using an integration strategy:

the integration strategy can be any one of a simple voting rule, a Bayesian voting method and an integration mode based on a D-S evidence theory.

The embodiment utilizes the majority voting rule in the simple voting rule as an integration strategy, and combines the basic classifiers into a stronger classifier by using minority obedience majority.

According to the prediction result of step 202, if y_m＝1，y_r＝2，y_s＝1，y_l1, the number of classifiers with prediction class 1 is greater than that with prediction class 2,3>1, the final recognition y can be obtained^*Is 1.

For the integration strategy in this embodiment, when more than 1 prediction result is the same and the maximum, the category can be obtained according to the priority order of MNBBBL, RFBL, SVMBL, and LMBL, that is, the priority MNBBBL > RFBL > SVMBL > LMBL; examples are as follows:

if y is_m＝2，y_r＝1，y_s＝2，y_lAs can be seen from 1, the numbers of votes for the recognition results 1 and 2 are the same (both 2) and are the maximum values, and since the MNBBBL has the highest priority, the recognition result of the MNBBBL having the highest priority is taken as the final recognition result, that is, the final recognition result y^*Is 2.

The invention is further illustrated below using a specific example.

(1) Experimental suite

The invention can be used for language identification of similar languages or language variants. In the experimental evaluation link, the method focuses on the analysis and research in three similar languages of Wei language, CoerkeCumin language and Kazakh language. Uygur, Kazakstan and Korke cumin belong to the syncope family of the Altai language, and most letters in the alphabets are completely the same, so that characters of the three languages are similar in a Unicode coding region, and therefore the three languages are difficult to recognize.

Because three languages, namely, Wei, Ha and Ke, have not disclosed language identification corpora, the method crawls data of Tianshan web and KeerkeCun news websites as experimental data sets.

(2) Evaluation method

In these experiments, the present invention was experimentally evaluated using F1 values.

F_{1} - \frac{2 * P * R}{P + R}

P represents accuracy (Precision) and R represents Recall (Recall).

(3) Baseline method

The present invention uses a polynomial naive Bayes classifier (MNB), N-gram, as a baseline method. In the baseline method, experimental evaluations were still performed using F1 values.

(4) Results of the experiment

Table 1 shows the comparison of the experimental results of the baseline experimental method, the basic learning classifier, and the Bagging method. The embodiment of the invention, a baseline method MNB and an N-gram, and other three basic learners: LMBL, SVMBL, RFBL, etc. are compared. As shown in the results of Table 1, the method provided by the present invention is significantly superior to the baseline experimental method and the basic classifier. In Kazakh, the accuracy of the Bagging method is not higher than that of other methods, but is equal to the MNB with the highest accuracy, namely 0.656. Since in ensemble learning, if the accuracy of basic learning is below 0.5, the ensemble method will not be applicable. We therefore speculate that the reason why our approach does not work in kazak may be due to the lower accuracy of the basic classifier.

As can be seen from table 1, the lowest F1 value in kazakstan and the highest F1 value in vician are probably due to the highest similarity between vician and cockCumin and the lowest similarity between kazakstan and cockCumin.

TABLE 1 comparison of experimental results (Whole basic learner)

Table 2 is a comparison of experimental results among all basic learners except LMBL, the reference experimental method, and the Bagging method. Table 3 is a comparison of experimental results between all basic learners except SVMBL, the baseline experimental method, and the Bagging method. Table 4 is a comparison of experimental results among all basic learners except RFBL, the reference experimental method, and the Bagging method.

TABLE 2 comparison of experimental results (basic learner except LMBL)

TABLE 3 comparison of experimental results (basic learner except SVMBL)

TABLE 4 comparison of experimental results (basic learner except RFBL)

Combining table 1 and table 2, table 3, and table 4, it can be seen that when one of the basic learners is removed, the F1 value of the Bagging method decreases. This shows that the Bagging method needs a proper enough basic classifier to better improve the language identification effect.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined by the appended claims and their equivalents.

Claims

1. A language identification method based on ensemble learning is characterized by comprising the following steps:

(1)t＝1；

(2) from the training data set D { (x)₁，y₁)，(x₂，y₂)，…，(x_n，y_n) Extracting according to presetSelecting bootstrap sample as training set D from scale parameter max-samples_bWhere D contains n instances of the labeled class (x)_i，y_i) Each instance x_i＝[x_i1，x_i2，…，x_id]^TIs a vector containing d features, y_iIs x_iClass to which i ∈ [1, n ] belongs]，y_i∈ Y, Y is {1,2, … q }, q represents the number of categories the sample belongs to;

(3) based on training set D_bSelecting sample characteristics according to a preset characteristic selection proportion parameter max-features, and performing characteristic filtering based on the selected characteristics to obtain a training set D after the characteristic filtering_t；

M_t＝MNB(D_t)；

R_t＝RF(D_t)；

S_t＝SVM(D_t)；

L_t＝LM(D_t)；

(3) Obtaining the prediction category y corresponding to the basic classifier x by adopting a simple voting rule for the four basic classifiers_m、y_r、y₈And y_iThe mathematical expression is as follows:

y_{m} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = M_{t} (x_{t}));

y_{r} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = R_{t} (x_{t}));

y_{s} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = S_{t} (x_{t}));

y_{l} = {argmax}_{y &Element; Y} Σ_{t = 1}^{T} l (y = L_{t} (x_{t}));

wherein,

l (a = b) = \{\begin{matrix} 0 & a! = b \\ 1 & a = b \end{matrix};

2. The language identification method based on ensemble learning as claimed in claim 1, wherein the training data set D in step 1 is obtained by:

3. The language identification method based on ensemble learning according to claim 1, wherein the sub-step (3) in step 1 is further completed by the following processes:

(1) selecting a proportional parameter max-features according to the characteristics, performing characteristic selection on the data set and marking;

(2) training a set D according to the characteristics selected in the step (1)_bThe bootstrap sample in (1) is subjected to feature filtering, so that a new training data set D can be formed_t。

4. The language identification method based on ensemble learning as claimed in claim 1, wherein the selection of bootstrap samples and the selection of sample features in step 1 are both selected randomly.

5. The language identification method based on ensemble learning as claimed in claim 1, wherein when the simple voting rule in step 2 obtains votes of a plurality of categories that are the same in number and the largest in number, the prediction category of x is determined according to the priority order from low to high.

6. The language identification method based on ensemble learning according to any one of claims 1-5, wherein the integration strategy in step 3 is any one of simple voting rules, Bayesian voting methods, and integration methods based on D-S evidence theory.

7. The language identification method based on ensemble learning as claimed in claim 6, wherein when the integration policy is a simple voting rule, if the obtained votes of the categories are the same and the votes are the largest, the final identification category of x is determined by the high-to-low priority MNBBBL > RFBL > SVMBL > LMBL according to the basic classifier.