CN105335350A - Language identification method based on ensemble learning - Google Patents

Language identification method based on ensemble learning Download PDF

Info

Publication number
CN105335350A
CN105335350A CN201510644536.1A CN201510644536A CN105335350A CN 105335350 A CN105335350 A CN 105335350A CN 201510644536 A CN201510644536 A CN 201510644536A CN 105335350 A CN105335350 A CN 105335350A
Authority
CN
China
Prior art keywords
classifier
training
sample
data set
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510644536.1A
Other languages
Chinese (zh)
Inventor
冯冲
高小燕
黄河燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201510644536.1A priority Critical patent/CN105335350A/en
Publication of CN105335350A publication Critical patent/CN105335350A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a language identification method based on ensemble learning, and belongs to the technical field of natural language processing and application. The method comprises: firstly, selecting bootstrap samples from a training set D according to a preset extraction proportion parameter max_samples, to obtain a training set Db; then, based on the Db, selecting sample features according to a feature selection proportion parameter max_features, and filtering the Db based on the selected features, to obtain a training set Dt; based on the Dt, training four kinds of basic classifiers of a polynomial Naive Bayes (MNBBL), a random forest (RFBL), a support vector machine (SVMBL), and a linear model (LMBL); and finally, using majority voting to combine the four kinds of basic classifiers into a better classifier; and using the classifier to identify to-be-identified samples. Compared with the prior art, the method can identify short text language of national minority, and accuracy is improved.

Description

Language identification method based on ensemble learning
Technical Field
The invention relates to a language identification technology of minority languages, in particular to a language identification method based on ensemble learning, and belongs to the technical field of natural language processing application.
Background
With the increasing globalization trend, international communication is becoming more and more intimate, and people in various countries and regions frequently come and go due to the needs of economy, politics, culture and tourism, so that people urgently need to break through the limit of language and freely communicate, and language identification becomes more and more important. The method has strong application value in voice recognition, information retrieval, automatic machine translation, national defense and daily life, gradually draws wide attention in related research and application fields, for example, language recognition can be regarded as a filtering technology, interesting languages are directly provided for users in the information retrieval, and the burden of a search engine is reduced.
Language identification automatically identifies the language type to which a document or a sentence belongs.
In the existing language identification technology, the method is widely applied to a language identification method based on an N-element model. However, the method is not satisfactory for the language identification problem between short texts or similar languages.
In fact, language identification between similar languages is difficult and has its linguistics root. A country or region has experienced historical changes from ancient times to our days, deriving several languages and even language variants similar to the original language. For example, Portuguese has two language variants of Brazilian Portuguese and European Portuguese. Therefore, similar languages or language variants share many lexical and grammatical structural features, and it becomes more difficult to distinguish between similar languages or language variants.
In recent times, some research work has been carried out. Such as the graph-based n-gram method LIGA. However, LIGA has domain limitations, which can reduce accuracy once other new domain vocabularies exist. In addition, it has also been proposed to use bag-of-words models to distinguish language variants. The article converts words of the long text into vectors by using a vector space model, and then uses a classifier for classification. The method has the defects that the vector space model usually solves the problem of long texts, and on the problem of short texts, the vector space is too sparse, dimension disasters are easily caused, and the effect is poor.
Disclosure of Invention
The invention aims to provide a Bagging-based similar language identification method aiming at the language identification problem of short texts in the current minority languages.
The invention integrates polynomial Bayes, random forests, support vector machines and linear models into a stronger classifier, constructs training sets of different versions, and performs feature filtering on a data set to increase the difference among different sub-learners, thereby effectively solving the problem of short text language identification of similar languages.
The rough process of the invention is that firstly, a bootstrap sample is selected from a training set D according to a preset extraction proportion parameter max _ samples to obtain the training set Db(ii) a Secondly based on DbSelecting sample characteristics according to the characteristic selection proportion parameter max _ features, and selecting the sample characteristics based on the selected characteristic pair DbFiltering to obtain a training set Dt(ii) a Again based on DtTraining four basic classifiers of polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL); and finally, combining the four basic classifiers into a stronger classifier by utilizing majority voting.
The purpose of the invention is realized by the following technical scheme:
a language identification method based on ensemble learning mainly comprises the following similar language identification steps based on Bagging:
step 1, training four basic classifiers, namely polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL), based on a training data set D by the following processes:
(1)t=1;
(2) from the training data set D { (x)1,y1),(x2,y2),...,(xn,yn) Selecting a bootstrap sample as a training set D according to a preset extraction ratio parameter max _ samplesbWhere D contains n instances of the labeled class (x)i,yi) Each instance xi=[xi1,xi2,...,xid]TIs a vector containing d features, yiIs xiClass to which i ∈ [1, n ] belongs],yi∈ Y, Y is {1,2, … q }, q represents the number of categories the sample belongs to;
(3) based on training set DbSelecting sample characteristics according to a preset characteristic selection proportion parameter max _ features, and performing characteristic filtering based on the selected characteristics to obtain a training set D after the characteristic filteringt
(4) Based on DtTraining four basic classifiers, namely polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL), to obtain the t-th classifier of each basic classifier, wherein the t-th classifier is expressed in the following form:
Mt=MNB(Dt);
Rt=RF(Dt);
St=SVM(Dt);
Lt=LM(Dt);
wherein M istDenotes the t MNBBL classifier, RtDenotes the t-th RFBL classifier, StDenotes the t-th SVMBL classifier, LtRepresents the t-th LMBL classifier;
(5) t is t + 1; if T is less than or equal to T, turning to (2); wherein T is a preset training frequency;
step 2, identifying the sample x to be identified by using the four basic classifiers trained in the step 1 through the following processes to obtain prediction classes of x corresponding to the four classifiers:
(1) according to the features selected by the t-th classifier, performing feature filtering on the x to obtain a filtered sample x to be identifiedt,t∈[1,T];
(2) T-th classifier pair x using four basic classifierstIdentification is carried out to obtain an identification result Mt(xt)、Rt(xt)、St(xt) And Lt(xt);
(3) Simple voting rule is adopted for four basic classifiersPrediction class y to corresponding base classifier xm、yr、ysAnd ylThe mathematical expression is as follows:
y m = argmax y ∈ Y Σ t = 1 T l ( y = M t ( x t ) ) ;
y r = argmax y ∈ Y Σ t = 1 T l ( y = R t ( x t ) ) ;
y s = argmax y ∈ Y Σ t = 1 T l ( y = S t ( x t ) ) ;
y l = argmax y ∈ Y Σ t = 1 T l ( y = L t ( x t ) ) ;
wherein, L ( a = b ) = 0 a ! = b 1 a = b ;
and 3, combining the four basic classifiers into a stronger classifier by using an integration strategy.
Preferably, the training data set D in step 1 can be obtained by the following process:
(1) preparing a training corpus and preprocessing the training corpus to obtain an initial corpus set;
(2) obtaining a sample of the data set according to the identification target normalized data set sample of the initial corpus set;
(3) and selecting a feature space, and vectorizing the sample of the data set based on the feature space to obtain a training data set D.
Preferably, substep (3) of step 1 is further performed by:
(1) selecting a proportional parameter max _ features according to the characteristics, performing characteristic selection on the data set and marking;
(2) and (4) performing feature filtering according to the bootstrap samples in the feature training set selected in the step (1) to form a new training data set.
Preferably, the selection of the bootstrap sample and the selection of the sample characteristics in the step 1 are both selected randomly.
Preferably, when the simple voting rule in step 2 obtains that the votes of the plurality of categories are the same and are all the largest, the prediction category of x is determined according to the priority order of the categories from low to high.
Preferably, the integration strategy in step 3 is any one of a simple voting rule, a bayesian voting method, and an integration method based on D-S evidence theory.
Preferably, when the integration policy is a simple voting rule, if the obtained votes of the plurality of categories are the same and are the largest, the final identification category of x is determined by the high-to-low priority order MNBBBL > RFBL > SVMBL > LMBL according to the basic classifier.
Advantageous effects
The invention designs a language identification method based on ensemble learning. Compared with the existing method, the method can identify the short text language of the minority, and the accuracy rate is improved.
Drawings
Fig. 1 is a schematic flow chart of a language identification method based on ensemble learning according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples.
The implementation steps of the invention are explained in detail by taking the identification of three minority national languages of Wei, Kazakh and Korke cumin as an example:
before describing the embodiments in detail, the following formalized symbols and definitions are given:
training data set D { (x)1,y1),(x2,y2,),...,(xn,yn) Contains a total of n labeled instances. Wherein each instance xi=[xi1,xi2,...,xid]TIs a d-dimensional vector, yiIs xiThe category to which it belongs.
The similar language identification based on Bagging is divided into two steps, namely a data set preparation step and a language identification step.
The data set preparation steps are as follows:
step 100, preparing and preprocessing a training corpus: since there is no corpus of identification languages of the vernacular, the kazakh and the korckii that has been disclosed so far, this embodiment collects news data from the Tianshan web site and the minority website, and performs necessary preprocessing on the collected data, such as removing messy codes and special symbols, filtering out webpage mark symbols, extracting useful text information, and the like, to obtain an initial corpus.
Step 101, according to an identification target normalized data set sample: since the embodiment mainly aims at language identification of the short text, the length l of the short text is set to be 5 to 60 characters; sentences which accord with the short text length l are extracted from the initial corpus to be used as samples of the data set.
Step 102, selecting a feature space, and vectorizing the sample of the data set based on the feature space: in this embodiment, the vocabulary appearing in the sample of the data set is selected as the feature space, the number of the vocabulary is d, so the number of the features is d, and the size of the feature space is d; aiming at each text sample, adopting a vector space model to combine each text sample xiMapped as a feature vector xi=[xi1,xi2,...,xid]TAnd d represents a vector dimension, i.e., the number of features. Assuming that there are 80 words, the number of features, d, is 80. x is the number ofik(k∈[1,d]) The weight representing the current feature word may be set according to whether a word in the vocabulary is present in the text sample, such as: if so, the weight x of the current feature wordikAnd recording as 1, otherwise, recording as 0.
The work flow of the language identification step is shown in fig. 1.
Now, it is assumed that the training data set D obtained by the above steps contains 100 samples, i.e., n is 100, and each sample is an 80-dimensional vector, i.e., the feature number D is 80. Class y to which a sample belongsi(i∈[1,n]) ∈ Y, Y is {1,2,3 … q }, q is the number of categories to which the sample belongs, if q is 2, the problem of binary classification is solved, and in the embodiment, q is 3.
Inputting: a training data set D initializes the number T of training rounds and a sample x to be recognized;
and (3) outputting: class y to which prediction sample x belongs*
Step 200, adjusting the extraction proportion parameter max _ samples, and selecting the bootstrap sample from the training set as the training set Db
Specifically, given a training data set, a Bagging method is first used to obtain bootstrap samples of different versions.
A bootstrap sample refers to that training samples are extracted through replacing the training set, the number of extracted samples is the product of the total number n of the training set samples and an extraction proportion parameter max _ samples, namely n max _ samples, and the extracted samples are marked as a new training set Db. The number of the bootstrap samples extracted from different versions is the same, and finally, a plurality of training sets with different versions and the same size are obtained.
Having samples drawn back will likely result in the same sample being successfully drawn multiple times, with some samples being successfully drawn 0 times. Thus, only some of the successfully extracted samples will appear in the training set, while the remaining samples are not extracted into the training set.
Adjusting the extraction scale parameter max _ samples, wherein the training set contains 100 samples, and when max _ samples is 0.5, the algorithm will replace the total training set to randomly extract 50 samples as a new training set DbAt this time DbThe number of samples m ═ n × max _ samples included in (a);
step 201, based on training set DbAccording to preset characteristicsSample characteristics are selected according to the proportion parameter max _ features, characteristic filtering is carried out on the basis of the selected characteristics, and a training set D after the characteristic filtering is obtainedt
Step 201a, selecting a proportional parameter max _ features according to characteristics, performing characteristic selection on a data set, and marking;
adjusting a feature selection proportion parameter max _ features, wherein the feature number of the sample is 80, when max _ features is 0.8, taking 64 features of the randomly extracted sample as a new training set sample, and marking the extracted features;
step 201b, according to the features selected in step 201a, the training set D obtained in step 200 is subjected tobThe bootstrap sample in (1) is subjected to feature filtering, so that a new training data set D can be formedtAt this time DtThe number m of samples included in (1) is n max samples, and the dimension d' of each sample is d max features;
202, training a polynomial naive Bayes (MNBBL), a Random Forest (RFBL), a Support Vector Machine (SVMBL) and a Linear Model (LMBL) basic classifier; the training process is illustrated below by taking the training of a polynomial naive bayes (MNBBL) basic classifier as an example:
step 202a, based on the training data set DtTraining a polynomial naive Bayes basic classifier, which is expressed as follows:
Mt=MNB(Dt)
wherein M istRepresentation based on DtTraining the obtained classifier;
because the training content of the polynomial naive Bayes classifier is the basic knowledge in the field, the description is omitted here;
step 202b, repeating the step 200, the step 201 and the step 202aT times to obtain T naive Bayes basic classifiers MtWherein, T ∈ [1, T]Representing the number of training rounds;
in a step 202c, the process is carried out,according to the feature marks made in step 201a, feature extraction is carried out on the sample x to be identified, and the sample x is marked as xt
Step 202d, based on Mt(t∈[1,T]) Using majority voting rule of simple voting rules to xtIdentification is carried out to obtain identification ym
A simple voting rule formalized can be expressed as follows:
y m = argmax y ∈ Y Σ t = 1 T l ( y = M t ( x t ) )
wherein l ( a = b ) = { 0 a ! = b 1 a = b , Y denotes the corresponding class label in the data set D, YmIndicating the identified category label.
For the simple voting rule described above, it is preferable that there are more than 1 y corresponding toWhen the values are the same and are all maximum values, taking the minimum y as ym
The meaning of the formula is:
1. traversing the class label, now assuming that q is 2, i.e. it is a binary problem:
when y is 1, the formula is writtenThe resulting value was S1 and,
when y is 2, the formula is recordedThe resulting value was S2;
2. comparing S1 and S2, finding the maximum value, and using the corresponding y value as the classifier predicted class label:
for example, when S1> S2, the prediction class label of the naive bayes base classifier is 1 at this time.
And (4) training the random forest, the support vector machine and the linear model classifier as usual, and finally obtaining T random forest basic classifiers, T support vector machine basic classifiers and T linear model basic classifiers. For the input sample x to be recognized, each classifier respectively predicts a class label, and the prediction class of MNNBBL is recorded as ymThe prediction class of RFBL is yrThe prediction class of SVMBL is ysPrediction class of LMBL is yl
Step 203, the basic classifiers are combined into a stronger classifier by using an integration strategy:
the integration strategy can be any one of a simple voting rule, a Bayesian voting method and an integration mode based on a D-S evidence theory.
The embodiment utilizes the majority voting rule in the simple voting rule as an integration strategy, and combines the basic classifiers into a stronger classifier by using minority obedience majority.
According to the prediction result of step 202, if ym=1,yr=2,ys=1,yl1, the number of classifiers with prediction class 1 is greater than that with prediction class 2,3>1, the final recognition y can be obtained*Is 1.
For the integration strategy in this embodiment, when more than 1 prediction result is the same and the maximum, the category can be obtained according to the priority order of MNBBBL, RFBL, SVMBL, and LMBL, that is, the priority MNBBBL > RFBL > SVMBL > LMBL; examples are as follows:
if y ism=2,yr=1,ys=2,ylAs can be seen from 1, the numbers of votes for the recognition results 1 and 2 are the same (both 2) and are the maximum values, and since the MNBBBL has the highest priority, the recognition result of the MNBBBL having the highest priority is taken as the final recognition result, that is, the final recognition result y*Is 2.
The invention is further illustrated below using a specific example.
(1) Experimental suite
The invention can be used for language identification of similar languages or language variants. In the experimental evaluation link, the method focuses on the analysis and research in three similar languages of Wei language, CoerkeCumin language and Kazakh language. Uygur, Kazakstan and Korke cumin belong to the syncope family of the Altai language, and most letters in the alphabets are completely the same, so that characters of the three languages are similar in a Unicode coding region, and therefore the three languages are difficult to recognize.
Because three languages, namely, Wei, Ha and Ke, have not disclosed language identification corpora, the method crawls data of Tianshan web and KeerkeCun news websites as experimental data sets.
(2) Evaluation method
In these experiments, the present invention was experimentally evaluated using F1 values.
F 1 - 2 * P * R P + R
P represents accuracy (Precision) and R represents Recall (Recall).
(3) Baseline method
The present invention uses a polynomial naive Bayes classifier (MNB), N-gram, as a baseline method. In the baseline method, experimental evaluations were still performed using F1 values.
(4) Results of the experiment
Table 1 shows the comparison of the experimental results of the baseline experimental method, the basic learning classifier, and the Bagging method. The embodiment of the invention, a baseline method MNB and an N-gram, and other three basic learners: LMBL, SVMBL, RFBL, etc. are compared. As shown in the results of Table 1, the method provided by the present invention is significantly superior to the baseline experimental method and the basic classifier. In Kazakh, the accuracy of the Bagging method is not higher than that of other methods, but is equal to the MNB with the highest accuracy, namely 0.656. Since in ensemble learning, if the accuracy of basic learning is below 0.5, the ensemble method will not be applicable. We therefore speculate that the reason why our approach does not work in kazak may be due to the lower accuracy of the basic classifier.
As can be seen from table 1, the lowest F1 value in kazakstan and the highest F1 value in vician are probably due to the highest similarity between vician and cockCumin and the lowest similarity between kazakstan and cockCumin.
TABLE 1 comparison of experimental results (Whole basic learner)
Table 2 is a comparison of experimental results among all basic learners except LMBL, the reference experimental method, and the Bagging method. Table 3 is a comparison of experimental results between all basic learners except SVMBL, the baseline experimental method, and the Bagging method. Table 4 is a comparison of experimental results among all basic learners except RFBL, the reference experimental method, and the Bagging method.
TABLE 2 comparison of experimental results (basic learner except LMBL)
TABLE 3 comparison of experimental results (basic learner except SVMBL)
TABLE 4 comparison of experimental results (basic learner except RFBL)
Combining table 1 and table 2, table 3, and table 4, it can be seen that when one of the basic learners is removed, the F1 value of the Bagging method decreases. This shows that the Bagging method needs a proper enough basic classifier to better improve the language identification effect.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined by the appended claims and their equivalents.

Claims (7)

1. A language identification method based on ensemble learning is characterized by comprising the following steps:
step 1, training four basic classifiers, namely polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL), based on a training data set D by the following processes:
(1)t=1;
(2) from the training data set D { (x)1,y1),(x2,y2),…,(xn,yn) Extracting according to presetSelecting bootstrap sample as training set D from scale parameter max-samplesbWhere D contains n instances of the labeled class (x)i,yi) Each instance xi=[xi1,xi2,…,xid]TIs a vector containing d features, yiIs xiClass to which i ∈ [1, n ] belongs],yi∈ Y, Y is {1,2, … q }, q represents the number of categories the sample belongs to;
(3) based on training set DbSelecting sample characteristics according to a preset characteristic selection proportion parameter max-features, and performing characteristic filtering based on the selected characteristics to obtain a training set D after the characteristic filteringt
(4) Based on DtTraining four basic classifiers, namely polynomial naive Bayes (MNBBL), Random Forests (RFBL), Support Vector Machines (SVMBL) and Linear Models (LMBL), to obtain the t-th classifier of each basic classifier, wherein the t-th classifier is expressed in the following form:
Mt=MNB(Dt);
Rt=RF(Dt);
St=SVM(Dt);
Lt=LM(Dt);
wherein M istDenotes the t MNBBL classifier, RtDenotes the t-th RFBL classifier, StDenotes the t-th SVMBL classifier, LtRepresents the t-th LMBL classifier;
(5) t is t + 1; if T is less than or equal to T, turning to (2); wherein T is a preset training frequency;
step 2, identifying the sample x to be identified by using the four basic classifiers trained in the step 1 through the following processes to obtain prediction classes of x corresponding to the four classifiers:
(1) according to the features selected by the t-th classifier, performing feature filtering on the x to obtain a filtered sample x to be identifiedt,t∈[1,T];
(2) T-th classifier pair x using four basic classifierstIdentification is carried out to obtain an identification result Mt(xt)、Rt(xt)、St(xt) And Lt(xt);
(3) Obtaining the prediction category y corresponding to the basic classifier x by adopting a simple voting rule for the four basic classifiersm、yr、y8And yiThe mathematical expression is as follows:
y m = argmax y ∈ Y Σ t = 1 T l ( y = M t ( x t ) ) ;
y r = argmax y ∈ Y Σ t = 1 T l ( y = R t ( x t ) ) ;
y s = argmax y ∈ Y Σ t = 1 T l ( y = S t ( x t ) ) ;
y l = argmax y ∈ Y Σ t = 1 T l ( y = L t ( x t ) ) ;
wherein, l ( a = b ) = 0 a ! = b 1 a = b ;
and 3, combining the four basic classifiers into a stronger classifier by using an integration strategy.
2. The language identification method based on ensemble learning as claimed in claim 1, wherein the training data set D in step 1 is obtained by:
(1) preparing a training corpus and preprocessing the training corpus to obtain an initial corpus set;
(2) obtaining a sample of the data set according to the identification target normalized data set sample of the initial corpus set;
(3) and selecting a feature space, and vectorizing the sample of the data set based on the feature space to obtain a training data set D.
3. The language identification method based on ensemble learning according to claim 1, wherein the sub-step (3) in step 1 is further completed by the following processes:
(1) selecting a proportional parameter max-features according to the characteristics, performing characteristic selection on the data set and marking;
(2) training a set D according to the characteristics selected in the step (1)bThe bootstrap sample in (1) is subjected to feature filtering, so that a new training data set D can be formedt
4. The language identification method based on ensemble learning as claimed in claim 1, wherein the selection of bootstrap samples and the selection of sample features in step 1 are both selected randomly.
5. The language identification method based on ensemble learning as claimed in claim 1, wherein when the simple voting rule in step 2 obtains votes of a plurality of categories that are the same in number and the largest in number, the prediction category of x is determined according to the priority order from low to high.
6. The language identification method based on ensemble learning according to any one of claims 1-5, wherein the integration strategy in step 3 is any one of simple voting rules, Bayesian voting methods, and integration methods based on D-S evidence theory.
7. The language identification method based on ensemble learning as claimed in claim 6, wherein when the integration policy is a simple voting rule, if the obtained votes of the categories are the same and the votes are the largest, the final identification category of x is determined by the high-to-low priority MNBBBL > RFBL > SVMBL > LMBL according to the basic classifier.
CN201510644536.1A 2015-10-08 2015-10-08 Language identification method based on ensemble learning Pending CN105335350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510644536.1A CN105335350A (en) 2015-10-08 2015-10-08 Language identification method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510644536.1A CN105335350A (en) 2015-10-08 2015-10-08 Language identification method based on ensemble learning

Publications (1)

Publication Number Publication Date
CN105335350A true CN105335350A (en) 2016-02-17

Family

ID=55285895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510644536.1A Pending CN105335350A (en) 2015-10-08 2015-10-08 Language identification method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN105335350A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886799A (en) * 2017-03-17 2017-06-23 东北大学 A kind of continuous annealing band steel quality online test method based on hybrid integrated study
CN108021941A (en) * 2017-11-30 2018-05-11 四川大学 Use in medicament-induced hepatotoxicity Forecasting Methodology and device
CN108090788A (en) * 2017-12-22 2018-05-29 苏州大学 Ad conversion rates predictor method based on temporal information integrated model
CN108564094A (en) * 2018-04-24 2018-09-21 河北智霖信息科技有限公司 A kind of Material Identification method based on convolutional neural networks and classifiers combination
CN109857862A (en) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 File classification method, device, server and medium based on intelligent decision
CN111636932A (en) * 2020-04-23 2020-09-08 天津大学 Blade crack online measurement method based on blade tip timing and integrated learning algorithm
CN114462397A (en) * 2022-01-20 2022-05-10 连连(杭州)信息技术有限公司 Language identification model training method, language identification method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125463A1 (en) * 2007-11-13 2009-05-14 Shohei Hido Technique for classifying data
CN102298646A (en) * 2011-09-21 2011-12-28 苏州大学 Method and device for classifying subjective text and objective text
CN102789498A (en) * 2012-07-16 2012-11-21 钱钢 Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning
CN103034626A (en) * 2012-12-26 2013-04-10 上海交通大学 Emotion analyzing system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125463A1 (en) * 2007-11-13 2009-05-14 Shohei Hido Technique for classifying data
CN102298646A (en) * 2011-09-21 2011-12-28 苏州大学 Method and device for classifying subjective text and objective text
CN102789498A (en) * 2012-07-16 2012-11-21 钱钢 Method and system for carrying out sentiment classification on Chinese comment text on basis of ensemble learning
CN103034626A (en) * 2012-12-26 2013-04-10 上海交通大学 Emotion analyzing system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
姚沛津: "基于朴素贝叶斯的集成算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
邓生雄 等: "《集成随机森林的分类模型》", 《计算机应用研究》 *
陈瑶玲 等: "基于多特征和多分类器融合的语种识别", 《微计算机信息》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886799A (en) * 2017-03-17 2017-06-23 东北大学 A kind of continuous annealing band steel quality online test method based on hybrid integrated study
CN106886799B (en) * 2017-03-17 2019-08-02 东北大学 A kind of continuous annealing band steel quality online test method based on hybrid integrated study
CN108021941B (en) * 2017-11-30 2020-08-28 四川大学 Method and device for predicting drug hepatotoxicity
CN108021941A (en) * 2017-11-30 2018-05-11 四川大学 Use in medicament-induced hepatotoxicity Forecasting Methodology and device
CN108090788A (en) * 2017-12-22 2018-05-29 苏州大学 Ad conversion rates predictor method based on temporal information integrated model
CN108090788B (en) * 2017-12-22 2021-04-20 苏州大学 Advertisement conversion rate estimation method based on time information integration model
CN108564094A (en) * 2018-04-24 2018-09-21 河北智霖信息科技有限公司 A kind of Material Identification method based on convolutional neural networks and classifiers combination
CN108564094B (en) * 2018-04-24 2021-09-14 河北智霖信息科技有限公司 Material identification method based on combination of convolutional neural network and classifier
CN109857862A (en) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 File classification method, device, server and medium based on intelligent decision
CN109857862B (en) * 2019-01-04 2024-04-19 平安科技(深圳)有限公司 Text classification method, device, server and medium based on intelligent decision
CN111636932A (en) * 2020-04-23 2020-09-08 天津大学 Blade crack online measurement method based on blade tip timing and integrated learning algorithm
CN114462397A (en) * 2022-01-20 2022-05-10 连连(杭州)信息技术有限公司 Language identification model training method, language identification method and device and electronic equipment
CN114462397B (en) * 2022-01-20 2023-09-22 连连(杭州)信息技术有限公司 Language identification model training method, language identification method, device and electronic equipment

Similar Documents

Publication Publication Date Title
Effrosynidis et al. A comparison of pre-processing techniques for twitter sentiment analysis
Chowdhury et al. Performing sentiment analysis in Bangla microblog posts
Le et al. Twitter sentiment analysis using machine learning techniques
CN105335350A (en) Language identification method based on ensemble learning
US8731904B2 (en) Apparatus and method for extracting and analyzing opinion in web document
CN101520802A (en) Question-answer pair quality evaluation method and system
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
Solakidis et al. Multilingual sentiment analysis using emoticons and keywords
Abbasi et al. Applying authorship analysis to Arabic web content
CN103995853A (en) Multi-language emotional data processing and classifying method and system based on key sentences
Hasanuzzaman et al. Demographic word embeddings for racism detection on Twitter
CN107463703A (en) English social media account number classification method based on information gain
Ljubešić et al. Discriminating between closely related languages on twitter
Veena et al. An effective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding
Espinosa et al. Bots and Gender Profiling using Character Bigrams.
CN107463715A (en) English social media account number classification method based on information gain
Khan et al. Harnessing english sentiment lexicons for polarity detection in urdu tweets: A baseline approach
CN112711666B (en) Futures label extraction method and device
Hussain et al. A technique for perceiving abusive bangla comments
Chumwatana COMMENT ANALYSIS FOR PRODUCT AND SERVICE SATISFACTION FROM THAI CUSTOMERS'REVIEW IN SOCIAL NETWORK
Sigit et al. Comparison of Classification Methods on Sentiment Analysis of Political Figure Electability Based on Public Comments on Online News Media Sites
Baniata et al. Sentence representation network for Arabic sentiment analysis
Schmid et al. FoSIL-Offensive language classification of German tweets combining SVMs and deep learning techniques.
Chen et al. Learning the chinese sentence representation with LSTM autoencoder
KR102569381B1 (en) System and Method for Machine Reading Comprehension to Table-centered Web Documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160217

WD01 Invention patent application deemed withdrawn after publication