CN102567529A

CN102567529A - Cross-language text classification method based on two-view active learning technology

Info

Publication number: CN102567529A
Application number: CN2011104532511A
Authority: CN
Inventors: 戴林; 刘越
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2011-12-30
Filing date: 2011-12-30
Publication date: 2012-07-11
Anticipated expiration: 2031-12-30
Also published as: CN102567529B

Abstract

The invention relates to a cross-language text classification method based on a two-view active learning technology. The method comprises the following steps of: (1) constructing two views, namely translating all source language texts into target language texts and translating all target language texts into source language texts by using a machine translation tool, so that each text has two language versions; (2) training initial classifiers, namely training by using the source language version to obtain a classifier, and training by using the target language version to obtain a classifier; (3) performing active learning, namely expanding a training set, and retraining the classifiers by using a new training set to obtain two enhanced classifiers; and (4) classifying, namely classifying by using the enhanced classifiers. A cross-language text classification effect is enhanced by the active learning technology, and the number of target language samples which are needed to be manually marked is greatly reduced.

Description

A kind of based on dual-view initiatively learning art stride the language text sorting technique

Technical field

The present invention relates to a kind of file classification method, particularly a kind ofly stride the language text sorting technique, belong to the language information processing field from the middle school acquistion of source language mark text collection to the target language text sorter.

Background technology

In internet and various information disposal system, text classification has very uses widely, as: news category, scientific and technical literature classification, public sentiment classification etc.The main method of text classification is the technology through machine learning, has marked learning law the text of classification from a collection of manual work, the structure automatic categorizer.Here, the text collection of artificial mark is commonly referred to training set.Along with international process, increasing company needs to handle multilingual data with tissue, if all set up a sorter for every kind of language, then need will bring bigger manpower and financial resources to expend thus for every kind of language all marks a collection of training set.

Stride the language text classification and be intended to from the training set that existing certain source language is described, study obtains being applicable to the sorter of target language, thereby reduces the workload of mark training set.Current, the main method of striding language text classification is based on mechanical translation, promptly translates into target language to training set, and therefrom study obtains the target language sorter again.But, since the difference of factors such as culture, the text of different language, and even if subordinate is similar, the theme of the news that is comprised is not identical.This phenomenon is called the theme drift.The training set middle school acquistion of coming from translation to sorter, can not adapt to target language fully, its classifying quality can be affected.

Summary of the invention

The objective of the invention is to solve and stride the theme difference problem between the source language and target language in the language text classification, study adapts to target language more to let sorter pass through initiatively, thereby promotes classifying quality.

The present invention adopts following technical proposals to realize:

A kind of based on dual-view initiatively learning art stride the language text sorting technique, establish source language and target language and be expressed as E and C respectively, the source language training set is expressed as TRe, other has extra target language not mark text collection to be expressed as Uc; It is following then to stride language text sorting technique concrete steps:

(1) structure dual-view: utilize machine translation tools, all source language text are translated into target language, all target language text are translated into source language, then each piece text has all had macaronic version; Regard the version of every kind of language as a kind of view, then every piece of text has all had two kinds of views, is respectively E view and C view; For TRe, its dual-view version table is shown TR; For Uc, its dual-view version table is shown U;

(2) training preliminary classification device: as training set, at first utilize the version training of its source language to obtain a sorter Ce with TR, utilize the version training of its target language to obtain a sorter Cc again; The sorter that training obtains needs to provide the probability that one piece of article belongs to each classification;

Here; Training classifier can adopt various machine learning algorithms, like SVMs (Support Vector Machines), naive Bayesian (

Bayes) etc.

(3) active learning process:

A) text among the U is classified based on E view and C view with Ce and Cc respectively, and calculate class probability;

B) select the minimum text of average confidence level of n piece of writing Cc and Ce; Comprised in these texts be difficult to source language or by source language translation be target language the training focusing study to classificating knowledge; Behind its artificial mark, add training set as new training text;

C) select the with a high credibility of m piece of writing Cc, use the classification results of Cc to do mark, then it is added training set for it in the text of Ce; Select the with a high credibility of m piece of writing Ce, use the classification results of Ce to do mark, then it is added training set for it in the text of Cc;

D) last, utilize new training set, train Cc and Ce again;

Above n and m are the positive integers that is not more than total textual data among the U, and in general, n that value is less and m can make learning process more effective; Iteration is carried out a to four steps of d I time, and I is a positive integer, and in general, I is big more, and study is abundant more, but the relative training time is long more, and cost is high more.

Through this active learning process, obtained the sorter Cc and the Ce of two enhancings;

(4) assorting process: for the text that one piece of target language C to be classified describes, utilize its E view of machine translation tools structure earlier, classify based on its C and E view with Cc and Ce respectively then; Two sorters can provide the probable value that text belongs to each classification separately, and the mean value of getting the two belongs to such final probable value as text; At last, get the classification of the highest classification of probability as text.

Beneficial effect

Method contrast prior art provided by the invention has following beneficial effect:

1, strides the language text classifying quality through the enhancing of active learning art.The phenomenon that has the theme skew between different language.The present invention lets the sorter of training from the source language training set continue the target language text of mark not the theme drift knowledge that discovery initiatively need be learnt, and then adapts to target language more, promotes classifying quality.Need the target language sample number of artificial mark to significantly reduce simultaneously.

2, utilize the dual-view technology to reduce the hand labor in the initiatively study.The present invention utilizes two sorters based on different views, and each other that own confidence level is the highest classification results gives the other side as new training data, reaches mutual the destination of study.Thereby further reduced the number of samples that in the active learning process, needs artificial mark.

Description of drawings

Fig. 1 is the synoptic diagram of structure text dual-view, is example with Chinese and English;

Fig. 2 is the synoptic diagram of learning process, is example with Chinese and English;

Fig. 3 is the synoptic diagram of assorting process, is example with Chinese and English.

Embodiment

Below in conjunction with accompanying drawing, be example as source language, SVM (Support Vector Machine) as the basic classification algorithm as target language, English with Chinese, the performing step of this method is described.The concrete realization of SVM can be used kits such as LibSVM.Here do not describe well-known method and process in detail, become unclear to avoid realization of the present invention.

Suppose that English and Chinese are expressed as E and C respectively, the good English training set TRe of existing mark, not the Chinese text collection Uc of mark.This method utilizes TRe and Uc study to obtain being applicable to the sorter of Chinese text.

Step 1, structure dual-view:

As shown in Figure 1, utilize machine translation tools, the English text among the TRe is translated as Chinese, construct Chinese and English dual-view training set TR; Chinese text among the Uc is translated as English, constructs Chinese and English dual-view and do not mark text set U.That is, each piece text among TR and the U all has Chinese and English two versions simultaneously.

Step 2, learning process:

As shown in Figure 2, this step is made up of following substep:

(1) makes that training set TrainingSet is TR;

(2), use the SVM algorithm training to obtain sorter Cc with the Chinese version of all texts among the TrainingSet;

(3), use the SVM algorithm training to obtain sorter Ce with the English edition of all texts among the TrainingSet;

(4) with Cc to all text classifications (based on Chinese version) among the U, calculate class probability (confidence level) Certainty (d, Cc);

(5) with Ce to all text classifications (based on English edition) among the U, calculate classification confidence level Certainty (d, Ce);

(6) from U, selecting a n piece of writing has minimum average classification confidence level Average_Certainty (text Ce) is expressed as S for d, Cc;

(7) from U, select all Certainty (d, Cc)＞(d Ce)＞text of h, is expressed as L for h or Certainty.Here, h is the confidence level threshold value, and value is the floating number between (0,1);

(8) to every piece among L text calculate the confidence level of two sorters difference Certainty_Diff (d, Ce, Cc) and Certainty_Diff (d, Cc, Ce).Select a m piece of writing have the highest Certainty_Diff (d, Ce, Cc) value text, be expressed as Ee; Select a m piece of writing have the highest Certainty_Diff (d, Cc, Ce) value text, be expressed as Ec;

(9) from U, remove Ee, Ec, S;

(10) classification of every piece of text among the artificial mark S; With the text categories among the classification results mark Ee of Ce; With the text categories among the classification results mark Ec of Cc;

(11) S, Ee, Ec are added TrainingSet;

(12) repeating above (2) goes on foot I time to (11);

In the said process, n and m are positive integers, and in general, n that value is less and m can make learning process more effective.I is a positive integer, and in general, I is big more, and study is abundant more, but the relative training time is long more, and cost is high more.

Through above step, obtained the sorter Cc and the Ce of two enhancings at last.

Next provide the computing method of several variablees of using in the above-mentioned steps.

For one piece of text d, the svm classifier device can provide its probability P that belongs to each classification (y=i|d), and y representes the classification of d here, the classification numbering that i expresses possibility.At this, be not described in detail how to obtain probability estimate through SVM.

Sorter C is calculated as follows the classification confidence level of text d:

Certainty(d，c)＝P _c(y＝i|d)-P _c(y＝j|d)

Here, i and j are two classifications with maximum probability.

The average confidence level of sorter Cc and Ce is according to computes:

Average_Certainty(d，Cc，Ce)＝(Certainty(d，Cc)+Certainty(d，Ce))/2

The difference of the confidence level of sorter Cc and Ce is according to computes:

Certainty_Diff(d，Cc，Ce)＝Certainty(d，Cc)-Certainty(d，Ce)

Step 3, assorting process

As shown in Figure 3, for one piece of Chinese text that needs classification, at first be translated into English, thereby obtain its English edition through machine translation tools.Then, utilize sorter Cc that its Chinese version is classified, utilize sorter Ce that its English edition is classified.Through following formula, utilize the two classification results to calculate net result at last:

P(y＝i|d)＝(P _c(y＝i|d)+P _e(y＝i|d))/2

Wherein, i is each possible classification numbering.Get the maximum classification of probability as classification results.

It should be understood that this embodiment is the instantiation that the present invention implements, should not be the restriction of protection domain of the present invention.Under the situation that does not break away from spirit of the present invention and scope, modification or the change of foregoing being carried out equivalence all should be included within the present invention's scope required for protection.

Claims

One kind based on dual-view initiatively learning art stride the language text sorting technique, establish source language and target language and be expressed as E and C respectively, the source language training set is expressed as TRe, other has extra target language not mark text collection to be expressed as Uc; It is following then to stride language text sorting technique concrete steps:

(1) structure dual-view: utilize machine translation tools, all source language text are translated into target language, all target language text are translated into source language, then each piece text has all had macaronic version; Regard the version of every kind of language as a kind of view, then every piece of text has all had two kinds of views, is respectively E view and C view; For TRe, its dual-view version table is shown TR; For Uc, its dual-view version table is shown U;

(2) training preliminary classification device: as training set, at first utilize the version training of its source language to obtain a sorter Ce with TR, utilize the version training of its target language to obtain a sorter Cc again; The sorter that training obtains needs to provide the probability that one piece of article belongs to each classification;

(3) active learning process:

A) text among the U is classified based on E view and C view with Ce and Cc respectively, and calculate class probability;

B) select the minimum text of average confidence level of n piece of writing Cc and Ce; Comprised in these texts be difficult to source language or by source language translation be target language the training focusing study to classificating knowledge; Behind its artificial mark, add training set as new training text;

C) select the with a high credibility of m piece of writing Cc, use the classification results of Cc to do mark, then it is added training set for it in the text of Ce; Select the with a high credibility of m piece of writing Ce, use the classification results of Ce to do mark, then it is added training set for it in the text of Cc;

D) last, utilize new training set, train Cc and Ce again;

Iteration is carried out a to four steps of d I time;

Above n and m are the positive integers that is not more than total textual data among the U; I is a positive integer;

Through this active learning process, obtained the sorter Cc and the Ce of two enhancings;

(4) assorting process: for the text that one piece of target language C to be classified describes, utilize its E view of machine translation tools structure earlier, classify based on its C and E view with Cc and Ce respectively then; Two sorters can provide the probable value that text belongs to each classification separately, and the mean value of getting the two belongs to such final probable value as text; At last, get the classification of the highest classification of probability as text.
2. a kind of language text sorting technique of striding according to claim 1; It is characterized in that the c of step (3)) the Chinese version method choosing and mark is: from U, select all Certainty (d, Cc)＞h or Certainty (d; Ce)＞and the text of h, be expressed as L; Wherein, h is the confidence level threshold value, and value is the floating number between (0,1);

Every piece among L text is calculated the difference Certainty_Diff of the confidence level of two sorters, and (Cc) (d, Cc Ce), select a m piece of writing and have the highest Certainty_Diff (text that Cc) is worth is expressed as Ee for d, Ce with Certainty_Diff for d, Ce; Select a m piece of writing have the highest Certainty_Diff (d, Cc, Ce) value text, be expressed as Ec; With the text categories among the classification results mark Ee of Ce; With the text categories among the classification results mark Ec of Cc;

Wherein, sorter C is calculated as follows the classification confidence level of text d:

Certainty(d，c)＝P _c(y＝i|d)-P _c(y＝j|d)

Here, i and j are two classifications with maximum probability;

The difference of the confidence level of sorter Cc and Ce is according to computes:

Certainty_Diff(d，Cc，Ce)＝Certainty(d，CC)-Certainty(d，Ce)。