CN102567529A - Cross-language text classification method based on two-view active learning technology - Google Patents
Cross-language text classification method based on two-view active learning technology Download PDFInfo
- Publication number
- CN102567529A CN102567529A CN2011104532511A CN201110453251A CN102567529A CN 102567529 A CN102567529 A CN 102567529A CN 2011104532511 A CN2011104532511 A CN 2011104532511A CN 201110453251 A CN201110453251 A CN 201110453251A CN 102567529 A CN102567529 A CN 102567529A
- Authority
- CN
- China
- Prior art keywords
- text
- language
- view
- certainty
- piece
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to a cross-language text classification method based on a two-view active learning technology. The method comprises the following steps of: (1) constructing two views, namely translating all source language texts into target language texts and translating all target language texts into source language texts by using a machine translation tool, so that each text has two language versions; (2) training initial classifiers, namely training by using the source language version to obtain a classifier, and training by using the target language version to obtain a classifier; (3) performing active learning, namely expanding a training set, and retraining the classifiers by using a new training set to obtain two enhanced classifiers; and (4) classifying, namely classifying by using the enhanced classifiers. A cross-language text classification effect is enhanced by the active learning technology, and the number of target language samples which are needed to be manually marked is greatly reduced.
Description
Technical field
The present invention relates to a kind of file classification method, particularly a kind ofly stride the language text sorting technique, belong to the language information processing field from the middle school acquistion of source language mark text collection to the target language text sorter.
Background technology
In internet and various information disposal system, text classification has very uses widely, as: news category, scientific and technical literature classification, public sentiment classification etc.The main method of text classification is the technology through machine learning, has marked learning law the text of classification from a collection of manual work, the structure automatic categorizer.Here, the text collection of artificial mark is commonly referred to training set.Along with international process, increasing company needs to handle multilingual data with tissue, if all set up a sorter for every kind of language, then need will bring bigger manpower and financial resources to expend thus for every kind of language all marks a collection of training set.
Stride the language text classification and be intended to from the training set that existing certain source language is described, study obtains being applicable to the sorter of target language, thereby reduces the workload of mark training set.Current, the main method of striding language text classification is based on mechanical translation, promptly translates into target language to training set, and therefrom study obtains the target language sorter again.But, since the difference of factors such as culture, the text of different language, and even if subordinate is similar, the theme of the news that is comprised is not identical.This phenomenon is called the theme drift.The training set middle school acquistion of coming from translation to sorter, can not adapt to target language fully, its classifying quality can be affected.
Summary of the invention
The objective of the invention is to solve and stride the theme difference problem between the source language and target language in the language text classification, study adapts to target language more to let sorter pass through initiatively, thereby promotes classifying quality.
The present invention adopts following technical proposals to realize:
A kind of based on dual-view initiatively learning art stride the language text sorting technique, establish source language and target language and be expressed as E and C respectively, the source language training set is expressed as TRe, other has extra target language not mark text collection to be expressed as Uc; It is following then to stride language text sorting technique concrete steps:
(1) structure dual-view: utilize machine translation tools, all source language text are translated into target language, all target language text are translated into source language, then each piece text has all had macaronic version; Regard the version of every kind of language as a kind of view, then every piece of text has all had two kinds of views, is respectively E view and C view; For TRe, its dual-view version table is shown TR; For Uc, its dual-view version table is shown U;
(2) training preliminary classification device: as training set, at first utilize the version training of its source language to obtain a sorter Ce with TR, utilize the version training of its target language to obtain a sorter Cc again; The sorter that training obtains needs to provide the probability that one piece of article belongs to each classification;
Here; Training classifier can adopt various machine learning algorithms, like SVMs (Support Vector Machines), naive Bayesian (
Bayes) etc.
(3) active learning process:
A) text among the U is classified based on E view and C view with Ce and Cc respectively, and calculate class probability;
B) select the minimum text of average confidence level of n piece of writing Cc and Ce; Comprised in these texts be difficult to source language or by source language translation be target language the training focusing study to classificating knowledge; Behind its artificial mark, add training set as new training text;
C) select the with a high credibility of m piece of writing Cc, use the classification results of Cc to do mark, then it is added training set for it in the text of Ce; Select the with a high credibility of m piece of writing Ce, use the classification results of Ce to do mark, then it is added training set for it in the text of Cc;
D) last, utilize new training set, train Cc and Ce again;
Above n and m are the positive integers that is not more than total textual data among the U, and in general, n that value is less and m can make learning process more effective; Iteration is carried out a to four steps of d I time, and I is a positive integer, and in general, I is big more, and study is abundant more, but the relative training time is long more, and cost is high more.
Through this active learning process, obtained the sorter Cc and the Ce of two enhancings;
(4) assorting process: for the text that one piece of target language C to be classified describes, utilize its E view of machine translation tools structure earlier, classify based on its C and E view with Cc and Ce respectively then; Two sorters can provide the probable value that text belongs to each classification separately, and the mean value of getting the two belongs to such final probable value as text; At last, get the classification of the highest classification of probability as text.
Beneficial effect
Method contrast prior art provided by the invention has following beneficial effect:
1, strides the language text classifying quality through the enhancing of active learning art.The phenomenon that has the theme skew between different language.The present invention lets the sorter of training from the source language training set continue the target language text of mark not the theme drift knowledge that discovery initiatively need be learnt, and then adapts to target language more, promotes classifying quality.Need the target language sample number of artificial mark to significantly reduce simultaneously.
2, utilize the dual-view technology to reduce the hand labor in the initiatively study.The present invention utilizes two sorters based on different views, and each other that own confidence level is the highest classification results gives the other side as new training data, reaches mutual the destination of study.Thereby further reduced the number of samples that in the active learning process, needs artificial mark.
Description of drawings
Fig. 1 is the synoptic diagram of structure text dual-view, is example with Chinese and English;
Fig. 2 is the synoptic diagram of learning process, is example with Chinese and English;
Fig. 3 is the synoptic diagram of assorting process, is example with Chinese and English.
Embodiment
Below in conjunction with accompanying drawing, be example as source language, SVM (Support Vector Machine) as the basic classification algorithm as target language, English with Chinese, the performing step of this method is described.The concrete realization of SVM can be used kits such as LibSVM.Here do not describe well-known method and process in detail, become unclear to avoid realization of the present invention.
Suppose that English and Chinese are expressed as E and C respectively, the good English training set TRe of existing mark, not the Chinese text collection Uc of mark.This method utilizes TRe and Uc study to obtain being applicable to the sorter of Chinese text.
Step 1, structure dual-view:
As shown in Figure 1, utilize machine translation tools, the English text among the TRe is translated as Chinese, construct Chinese and English dual-view training set TR; Chinese text among the Uc is translated as English, constructs Chinese and English dual-view and do not mark text set U.That is, each piece text among TR and the U all has Chinese and English two versions simultaneously.
Step 2, learning process:
As shown in Figure 2, this step is made up of following substep:
(1) makes that training set TrainingSet is TR;
(2), use the SVM algorithm training to obtain sorter Cc with the Chinese version of all texts among the TrainingSet;
(3), use the SVM algorithm training to obtain sorter Ce with the English edition of all texts among the TrainingSet;
(4) with Cc to all text classifications (based on Chinese version) among the U, calculate class probability (confidence level) Certainty (d, Cc);
(5) with Ce to all text classifications (based on English edition) among the U, calculate classification confidence level Certainty (d, Ce);
(6) from U, selecting a n piece of writing has minimum average classification confidence level Average_Certainty (text Ce) is expressed as S for d, Cc;
(7) from U, select all Certainty (d, Cc)>(d Ce)>text of h, is expressed as L for h or Certainty.Here, h is the confidence level threshold value, and value is the floating number between (0,1);
(8) to every piece among L text calculate the confidence level of two sorters difference Certainty_Diff (d, Ce, Cc) and Certainty_Diff (d, Cc, Ce).Select a m piece of writing have the highest Certainty_Diff (d, Ce, Cc) value text, be expressed as Ee; Select a m piece of writing have the highest Certainty_Diff (d, Cc, Ce) value text, be expressed as Ec;
(9) from U, remove Ee, Ec, S;
(10) classification of every piece of text among the artificial mark S; With the text categories among the classification results mark Ee of Ce; With the text categories among the classification results mark Ec of Cc;
(11) S, Ee, Ec are added TrainingSet;
(12) repeating above (2) goes on foot I time to (11);
In the said process, n and m are positive integers, and in general, n that value is less and m can make learning process more effective.I is a positive integer, and in general, I is big more, and study is abundant more, but the relative training time is long more, and cost is high more.
Through above step, obtained the sorter Cc and the Ce of two enhancings at last.
Next provide the computing method of several variablees of using in the above-mentioned steps.
For one piece of text d, the svm classifier device can provide its probability P that belongs to each classification (y=i|d), and y representes the classification of d here, the classification numbering that i expresses possibility.At this, be not described in detail how to obtain probability estimate through SVM.
Sorter C is calculated as follows the classification confidence level of text d:
Certainty(d,c)=P
c(y=i|d)-P
c(y=j|d)
Here, i and j are two classifications with maximum probability.
The average confidence level of sorter Cc and Ce is according to computes:
Average_Certainty(d,Cc,Ce)=(Certainty(d,Cc)+Certainty(d,Ce))/2
The difference of the confidence level of sorter Cc and Ce is according to computes:
Certainty_Diff(d,Cc,Ce)=Certainty(d,Cc)-Certainty(d,Ce)
Step 3, assorting process
As shown in Figure 3, for one piece of Chinese text that needs classification, at first be translated into English, thereby obtain its English edition through machine translation tools.Then, utilize sorter Cc that its Chinese version is classified, utilize sorter Ce that its English edition is classified.Through following formula, utilize the two classification results to calculate net result at last:
P(y=i|d)=(P
c(y=i|d)+P
e(y=i|d))/2
Wherein, i is each possible classification numbering.Get the maximum classification of probability as classification results.
It should be understood that this embodiment is the instantiation that the present invention implements, should not be the restriction of protection domain of the present invention.Under the situation that does not break away from spirit of the present invention and scope, modification or the change of foregoing being carried out equivalence all should be included within the present invention's scope required for protection.
Claims (2)
- One kind based on dual-view initiatively learning art stride the language text sorting technique, establish source language and target language and be expressed as E and C respectively, the source language training set is expressed as TRe, other has extra target language not mark text collection to be expressed as Uc; It is following then to stride language text sorting technique concrete steps:(1) structure dual-view: utilize machine translation tools, all source language text are translated into target language, all target language text are translated into source language, then each piece text has all had macaronic version; Regard the version of every kind of language as a kind of view, then every piece of text has all had two kinds of views, is respectively E view and C view; For TRe, its dual-view version table is shown TR; For Uc, its dual-view version table is shown U;(2) training preliminary classification device: as training set, at first utilize the version training of its source language to obtain a sorter Ce with TR, utilize the version training of its target language to obtain a sorter Cc again; The sorter that training obtains needs to provide the probability that one piece of article belongs to each classification;(3) active learning process:A) text among the U is classified based on E view and C view with Ce and Cc respectively, and calculate class probability;B) select the minimum text of average confidence level of n piece of writing Cc and Ce; Comprised in these texts be difficult to source language or by source language translation be target language the training focusing study to classificating knowledge; Behind its artificial mark, add training set as new training text;C) select the with a high credibility of m piece of writing Cc, use the classification results of Cc to do mark, then it is added training set for it in the text of Ce; Select the with a high credibility of m piece of writing Ce, use the classification results of Ce to do mark, then it is added training set for it in the text of Cc;D) last, utilize new training set, train Cc and Ce again;Iteration is carried out a to four steps of d I time;Above n and m are the positive integers that is not more than total textual data among the U; I is a positive integer;Through this active learning process, obtained the sorter Cc and the Ce of two enhancings;(4) assorting process: for the text that one piece of target language C to be classified describes, utilize its E view of machine translation tools structure earlier, classify based on its C and E view with Cc and Ce respectively then; Two sorters can provide the probable value that text belongs to each classification separately, and the mean value of getting the two belongs to such final probable value as text; At last, get the classification of the highest classification of probability as text.
- 2. a kind of language text sorting technique of striding according to claim 1; It is characterized in that the c of step (3)) the Chinese version method choosing and mark is: from U, select all Certainty (d, Cc)>h or Certainty (d; Ce)>and the text of h, be expressed as L; Wherein, h is the confidence level threshold value, and value is the floating number between (0,1);Every piece among L text is calculated the difference Certainty_Diff of the confidence level of two sorters, and (Cc) (d, Cc Ce), select a m piece of writing and have the highest Certainty_Diff (text that Cc) is worth is expressed as Ee for d, Ce with Certainty_Diff for d, Ce; Select a m piece of writing have the highest Certainty_Diff (d, Cc, Ce) value text, be expressed as Ec; With the text categories among the classification results mark Ee of Ce; With the text categories among the classification results mark Ec of Cc;Wherein, sorter C is calculated as follows the classification confidence level of text d:Certainty(d,c)=P c(y=i|d)-P c(y=j|d)Here, i and j are two classifications with maximum probability;The difference of the confidence level of sorter Cc and Ce is according to computes:Certainty_Diff(d,Cc,Ce)=Certainty(d,CC)-Certainty(d,Ce)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110453251 CN102567529B (en) | 2011-12-30 | 2011-12-30 | Cross-language text classification method based on two-view active learning technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110453251 CN102567529B (en) | 2011-12-30 | 2011-12-30 | Cross-language text classification method based on two-view active learning technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102567529A true CN102567529A (en) | 2012-07-11 |
CN102567529B CN102567529B (en) | 2013-11-06 |
Family
ID=46412928
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110453251 Expired - Fee Related CN102567529B (en) | 2011-12-30 | 2011-12-30 | Cross-language text classification method based on two-view active learning technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102567529B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577498A (en) * | 2012-08-09 | 2014-02-12 | 北京百度网讯科技有限公司 | Method and device for automatically establishing classification rule for cross-language |
CN104584005A (en) * | 2012-08-22 | 2015-04-29 | 株式会社东芝 | Document classification device and document classification method |
CN107168533A (en) * | 2017-05-09 | 2017-09-15 | 长春理工大学 | A kind of P300 based on integrated supporting vector machine spells the training set extended method of device |
CN107169001A (en) * | 2017-03-31 | 2017-09-15 | 华东师范大学 | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning |
CN107798386A (en) * | 2016-09-01 | 2018-03-13 | 微软技术许可有限责任公司 | More process synergics training based on unlabeled data |
CN110929530A (en) * | 2018-09-17 | 2020-03-27 | 阿里巴巴集团控股有限公司 | Method and device for identifying multilingual junk text and computing equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050049852A1 (en) * | 2003-09-03 | 2005-03-03 | Chao Gerald Cheshun | Adaptive and scalable method for resolving natural language ambiguities |
CN101261623A (en) * | 2007-03-07 | 2008-09-10 | 国际商业机器公司 | Word splitting method and device for word border-free mark language based on search |
CN101770453A (en) * | 2008-12-31 | 2010-07-07 | 华建机器翻译有限公司 | Chinese text coreference resolution method based on domain ontology through being combined with machine learning model |
-
2011
- 2011-12-30 CN CN 201110453251 patent/CN102567529B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050049852A1 (en) * | 2003-09-03 | 2005-03-03 | Chao Gerald Cheshun | Adaptive and scalable method for resolving natural language ambiguities |
CN101261623A (en) * | 2007-03-07 | 2008-09-10 | 国际商业机器公司 | Word splitting method and device for word border-free mark language based on search |
CN101770453A (en) * | 2008-12-31 | 2010-07-07 | 华建机器翻译有限公司 | Chinese text coreference resolution method based on domain ontology through being combined with machine learning model |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577498A (en) * | 2012-08-09 | 2014-02-12 | 北京百度网讯科技有限公司 | Method and device for automatically establishing classification rule for cross-language |
CN103577498B (en) * | 2012-08-09 | 2018-09-07 | 北京百度网讯科技有限公司 | A kind of method and apparatus building classifying rules automatically across language |
CN104584005A (en) * | 2012-08-22 | 2015-04-29 | 株式会社东芝 | Document classification device and document classification method |
CN104584005B (en) * | 2012-08-22 | 2018-01-05 | 株式会社东芝 | Document sorting apparatus and Document Classification Method |
CN107798386A (en) * | 2016-09-01 | 2018-03-13 | 微软技术许可有限责任公司 | More process synergics training based on unlabeled data |
CN107169001A (en) * | 2017-03-31 | 2017-09-15 | 华东师范大学 | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning |
CN107168533A (en) * | 2017-05-09 | 2017-09-15 | 长春理工大学 | A kind of P300 based on integrated supporting vector machine spells the training set extended method of device |
CN110929530A (en) * | 2018-09-17 | 2020-03-27 | 阿里巴巴集团控股有限公司 | Method and device for identifying multilingual junk text and computing equipment |
CN110929530B (en) * | 2018-09-17 | 2023-04-25 | 阿里巴巴集团控股有限公司 | Multi-language junk text recognition method and device and computing equipment |
Also Published As
Publication number | Publication date |
---|---|
CN102567529B (en) | 2013-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102567529B (en) | Cross-language text classification method based on two-view active learning technology | |
Li et al. | Multi-domain sentiment classification | |
CN107622104B (en) | Character image identification and marking method and system | |
CN107169001A (en) | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning | |
CN101059796A (en) | Two-stage combined file classification method based on probability subject | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN106250372A (en) | A kind of Chinese electric power data text mining method for power system | |
CN105335352A (en) | Entity identification method based on Weibo emotion | |
CN101996241A (en) | Bayesian algorithm-based content filtering method | |
CN101540017B (en) | Feature extracting method based on byte level n-gram and twit filter | |
BaygIn | Classification of text documents based on Naive Bayes using N-Gram features | |
CN103324628A (en) | Industry classification method and system for text publishing | |
CN101950284A (en) | Chinese word segmentation method and system | |
CN101079028A (en) | On-line translation model selection method of statistic machine translation | |
CN105183715B (en) | A kind of word-based distribution and the comment spam automatic classification method of file characteristics | |
Rekha et al. | Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method | |
CN103514170A (en) | Speech-recognition text classification method and device | |
CN105183831A (en) | Text classification method for different subject topics | |
CN101251896B (en) | Object detecting system and method based on multiple classifiers | |
CN105306296A (en) | Data filter processing method based on LTE (Long Term Evolution) signaling | |
CN102194012A (en) | Microblog topic detecting method and system | |
CN102663435A (en) | Junk image filtering method based on semi-supervision | |
CN1889108A (en) | Method of identifying junk mail | |
CN104050556A (en) | Feature selection method and detection method of junk mails | |
CN103020167A (en) | Chinese text classification method for computer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20131106 Termination date: 20141230 |
|
EXPY | Termination of patent right or utility model |