CN102567529A - Cross-language text classification method based on two-view active learning technology - Google Patents

Cross-language text classification method based on two-view active learning technology Download PDF

Info

Publication number
CN102567529A
CN102567529A CN2011104532511A CN201110453251A CN102567529A CN 102567529 A CN102567529 A CN 102567529A CN 2011104532511 A CN2011104532511 A CN 2011104532511A CN 201110453251 A CN201110453251 A CN 201110453251A CN 102567529 A CN102567529 A CN 102567529A
Authority
CN
China
Prior art keywords
text
language
view
certainty
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104532511A
Other languages
Chinese (zh)
Other versions
CN102567529B (en
Inventor
戴林
刘越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN 201110453251 priority Critical patent/CN102567529B/en
Publication of CN102567529A publication Critical patent/CN102567529A/en
Application granted granted Critical
Publication of CN102567529B publication Critical patent/CN102567529B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a cross-language text classification method based on a two-view active learning technology. The method comprises the following steps of: (1) constructing two views, namely translating all source language texts into target language texts and translating all target language texts into source language texts by using a machine translation tool, so that each text has two language versions; (2) training initial classifiers, namely training by using the source language version to obtain a classifier, and training by using the target language version to obtain a classifier; (3) performing active learning, namely expanding a training set, and retraining the classifiers by using a new training set to obtain two enhanced classifiers; and (4) classifying, namely classifying by using the enhanced classifiers. A cross-language text classification effect is enhanced by the active learning technology, and the number of target language samples which are needed to be manually marked is greatly reduced.

Description

A kind of based on dual-view initiatively learning art stride the language text sorting technique
Technical field
The present invention relates to a kind of file classification method, particularly a kind ofly stride the language text sorting technique, belong to the language information processing field from the middle school acquistion of source language mark text collection to the target language text sorter.
Background technology
In internet and various information disposal system, text classification has very uses widely, as: news category, scientific and technical literature classification, public sentiment classification etc.The main method of text classification is the technology through machine learning, has marked learning law the text of classification from a collection of manual work, the structure automatic categorizer.Here, the text collection of artificial mark is commonly referred to training set.Along with international process, increasing company needs to handle multilingual data with tissue, if all set up a sorter for every kind of language, then need will bring bigger manpower and financial resources to expend thus for every kind of language all marks a collection of training set.
Stride the language text classification and be intended to from the training set that existing certain source language is described, study obtains being applicable to the sorter of target language, thereby reduces the workload of mark training set.Current, the main method of striding language text classification is based on mechanical translation, promptly translates into target language to training set, and therefrom study obtains the target language sorter again.But, since the difference of factors such as culture, the text of different language, and even if subordinate is similar, the theme of the news that is comprised is not identical.This phenomenon is called the theme drift.The training set middle school acquistion of coming from translation to sorter, can not adapt to target language fully, its classifying quality can be affected.
Summary of the invention
The objective of the invention is to solve and stride the theme difference problem between the source language and target language in the language text classification, study adapts to target language more to let sorter pass through initiatively, thereby promotes classifying quality.
The present invention adopts following technical proposals to realize:
A kind of based on dual-view initiatively learning art stride the language text sorting technique, establish source language and target language and be expressed as E and C respectively, the source language training set is expressed as TRe, other has extra target language not mark text collection to be expressed as Uc; It is following then to stride language text sorting technique concrete steps:
(1) structure dual-view: utilize machine translation tools, all source language text are translated into target language, all target language text are translated into source language, then each piece text has all had macaronic version; Regard the version of every kind of language as a kind of view, then every piece of text has all had two kinds of views, is respectively E view and C view; For TRe, its dual-view version table is shown TR; For Uc, its dual-view version table is shown U;
(2) training preliminary classification device: as training set, at first utilize the version training of its source language to obtain a sorter Ce with TR, utilize the version training of its target language to obtain a sorter Cc again; The sorter that training obtains needs to provide the probability that one piece of article belongs to each classification;
Here; Training classifier can adopt various machine learning algorithms, like SVMs (Support Vector Machines), naive Bayesian (
Figure BDA0000126958450000021
Bayes) etc.
(3) active learning process:
A) text among the U is classified based on E view and C view with Ce and Cc respectively, and calculate class probability;
B) select the minimum text of average confidence level of n piece of writing Cc and Ce; Comprised in these texts be difficult to source language or by source language translation be target language the training focusing study to classificating knowledge; Behind its artificial mark, add training set as new training text;
C) select the with a high credibility of m piece of writing Cc, use the classification results of Cc to do mark, then it is added training set for it in the text of Ce; Select the with a high credibility of m piece of writing Ce, use the classification results of Ce to do mark, then it is added training set for it in the text of Cc;
D) last, utilize new training set, train Cc and Ce again;
Above n and m are the positive integers that is not more than total textual data among the U, and in general, n that value is less and m can make learning process more effective; Iteration is carried out a to four steps of d I time, and I is a positive integer, and in general, I is big more, and study is abundant more, but the relative training time is long more, and cost is high more.
Through this active learning process, obtained the sorter Cc and the Ce of two enhancings;
(4) assorting process: for the text that one piece of target language C to be classified describes, utilize its E view of machine translation tools structure earlier, classify based on its C and E view with Cc and Ce respectively then; Two sorters can provide the probable value that text belongs to each classification separately, and the mean value of getting the two belongs to such final probable value as text; At last, get the classification of the highest classification of probability as text.
Beneficial effect
Method contrast prior art provided by the invention has following beneficial effect:
1, strides the language text classifying quality through the enhancing of active learning art.The phenomenon that has the theme skew between different language.The present invention lets the sorter of training from the source language training set continue the target language text of mark not the theme drift knowledge that discovery initiatively need be learnt, and then adapts to target language more, promotes classifying quality.Need the target language sample number of artificial mark to significantly reduce simultaneously.
2, utilize the dual-view technology to reduce the hand labor in the initiatively study.The present invention utilizes two sorters based on different views, and each other that own confidence level is the highest classification results gives the other side as new training data, reaches mutual the destination of study.Thereby further reduced the number of samples that in the active learning process, needs artificial mark.
Description of drawings
Fig. 1 is the synoptic diagram of structure text dual-view, is example with Chinese and English;
Fig. 2 is the synoptic diagram of learning process, is example with Chinese and English;
Fig. 3 is the synoptic diagram of assorting process, is example with Chinese and English.
Embodiment
Below in conjunction with accompanying drawing, be example as source language, SVM (Support Vector Machine) as the basic classification algorithm as target language, English with Chinese, the performing step of this method is described.The concrete realization of SVM can be used kits such as LibSVM.Here do not describe well-known method and process in detail, become unclear to avoid realization of the present invention.
Suppose that English and Chinese are expressed as E and C respectively, the good English training set TRe of existing mark, not the Chinese text collection Uc of mark.This method utilizes TRe and Uc study to obtain being applicable to the sorter of Chinese text.
Step 1, structure dual-view:
As shown in Figure 1, utilize machine translation tools, the English text among the TRe is translated as Chinese, construct Chinese and English dual-view training set TR; Chinese text among the Uc is translated as English, constructs Chinese and English dual-view and do not mark text set U.That is, each piece text among TR and the U all has Chinese and English two versions simultaneously.
Step 2, learning process:
As shown in Figure 2, this step is made up of following substep:
(1) makes that training set TrainingSet is TR;
(2), use the SVM algorithm training to obtain sorter Cc with the Chinese version of all texts among the TrainingSet;
(3), use the SVM algorithm training to obtain sorter Ce with the English edition of all texts among the TrainingSet;
(4) with Cc to all text classifications (based on Chinese version) among the U, calculate class probability (confidence level) Certainty (d, Cc);
(5) with Ce to all text classifications (based on English edition) among the U, calculate classification confidence level Certainty (d, Ce);
(6) from U, selecting a n piece of writing has minimum average classification confidence level Average_Certainty (text Ce) is expressed as S for d, Cc;
(7) from U, select all Certainty (d, Cc)>(d Ce)>text of h, is expressed as L for h or Certainty.Here, h is the confidence level threshold value, and value is the floating number between (0,1);
(8) to every piece among L text calculate the confidence level of two sorters difference Certainty_Diff (d, Ce, Cc) and Certainty_Diff (d, Cc, Ce).Select a m piece of writing have the highest Certainty_Diff (d, Ce, Cc) value text, be expressed as Ee; Select a m piece of writing have the highest Certainty_Diff (d, Cc, Ce) value text, be expressed as Ec;
(9) from U, remove Ee, Ec, S;
(10) classification of every piece of text among the artificial mark S; With the text categories among the classification results mark Ee of Ce; With the text categories among the classification results mark Ec of Cc;
(11) S, Ee, Ec are added TrainingSet;
(12) repeating above (2) goes on foot I time to (11);
In the said process, n and m are positive integers, and in general, n that value is less and m can make learning process more effective.I is a positive integer, and in general, I is big more, and study is abundant more, but the relative training time is long more, and cost is high more.
Through above step, obtained the sorter Cc and the Ce of two enhancings at last.
Next provide the computing method of several variablees of using in the above-mentioned steps.
For one piece of text d, the svm classifier device can provide its probability P that belongs to each classification (y=i|d), and y representes the classification of d here, the classification numbering that i expresses possibility.At this, be not described in detail how to obtain probability estimate through SVM.
Sorter C is calculated as follows the classification confidence level of text d:
Certainty(d,c)=P c(y=i|d)-P c(y=j|d)
Here, i and j are two classifications with maximum probability.
The average confidence level of sorter Cc and Ce is according to computes:
Average_Certainty(d,Cc,Ce)=(Certainty(d,Cc)+Certainty(d,Ce))/2
The difference of the confidence level of sorter Cc and Ce is according to computes:
Certainty_Diff(d,Cc,Ce)=Certainty(d,Cc)-Certainty(d,Ce)
Step 3, assorting process
As shown in Figure 3, for one piece of Chinese text that needs classification, at first be translated into English, thereby obtain its English edition through machine translation tools.Then, utilize sorter Cc that its Chinese version is classified, utilize sorter Ce that its English edition is classified.Through following formula, utilize the two classification results to calculate net result at last:
P(y=i|d)=(P c(y=i|d)+P e(y=i|d))/2
Wherein, i is each possible classification numbering.Get the maximum classification of probability as classification results.
It should be understood that this embodiment is the instantiation that the present invention implements, should not be the restriction of protection domain of the present invention.Under the situation that does not break away from spirit of the present invention and scope, modification or the change of foregoing being carried out equivalence all should be included within the present invention's scope required for protection.

Claims (2)

  1. One kind based on dual-view initiatively learning art stride the language text sorting technique, establish source language and target language and be expressed as E and C respectively, the source language training set is expressed as TRe, other has extra target language not mark text collection to be expressed as Uc; It is following then to stride language text sorting technique concrete steps:
    (1) structure dual-view: utilize machine translation tools, all source language text are translated into target language, all target language text are translated into source language, then each piece text has all had macaronic version; Regard the version of every kind of language as a kind of view, then every piece of text has all had two kinds of views, is respectively E view and C view; For TRe, its dual-view version table is shown TR; For Uc, its dual-view version table is shown U;
    (2) training preliminary classification device: as training set, at first utilize the version training of its source language to obtain a sorter Ce with TR, utilize the version training of its target language to obtain a sorter Cc again; The sorter that training obtains needs to provide the probability that one piece of article belongs to each classification;
    (3) active learning process:
    A) text among the U is classified based on E view and C view with Ce and Cc respectively, and calculate class probability;
    B) select the minimum text of average confidence level of n piece of writing Cc and Ce; Comprised in these texts be difficult to source language or by source language translation be target language the training focusing study to classificating knowledge; Behind its artificial mark, add training set as new training text;
    C) select the with a high credibility of m piece of writing Cc, use the classification results of Cc to do mark, then it is added training set for it in the text of Ce; Select the with a high credibility of m piece of writing Ce, use the classification results of Ce to do mark, then it is added training set for it in the text of Cc;
    D) last, utilize new training set, train Cc and Ce again;
    Iteration is carried out a to four steps of d I time;
    Above n and m are the positive integers that is not more than total textual data among the U; I is a positive integer;
    Through this active learning process, obtained the sorter Cc and the Ce of two enhancings;
    (4) assorting process: for the text that one piece of target language C to be classified describes, utilize its E view of machine translation tools structure earlier, classify based on its C and E view with Cc and Ce respectively then; Two sorters can provide the probable value that text belongs to each classification separately, and the mean value of getting the two belongs to such final probable value as text; At last, get the classification of the highest classification of probability as text.
  2. 2. a kind of language text sorting technique of striding according to claim 1; It is characterized in that the c of step (3)) the Chinese version method choosing and mark is: from U, select all Certainty (d, Cc)>h or Certainty (d; Ce)>and the text of h, be expressed as L; Wherein, h is the confidence level threshold value, and value is the floating number between (0,1);
    Every piece among L text is calculated the difference Certainty_Diff of the confidence level of two sorters, and (Cc) (d, Cc Ce), select a m piece of writing and have the highest Certainty_Diff (text that Cc) is worth is expressed as Ee for d, Ce with Certainty_Diff for d, Ce; Select a m piece of writing have the highest Certainty_Diff (d, Cc, Ce) value text, be expressed as Ec; With the text categories among the classification results mark Ee of Ce; With the text categories among the classification results mark Ec of Cc;
    Wherein, sorter C is calculated as follows the classification confidence level of text d:
    Certainty(d,c)=P c(y=i|d)-P c(y=j|d)
    Here, i and j are two classifications with maximum probability;
    The difference of the confidence level of sorter Cc and Ce is according to computes:
    Certainty_Diff(d,Cc,Ce)=Certainty(d,CC)-Certainty(d,Ce)。
CN 201110453251 2011-12-30 2011-12-30 Cross-language text classification method based on two-view active learning technology Expired - Fee Related CN102567529B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110453251 CN102567529B (en) 2011-12-30 2011-12-30 Cross-language text classification method based on two-view active learning technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110453251 CN102567529B (en) 2011-12-30 2011-12-30 Cross-language text classification method based on two-view active learning technology

Publications (2)

Publication Number Publication Date
CN102567529A true CN102567529A (en) 2012-07-11
CN102567529B CN102567529B (en) 2013-11-06

Family

ID=46412928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110453251 Expired - Fee Related CN102567529B (en) 2011-12-30 2011-12-30 Cross-language text classification method based on two-view active learning technology

Country Status (1)

Country Link
CN (1) CN102567529B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577498A (en) * 2012-08-09 2014-02-12 北京百度网讯科技有限公司 Method and device for automatically establishing classification rule for cross-language
CN104584005A (en) * 2012-08-22 2015-04-29 株式会社东芝 Document classification device and document classification method
CN107168533A (en) * 2017-05-09 2017-09-15 长春理工大学 A kind of P300 based on integrated supporting vector machine spells the training set extended method of device
CN107169001A (en) * 2017-03-31 2017-09-15 华东师范大学 A kind of textual classification model optimization method based on mass-rent feedback and Active Learning
CN107798386A (en) * 2016-09-01 2018-03-13 微软技术许可有限责任公司 More process synergics training based on unlabeled data
CN110929530A (en) * 2018-09-17 2020-03-27 阿里巴巴集团控股有限公司 Method and device for identifying multilingual junk text and computing equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050049852A1 (en) * 2003-09-03 2005-03-03 Chao Gerald Cheshun Adaptive and scalable method for resolving natural language ambiguities
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
CN101770453A (en) * 2008-12-31 2010-07-07 华建机器翻译有限公司 Chinese text coreference resolution method based on domain ontology through being combined with machine learning model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050049852A1 (en) * 2003-09-03 2005-03-03 Chao Gerald Cheshun Adaptive and scalable method for resolving natural language ambiguities
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
CN101770453A (en) * 2008-12-31 2010-07-07 华建机器翻译有限公司 Chinese text coreference resolution method based on domain ontology through being combined with machine learning model

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577498A (en) * 2012-08-09 2014-02-12 北京百度网讯科技有限公司 Method and device for automatically establishing classification rule for cross-language
CN103577498B (en) * 2012-08-09 2018-09-07 北京百度网讯科技有限公司 A kind of method and apparatus building classifying rules automatically across language
CN104584005A (en) * 2012-08-22 2015-04-29 株式会社东芝 Document classification device and document classification method
CN104584005B (en) * 2012-08-22 2018-01-05 株式会社东芝 Document sorting apparatus and Document Classification Method
CN107798386A (en) * 2016-09-01 2018-03-13 微软技术许可有限责任公司 More process synergics training based on unlabeled data
CN107169001A (en) * 2017-03-31 2017-09-15 华东师范大学 A kind of textual classification model optimization method based on mass-rent feedback and Active Learning
CN107168533A (en) * 2017-05-09 2017-09-15 长春理工大学 A kind of P300 based on integrated supporting vector machine spells the training set extended method of device
CN110929530A (en) * 2018-09-17 2020-03-27 阿里巴巴集团控股有限公司 Method and device for identifying multilingual junk text and computing equipment
CN110929530B (en) * 2018-09-17 2023-04-25 阿里巴巴集团控股有限公司 Multi-language junk text recognition method and device and computing equipment

Also Published As

Publication number Publication date
CN102567529B (en) 2013-11-06

Similar Documents

Publication Publication Date Title
CN102567529B (en) Cross-language text classification method based on two-view active learning technology
Li et al. Multi-domain sentiment classification
CN107622104B (en) Character image identification and marking method and system
CN107169001A (en) A kind of textual classification model optimization method based on mass-rent feedback and Active Learning
CN101059796A (en) Two-stage combined file classification method based on probability subject
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN106250372A (en) A kind of Chinese electric power data text mining method for power system
CN105335352A (en) Entity identification method based on Weibo emotion
CN101996241A (en) Bayesian algorithm-based content filtering method
CN101540017B (en) Feature extracting method based on byte level n-gram and twit filter
BaygIn Classification of text documents based on Naive Bayes using N-Gram features
CN103324628A (en) Industry classification method and system for text publishing
CN101950284A (en) Chinese word segmentation method and system
CN101079028A (en) On-line translation model selection method of statistic machine translation
CN105183715B (en) A kind of word-based distribution and the comment spam automatic classification method of file characteristics
Rekha et al. Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method
CN103514170A (en) Speech-recognition text classification method and device
CN105183831A (en) Text classification method for different subject topics
CN101251896B (en) Object detecting system and method based on multiple classifiers
CN105306296A (en) Data filter processing method based on LTE (Long Term Evolution) signaling
CN102194012A (en) Microblog topic detecting method and system
CN102663435A (en) Junk image filtering method based on semi-supervision
CN1889108A (en) Method of identifying junk mail
CN104050556A (en) Feature selection method and detection method of junk mails
CN103020167A (en) Chinese text classification method for computer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131106

Termination date: 20141230

EXPY Termination of patent right or utility model