CN106951565A

CN106951565A - File classification method and the text classifier of acquisition

Info

Publication number: CN106951565A
Application number: CN201710216502.1A
Authority: CN
Inventors: 贾宁; 夏磊
Original assignee: Number Library (shanghai) Technology Co Ltd
Current assignee: Number Library (shanghai) Technology Co Ltd
Priority date: 2017-04-05
Filing date: 2017-04-05
Publication date: 2017-07-14
Anticipated expiration: 2037-04-05
Also published as: CN106951565B

Abstract

Technical solution of the present invention discloses a kind of text classifier method and text classifier obtained for automatic marking language material, this method includes determining concept set, un-annotated data text match with the concept keyword in the corresponding concept keyword set of each concept and automatic marking processing；For each concept, when the corresponding mark language material text collection Chinese version quantity of the concept meets threshold condition, corresponding textual classification model is then trained to the concept, corresponding text classifier is obtained, the text classifier set corresponding with the concept that all amount of text meet threshold condition is finally obtained.The method of the present invention provides a kind of algorithm structure, with universality, can neatly change taxonomic hierarchies, calculating time and resource have been saved, and the present invention provides a small amount of initial language material text, and automatic marking, without manually marking, further save time and cost.

Description

File classification method and the text classifier of acquisition

Technical field

The present invention relates to the Text Classification field of artificial intelligence, and in particular to a kind of file classification method and acquisition Text classifier.

Background technology

With the fast development of network technology, effectively organization and management is carried out for electronic text information, and can it is quick, Requirement more and more higher that is accurate and comprehensively therefrom finding relevant information.Text classification is as handling and organize a large amount of text datas Key technology, the problem of largely solving information clutter facilitate user to obtain required information exactly, be letter Cease the technical foundation in the fields such as filtering, information retrieval, search engine and text database.Text classification is generally comprised to text The process such as expression, the selection of text classifier and training, the evaluation of text classification result and feedback.Existing Text Classification Generally implement according to following steps：(1) taxonomic hierarchies is determined；(2) language material to be marked is collected, corpus is formed；(3) corpus is used Train classification models；(4) new text is classified with the disaggregated model trained.

But, for following several application scenarios, 1) it is unbalanced for text classification system, some classification categories are very big, Some classification category very littles, a such as classification is " automobile industry ", and another classification is the application scenarios of " lithium battery " etc.；2) For being easily obtained substantial amounts of un-annotated data, and the higher application scenarios of the cost that obtains mark language material；3) for classified body System's often change, often increases classification newly or deletes the application scenarios of classification, and existing Text Classification there are following lack Fall into：(1) in terms of taxonomic hierarchies, prior art is all that determining can not be careless after taxonomic hierarchies using fixed taxonomic hierarchies Change.Change taxonomic hierarchies if necessary, it is necessary to collect language material again, the disaggregated model of re -training all categories, this process Need substantial amounts of time and computing resource.(2) in terms of corpus, prior art need to the text for having marked classification based on certain scale This is as training corpus, and for having marked the text of classification, generally require to spend it is substantial amounts of be manually first labeled processing, from And add input cost.

The content of the invention

The present invention can not arbitrarily change for the taxonomic hierarchies of Text Classification in the prior art, need to such as change classified body System needs consuming more time and computing resource, and language material need to be manually labeled, it is necessary to the more costly and time consuming length of input cost Defect, it is therefore intended that provide it is a kind of can flexibly change taxonomic hierarchies and the automatic file classification method for being labeled text, It cost savings computing resource, time and cost.

Realizing the technical scheme of above-mentioned purpose is：

The method that the present invention obtains the text classifier for automatic marking language material, this method includes：

Concept determines step I, determines concept set, and there is each concept correspondence in concept set at least one concept to close The concept keyword set of keyword；

Language material automatic marking step II, utilizes the concept in the corresponding concept keyword set of each concept in concept set Keyword carries out matching treatment to un-annotated data text collection, and with corresponding concept in un-annotated data text collection Text is associated and automatic marking processing, acquisition mark language material text collection；

" mark language material text collection " includes following 2 class text, and the 1st class is：Join with any conceptual dependency in concept set And the text being labeled with the concept；2nd class is：To unconnected other texts of any concept in concept set, But such other text is not associated due to the concept with having determined, can be labeled or not appointed with " other " What is marked or the text that is labeled with other words do not obscured with concept word etc., as long as can clearly demonstrate that this kind of Text is any word that can be made a distinction with the 1st class text.

Disaggregated model training step III, for each concept in concept set, when the mark language material concept related with this When text collection Chinese version quantity meets threshold condition, then the text in the mark language material text collection concept related with this is made For the positive example of textual classification model, the text not with this in concept related mark language material text collection is used as textual classification model Negative example, the positive example and negative example of textual classification model corresponding to the concept be trained, and obtains the corresponding text point of the concept Class device, finally obtains the corresponding text classifier of concept that all amount of text meet threshold condition.

In the preferred embodiment of the present invention, concept is determined in step I, determines concept set X by concept x_iComposition, its Each concept x in middle i=1,2,3 ... n, concept set X_iCorrespondence with least one concept crucial phrase into concept close Keyword set Y_i。

In the preferred embodiment of the present invention, language material automatic marking step II includes：

Step II 1, sufficient amount N un-annotated data, note un-annotated data text are collected according to specific practical situations This collection is combined into D={ d_j, wherein j=1,2 ..., N；

Step II 2, utilizes each concept x_iCorresponding concept keyword set Y_iIn concept keyword to un-annotated data Every text in text collection D carries out matching treatment respectively, as a certain piece text d in un-annotated data text collection D_jWith Concept x_iWhen the match condition of corresponding concept keyword meets matching condition, then by this text d_jIt is labeled as and concept x_i It is associated；Matching treatment is carried out to every text in un-annotated data text collection D and obtains mark language material text collection C.

In the preferred embodiment of the present invention, disaggregated model training step III includes：

Step III 1, concept set X is divided into the concept subclass X of two supplementary sets each other_AWith concept subclass X_B, divide Principle is, if in mark language material text collection C with concept x_iAssociated mark language material amount of text is more than or equal to threshold alpha When, then by concept x_iIt is subdivided into concept subclass X_AIn, with concept x_iAssociated mark language material text collection is designated as A_i； If during less than threshold alpha, by concept x_iIt is subdivided into concept subclass X_BIn；

Step III 2, by concept subclass X_AIn concept x_iCorresponding mark language material text collection A_iIt is used as training classification mould The positive example of type, random extraction k is not belonging to mark language material text collection A from mark language material text collection C_iIn text be designated as Mark language material text collection A_i', it is used as the negative example of train classification models；

Step III 3, using the textual classification model of naive Bayesian, SVMs or logistic regression to mark language material text This set A_iAnd A_i' training concept x_iCorresponding text classifier is designated as m_i；The concept subclass X trained_AIn each concept x_iCorresponding text classifier set is designated as M₀。

In the preferred embodiment of the present invention, disaggregated model training step III also includes：

Step III 4, utilizes text classifier set M₀In respectively with each concept x_iCorresponding text classifier m_iTo not It is that association is handled with mark to mark the text in language material text collection D and carry out classification processing, obtains corresponding text classification result, Text classification results are individually deposited, and mark language material text collection C is not influenceed；

Step III 5, for concept subclass X_AIn each concept x_i, use text classifier m_iText is calculated to correspond to generally Read x_iProbability, selected from text classification result and correspond to concept x_iProbability be more than the text of threshold value beta, add it to general Read x_iCorresponding mark language material text collection A_iIn, form new mark language material set A_i；

Step III 6, for new mark language material set A_i, repeat step III 2~III 5 2~10 time are met the requirements Concept x_iCorresponding text classifier m_i, so as to obtain final satisfactory text classifier set M；Or for new mark Note language material set A_i, artificial matching, which is assessed, obtains satisfactory concept x_iCorresponding text classifier m_i, so as to finally be accorded with Close desired text classifier set M.

In the preferred embodiment of the present invention, in disaggregated model training step III 3：

Using the textual classification model of SVMs to mark language material text collection A_iAnd A_i' train for concept x_i's Text classifier.

In the preferred embodiment of the present invention, in disaggregated model training step III 5：

Threshold value beta span in step III 5 is 0.1~0.5；In step III 6, repeat step III 2~III 5 5~10 It is secondary.

In the preferred embodiment of the present invention, in disaggregated model training step III 6：

Artificial matching assessment refers to for concept x_i, from mark language material set A_iIn randomly select some texts, then from mark Randomly selected in note language material text collection C some pieces not with concept x_iAll text K of extraction are entered by the text of association again Pedestrian's work is marked, and obtains criteria classification result；Text classifier m is trained in step III 3 every time_iAfterwards, text classifier m is used_iIt is right All text K extracted separately carry out classification processing and obtain provisional classifications result, i.e., using concept x_iCorresponding text classifier m_i Every textual association in all text K is calculated to concept x_iProbability, if probability be more than threshold value beta, the text is labeled as With concept x_iThe text of association；Provisional classifications result and criteria classification result are compared, the standard of provisional classifications result is calculated True rate, when accuracy rate is more than or equal to threshold gamma, then text grader m_iFor satisfactory text classifier；

When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept x_iIt is corresponding extremely A few new concept keyword, forms new concept keyword set Y_i, and/or, redefine the matching bar of step II 2 Part；When re-starting concept and determining step I, according to new concept keyword set Y_iCarry out step II 2 and obtain new mark Note language material text collection C；Language material text collection C will be marked and carry out the new concept subclass X of the acquisition of step III 1_AWith new concept Subclass X_B；To new concept subclass X_AWith new concept subclass X_BProceed step III 2~III 6, until when text point Class device m_iThe accuracy rate of provisional classifications result is more than or equal to threshold gamma, then text grader m_iFor satisfactory text point Class device；When only redefining the matching condition of step II 2, up to text grader m since step II 2_iFor symbol Untill closing desired text classifier.

In the preferred embodiment of the present invention, also including concept related step IV：

Concept related step IV, each concept x is directed to using in text classifier set M_iText classifier m_iTo text This d carries out classification processing；Concept subclass X is utilized simultaneously_BIn each concept x_iConcept in corresponding concept keyword set Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set X_iFinal association knot Really.

In the preferred embodiment of the present invention, concept related step IV is specifically included：

Step IV 1, utilizes each text classifier m in text classifier set M_iClassification processing is carried out to text d, and counted Calculate text d and correspond to concept x_iProbability, if text d corresponds to concept x_iProbability be more than given threshold β, text d is marked For with concept subclass X_AIn concept x_iIt is associated；

Step IV 2, meanwhile, also to utilize concept subclass X_BIn each concept x_iCorresponding concept keyword set Y_iIn Concept keyword to text d carry out matching treatment, when the match conditions are met, text d is labeled as and the concept subset Close X_BIn concept x_iIt is associated；Obtain each concept x in text d and concept set X_iFinal association results.

In the preferred embodiment of the present invention, matching condition is specifically referred to：

Judge whether to have more than 25% in the text and be preferably greater than in 30% paragraph occur and concept x_iIt is corresponding general Read keyword set Y_iIn concept keyword, if exceeding, then the text is labeled as and concept x_iAssociation；If being no more than, Then the text is not marked and concept x_iAssociation.

In the preferred embodiment of the present invention, step V also is updated including language material,

Language material updates step V, by each concept x in text d and concept set X_iFinal association results be added to mark Note in language material text collection C, older mark language is periodically removed from mark language material text collection C using certain removal mode Expect text, the mark language material text collection C after being updated.

In the preferred embodiment of the present invention, language material updates removal mode in step V and referred to：Make mark language material text Each concept x in set C_iCorresponding amount of text is maintained at tens of between hundreds of, if certain concept x_iCorresponding amount of text More than hundreds of, then older text is removed；Make not with any concept x_iThe amount of text of association is maintained at thousands of between tens thousand of, If more than tens thousand of, then removing older text.

In the preferred embodiment of the present invention, also step VI is updated including grader：

Disaggregated model training step III, the text classification updated are repeated to the mark language material text collection C after renewal Device set M.

In the preferred embodiment of the present invention, also step VII is increased newly including concept：

Step VII 1, is increasing some newly-increased concept x_pAfterwards, concept is taken to increase set X_P={ x_p, increase concept x newly_pCorrespondence concept Keyword set Y_p；

Step VII 2, set X is increased to concept_P={ x_pLanguage material automatic marking step II 2 is carried out, then instructed according to disaggregated model Practice the division principle of step III 1 by X_PIt is divided into the concept subclass X of two supplementary sets each other_PAWith concept subclass X_PB, then carry out Disaggregated model training step III 2~III 6；

Judge concept x_pCorresponding mark language material text collection C_pWhether Chinese version quantity is more than or equal to threshold alpha, if greatly In or equal to threshold alpha, then by concept x_pIt is assigned to concept subclass X_AIn, satisfactory text classifier m will be trained_pPlus Enter into text classifier set M；If less than threshold alpha, then by concept x_pIt is assigned to concept subclass X_BIn.

In the preferred embodiment of the present invention, the concept related step VIII after also being increased newly including concept：It is new using concept Increase in the text classifier set M obtained after step VII and be directed to each concept x_pText classifier m_pText d is classified Processing；Concept subclass X is utilized simultaneously_BIn each concept x_pConcept keyword in corresponding concept keyword set is to this Text d carries out matching treatment, obtains each concept x in text d and concept set X_pFinal association results.

In the preferred embodiment of the present invention, in step II 2, when a certain piece text in un-annotated data text collection D This d_jWith multiple concept x_iWhen the match condition of corresponding concept keyword is satisfied by matching condition, then by this text d_jMark For multiple concept xs corresponding with this_iHomogeneous association.

In the preferred embodiment of the present invention, in step II 2, when a certain piece text in un-annotated data text collection D This d_jWith any concept x_iWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then by this text d_jMark Note as " other " or do not make any mark.

In the preferred embodiment of the present invention, in the method manually marked, when a certain piece text and any concept pair When the match condition for the concept keyword answered is unsatisfactory for matching condition, then by this text marking is " other " or does not appoint What is marked.

In the preferred embodiment of the present invention, after step IV 2, as text d and any concept x_iCorresponding concept is closed When the match condition of keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or does not make any mark, obtained Obtain each concept x in text d and concept set X_iFinal association results.

In the preferred embodiment of the present invention, present invention additionally comprises the text classifier of the automatic marking language material of acquisition.

In the preferred embodiment of the present invention, present invention additionally comprises the mark language material text collection of acquisition.

In a kind of file classification method of the present invention, the text classifier set obtained using the process described above Each concept x is directed in M_iText classifier m_iClassification processing is carried out to text d；Concept subclass X is utilized simultaneously_BIn Each concept x_iConcept keyword in corresponding concept keyword set is matched and processing of classifying to text d, is obtained Obtain each concept x in text d and concept set X_iFinal classification result.

In a kind of file classification method of the present invention, specifically include,

The first step, utilizes each text classifier m in text classifier set M_iClassification processing is carried out to text d, and counted Calculate text d and correspond to concept x_iProbability, if text d corresponds to concept x_iProbability be more than given threshold β, be by text d points With concept subclass X_AIn concept x_iIt is associated；

Second step, while also to utilize concept subclass X_BIn each concept x_iCorresponding concept keyword set Y_iIn Concept keyword carries out matching treatment to text d, when the match conditions are met, by text d and concept subclass X_BIn Concept x_iIt is associated.

Term " concept " is the general character of the management functions of several listed companies.The category of concept includes：(1) policy, such as The Belt and Road etc.；(2) industry, such as " food and drink "；(3) product, such as " unmanned plane "；(4) the characteristics of financial sector, such as " sub-new stock " etc., and other guide.Concept includes but is not limited to above-mentioned category.Term " concept keyword " refers to each concept pair The word of concept feature can be represented by answering several.If mentioning concept keyword in a word, then it is assumed that what is talked about is corresponding Concept.Text includes but is not limited to news, the title and body matter that grind report, the bulletin of listed company etc. of security organization.Art Language " text and conceptual dependency connection " refers to when the content of a text is more related to the content with some conceptual dependency, then be referred to as The text and the conceptual dependency join.One text may be associated with concept, it is also possible to not associate any concept；One may be associated with Individual concept, it is also possible to associate multiple concepts.From the perspective of text classification, concept may be considered classification, and textual association is arrived Some concept, it is believed that text belongs to some classification.Term " mark " refers to mark the corresponding concept of text, and mark can be Artificial mark, can also realize automatic marking by algorithm.Term " un-annotated data text collection " refers to not mark text The text collection of correspondence concept.Term " mark language material text collection " refers to the text collection for having marked text correspondence concept.

The present invention positive effect be：

The method of the present invention provides a kind of algorithm structure, with universality, can neatly change taxonomic hierarchies, for example can Realization regularly updates mark language material text collection and text classifier, increases concept newly if necessary, both increases in concept set Newly-increased concept, need to only carry out automatic marking language material and training grader, you can obtain newly-increased concept corresponding for newly-increased concept Text classifier, has thus saved calculating time and resource, and the present invention provides a small amount of initial language material text, And automatic marking, without artificial mark, further save time and cost.

Brief description of the drawings

Fig. 1 is the flow chart of training text grader of the present invention；

Fig. 2 is the flow chart of the file classification method of the present invention.

Embodiment

Preferred embodiment is enumerated below, and combines Fig. 1 and Fig. 2 to become apparent from intactly illustrating that the present invention is obtained for automatic Mark the text classifier method of language material and the implementation process of text classifier.

Step A, concept determination process includes：

Concept set X is by concept x_iEach concept x in composition, wherein i=1,2,3 ... n, concept set X_iCorrespondence has At least one concept crucial phrase into concept keyword set Y_i.One text may be associated with one or more concept x_i, Any concept x may not be associated_i.If text have more with certain concept x in concept set X_iRelated content, then The text and concept x_iIt is associated；If the content of a text and any concept x in concept set X_iIt is uncorrelated, it is referred to as The text and the conceptual dependency join.From the perspective of text classification, concept may be considered classification, if textual association to certain Concept, it is believed that text belongs to the classification belonging to certain concept.Concept set is shared in it is determined that the text concept to be associated with, i.e., true It is fixed that text is divided into the corresponding classification of concept being associated with.

Step B, language material automatic marking process includes：

Step B1, sufficient amount N un-annotated data is collected according to specific practical situations, remembers un-annotated data text Collection is combined into D={ d_j, wherein j=1,2 ..., N.

Step B2, utilizes each concept x_iCorresponding concept keyword set Y_iIn concept keyword to un-annotated data Every text in text collection D carries out matching treatment respectively, judges to work as a certain piece text in un-annotated data text collection D d_jWith concept x_iWhen whether the match condition of corresponding concept keyword meets matching condition, in a preferred embodiment of the present invention, Matching condition is to judge text d_jIn whether have more than 25% preferably greater than 30% paragraph in occur with concept x_iCorrespondence Concept keyword set Y_iIn concept keyword, if exceeding, then by text d_jWith concept x_iIt is associated；If not surpassing Cross, then text d_jCannot be with concept x_iIt is associated.As a certain piece text d in un-annotated data text collection D_jWith concept x_iWhen the match condition of corresponding concept keyword meets matching condition, then by this text d_jIt is labeled as and concept x_iIt is related The text of connection, and by this text d_jIt is labeled as and concept x_iIt is associated.To every text in un-annotated data text collection D Carry out matching treatment and obtain mark language material text collection C.

As a certain piece text d in un-annotated data text collection D_jWith multiple concept x_iOf corresponding concept keyword When being satisfied by matching condition with situation, then by this text d_jIt is labeled as multiple concept x corresponding with this_iHomogeneous association.When not marking Note a certain piece text d in language material text collection D_jWith any concept x_iThe match condition of corresponding concept keyword is unsatisfactory for During matching condition, then by this text d_jIt is labeled as " other " or does not make any mark.

Wherein, the principal element of influence matching condition includes：Occur in text with concept x_iConcept keyword set Y_i The concept x occurred in the quantity of middle concept keyword, text_iConcept keyword set Y_iThe quantity and text of middle concept keyword There is concept xi concept keyword set Y in ratio of all word quantity, text in this_iThe sentence number of middle concept keyword There is concept x in amount, text_iConcept keyword set Y_iAll sentence numbers in the sentence quantity and text of middle concept keyword There is concept x in the ratio of amount, text_iConcept keyword set Y_iThe natural segment number of concept keyword, occur in text Concept x_iConcept keyword set Y_iConcept keyword natural segment number and text paragragh sum ratio.

Step C, disaggregated model training process includes：

Step C1, concept set X is divided into the concept subclass X of two supplementary sets each other_AWith concept subclass X_B, divide Principle is, if in mark language material text collection C with concept x_iAssociated mark language material amount of text is more than or equal to threshold alpha When, then by concept x_iIt is subdivided into concept subclass X_AIn, with concept x_iAssociated mark language material text collection is designated as A_i； If during less than threshold alpha, by concept x_iIt is subdivided into concept subclass X_BIn.In one preferred embodiment, threshold alpha value is 100.

Step C2, by concept subclass X_AIn concept x_iCorresponding mark language material text collection A_iIn text be used as instruction Practice the positive example of disaggregated model, random extraction k is not belonging to mark language material text collection A from mark language material text collection C_iIn Text is designated as mark language material text collection A as the negative example of train classification models_i'。

Step C3, using textual classification models such as naive Bayesian, SVMs or logistic regressions to mark language material text This set A_iAnd A_i' training concept x_iCorresponding text classifier, is designated as m_i.The concept subclass X trained_AIn each concept x_iCorresponding text classifier set is designated as M₀.In a preferred embodiment of the present invention, document (is referred to using SVMs： Yuan,G.,Ho,C.,Lin,C.:Recent Advances of Large-scale Linear Classification.Proc.IEEE 100 (9), 2584-2603 (2012)) textual classification model to mark language material text Set A_iAnd A_i' train the text classification grader for being directed to concept xi.

Step C4, utilizes text classifier set M₀In respectively with each concept x_iCorresponding text classifier m_iTo not The text marked in language material text collection D carries out classification processing, obtains corresponding text classification result, the classification results are individually deposited Put, mark language material text collection C is not influenceed.

Step C5, for concept subclass X_AIn each concept x_i, use text classifier m_iCalculate text and correspond to concept x_iProbability, selected from text classification result and correspond to concept x_iProbability be more than threshold value beta text, add it to concept x_iCorresponding mark language material text collection A_iIn, form new mark language material set A_i.Wherein, β spans are 0.1~0.5.

Step C6, for new mark language material set A_i, repeat step C2~C5 2~10 times obtains satisfactory general Read x_iCorresponding text classifier m_i, so as to obtain final satisfactory text classifier set M.In one preferred embodiment, weight Multiple step C2~C5 5~10 times.Or for new mark language material set A_i, artificial matching, which is assessed, obtains satisfactory concept x_i Corresponding text classifier m_i, so as to obtain final satisfactory text classifier set M.

Wherein, artificial matching assessment refers to for concept x_i, from mark language material set A_iIn randomly select some texts, Again from mark language material text collection C in randomly select some pieces not with concept x_iThe text of association, to all text K of extraction Artificial mark is re-started, criteria classification result is obtained；Text classifier m is trained in step III 3 every time_iAfterwards, with text point Class device m_iAll text K of extraction are separately carried out with classification processing and obtains provisional classifications result, i.e., using concept x_iCorresponding text Grader m_iEvery textual association in all text K is calculated to concept x_iProbability, if probability be more than threshold value beta, by the text It is labeled as and concept x_iThe text of association；Provisional classifications result and criteria classification result are compared, provisional classifications knot is calculated The accuracy rate of fruit, when accuracy rate is more than or equal to threshold gamma (usual γ spans be 0.8~1), then text grader m_iFor satisfactory text classifier；

When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept x_iIt is corresponding extremely A few new concept keyword, forms new concept keyword set Y_i, and/or, redefine step B2 matching condition； When re-starting concept and determining step I, according to new concept keyword set Y_iCarry out step B2 and obtain new mark language Expect text collection C；Language material text collection C will be marked and carry out the new concept subclass X of step C1 acquisitions_AWith new concept subclass X_B；To new concept subclass X_AWith new concept subclass X_BProceed step C2~C6, until as text classifier m_iFace When classification results accuracy rate be more than or equal to threshold gamma, then text grader m_iFor satisfactory text classifier；When When only redefining step B2 matching condition, up to text grader m since step B2_iFor satisfactory text Untill this grader.

Step D, concept related process includes：

Step D1, each concept x is directed to using in text classifier set M_iText classifier m_iText d is carried out Classification is handled, and uses text classifier m_iCalculate text d and correspond to concept x_iProbability, if text d corresponds to concept x_iProbability More than given threshold β, text d is labeled as and is associated with concept subclass X_AIn concept x_iIt is associated；

Step D2, while also to utilize concept subclass X_BIn each concept x_iCorresponding concept keyword set Y_iIn Concept keyword carries out matching treatment to text d, and when the match conditions are met, text d is labeled as and the concept subclass X_BIn concept x_iIt is associated；Obtain each concept x in text d and concept set X_iFinal association results.As text d With any concept x_iWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or do not make any mark, obtain each concept x in text d and concept set X_iFinal association results.

Step E, updating mark language material text collection includes：

By each concept x in text d and concept set X_iFinal association results be added to mark language material text collection C In, older mark language material text is periodically removed from mark language material text collection C using certain removal mode, updated Mark language material text collection C afterwards.Wherein, removal mode refers to：Make each concept x in mark language material text collection C_iIt is corresponding Amount of text is maintained at tens of between hundreds of, if certain concept x_iCorresponding amount of text is then removed older more than hundreds of Text；Make not with any concept x_iThe amount of text of association is maintained at thousands of between tens thousand of, if more than tens thousand of, then remove compared with Old text.

Step F, updating grader process includes：

Step G, the newly-increased concept process of increase includes：

Step G1, is increasing some newly-increased concept x_pAfterwards, concept is taken to increase set X_P={ x_p, increase concept x newly_pCorrespondence concept Keyword set Y_p。

Step G2, set X is increased to concept_P={ x_pLanguage material automatic marking processing is carried out, then walked according to disaggregated model training Rapid III 1 division principle is by X_PIt is divided into the concept subclass X of two supplementary sets each other_PAWith concept subclass X_PB, then classified Model training step C2~C6.Judge concept x_pCorresponding mark language material text collection C_pWhether Chinese version quantity is more than or equal to Threshold alpha, if more than or equal to threshold alpha, then by concept x_pIt is assigned to concept subclass X_AIn, satisfactory text will be trained This grader m_pIt is added in text classifier set M；If less than threshold alpha, then by concept x_pIt is assigned to concept subclass X_B In.

Concept related process after concept is newly-increased includes：Increase the text classifier set obtained after step VII newly using concept Each concept x is directed in M_pText classifier m_pClassification processing is carried out to text d；Concept subclass X is utilized simultaneously_BIn Each concept x_pConcept keyword in corresponding concept keyword set carries out matching treatment to text d, obtains text d With each concept x in concept set X_pFinal association results.

The method tool for the text classifier of automatic marking language material is obtained to the present invention below by a specific embodiment Body implementation process is further described：

Step S1, determines concept：

It is determined that including the concept set X={ x of three concepts₁, x₂, x₃, x₁=waste gas pollution control and treatment, x₂=augmented reality, x₃= Vanadium cell.x₁, x₂, x₃Corresponding concept keyword set Y₁, Y₂, Y₃Respectively：

Y₁={ waste gas, waste gas pollution control and treatment, exhaust-gas treatment, organic exhaust gas, industrial waste gas, waste gas purification }；

Y₂={ augmented reality, ar }；

Y₃={ vanadium cell }.

Step S2, automatic corpus labeling：

Step S21, collects 5000 news corpus, forms un-annotated data text collection D={ d_j, wherein j=1, 2 ..., 5000；

Step S22, with each concept x₁, x₂, x₃Corresponding concept keyword set Y₁, Y₂, Y₃To un-annotated data text Every text in set D carries out matching treatment respectively.As a text d_jWith some concept x_iCorresponding concept keyword Y_i's When match condition meets matching condition, that is, judge text d_jIn whether have more than 25% preferably greater than 30% paragraph in occur With concept x_iCorresponding concept keyword set Y_iIn concept keyword, if exceeding, then text d_jThis can be associated with general Read x_i；If being no more than, then text d_jConcept x cannot be associated with_i.An a certain piece in un-annotated data text collection D Text d_jWith concept x_iWhen the match condition of corresponding concept keyword meets matching condition, then by this text d_jBe labeled as with Concept x_iAssociated text, then by this text d_jIt is labeled as and concept x_iIt is associated.To un-annotated data text collection Every text, which is labeled, in D obtains mark language material text collection C.

Step S3, disaggregated model training：

It marked and concept x in step S31, mark language material text collection C₁Associated text has 208, marked general Read x₂Associated text has 154, marked concept x₃Associated text has 34；

Step S32, concept set X is divided into the concept subclass X of two supplementary sets each other_AWith concept subclass X_B.Take threshold Value α=100, then X_A={ x₁, x₂, X_B={ x₃}.Concept x₁, x₂Associated mark language material text collection C_iA is designated as respectively₁, A₂。

Step S33, respectively by concept subclass X_AIn concept x₁, x₂Corresponding mark language material text collection A₁In text This is as the positive example of train classification models, and random extraction 1000 is not belonging to mark language material text from mark language material text collection C This set A₁, A₂In text, as the negative example of train classification models, be designated as mark language material text collection A₁', A₂'.Using branch Vector machine is held to mark language material text collection A₁、A₁' and A₂、A₂' it is trained concept x₁, x₂Corresponding text classifier, respectively It is designated as m₁, m₂.Concept subclass X_AIn concept x₁, x₂Corresponding text classifier set is designated as M₀。

Step S34, utilizes text classifier set M₀In concept x₁, x₂Corresponding text classifier m₁, m₂To not marking Text in language material text collection D carries out classification processing, obtains corresponding text classification result.

Step S35, for concept subclass X_AIn x₁, x₂, use text classifier m₁, m₂Calculate text and correspond to concept x₁, x₂Probability, selected respectively from text classification result and correspond to concept x₁, x₂Probability be more than threshold value beta text, by it It is added to concept x₁, x₂Corresponding mark language material text collection A₁, A₂In, form new mark language material set A₁, A₂。

Step S36, for new mark language material set A₁, A₂, respectively repeat steps S32~S35 5 times, conformed to The concept x asked₁, x₂Corresponding text classifier m₁, m₂, so as to obtain final satisfactory text classifier set M.

Step S4, concept related processing：

Step S41, utilizes concept x in text classifier set M₁, x₂Text classifier m₁, m₂Text d is classified Processing, uses text classifier m₁, m₂Calculate text d and correspond to concept x₁, x₂Probability, if text d corresponds to concept x₁, x₂'s Probability is more than given threshold β, and text d is labeled as and concept subclass X_AIn concept x₁, x₂It is associated；

Step S42, meanwhile, also to utilize concept subclass X_BMiddle concept x₃Corresponding concept keyword set Y_iIn it is general Read keyword and matching treatment is carried out to text d, when the match conditions are met, text d is labeled as and concept subclass X_B In concept x₃It is associated；

Step S43, combining step S41 and step S42 obtain text d concept related result, obtain text d with Each concept x in concept set X₁, x₂, x₃Final association results.

Claims

1. a kind of method for obtaining the text classifier for automatic marking language material, it is characterised in that this method includes：

Concept determines step I, determines concept set, and each concept correspondence in concept set has at least one concept keyword Concept keyword set；

Language material automatic marking step II, it is crucial using the concept in the corresponding concept keyword set of each concept in concept set Word carries out matching treatment to un-annotated data text collection, and with corresponding concept to the text in un-annotated data text collection It is associated and automatic marking processing, acquisition mark language material text collection；

Disaggregated model training step III, for each concept in concept set, when the mark language material text concept related with this When set Chinese version quantity meets threshold condition, then the mark language material text collection concept related with this is regard as text classification mould The positive example of type, not with negative example of the concept related mark language material text collection as textual classification model, to concept correspondence Textual classification model positive example and negative example be trained, obtain the corresponding text classifier of the concept, finally obtain all texts This quantity meets the corresponding text classifier of concept of threshold condition.

2. the method as described in claim 1, it is characterised in that concept is determined in step I, determines concept set X by concept x_iGroup Into wherein i=1,2,3 ... n, each concept x in concept set X_iCorrespondence with least one concept crucial phrase into it is general Read keyword set Y_i。

3. method as claimed in claim 2, it is characterised in that language material automatic marking step II includes：

Step II 1, sufficient amount N un-annotated data is collected according to specific practical situations, remembers un-annotated data text set It is combined into D={ d_j, wherein j=1,2 ..., N；

Step II 2, utilizes each concept x_iCorresponding concept keyword set Y_iIn concept keyword to un-annotated data text Every text in set D carries out matching treatment respectively, as a certain piece text d in un-annotated data text collection D_jWith concept x_iWhen the match condition of corresponding concept keyword meets matching condition, then by this text d_jIt is labeled as and concept x_iIt is related Connection；Matching treatment is carried out to every text in un-annotated data text collection D and obtains mark language material text collection C.

4. method as claimed in claim 3, it is characterised in that disaggregated model training step III includes：

Step III 1, concept set X is divided into the concept subclass X of two supplementary sets each other_AWith concept subclass X_B, division principle Be, if in mark language material text collection C with concept x_iWhen associated mark language material amount of text is more than or equal to threshold alpha, then By concept x_iIt is subdivided into concept subclass X_AIn, with concept x_iAssociated mark language material text collection is designated as A_i；If being less than During threshold alpha, then by concept x_iIt is subdivided into concept subclass X_BIn；

Step III 2, by concept subclass X_AIn concept x_iCorresponding mark language material text collection A_iIt is used as train classification models Positive example, random extraction k is not belonging to mark language material text collection A from mark language material text collection C_iIn text be designated as mark Language material text collection A_i', it is used as the negative example of train classification models；

Step III 3, using the textual classification model of naive Bayesian, SVMs or logistic regression to mark language material text set Close A_iAnd A_i' training concept x_iCorresponding text classifier, is designated as m_i；The concept subclass X trained_AIn each concept x_iIt is right The text classifier set answered is designated as M₀。

5. method as claimed in claim 4, it is characterised in that disaggregated model training step III also includes：

Step III 4, utilizes text classifier set M₀In respectively with each concept x_iCorresponding text classifier m_iTo not marking Text in language material text collection D carries out classification processing i.e. association and handled with mark, obtains corresponding text classification result, this article This classification results is individually deposited；

Step III 5, for concept subclass X_AIn each concept x_i, use text classifier m_iCalculate text and correspond to concept x_i's Probability, selects from text classification result and corresponds to concept x_iProbability be more than threshold value beta text, add it to concept x_iIt is right The mark language material text collection A answered_iIn, form new mark language material set A_i；

Step III 6, for new mark language material set A_i, repeat step III 2~III 5 2~10 time obtain satisfactory concept x_iCorresponding text classifier m_i, so as to obtain final satisfactory text classifier set M；Or for new mark language material Set A_i, artificial matching, which is assessed, obtains satisfactory concept x_iCorresponding text classifier m_i, so as to finally be met the requirements Text classifier set M.

6. method as claimed in claim 4, it is characterised in that in disaggregated model training step III 3：

Using the textual classification model of SVMs to mark language material text collection A_iAnd A_i' train for concept x_iText Grader.

7. method as claimed in claim 5, it is characterised in that in disaggregated model training step III 5：

Threshold value beta span in step III 5 is 0.1~0.5；In step III 6, repeat step III 2~III 5 5~10 time.

8. method as claimed in claim 5, it is characterised in that in disaggregated model training step III 6：

Artificial matching assessment refers to for concept x_i, from mark language material set A_iIn randomly select some texts, then from mark language Randomly selected in material text collection C some pieces not with concept x_iAll text K of extraction are re-started people by the text of association Work is marked, and obtains criteria classification result；Text classifier m is trained in step III 3 every time_iAfterwards, text classifier m is used_iTo extracting All text K separately carry out classification processing obtain provisional classifications result, i.e., using concept x_iCorresponding text classifier m_iCalculate Every textual association in all text K is to concept x_iProbability, if probability be more than threshold value beta, by the text be labeled as with generally Read x_iThe text of association；Provisional classifications result and criteria classification result are compared, the accuracy rate of provisional classifications result is calculated, When accuracy rate is more than or equal to threshold gamma, then text grader m_iFor satisfactory text classifier；

When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept x_iCorresponding at least one Individual new concept keyword, forms new concept keyword set Y_i, and/or, redefine the matching condition of step II 2；When When re-starting concept and determining step I, according to new concept keyword set Y_iCarry out step II 2 and obtain new mark language material Text collection C；Language material text collection C will be marked and carry out the new concept subclass X of the acquisition of step III 1_AWith new concept subclass X_B；To new concept subclass X_AWith new concept subclass X_BProceed step III 2~III 6, until as text classifier m_i The accuracy rate of provisional classifications result is more than or equal to threshold gamma, then text grader m_iFor satisfactory text classifier； When only redefining the matching condition of step II 2, up to text grader m since step II 2_iTo meet the requirements Text classifier untill.

9. method as claimed in claim 8, it is characterised in that also including concept related step IV：

Concept related step IV, each concept x is directed to using in text classifier set M_iText classifier m_iTo text d Carry out classification processing；Concept subclass X is utilized simultaneously_BIn each concept x_iConcept in corresponding concept keyword set is closed Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set X_iFinal association results.

10. method as claimed in claim 9, it is characterised in that concept related step IV is specifically included：

Step IV 1, utilizes each text classifier m in text classifier set M_iClassification processing is carried out to text d, and calculates text This d corresponds to concept x_iProbability, if text d corresponds to concept x_iProbability be more than given threshold β, by text d be labeled as with Concept subclass X_AIn concept x_iIt is associated；

Step IV 2, meanwhile, also to utilize concept subclass X_BIn each concept x_iCorresponding concept keyword set Y_iIn it is general Read keyword and matching treatment is carried out to text d, when the match conditions are met, text d is labeled as and concept subclass X_B In concept x_iIt is associated；Obtain each concept x in text d and concept set X_iFinal association results.

11. the method as described in claim 3,8 or 10, it is characterised in that matching condition is specifically referred to：

Judge whether to have more than 25% in the text and be preferably greater than in 30% paragraph occur and concept x_iCorresponding concept is crucial Set of words Y_iIn concept keyword, if exceeding, then the text is labeled as and concept x_iAssociation；If being no more than, then this article This is not marked and concept x_iAssociation.

12. method as claimed in claim 9, it is characterised in that also update step V including language material,

Language material updates step V, by each concept x in text d and concept set X_iFinal association results be added to mark language Expect in text collection C, periodically remove older mark language material text from mark language material text collection C using certain removal mode This, the mark language material text collection C after being updated.

13. method as claimed in claim 12, it is characterised in that language material updates removal mode in step V and referred to：Make mark Each concept x in language material text collection C_iCorresponding amount of text is maintained at tens of between hundreds of, if certain concept x_iIt is corresponding Amount of text then removes older text more than hundreds of；Make not with any concept x_iThe amount of text of association is maintained at thousands of and arrived Between tens thousand of, if more than tens thousand of, then removing older text.

14. method as claimed in claim 12, it is characterised in that also update step VI including grader：

Disaggregated model training step III, the text classifier collection updated are repeated to the mark language material text collection C after renewal Close M.

15. method as claimed in claim 14, it is characterised in that also increase step VII newly including concept：

Step VII 1, is increasing some newly-increased concept x_pAfterwards, concept is taken to increase set X_P={ x_p, increase concept x newly_pCorrespondence concept is crucial Set of words Y_p；

Step VII 2, set X is increased to concept_P={ x_pLanguage material automatic marking step II 2 is carried out, then walked according to disaggregated model training Rapid III 1 division principle is by X_PIt is divided into the concept subclass X of two supplementary sets each other_PAWith concept subclass X_PB, then classified Model training step III 2~III 6；

Judge concept x_pCorresponding mark language material text collection C_pChinese version quantity whether be more than or equal to threshold alpha, if be more than or Equal to threshold alpha, then by concept x_pIt is assigned to concept subclass X_AIn, satisfactory text classifier m will be trained_pIt is added to In text classifier set M；If less than threshold alpha, then by concept x_pIt is assigned to concept subclass X_BIn.

16. method as claimed in claim 15, it is characterised in that the concept related step VIII after also being increased newly including concept：Profit Increased newly with concept in the text classifier set M obtained after step VII and be directed to each concept x_pText classifier m_pTo text d Carry out classification processing；Concept subclass X is utilized simultaneously_BIn each concept x_pConcept in corresponding concept keyword set is closed Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set X_pFinal association results.

17. method as claimed in claim 3, it is characterised in that in step II 2, when certain in un-annotated data text collection D One text d_jWith multiple concept x_iWhen the match condition of corresponding concept keyword is satisfied by matching condition, then by this text d_jIt is labeled as multiple concept x corresponding with this_iHomogeneous association.

18. method as claimed in claim 3, it is characterised in that in step II 2, when certain in un-annotated data text collection D One text d_jWith any concept x_iWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then by this text This d_jIt is labeled as " other " or does not make any mark.

19. method as claimed in claim 8, it is characterised in that in the method manually marked, when a certain piece text with it is any When the match condition of the corresponding concept keyword of concept is unsatisfactory for matching condition, then by this text marking be " other " or Any mark is not made.

20. method as claimed in claim 10, it is characterised in that after step IV 2, as text d and any concept x_iIt is corresponding When the match condition of concept keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or does not make any Mark, obtains each concept x in text d and concept set X_iFinal association results.

21. a kind of text point for the automatic marking language material that method as any one of claim 1~8,13~14 is obtained Class device.

22. the mark language material text collection that a kind of method as any one of claim 3,9,12,14 and 15 is obtained.

23. a kind of file classification method, it is characterised in that：Obtained using the method described in claim 1~8,13~14 any one To text classifier set M in be directed to each concept x_iText classifier m_iClassification processing is carried out to text d；It is simultaneously sharp With concept subclass X_BIn each concept x_iConcept keyword in corresponding concept keyword set is to text d progress Match somebody with somebody and processing of classifying, obtain each concept x in text d and concept set X_iFinal classification result.

24. method as claimed in claim 22, it is characterised in that specifically include,

The first step, utilizes each text classifier m in text classifier set M_iClassification processing is carried out to text d, and calculates text D corresponds to concept x_iProbability, if text d corresponds to concept x_iProbability be more than given threshold β, be and concept by text d points Subclass X_AIn concept x_iIt is associated；

Second step, while also to utilize concept subclass X_BIn each concept x_iCorresponding concept keyword set Y_iIn concept Keyword carries out matching treatment to text d, when the match conditions are met, by text d and concept subclass X_BIn concept x_i It is associated.