CN106951565A - File classification method and the text classifier of acquisition - Google Patents

File classification method and the text classifier of acquisition Download PDF

Info

Publication number
CN106951565A
CN106951565A CN201710216502.1A CN201710216502A CN106951565A CN 106951565 A CN106951565 A CN 106951565A CN 201710216502 A CN201710216502 A CN 201710216502A CN 106951565 A CN106951565 A CN 106951565A
Authority
CN
China
Prior art keywords
concept
text
language material
classifier
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710216502.1A
Other languages
Chinese (zh)
Other versions
CN106951565B (en
Inventor
贾宁
夏磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Number Library (shanghai) Technology Co Ltd
Original Assignee
Number Library (shanghai) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Number Library (shanghai) Technology Co Ltd filed Critical Number Library (shanghai) Technology Co Ltd
Priority to CN201710216502.1A priority Critical patent/CN106951565B/en
Publication of CN106951565A publication Critical patent/CN106951565A/en
Application granted granted Critical
Publication of CN106951565B publication Critical patent/CN106951565B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Technical solution of the present invention discloses a kind of text classifier method and text classifier obtained for automatic marking language material, this method includes determining concept set, un-annotated data text match with the concept keyword in the corresponding concept keyword set of each concept and automatic marking processing;For each concept, when the corresponding mark language material text collection Chinese version quantity of the concept meets threshold condition, corresponding textual classification model is then trained to the concept, corresponding text classifier is obtained, the text classifier set corresponding with the concept that all amount of text meet threshold condition is finally obtained.The method of the present invention provides a kind of algorithm structure, with universality, can neatly change taxonomic hierarchies, calculating time and resource have been saved, and the present invention provides a small amount of initial language material text, and automatic marking, without manually marking, further save time and cost.

Description

File classification method and the text classifier of acquisition
Technical field
The present invention relates to the Text Classification field of artificial intelligence, and in particular to a kind of file classification method and acquisition Text classifier.
Background technology
With the fast development of network technology, effectively organization and management is carried out for electronic text information, and can it is quick, Requirement more and more higher that is accurate and comprehensively therefrom finding relevant information.Text classification is as handling and organize a large amount of text datas Key technology, the problem of largely solving information clutter facilitate user to obtain required information exactly, be letter Cease the technical foundation in the fields such as filtering, information retrieval, search engine and text database.Text classification is generally comprised to text The process such as expression, the selection of text classifier and training, the evaluation of text classification result and feedback.Existing Text Classification Generally implement according to following steps:(1) taxonomic hierarchies is determined;(2) language material to be marked is collected, corpus is formed;(3) corpus is used Train classification models;(4) new text is classified with the disaggregated model trained.
But, for following several application scenarios, 1) it is unbalanced for text classification system, some classification categories are very big, Some classification category very littles, a such as classification is " automobile industry ", and another classification is the application scenarios of " lithium battery " etc.;2) For being easily obtained substantial amounts of un-annotated data, and the higher application scenarios of the cost that obtains mark language material;3) for classified body System's often change, often increases classification newly or deletes the application scenarios of classification, and existing Text Classification there are following lack Fall into:(1) in terms of taxonomic hierarchies, prior art is all that determining can not be careless after taxonomic hierarchies using fixed taxonomic hierarchies Change.Change taxonomic hierarchies if necessary, it is necessary to collect language material again, the disaggregated model of re -training all categories, this process Need substantial amounts of time and computing resource.(2) in terms of corpus, prior art need to the text for having marked classification based on certain scale This is as training corpus, and for having marked the text of classification, generally require to spend it is substantial amounts of be manually first labeled processing, from And add input cost.
The content of the invention
The present invention can not arbitrarily change for the taxonomic hierarchies of Text Classification in the prior art, need to such as change classified body System needs consuming more time and computing resource, and language material need to be manually labeled, it is necessary to the more costly and time consuming length of input cost Defect, it is therefore intended that provide it is a kind of can flexibly change taxonomic hierarchies and the automatic file classification method for being labeled text, It cost savings computing resource, time and cost.
Realizing the technical scheme of above-mentioned purpose is:
The method that the present invention obtains the text classifier for automatic marking language material, this method includes:
Concept determines step I, determines concept set, and there is each concept correspondence in concept set at least one concept to close The concept keyword set of keyword;
Language material automatic marking step II, utilizes the concept in the corresponding concept keyword set of each concept in concept set Keyword carries out matching treatment to un-annotated data text collection, and with corresponding concept in un-annotated data text collection Text is associated and automatic marking processing, acquisition mark language material text collection;
" mark language material text collection " includes following 2 class text, and the 1st class is:Join with any conceptual dependency in concept set And the text being labeled with the concept;2nd class is:To unconnected other texts of any concept in concept set, But such other text is not associated due to the concept with having determined, can be labeled or not appointed with " other " What is marked or the text that is labeled with other words do not obscured with concept word etc., as long as can clearly demonstrate that this kind of Text is any word that can be made a distinction with the 1st class text.
Disaggregated model training step III, for each concept in concept set, when the mark language material concept related with this When text collection Chinese version quantity meets threshold condition, then the text in the mark language material text collection concept related with this is made For the positive example of textual classification model, the text not with this in concept related mark language material text collection is used as textual classification model Negative example, the positive example and negative example of textual classification model corresponding to the concept be trained, and obtains the corresponding text point of the concept Class device, finally obtains the corresponding text classifier of concept that all amount of text meet threshold condition.
In the preferred embodiment of the present invention, concept is determined in step I, determines concept set X by concept xiComposition, its Each concept x in middle i=1,2,3 ... n, concept set XiCorrespondence with least one concept crucial phrase into concept close Keyword set Yi
In the preferred embodiment of the present invention, language material automatic marking step II includes:
Step II 1, sufficient amount N un-annotated data, note un-annotated data text are collected according to specific practical situations This collection is combined into D={ dj, wherein j=1,2 ..., N;
Step II 2, utilizes each concept xiCorresponding concept keyword set YiIn concept keyword to un-annotated data Every text in text collection D carries out matching treatment respectively, as a certain piece text d in un-annotated data text collection DjWith Concept xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djIt is labeled as and concept xi It is associated;Matching treatment is carried out to every text in un-annotated data text collection D and obtains mark language material text collection C.
In the preferred embodiment of the present invention, disaggregated model training step III includes:
Step III 1, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB, divide Principle is, if in mark language material text collection C with concept xiAssociated mark language material amount of text is more than or equal to threshold alpha When, then by concept xiIt is subdivided into concept subclass XAIn, with concept xiAssociated mark language material text collection is designated as Ai; If during less than threshold alpha, by concept xiIt is subdivided into concept subclass XBIn;
Step III 2, by concept subclass XAIn concept xiCorresponding mark language material text collection AiIt is used as training classification mould The positive example of type, random extraction k is not belonging to mark language material text collection A from mark language material text collection CiIn text be designated as Mark language material text collection Ai', it is used as the negative example of train classification models;
Step III 3, using the textual classification model of naive Bayesian, SVMs or logistic regression to mark language material text This set AiAnd Ai' training concept xiCorresponding text classifier is designated as mi;The concept subclass X trainedAIn each concept xiCorresponding text classifier set is designated as M0
In the preferred embodiment of the present invention, disaggregated model training step III also includes:
Step III 4, utilizes text classifier set M0In respectively with each concept xiCorresponding text classifier miTo not It is that association is handled with mark to mark the text in language material text collection D and carry out classification processing, obtains corresponding text classification result, Text classification results are individually deposited, and mark language material text collection C is not influenceed;
Step III 5, for concept subclass XAIn each concept xi, use text classifier miText is calculated to correspond to generally Read xiProbability, selected from text classification result and correspond to concept xiProbability be more than the text of threshold value beta, add it to general Read xiCorresponding mark language material text collection AiIn, form new mark language material set Ai
Step III 6, for new mark language material set Ai, repeat step III 2~III 5 2~10 time are met the requirements Concept xiCorresponding text classifier mi, so as to obtain final satisfactory text classifier set M;Or for new mark Note language material set Ai, artificial matching, which is assessed, obtains satisfactory concept xiCorresponding text classifier mi, so as to finally be accorded with Close desired text classifier set M.
In the preferred embodiment of the present invention, in disaggregated model training step III 3:
Using the textual classification model of SVMs to mark language material text collection AiAnd Ai' train for concept xi's Text classifier.
In the preferred embodiment of the present invention, in disaggregated model training step III 5:
Threshold value beta span in step III 5 is 0.1~0.5;In step III 6, repeat step III 2~III 5 5~10 It is secondary.
In the preferred embodiment of the present invention, in disaggregated model training step III 6:
Artificial matching assessment refers to for concept xi, from mark language material set AiIn randomly select some texts, then from mark Randomly selected in note language material text collection C some pieces not with concept xiAll text K of extraction are entered by the text of association again Pedestrian's work is marked, and obtains criteria classification result;Text classifier m is trained in step III 3 every timeiAfterwards, text classifier m is usediIt is right All text K extracted separately carry out classification processing and obtain provisional classifications result, i.e., using concept xiCorresponding text classifier mi Every textual association in all text K is calculated to concept xiProbability, if probability be more than threshold value beta, the text is labeled as With concept xiThe text of association;Provisional classifications result and criteria classification result are compared, the standard of provisional classifications result is calculated True rate, when accuracy rate is more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;
When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept xiIt is corresponding extremely A few new concept keyword, forms new concept keyword set Yi, and/or, redefine the matching bar of step II 2 Part;When re-starting concept and determining step I, according to new concept keyword set YiCarry out step II 2 and obtain new mark Note language material text collection C;Language material text collection C will be marked and carry out the new concept subclass X of the acquisition of step III 1AWith new concept Subclass XB;To new concept subclass XAWith new concept subclass XBProceed step III 2~III 6, until when text point Class device miThe accuracy rate of provisional classifications result is more than or equal to threshold gamma, then text grader miFor satisfactory text point Class device;When only redefining the matching condition of step II 2, up to text grader m since step II 2iFor symbol Untill closing desired text classifier.
In the preferred embodiment of the present invention, also including concept related step IV:
Concept related step IV, each concept x is directed to using in text classifier set MiText classifier miTo text This d carries out classification processing;Concept subclass X is utilized simultaneouslyBIn each concept xiConcept in corresponding concept keyword set Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set XiFinal association knot Really.
In the preferred embodiment of the present invention, concept related step IV is specifically included:
Step IV 1, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and counted Calculate text d and correspond to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, text d is marked For with concept subclass XAIn concept xiIt is associated;
Step IV 2, meanwhile, also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn Concept keyword to text d carry out matching treatment, when the match conditions are met, text d is labeled as and the concept subset Close XBIn concept xiIt is associated;Obtain each concept x in text d and concept set XiFinal association results.
In the preferred embodiment of the present invention, matching condition is specifically referred to:
Judge whether to have more than 25% in the text and be preferably greater than in 30% paragraph occur and concept xiIt is corresponding general Read keyword set YiIn concept keyword, if exceeding, then the text is labeled as and concept xiAssociation;If being no more than, Then the text is not marked and concept xiAssociation.
In the preferred embodiment of the present invention, step V also is updated including language material,
Language material updates step V, by each concept x in text d and concept set XiFinal association results be added to mark Note in language material text collection C, older mark language is periodically removed from mark language material text collection C using certain removal mode Expect text, the mark language material text collection C after being updated.
In the preferred embodiment of the present invention, language material updates removal mode in step V and referred to:Make mark language material text Each concept x in set CiCorresponding amount of text is maintained at tens of between hundreds of, if certain concept xiCorresponding amount of text More than hundreds of, then older text is removed;Make not with any concept xiThe amount of text of association is maintained at thousands of between tens thousand of, If more than tens thousand of, then removing older text.
In the preferred embodiment of the present invention, also step VI is updated including grader:
Disaggregated model training step III, the text classification updated are repeated to the mark language material text collection C after renewal Device set M.
In the preferred embodiment of the present invention, also step VII is increased newly including concept:
Step VII 1, is increasing some newly-increased concept xpAfterwards, concept is taken to increase set XP={ xp, increase concept x newlypCorrespondence concept Keyword set Yp
Step VII 2, set X is increased to conceptP={ xpLanguage material automatic marking step II 2 is carried out, then instructed according to disaggregated model Practice the division principle of step III 1 by XPIt is divided into the concept subclass X of two supplementary sets each otherPAWith concept subclass XPB, then carry out Disaggregated model training step III 2~III 6;
Judge concept xpCorresponding mark language material text collection CpWhether Chinese version quantity is more than or equal to threshold alpha, if greatly In or equal to threshold alpha, then by concept xpIt is assigned to concept subclass XAIn, satisfactory text classifier m will be trainedpPlus Enter into text classifier set M;If less than threshold alpha, then by concept xpIt is assigned to concept subclass XBIn.
In the preferred embodiment of the present invention, the concept related step VIII after also being increased newly including concept:It is new using concept Increase in the text classifier set M obtained after step VII and be directed to each concept xpText classifier mpText d is classified Processing;Concept subclass X is utilized simultaneouslyBIn each concept xpConcept keyword in corresponding concept keyword set is to this Text d carries out matching treatment, obtains each concept x in text d and concept set XpFinal association results.
In the preferred embodiment of the present invention, in step II 2, when a certain piece text in un-annotated data text collection D This djWith multiple concept xiWhen the match condition of corresponding concept keyword is satisfied by matching condition, then by this text djMark For multiple concept xs corresponding with thisiHomogeneous association.
In the preferred embodiment of the present invention, in step II 2, when a certain piece text in un-annotated data text collection D This djWith any concept xiWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then by this text djMark Note as " other " or do not make any mark.
In the preferred embodiment of the present invention, in the method manually marked, when a certain piece text and any concept pair When the match condition for the concept keyword answered is unsatisfactory for matching condition, then by this text marking is " other " or does not appoint What is marked.
In the preferred embodiment of the present invention, after step IV 2, as text d and any concept xiCorresponding concept is closed When the match condition of keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or does not make any mark, obtained Obtain each concept x in text d and concept set XiFinal association results.
In the preferred embodiment of the present invention, present invention additionally comprises the text classifier of the automatic marking language material of acquisition.
In the preferred embodiment of the present invention, present invention additionally comprises the mark language material text collection of acquisition.
In a kind of file classification method of the present invention, the text classifier set obtained using the process described above Each concept x is directed in MiText classifier miClassification processing is carried out to text d;Concept subclass X is utilized simultaneouslyBIn Each concept xiConcept keyword in corresponding concept keyword set is matched and processing of classifying to text d, is obtained Obtain each concept x in text d and concept set XiFinal classification result.
In a kind of file classification method of the present invention, specifically include,
The first step, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and counted Calculate text d and correspond to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, be by text d points With concept subclass XAIn concept xiIt is associated;
Second step, while also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn Concept keyword carries out matching treatment to text d, when the match conditions are met, by text d and concept subclass XBIn Concept xiIt is associated.
Term " concept " is the general character of the management functions of several listed companies.The category of concept includes:(1) policy, such as The Belt and Road etc.;(2) industry, such as " food and drink ";(3) product, such as " unmanned plane ";(4) the characteristics of financial sector, such as " sub-new stock " etc., and other guide.Concept includes but is not limited to above-mentioned category.Term " concept keyword " refers to each concept pair The word of concept feature can be represented by answering several.If mentioning concept keyword in a word, then it is assumed that what is talked about is corresponding Concept.Text includes but is not limited to news, the title and body matter that grind report, the bulletin of listed company etc. of security organization.Art Language " text and conceptual dependency connection " refers to when the content of a text is more related to the content with some conceptual dependency, then be referred to as The text and the conceptual dependency join.One text may be associated with concept, it is also possible to not associate any concept;One may be associated with Individual concept, it is also possible to associate multiple concepts.From the perspective of text classification, concept may be considered classification, and textual association is arrived Some concept, it is believed that text belongs to some classification.Term " mark " refers to mark the corresponding concept of text, and mark can be Artificial mark, can also realize automatic marking by algorithm.Term " un-annotated data text collection " refers to not mark text The text collection of correspondence concept.Term " mark language material text collection " refers to the text collection for having marked text correspondence concept.
The present invention positive effect be:
The method of the present invention provides a kind of algorithm structure, with universality, can neatly change taxonomic hierarchies, for example can Realization regularly updates mark language material text collection and text classifier, increases concept newly if necessary, both increases in concept set Newly-increased concept, need to only carry out automatic marking language material and training grader, you can obtain newly-increased concept corresponding for newly-increased concept Text classifier, has thus saved calculating time and resource, and the present invention provides a small amount of initial language material text, And automatic marking, without artificial mark, further save time and cost.
Brief description of the drawings
Fig. 1 is the flow chart of training text grader of the present invention;
Fig. 2 is the flow chart of the file classification method of the present invention.
Embodiment
Preferred embodiment is enumerated below, and combines Fig. 1 and Fig. 2 to become apparent from intactly illustrating that the present invention is obtained for automatic Mark the text classifier method of language material and the implementation process of text classifier.
Step A, concept determination process includes:
Concept set X is by concept xiEach concept x in composition, wherein i=1,2,3 ... n, concept set XiCorrespondence has At least one concept crucial phrase into concept keyword set Yi.One text may be associated with one or more concept xi, Any concept x may not be associatedi.If text have more with certain concept x in concept set XiRelated content, then The text and concept xiIt is associated;If the content of a text and any concept x in concept set XiIt is uncorrelated, it is referred to as The text and the conceptual dependency join.From the perspective of text classification, concept may be considered classification, if textual association to certain Concept, it is believed that text belongs to the classification belonging to certain concept.Concept set is shared in it is determined that the text concept to be associated with, i.e., true It is fixed that text is divided into the corresponding classification of concept being associated with.
Step B, language material automatic marking process includes:
Step B1, sufficient amount N un-annotated data is collected according to specific practical situations, remembers un-annotated data text Collection is combined into D={ dj, wherein j=1,2 ..., N.
Step B2, utilizes each concept xiCorresponding concept keyword set YiIn concept keyword to un-annotated data Every text in text collection D carries out matching treatment respectively, judges to work as a certain piece text in un-annotated data text collection D djWith concept xiWhen whether the match condition of corresponding concept keyword meets matching condition, in a preferred embodiment of the present invention, Matching condition is to judge text djIn whether have more than 25% preferably greater than 30% paragraph in occur with concept xiCorrespondence Concept keyword set YiIn concept keyword, if exceeding, then by text djWith concept xiIt is associated;If not surpassing Cross, then text djCannot be with concept xiIt is associated.As a certain piece text d in un-annotated data text collection DjWith concept xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djIt is labeled as and concept xiIt is related The text of connection, and by this text djIt is labeled as and concept xiIt is associated.To every text in un-annotated data text collection D Carry out matching treatment and obtain mark language material text collection C.
As a certain piece text d in un-annotated data text collection DjWith multiple concept xiOf corresponding concept keyword When being satisfied by matching condition with situation, then by this text djIt is labeled as multiple concept x corresponding with thisiHomogeneous association.When not marking Note a certain piece text d in language material text collection DjWith any concept xiThe match condition of corresponding concept keyword is unsatisfactory for During matching condition, then by this text djIt is labeled as " other " or does not make any mark.
Wherein, the principal element of influence matching condition includes:Occur in text with concept xiConcept keyword set Yi The concept x occurred in the quantity of middle concept keyword, textiConcept keyword set YiThe quantity and text of middle concept keyword There is concept xi concept keyword set Y in ratio of all word quantity, text in thisiThe sentence number of middle concept keyword There is concept x in amount, textiConcept keyword set YiAll sentence numbers in the sentence quantity and text of middle concept keyword There is concept x in the ratio of amount, textiConcept keyword set YiThe natural segment number of concept keyword, occur in text Concept xiConcept keyword set YiConcept keyword natural segment number and text paragragh sum ratio.
Step C, disaggregated model training process includes:
Step C1, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB, divide Principle is, if in mark language material text collection C with concept xiAssociated mark language material amount of text is more than or equal to threshold alpha When, then by concept xiIt is subdivided into concept subclass XAIn, with concept xiAssociated mark language material text collection is designated as Ai; If during less than threshold alpha, by concept xiIt is subdivided into concept subclass XBIn.In one preferred embodiment, threshold alpha value is 100.
Step C2, by concept subclass XAIn concept xiCorresponding mark language material text collection AiIn text be used as instruction Practice the positive example of disaggregated model, random extraction k is not belonging to mark language material text collection A from mark language material text collection CiIn Text is designated as mark language material text collection A as the negative example of train classification modelsi'。
Step C3, using textual classification models such as naive Bayesian, SVMs or logistic regressions to mark language material text This set AiAnd Ai' training concept xiCorresponding text classifier, is designated as mi.The concept subclass X trainedAIn each concept xiCorresponding text classifier set is designated as M0.In a preferred embodiment of the present invention, document (is referred to using SVMs: Yuan,G.,Ho,C.,Lin,C.:Recent Advances of Large-scale Linear Classification.Proc.IEEE 100 (9), 2584-2603 (2012)) textual classification model to mark language material text Set AiAnd Ai' train the text classification grader for being directed to concept xi.
Step C4, utilizes text classifier set M0In respectively with each concept xiCorresponding text classifier miTo not The text marked in language material text collection D carries out classification processing, obtains corresponding text classification result, the classification results are individually deposited Put, mark language material text collection C is not influenceed.
Step C5, for concept subclass XAIn each concept xi, use text classifier miCalculate text and correspond to concept xiProbability, selected from text classification result and correspond to concept xiProbability be more than threshold value beta text, add it to concept xiCorresponding mark language material text collection AiIn, form new mark language material set Ai.Wherein, β spans are 0.1~0.5.
Step C6, for new mark language material set Ai, repeat step C2~C5 2~10 times obtains satisfactory general Read xiCorresponding text classifier mi, so as to obtain final satisfactory text classifier set M.In one preferred embodiment, weight Multiple step C2~C5 5~10 times.Or for new mark language material set Ai, artificial matching, which is assessed, obtains satisfactory concept xi Corresponding text classifier mi, so as to obtain final satisfactory text classifier set M.
Wherein, artificial matching assessment refers to for concept xi, from mark language material set AiIn randomly select some texts, Again from mark language material text collection C in randomly select some pieces not with concept xiThe text of association, to all text K of extraction Artificial mark is re-started, criteria classification result is obtained;Text classifier m is trained in step III 3 every timeiAfterwards, with text point Class device miAll text K of extraction are separately carried out with classification processing and obtains provisional classifications result, i.e., using concept xiCorresponding text Grader miEvery textual association in all text K is calculated to concept xiProbability, if probability be more than threshold value beta, by the text It is labeled as and concept xiThe text of association;Provisional classifications result and criteria classification result are compared, provisional classifications knot is calculated The accuracy rate of fruit, when accuracy rate is more than or equal to threshold gamma (usual γ spans be 0.8~1), then text grader miFor satisfactory text classifier;
When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept xiIt is corresponding extremely A few new concept keyword, forms new concept keyword set Yi, and/or, redefine step B2 matching condition; When re-starting concept and determining step I, according to new concept keyword set YiCarry out step B2 and obtain new mark language Expect text collection C;Language material text collection C will be marked and carry out the new concept subclass X of step C1 acquisitionsAWith new concept subclass XB;To new concept subclass XAWith new concept subclass XBProceed step C2~C6, until as text classifier miFace When classification results accuracy rate be more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;When When only redefining step B2 matching condition, up to text grader m since step B2iFor satisfactory text Untill this grader.
Step D, concept related process includes:
Step D1, each concept x is directed to using in text classifier set MiText classifier miText d is carried out Classification is handled, and uses text classifier miCalculate text d and correspond to concept xiProbability, if text d corresponds to concept xiProbability More than given threshold β, text d is labeled as and is associated with concept subclass XAIn concept xiIt is associated;
Step D2, while also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn Concept keyword carries out matching treatment to text d, and when the match conditions are met, text d is labeled as and the concept subclass XBIn concept xiIt is associated;Obtain each concept x in text d and concept set XiFinal association results.As text d With any concept xiWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or do not make any mark, obtain each concept x in text d and concept set XiFinal association results.
Step E, updating mark language material text collection includes:
By each concept x in text d and concept set XiFinal association results be added to mark language material text collection C In, older mark language material text is periodically removed from mark language material text collection C using certain removal mode, updated Mark language material text collection C afterwards.Wherein, removal mode refers to:Make each concept x in mark language material text collection CiIt is corresponding Amount of text is maintained at tens of between hundreds of, if certain concept xiCorresponding amount of text is then removed older more than hundreds of Text;Make not with any concept xiThe amount of text of association is maintained at thousands of between tens thousand of, if more than tens thousand of, then remove compared with Old text.
Step F, updating grader process includes:
Disaggregated model training step III, the text classification updated are repeated to the mark language material text collection C after renewal Device set M.
Step G, the newly-increased concept process of increase includes:
Step G1, is increasing some newly-increased concept xpAfterwards, concept is taken to increase set XP={ xp, increase concept x newlypCorrespondence concept Keyword set Yp
Step G2, set X is increased to conceptP={ xpLanguage material automatic marking processing is carried out, then walked according to disaggregated model training Rapid III 1 division principle is by XPIt is divided into the concept subclass X of two supplementary sets each otherPAWith concept subclass XPB, then classified Model training step C2~C6.Judge concept xpCorresponding mark language material text collection CpWhether Chinese version quantity is more than or equal to Threshold alpha, if more than or equal to threshold alpha, then by concept xpIt is assigned to concept subclass XAIn, satisfactory text will be trained This grader mpIt is added in text classifier set M;If less than threshold alpha, then by concept xpIt is assigned to concept subclass XB In.
Concept related process after concept is newly-increased includes:Increase the text classifier set obtained after step VII newly using concept Each concept x is directed in MpText classifier mpClassification processing is carried out to text d;Concept subclass X is utilized simultaneouslyBIn Each concept xpConcept keyword in corresponding concept keyword set carries out matching treatment to text d, obtains text d With each concept x in concept set XpFinal association results.
The method tool for the text classifier of automatic marking language material is obtained to the present invention below by a specific embodiment Body implementation process is further described:
Step S1, determines concept:
It is determined that including the concept set X={ x of three concepts1, x2, x3, x1=waste gas pollution control and treatment, x2=augmented reality, x3= Vanadium cell.x1, x2, x3Corresponding concept keyword set Y1, Y2, Y3Respectively:
Y1={ waste gas, waste gas pollution control and treatment, exhaust-gas treatment, organic exhaust gas, industrial waste gas, waste gas purification };
Y2={ augmented reality, ar };
Y3={ vanadium cell }.
Step S2, automatic corpus labeling:
Step S21, collects 5000 news corpus, forms un-annotated data text collection D={ dj, wherein j=1, 2 ..., 5000;
Step S22, with each concept x1, x2, x3Corresponding concept keyword set Y1, Y2, Y3To un-annotated data text Every text in set D carries out matching treatment respectively.As a text djWith some concept xiCorresponding concept keyword Yi's When match condition meets matching condition, that is, judge text djIn whether have more than 25% preferably greater than 30% paragraph in occur With concept xiCorresponding concept keyword set YiIn concept keyword, if exceeding, then text djThis can be associated with general Read xi;If being no more than, then text djConcept x cannot be associated withi.An a certain piece in un-annotated data text collection D Text djWith concept xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djBe labeled as with Concept xiAssociated text, then by this text djIt is labeled as and concept xiIt is associated.To un-annotated data text collection Every text, which is labeled, in D obtains mark language material text collection C.
Step S3, disaggregated model training:
It marked and concept x in step S31, mark language material text collection C1Associated text has 208, marked general Read x2Associated text has 154, marked concept x3Associated text has 34;
Step S32, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB.Take threshold Value α=100, then XA={ x1, x2, XB={ x3}.Concept x1, x2Associated mark language material text collection CiA is designated as respectively1, A2
Step S33, respectively by concept subclass XAIn concept x1, x2Corresponding mark language material text collection A1In text This is as the positive example of train classification models, and random extraction 1000 is not belonging to mark language material text from mark language material text collection C This set A1, A2In text, as the negative example of train classification models, be designated as mark language material text collection A1', A2'.Using branch Vector machine is held to mark language material text collection A1、A1' and A2、A2' it is trained concept x1, x2Corresponding text classifier, respectively It is designated as m1, m2.Concept subclass XAIn concept x1, x2Corresponding text classifier set is designated as M0
Step S34, utilizes text classifier set M0In concept x1, x2Corresponding text classifier m1, m2To not marking Text in language material text collection D carries out classification processing, obtains corresponding text classification result.
Step S35, for concept subclass XAIn x1, x2, use text classifier m1, m2Calculate text and correspond to concept x1, x2Probability, selected respectively from text classification result and correspond to concept x1, x2Probability be more than threshold value beta text, by it It is added to concept x1, x2Corresponding mark language material text collection A1, A2In, form new mark language material set A1, A2
Step S36, for new mark language material set A1, A2, respectively repeat steps S32~S35 5 times, conformed to The concept x asked1, x2Corresponding text classifier m1, m2, so as to obtain final satisfactory text classifier set M.
Step S4, concept related processing:
Step S41, utilizes concept x in text classifier set M1, x2Text classifier m1, m2Text d is classified Processing, uses text classifier m1, m2Calculate text d and correspond to concept x1, x2Probability, if text d corresponds to concept x1, x2's Probability is more than given threshold β, and text d is labeled as and concept subclass XAIn concept x1, x2It is associated;
Step S42, meanwhile, also to utilize concept subclass XBMiddle concept x3Corresponding concept keyword set YiIn it is general Read keyword and matching treatment is carried out to text d, when the match conditions are met, text d is labeled as and concept subclass XB In concept x3It is associated;
Step S43, combining step S41 and step S42 obtain text d concept related result, obtain text d with Each concept x in concept set X1, x2, x3Final association results.

Claims (24)

1. a kind of method for obtaining the text classifier for automatic marking language material, it is characterised in that this method includes:
Concept determines step I, determines concept set, and each concept correspondence in concept set has at least one concept keyword Concept keyword set;
Language material automatic marking step II, it is crucial using the concept in the corresponding concept keyword set of each concept in concept set Word carries out matching treatment to un-annotated data text collection, and with corresponding concept to the text in un-annotated data text collection It is associated and automatic marking processing, acquisition mark language material text collection;
Disaggregated model training step III, for each concept in concept set, when the mark language material text concept related with this When set Chinese version quantity meets threshold condition, then the mark language material text collection concept related with this is regard as text classification mould The positive example of type, not with negative example of the concept related mark language material text collection as textual classification model, to concept correspondence Textual classification model positive example and negative example be trained, obtain the corresponding text classifier of the concept, finally obtain all texts This quantity meets the corresponding text classifier of concept of threshold condition.
2. the method as described in claim 1, it is characterised in that concept is determined in step I, determines concept set X by concept xiGroup Into wherein i=1,2,3 ... n, each concept x in concept set XiCorrespondence with least one concept crucial phrase into it is general Read keyword set Yi
3. method as claimed in claim 2, it is characterised in that language material automatic marking step II includes:
Step II 1, sufficient amount N un-annotated data is collected according to specific practical situations, remembers un-annotated data text set It is combined into D={ dj, wherein j=1,2 ..., N;
Step II 2, utilizes each concept xiCorresponding concept keyword set YiIn concept keyword to un-annotated data text Every text in set D carries out matching treatment respectively, as a certain piece text d in un-annotated data text collection DjWith concept xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djIt is labeled as and concept xiIt is related Connection;Matching treatment is carried out to every text in un-annotated data text collection D and obtains mark language material text collection C.
4. method as claimed in claim 3, it is characterised in that disaggregated model training step III includes:
Step III 1, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB, division principle Be, if in mark language material text collection C with concept xiWhen associated mark language material amount of text is more than or equal to threshold alpha, then By concept xiIt is subdivided into concept subclass XAIn, with concept xiAssociated mark language material text collection is designated as Ai;If being less than During threshold alpha, then by concept xiIt is subdivided into concept subclass XBIn;
Step III 2, by concept subclass XAIn concept xiCorresponding mark language material text collection AiIt is used as train classification models Positive example, random extraction k is not belonging to mark language material text collection A from mark language material text collection CiIn text be designated as mark Language material text collection Ai', it is used as the negative example of train classification models;
Step III 3, using the textual classification model of naive Bayesian, SVMs or logistic regression to mark language material text set Close AiAnd Ai' training concept xiCorresponding text classifier, is designated as mi;The concept subclass X trainedAIn each concept xiIt is right The text classifier set answered is designated as M0
5. method as claimed in claim 4, it is characterised in that disaggregated model training step III also includes:
Step III 4, utilizes text classifier set M0In respectively with each concept xiCorresponding text classifier miTo not marking Text in language material text collection D carries out classification processing i.e. association and handled with mark, obtains corresponding text classification result, this article This classification results is individually deposited;
Step III 5, for concept subclass XAIn each concept xi, use text classifier miCalculate text and correspond to concept xi's Probability, selects from text classification result and corresponds to concept xiProbability be more than threshold value beta text, add it to concept xiIt is right The mark language material text collection A answerediIn, form new mark language material set Ai
Step III 6, for new mark language material set Ai, repeat step III 2~III 5 2~10 time obtain satisfactory concept xiCorresponding text classifier mi, so as to obtain final satisfactory text classifier set M;Or for new mark language material Set Ai, artificial matching, which is assessed, obtains satisfactory concept xiCorresponding text classifier mi, so as to finally be met the requirements Text classifier set M.
6. method as claimed in claim 4, it is characterised in that in disaggregated model training step III 3:
Using the textual classification model of SVMs to mark language material text collection AiAnd Ai' train for concept xiText Grader.
7. method as claimed in claim 5, it is characterised in that in disaggregated model training step III 5:
Threshold value beta span in step III 5 is 0.1~0.5;In step III 6, repeat step III 2~III 5 5~10 time.
8. method as claimed in claim 5, it is characterised in that in disaggregated model training step III 6:
Artificial matching assessment refers to for concept xi, from mark language material set AiIn randomly select some texts, then from mark language Randomly selected in material text collection C some pieces not with concept xiAll text K of extraction are re-started people by the text of association Work is marked, and obtains criteria classification result;Text classifier m is trained in step III 3 every timeiAfterwards, text classifier m is usediTo extracting All text K separately carry out classification processing obtain provisional classifications result, i.e., using concept xiCorresponding text classifier miCalculate Every textual association in all text K is to concept xiProbability, if probability be more than threshold value beta, by the text be labeled as with generally Read xiThe text of association;Provisional classifications result and criteria classification result are compared, the accuracy rate of provisional classifications result is calculated, When accuracy rate is more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;
When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept xiCorresponding at least one Individual new concept keyword, forms new concept keyword set Yi, and/or, redefine the matching condition of step II 2;When When re-starting concept and determining step I, according to new concept keyword set YiCarry out step II 2 and obtain new mark language material Text collection C;Language material text collection C will be marked and carry out the new concept subclass X of the acquisition of step III 1AWith new concept subclass XB;To new concept subclass XAWith new concept subclass XBProceed step III 2~III 6, until as text classifier mi The accuracy rate of provisional classifications result is more than or equal to threshold gamma, then text grader miFor satisfactory text classifier; When only redefining the matching condition of step II 2, up to text grader m since step II 2iTo meet the requirements Text classifier untill.
9. method as claimed in claim 8, it is characterised in that also including concept related step IV:
Concept related step IV, each concept x is directed to using in text classifier set MiText classifier miTo text d Carry out classification processing;Concept subclass X is utilized simultaneouslyBIn each concept xiConcept in corresponding concept keyword set is closed Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set XiFinal association results.
10. method as claimed in claim 9, it is characterised in that concept related step IV is specifically included:
Step IV 1, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and calculates text This d corresponds to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, by text d be labeled as with Concept subclass XAIn concept xiIt is associated;
Step IV 2, meanwhile, also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn it is general Read keyword and matching treatment is carried out to text d, when the match conditions are met, text d is labeled as and concept subclass XB In concept xiIt is associated;Obtain each concept x in text d and concept set XiFinal association results.
11. the method as described in claim 3,8 or 10, it is characterised in that matching condition is specifically referred to:
Judge whether to have more than 25% in the text and be preferably greater than in 30% paragraph occur and concept xiCorresponding concept is crucial Set of words YiIn concept keyword, if exceeding, then the text is labeled as and concept xiAssociation;If being no more than, then this article This is not marked and concept xiAssociation.
12. method as claimed in claim 9, it is characterised in that also update step V including language material,
Language material updates step V, by each concept x in text d and concept set XiFinal association results be added to mark language Expect in text collection C, periodically remove older mark language material text from mark language material text collection C using certain removal mode This, the mark language material text collection C after being updated.
13. method as claimed in claim 12, it is characterised in that language material updates removal mode in step V and referred to:Make mark Each concept x in language material text collection CiCorresponding amount of text is maintained at tens of between hundreds of, if certain concept xiIt is corresponding Amount of text then removes older text more than hundreds of;Make not with any concept xiThe amount of text of association is maintained at thousands of and arrived Between tens thousand of, if more than tens thousand of, then removing older text.
14. method as claimed in claim 12, it is characterised in that also update step VI including grader:
Disaggregated model training step III, the text classifier collection updated are repeated to the mark language material text collection C after renewal Close M.
15. method as claimed in claim 14, it is characterised in that also increase step VII newly including concept:
Step VII 1, is increasing some newly-increased concept xpAfterwards, concept is taken to increase set XP={ xp, increase concept x newlypCorrespondence concept is crucial Set of words Yp
Step VII 2, set X is increased to conceptP={ xpLanguage material automatic marking step II 2 is carried out, then walked according to disaggregated model training Rapid III 1 division principle is by XPIt is divided into the concept subclass X of two supplementary sets each otherPAWith concept subclass XPB, then classified Model training step III 2~III 6;
Judge concept xpCorresponding mark language material text collection CpChinese version quantity whether be more than or equal to threshold alpha, if be more than or Equal to threshold alpha, then by concept xpIt is assigned to concept subclass XAIn, satisfactory text classifier m will be trainedpIt is added to In text classifier set M;If less than threshold alpha, then by concept xpIt is assigned to concept subclass XBIn.
16. method as claimed in claim 15, it is characterised in that the concept related step VIII after also being increased newly including concept:Profit Increased newly with concept in the text classifier set M obtained after step VII and be directed to each concept xpText classifier mpTo text d Carry out classification processing;Concept subclass X is utilized simultaneouslyBIn each concept xpConcept in corresponding concept keyword set is closed Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set XpFinal association results.
17. method as claimed in claim 3, it is characterised in that in step II 2, when certain in un-annotated data text collection D One text djWith multiple concept xiWhen the match condition of corresponding concept keyword is satisfied by matching condition, then by this text djIt is labeled as multiple concept x corresponding with thisiHomogeneous association.
18. method as claimed in claim 3, it is characterised in that in step II 2, when certain in un-annotated data text collection D One text djWith any concept xiWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then by this text This djIt is labeled as " other " or does not make any mark.
19. method as claimed in claim 8, it is characterised in that in the method manually marked, when a certain piece text with it is any When the match condition of the corresponding concept keyword of concept is unsatisfactory for matching condition, then by this text marking be " other " or Any mark is not made.
20. method as claimed in claim 10, it is characterised in that after step IV 2, as text d and any concept xiIt is corresponding When the match condition of concept keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or does not make any Mark, obtains each concept x in text d and concept set XiFinal association results.
21. a kind of text point for the automatic marking language material that method as any one of claim 1~8,13~14 is obtained Class device.
22. the mark language material text collection that a kind of method as any one of claim 3,9,12,14 and 15 is obtained.
23. a kind of file classification method, it is characterised in that:Obtained using the method described in claim 1~8,13~14 any one To text classifier set M in be directed to each concept xiText classifier miClassification processing is carried out to text d;It is simultaneously sharp With concept subclass XBIn each concept xiConcept keyword in corresponding concept keyword set is to text d progress Match somebody with somebody and processing of classifying, obtain each concept x in text d and concept set XiFinal classification result.
24. method as claimed in claim 22, it is characterised in that specifically include,
The first step, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and calculates text D corresponds to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, be and concept by text d points Subclass XAIn concept xiIt is associated;
Second step, while also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn concept Keyword carries out matching treatment to text d, when the match conditions are met, by text d and concept subclass XBIn concept xi It is associated.
CN201710216502.1A 2017-04-05 2017-04-05 File classification method and the text classifier of acquisition Active CN106951565B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710216502.1A CN106951565B (en) 2017-04-05 2017-04-05 File classification method and the text classifier of acquisition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710216502.1A CN106951565B (en) 2017-04-05 2017-04-05 File classification method and the text classifier of acquisition

Publications (2)

Publication Number Publication Date
CN106951565A true CN106951565A (en) 2017-07-14
CN106951565B CN106951565B (en) 2018-04-27

Family

ID=59475323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710216502.1A Active CN106951565B (en) 2017-04-05 2017-04-05 File classification method and the text classifier of acquisition

Country Status (1)

Country Link
CN (1) CN106951565B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549723A (en) * 2018-04-28 2018-09-18 北京神州泰岳软件股份有限公司 A kind of text concept sorting technique, device and server
CN109189941A (en) * 2018-09-07 2019-01-11 百度在线网络技术(北京)有限公司 For updating the method, apparatus, equipment and medium of model parameter
CN109766440A (en) * 2018-12-17 2019-05-17 航天信息股份有限公司 A kind of method and system for for the determining default categories information of object text description
CN111132056A (en) * 2019-08-07 2020-05-08 上海钧正网络科技有限公司 Battery management method and device, battery and server
CN111539209A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Method and apparatus for entity classification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023986A (en) * 2009-09-22 2011-04-20 日电(中国)有限公司 Method and equipment for constructing text classifier by referencing external knowledge
CN104298665A (en) * 2014-10-16 2015-01-21 苏州大学 Identification method and device of evaluation objects of Chinese texts
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023986A (en) * 2009-09-22 2011-04-20 日电(中国)有限公司 Method and equipment for constructing text classifier by referencing external knowledge
CN104298665A (en) * 2014-10-16 2015-01-21 苏州大学 Identification method and device of evaluation objects of Chinese texts
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOLI LI ET AL.: "Learning to Classify Texts Using Positive and Unlabeled Data", 《PROCEEDINGS OF THE 18TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549723A (en) * 2018-04-28 2018-09-18 北京神州泰岳软件股份有限公司 A kind of text concept sorting technique, device and server
CN108549723B (en) * 2018-04-28 2022-04-05 北京神州泰岳软件股份有限公司 Text concept classification method and device and server
CN109189941A (en) * 2018-09-07 2019-01-11 百度在线网络技术(北京)有限公司 For updating the method, apparatus, equipment and medium of model parameter
CN109766440A (en) * 2018-12-17 2019-05-17 航天信息股份有限公司 A kind of method and system for for the determining default categories information of object text description
CN109766440B (en) * 2018-12-17 2023-09-01 航天信息股份有限公司 Method and system for determining default classification information for object text description
CN111132056A (en) * 2019-08-07 2020-05-08 上海钧正网络科技有限公司 Battery management method and device, battery and server
CN111132056B (en) * 2019-08-07 2023-05-05 上海钧正网络科技有限公司 Battery management method and device, battery and server
CN111539209A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Method and apparatus for entity classification
CN111539209B (en) * 2020-04-15 2023-09-15 北京百度网讯科技有限公司 Method and apparatus for entity classification

Also Published As

Publication number Publication date
CN106951565B (en) 2018-04-27

Similar Documents

Publication Publication Date Title
CN106951565B (en) File classification method and the text classifier of acquisition
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN111860981B (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
CN106021410A (en) Source code annotation quality evaluation method based on machine learning
CN106021389A (en) System and method for automatically generating news based on template
CN106649260A (en) Product feature structure tree construction method based on comment text mining
CN105975555A (en) Enterprise abbreviation extraction method based on bidirectional recurrent neural network
CN102270212A (en) User interest feature extraction method based on hidden semi-Markov model
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN107004141A (en) To the efficient mark of large sample group
CN112307153A (en) Automatic construction method and device of industrial knowledge base and storage medium
CN108027814A (en) Disable word recognition method and device
CN110046225B (en) Scientific and technological project material integrity assessment decision model training method
CN112800232B (en) Case automatic classification method based on big data
CN113434688B (en) Data processing method and device for public opinion classification model training
CN110705283A (en) Deep learning method and system based on matching of text laws and regulations and judicial interpretations
CN116703328B (en) Project review method and system
Shanmugalingam et al. Corporate it-support help-desk process hybrid-automation solution with machine learning approach
CN114817575A (en) Large-scale electric power affair map processing method based on extended model
CN117520754B (en) Pretreatment system for model training data
CN112785156B (en) Industrial collar and sleeve identification method based on clustering and comprehensive evaluation
CN110362828B (en) Network information risk identification method and system
CN107220238A (en) A kind of text object abstracting method based on Mixed Weibull distribution
CN114281998B (en) Event labeling system construction method for multi-level labeling person based on crowdsourcing technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant