CN106951565A - File classification method and the text classifier of acquisition - Google Patents
File classification method and the text classifier of acquisition Download PDFInfo
- Publication number
- CN106951565A CN106951565A CN201710216502.1A CN201710216502A CN106951565A CN 106951565 A CN106951565 A CN 106951565A CN 201710216502 A CN201710216502 A CN 201710216502A CN 106951565 A CN106951565 A CN 106951565A
- Authority
- CN
- China
- Prior art keywords
- concept
- text
- language material
- classifier
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Technical solution of the present invention discloses a kind of text classifier method and text classifier obtained for automatic marking language material, this method includes determining concept set, un-annotated data text match with the concept keyword in the corresponding concept keyword set of each concept and automatic marking processing;For each concept, when the corresponding mark language material text collection Chinese version quantity of the concept meets threshold condition, corresponding textual classification model is then trained to the concept, corresponding text classifier is obtained, the text classifier set corresponding with the concept that all amount of text meet threshold condition is finally obtained.The method of the present invention provides a kind of algorithm structure, with universality, can neatly change taxonomic hierarchies, calculating time and resource have been saved, and the present invention provides a small amount of initial language material text, and automatic marking, without manually marking, further save time and cost.
Description
Technical field
The present invention relates to the Text Classification field of artificial intelligence, and in particular to a kind of file classification method and acquisition
Text classifier.
Background technology
With the fast development of network technology, effectively organization and management is carried out for electronic text information, and can it is quick,
Requirement more and more higher that is accurate and comprehensively therefrom finding relevant information.Text classification is as handling and organize a large amount of text datas
Key technology, the problem of largely solving information clutter facilitate user to obtain required information exactly, be letter
Cease the technical foundation in the fields such as filtering, information retrieval, search engine and text database.Text classification is generally comprised to text
The process such as expression, the selection of text classifier and training, the evaluation of text classification result and feedback.Existing Text Classification
Generally implement according to following steps:(1) taxonomic hierarchies is determined;(2) language material to be marked is collected, corpus is formed;(3) corpus is used
Train classification models;(4) new text is classified with the disaggregated model trained.
But, for following several application scenarios, 1) it is unbalanced for text classification system, some classification categories are very big,
Some classification category very littles, a such as classification is " automobile industry ", and another classification is the application scenarios of " lithium battery " etc.;2)
For being easily obtained substantial amounts of un-annotated data, and the higher application scenarios of the cost that obtains mark language material;3) for classified body
System's often change, often increases classification newly or deletes the application scenarios of classification, and existing Text Classification there are following lack
Fall into:(1) in terms of taxonomic hierarchies, prior art is all that determining can not be careless after taxonomic hierarchies using fixed taxonomic hierarchies
Change.Change taxonomic hierarchies if necessary, it is necessary to collect language material again, the disaggregated model of re -training all categories, this process
Need substantial amounts of time and computing resource.(2) in terms of corpus, prior art need to the text for having marked classification based on certain scale
This is as training corpus, and for having marked the text of classification, generally require to spend it is substantial amounts of be manually first labeled processing, from
And add input cost.
The content of the invention
The present invention can not arbitrarily change for the taxonomic hierarchies of Text Classification in the prior art, need to such as change classified body
System needs consuming more time and computing resource, and language material need to be manually labeled, it is necessary to the more costly and time consuming length of input cost
Defect, it is therefore intended that provide it is a kind of can flexibly change taxonomic hierarchies and the automatic file classification method for being labeled text,
It cost savings computing resource, time and cost.
Realizing the technical scheme of above-mentioned purpose is:
The method that the present invention obtains the text classifier for automatic marking language material, this method includes:
Concept determines step I, determines concept set, and there is each concept correspondence in concept set at least one concept to close
The concept keyword set of keyword;
Language material automatic marking step II, utilizes the concept in the corresponding concept keyword set of each concept in concept set
Keyword carries out matching treatment to un-annotated data text collection, and with corresponding concept in un-annotated data text collection
Text is associated and automatic marking processing, acquisition mark language material text collection;
" mark language material text collection " includes following 2 class text, and the 1st class is:Join with any conceptual dependency in concept set
And the text being labeled with the concept;2nd class is:To unconnected other texts of any concept in concept set,
But such other text is not associated due to the concept with having determined, can be labeled or not appointed with " other "
What is marked or the text that is labeled with other words do not obscured with concept word etc., as long as can clearly demonstrate that this kind of
Text is any word that can be made a distinction with the 1st class text.
Disaggregated model training step III, for each concept in concept set, when the mark language material concept related with this
When text collection Chinese version quantity meets threshold condition, then the text in the mark language material text collection concept related with this is made
For the positive example of textual classification model, the text not with this in concept related mark language material text collection is used as textual classification model
Negative example, the positive example and negative example of textual classification model corresponding to the concept be trained, and obtains the corresponding text point of the concept
Class device, finally obtains the corresponding text classifier of concept that all amount of text meet threshold condition.
In the preferred embodiment of the present invention, concept is determined in step I, determines concept set X by concept xiComposition, its
Each concept x in middle i=1,2,3 ... n, concept set XiCorrespondence with least one concept crucial phrase into concept close
Keyword set Yi。
In the preferred embodiment of the present invention, language material automatic marking step II includes:
Step II 1, sufficient amount N un-annotated data, note un-annotated data text are collected according to specific practical situations
This collection is combined into D={ dj, wherein j=1,2 ..., N;
Step II 2, utilizes each concept xiCorresponding concept keyword set YiIn concept keyword to un-annotated data
Every text in text collection D carries out matching treatment respectively, as a certain piece text d in un-annotated data text collection DjWith
Concept xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djIt is labeled as and concept xi
It is associated;Matching treatment is carried out to every text in un-annotated data text collection D and obtains mark language material text collection C.
In the preferred embodiment of the present invention, disaggregated model training step III includes:
Step III 1, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB, divide
Principle is, if in mark language material text collection C with concept xiAssociated mark language material amount of text is more than or equal to threshold alpha
When, then by concept xiIt is subdivided into concept subclass XAIn, with concept xiAssociated mark language material text collection is designated as Ai;
If during less than threshold alpha, by concept xiIt is subdivided into concept subclass XBIn;
Step III 2, by concept subclass XAIn concept xiCorresponding mark language material text collection AiIt is used as training classification mould
The positive example of type, random extraction k is not belonging to mark language material text collection A from mark language material text collection CiIn text be designated as
Mark language material text collection Ai', it is used as the negative example of train classification models;
Step III 3, using the textual classification model of naive Bayesian, SVMs or logistic regression to mark language material text
This set AiAnd Ai' training concept xiCorresponding text classifier is designated as mi;The concept subclass X trainedAIn each concept
xiCorresponding text classifier set is designated as M0。
In the preferred embodiment of the present invention, disaggregated model training step III also includes:
Step III 4, utilizes text classifier set M0In respectively with each concept xiCorresponding text classifier miTo not
It is that association is handled with mark to mark the text in language material text collection D and carry out classification processing, obtains corresponding text classification result,
Text classification results are individually deposited, and mark language material text collection C is not influenceed;
Step III 5, for concept subclass XAIn each concept xi, use text classifier miText is calculated to correspond to generally
Read xiProbability, selected from text classification result and correspond to concept xiProbability be more than the text of threshold value beta, add it to general
Read xiCorresponding mark language material text collection AiIn, form new mark language material set Ai;
Step III 6, for new mark language material set Ai, repeat step III 2~III 5 2~10 time are met the requirements
Concept xiCorresponding text classifier mi, so as to obtain final satisfactory text classifier set M;Or for new mark
Note language material set Ai, artificial matching, which is assessed, obtains satisfactory concept xiCorresponding text classifier mi, so as to finally be accorded with
Close desired text classifier set M.
In the preferred embodiment of the present invention, in disaggregated model training step III 3:
Using the textual classification model of SVMs to mark language material text collection AiAnd Ai' train for concept xi's
Text classifier.
In the preferred embodiment of the present invention, in disaggregated model training step III 5:
Threshold value beta span in step III 5 is 0.1~0.5;In step III 6, repeat step III 2~III 5 5~10
It is secondary.
In the preferred embodiment of the present invention, in disaggregated model training step III 6:
Artificial matching assessment refers to for concept xi, from mark language material set AiIn randomly select some texts, then from mark
Randomly selected in note language material text collection C some pieces not with concept xiAll text K of extraction are entered by the text of association again
Pedestrian's work is marked, and obtains criteria classification result;Text classifier m is trained in step III 3 every timeiAfterwards, text classifier m is usediIt is right
All text K extracted separately carry out classification processing and obtain provisional classifications result, i.e., using concept xiCorresponding text classifier mi
Every textual association in all text K is calculated to concept xiProbability, if probability be more than threshold value beta, the text is labeled as
With concept xiThe text of association;Provisional classifications result and criteria classification result are compared, the standard of provisional classifications result is calculated
True rate, when accuracy rate is more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;
When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept xiIt is corresponding extremely
A few new concept keyword, forms new concept keyword set Yi, and/or, redefine the matching bar of step II 2
Part;When re-starting concept and determining step I, according to new concept keyword set YiCarry out step II 2 and obtain new mark
Note language material text collection C;Language material text collection C will be marked and carry out the new concept subclass X of the acquisition of step III 1AWith new concept
Subclass XB;To new concept subclass XAWith new concept subclass XBProceed step III 2~III 6, until when text point
Class device miThe accuracy rate of provisional classifications result is more than or equal to threshold gamma, then text grader miFor satisfactory text point
Class device;When only redefining the matching condition of step II 2, up to text grader m since step II 2iFor symbol
Untill closing desired text classifier.
In the preferred embodiment of the present invention, also including concept related step IV:
Concept related step IV, each concept x is directed to using in text classifier set MiText classifier miTo text
This d carries out classification processing;Concept subclass X is utilized simultaneouslyBIn each concept xiConcept in corresponding concept keyword set
Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set XiFinal association knot
Really.
In the preferred embodiment of the present invention, concept related step IV is specifically included:
Step IV 1, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and counted
Calculate text d and correspond to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, text d is marked
For with concept subclass XAIn concept xiIt is associated;
Step IV 2, meanwhile, also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn
Concept keyword to text d carry out matching treatment, when the match conditions are met, text d is labeled as and the concept subset
Close XBIn concept xiIt is associated;Obtain each concept x in text d and concept set XiFinal association results.
In the preferred embodiment of the present invention, matching condition is specifically referred to:
Judge whether to have more than 25% in the text and be preferably greater than in 30% paragraph occur and concept xiIt is corresponding general
Read keyword set YiIn concept keyword, if exceeding, then the text is labeled as and concept xiAssociation;If being no more than,
Then the text is not marked and concept xiAssociation.
In the preferred embodiment of the present invention, step V also is updated including language material,
Language material updates step V, by each concept x in text d and concept set XiFinal association results be added to mark
Note in language material text collection C, older mark language is periodically removed from mark language material text collection C using certain removal mode
Expect text, the mark language material text collection C after being updated.
In the preferred embodiment of the present invention, language material updates removal mode in step V and referred to:Make mark language material text
Each concept x in set CiCorresponding amount of text is maintained at tens of between hundreds of, if certain concept xiCorresponding amount of text
More than hundreds of, then older text is removed;Make not with any concept xiThe amount of text of association is maintained at thousands of between tens thousand of,
If more than tens thousand of, then removing older text.
In the preferred embodiment of the present invention, also step VI is updated including grader:
Disaggregated model training step III, the text classification updated are repeated to the mark language material text collection C after renewal
Device set M.
In the preferred embodiment of the present invention, also step VII is increased newly including concept:
Step VII 1, is increasing some newly-increased concept xpAfterwards, concept is taken to increase set XP={ xp, increase concept x newlypCorrespondence concept
Keyword set Yp;
Step VII 2, set X is increased to conceptP={ xpLanguage material automatic marking step II 2 is carried out, then instructed according to disaggregated model
Practice the division principle of step III 1 by XPIt is divided into the concept subclass X of two supplementary sets each otherPAWith concept subclass XPB, then carry out
Disaggregated model training step III 2~III 6;
Judge concept xpCorresponding mark language material text collection CpWhether Chinese version quantity is more than or equal to threshold alpha, if greatly
In or equal to threshold alpha, then by concept xpIt is assigned to concept subclass XAIn, satisfactory text classifier m will be trainedpPlus
Enter into text classifier set M;If less than threshold alpha, then by concept xpIt is assigned to concept subclass XBIn.
In the preferred embodiment of the present invention, the concept related step VIII after also being increased newly including concept:It is new using concept
Increase in the text classifier set M obtained after step VII and be directed to each concept xpText classifier mpText d is classified
Processing;Concept subclass X is utilized simultaneouslyBIn each concept xpConcept keyword in corresponding concept keyword set is to this
Text d carries out matching treatment, obtains each concept x in text d and concept set XpFinal association results.
In the preferred embodiment of the present invention, in step II 2, when a certain piece text in un-annotated data text collection D
This djWith multiple concept xiWhen the match condition of corresponding concept keyword is satisfied by matching condition, then by this text djMark
For multiple concept xs corresponding with thisiHomogeneous association.
In the preferred embodiment of the present invention, in step II 2, when a certain piece text in un-annotated data text collection D
This djWith any concept xiWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then by this text djMark
Note as " other " or do not make any mark.
In the preferred embodiment of the present invention, in the method manually marked, when a certain piece text and any concept pair
When the match condition for the concept keyword answered is unsatisfactory for matching condition, then by this text marking is " other " or does not appoint
What is marked.
In the preferred embodiment of the present invention, after step IV 2, as text d and any concept xiCorresponding concept is closed
When the match condition of keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or does not make any mark, obtained
Obtain each concept x in text d and concept set XiFinal association results.
In the preferred embodiment of the present invention, present invention additionally comprises the text classifier of the automatic marking language material of acquisition.
In the preferred embodiment of the present invention, present invention additionally comprises the mark language material text collection of acquisition.
In a kind of file classification method of the present invention, the text classifier set obtained using the process described above
Each concept x is directed in MiText classifier miClassification processing is carried out to text d;Concept subclass X is utilized simultaneouslyBIn
Each concept xiConcept keyword in corresponding concept keyword set is matched and processing of classifying to text d, is obtained
Obtain each concept x in text d and concept set XiFinal classification result.
In a kind of file classification method of the present invention, specifically include,
The first step, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and counted
Calculate text d and correspond to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, be by text d points
With concept subclass XAIn concept xiIt is associated;
Second step, while also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn
Concept keyword carries out matching treatment to text d, when the match conditions are met, by text d and concept subclass XBIn
Concept xiIt is associated.
Term " concept " is the general character of the management functions of several listed companies.The category of concept includes:(1) policy, such as
The Belt and Road etc.;(2) industry, such as " food and drink ";(3) product, such as " unmanned plane ";(4) the characteristics of financial sector, such as
" sub-new stock " etc., and other guide.Concept includes but is not limited to above-mentioned category.Term " concept keyword " refers to each concept pair
The word of concept feature can be represented by answering several.If mentioning concept keyword in a word, then it is assumed that what is talked about is corresponding
Concept.Text includes but is not limited to news, the title and body matter that grind report, the bulletin of listed company etc. of security organization.Art
Language " text and conceptual dependency connection " refers to when the content of a text is more related to the content with some conceptual dependency, then be referred to as
The text and the conceptual dependency join.One text may be associated with concept, it is also possible to not associate any concept;One may be associated with
Individual concept, it is also possible to associate multiple concepts.From the perspective of text classification, concept may be considered classification, and textual association is arrived
Some concept, it is believed that text belongs to some classification.Term " mark " refers to mark the corresponding concept of text, and mark can be
Artificial mark, can also realize automatic marking by algorithm.Term " un-annotated data text collection " refers to not mark text
The text collection of correspondence concept.Term " mark language material text collection " refers to the text collection for having marked text correspondence concept.
The present invention positive effect be:
The method of the present invention provides a kind of algorithm structure, with universality, can neatly change taxonomic hierarchies, for example can
Realization regularly updates mark language material text collection and text classifier, increases concept newly if necessary, both increases in concept set
Newly-increased concept, need to only carry out automatic marking language material and training grader, you can obtain newly-increased concept corresponding for newly-increased concept
Text classifier, has thus saved calculating time and resource, and the present invention provides a small amount of initial language material text,
And automatic marking, without artificial mark, further save time and cost.
Brief description of the drawings
Fig. 1 is the flow chart of training text grader of the present invention;
Fig. 2 is the flow chart of the file classification method of the present invention.
Embodiment
Preferred embodiment is enumerated below, and combines Fig. 1 and Fig. 2 to become apparent from intactly illustrating that the present invention is obtained for automatic
Mark the text classifier method of language material and the implementation process of text classifier.
Step A, concept determination process includes:
Concept set X is by concept xiEach concept x in composition, wherein i=1,2,3 ... n, concept set XiCorrespondence has
At least one concept crucial phrase into concept keyword set Yi.One text may be associated with one or more concept xi,
Any concept x may not be associatedi.If text have more with certain concept x in concept set XiRelated content, then
The text and concept xiIt is associated;If the content of a text and any concept x in concept set XiIt is uncorrelated, it is referred to as
The text and the conceptual dependency join.From the perspective of text classification, concept may be considered classification, if textual association to certain
Concept, it is believed that text belongs to the classification belonging to certain concept.Concept set is shared in it is determined that the text concept to be associated with, i.e., true
It is fixed that text is divided into the corresponding classification of concept being associated with.
Step B, language material automatic marking process includes:
Step B1, sufficient amount N un-annotated data is collected according to specific practical situations, remembers un-annotated data text
Collection is combined into D={ dj, wherein j=1,2 ..., N.
Step B2, utilizes each concept xiCorresponding concept keyword set YiIn concept keyword to un-annotated data
Every text in text collection D carries out matching treatment respectively, judges to work as a certain piece text in un-annotated data text collection D
djWith concept xiWhen whether the match condition of corresponding concept keyword meets matching condition, in a preferred embodiment of the present invention,
Matching condition is to judge text djIn whether have more than 25% preferably greater than 30% paragraph in occur with concept xiCorrespondence
Concept keyword set YiIn concept keyword, if exceeding, then by text djWith concept xiIt is associated;If not surpassing
Cross, then text djCannot be with concept xiIt is associated.As a certain piece text d in un-annotated data text collection DjWith concept
xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djIt is labeled as and concept xiIt is related
The text of connection, and by this text djIt is labeled as and concept xiIt is associated.To every text in un-annotated data text collection D
Carry out matching treatment and obtain mark language material text collection C.
As a certain piece text d in un-annotated data text collection DjWith multiple concept xiOf corresponding concept keyword
When being satisfied by matching condition with situation, then by this text djIt is labeled as multiple concept x corresponding with thisiHomogeneous association.When not marking
Note a certain piece text d in language material text collection DjWith any concept xiThe match condition of corresponding concept keyword is unsatisfactory for
During matching condition, then by this text djIt is labeled as " other " or does not make any mark.
Wherein, the principal element of influence matching condition includes:Occur in text with concept xiConcept keyword set Yi
The concept x occurred in the quantity of middle concept keyword, textiConcept keyword set YiThe quantity and text of middle concept keyword
There is concept xi concept keyword set Y in ratio of all word quantity, text in thisiThe sentence number of middle concept keyword
There is concept x in amount, textiConcept keyword set YiAll sentence numbers in the sentence quantity and text of middle concept keyword
There is concept x in the ratio of amount, textiConcept keyword set YiThe natural segment number of concept keyword, occur in text
Concept xiConcept keyword set YiConcept keyword natural segment number and text paragragh sum ratio.
Step C, disaggregated model training process includes:
Step C1, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB, divide
Principle is, if in mark language material text collection C with concept xiAssociated mark language material amount of text is more than or equal to threshold alpha
When, then by concept xiIt is subdivided into concept subclass XAIn, with concept xiAssociated mark language material text collection is designated as Ai;
If during less than threshold alpha, by concept xiIt is subdivided into concept subclass XBIn.In one preferred embodiment, threshold alpha value is 100.
Step C2, by concept subclass XAIn concept xiCorresponding mark language material text collection AiIn text be used as instruction
Practice the positive example of disaggregated model, random extraction k is not belonging to mark language material text collection A from mark language material text collection CiIn
Text is designated as mark language material text collection A as the negative example of train classification modelsi'。
Step C3, using textual classification models such as naive Bayesian, SVMs or logistic regressions to mark language material text
This set AiAnd Ai' training concept xiCorresponding text classifier, is designated as mi.The concept subclass X trainedAIn each concept
xiCorresponding text classifier set is designated as M0.In a preferred embodiment of the present invention, document (is referred to using SVMs:
Yuan,G.,Ho,C.,Lin,C.:Recent Advances of Large-scale Linear
Classification.Proc.IEEE 100 (9), 2584-2603 (2012)) textual classification model to mark language material text
Set AiAnd Ai' train the text classification grader for being directed to concept xi.
Step C4, utilizes text classifier set M0In respectively with each concept xiCorresponding text classifier miTo not
The text marked in language material text collection D carries out classification processing, obtains corresponding text classification result, the classification results are individually deposited
Put, mark language material text collection C is not influenceed.
Step C5, for concept subclass XAIn each concept xi, use text classifier miCalculate text and correspond to concept
xiProbability, selected from text classification result and correspond to concept xiProbability be more than threshold value beta text, add it to concept
xiCorresponding mark language material text collection AiIn, form new mark language material set Ai.Wherein, β spans are 0.1~0.5.
Step C6, for new mark language material set Ai, repeat step C2~C5 2~10 times obtains satisfactory general
Read xiCorresponding text classifier mi, so as to obtain final satisfactory text classifier set M.In one preferred embodiment, weight
Multiple step C2~C5 5~10 times.Or for new mark language material set Ai, artificial matching, which is assessed, obtains satisfactory concept xi
Corresponding text classifier mi, so as to obtain final satisfactory text classifier set M.
Wherein, artificial matching assessment refers to for concept xi, from mark language material set AiIn randomly select some texts,
Again from mark language material text collection C in randomly select some pieces not with concept xiThe text of association, to all text K of extraction
Artificial mark is re-started, criteria classification result is obtained;Text classifier m is trained in step III 3 every timeiAfterwards, with text point
Class device miAll text K of extraction are separately carried out with classification processing and obtains provisional classifications result, i.e., using concept xiCorresponding text
Grader miEvery textual association in all text K is calculated to concept xiProbability, if probability be more than threshold value beta, by the text
It is labeled as and concept xiThe text of association;Provisional classifications result and criteria classification result are compared, provisional classifications knot is calculated
The accuracy rate of fruit, when accuracy rate is more than or equal to threshold gamma (usual γ spans be 0.8~1), then text grader
miFor satisfactory text classifier;
When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept xiIt is corresponding extremely
A few new concept keyword, forms new concept keyword set Yi, and/or, redefine step B2 matching condition;
When re-starting concept and determining step I, according to new concept keyword set YiCarry out step B2 and obtain new mark language
Expect text collection C;Language material text collection C will be marked and carry out the new concept subclass X of step C1 acquisitionsAWith new concept subclass
XB;To new concept subclass XAWith new concept subclass XBProceed step C2~C6, until as text classifier miFace
When classification results accuracy rate be more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;When
When only redefining step B2 matching condition, up to text grader m since step B2iFor satisfactory text
Untill this grader.
Step D, concept related process includes:
Step D1, each concept x is directed to using in text classifier set MiText classifier miText d is carried out
Classification is handled, and uses text classifier miCalculate text d and correspond to concept xiProbability, if text d corresponds to concept xiProbability
More than given threshold β, text d is labeled as and is associated with concept subclass XAIn concept xiIt is associated;
Step D2, while also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn
Concept keyword carries out matching treatment to text d, and when the match conditions are met, text d is labeled as and the concept subclass
XBIn concept xiIt is associated;Obtain each concept x in text d and concept set XiFinal association results.As text d
With any concept xiWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then this text d is labeled as
" other " or do not make any mark, obtain each concept x in text d and concept set XiFinal association results.
Step E, updating mark language material text collection includes:
By each concept x in text d and concept set XiFinal association results be added to mark language material text collection C
In, older mark language material text is periodically removed from mark language material text collection C using certain removal mode, updated
Mark language material text collection C afterwards.Wherein, removal mode refers to:Make each concept x in mark language material text collection CiIt is corresponding
Amount of text is maintained at tens of between hundreds of, if certain concept xiCorresponding amount of text is then removed older more than hundreds of
Text;Make not with any concept xiThe amount of text of association is maintained at thousands of between tens thousand of, if more than tens thousand of, then remove compared with
Old text.
Step F, updating grader process includes:
Disaggregated model training step III, the text classification updated are repeated to the mark language material text collection C after renewal
Device set M.
Step G, the newly-increased concept process of increase includes:
Step G1, is increasing some newly-increased concept xpAfterwards, concept is taken to increase set XP={ xp, increase concept x newlypCorrespondence concept
Keyword set Yp。
Step G2, set X is increased to conceptP={ xpLanguage material automatic marking processing is carried out, then walked according to disaggregated model training
Rapid III 1 division principle is by XPIt is divided into the concept subclass X of two supplementary sets each otherPAWith concept subclass XPB, then classified
Model training step C2~C6.Judge concept xpCorresponding mark language material text collection CpWhether Chinese version quantity is more than or equal to
Threshold alpha, if more than or equal to threshold alpha, then by concept xpIt is assigned to concept subclass XAIn, satisfactory text will be trained
This grader mpIt is added in text classifier set M;If less than threshold alpha, then by concept xpIt is assigned to concept subclass XB
In.
Concept related process after concept is newly-increased includes:Increase the text classifier set obtained after step VII newly using concept
Each concept x is directed in MpText classifier mpClassification processing is carried out to text d;Concept subclass X is utilized simultaneouslyBIn
Each concept xpConcept keyword in corresponding concept keyword set carries out matching treatment to text d, obtains text d
With each concept x in concept set XpFinal association results.
The method tool for the text classifier of automatic marking language material is obtained to the present invention below by a specific embodiment
Body implementation process is further described:
Step S1, determines concept:
It is determined that including the concept set X={ x of three concepts1, x2, x3, x1=waste gas pollution control and treatment, x2=augmented reality, x3=
Vanadium cell.x1, x2, x3Corresponding concept keyword set Y1, Y2, Y3Respectively:
Y1={ waste gas, waste gas pollution control and treatment, exhaust-gas treatment, organic exhaust gas, industrial waste gas, waste gas purification };
Y2={ augmented reality, ar };
Y3={ vanadium cell }.
Step S2, automatic corpus labeling:
Step S21, collects 5000 news corpus, forms un-annotated data text collection D={ dj, wherein j=1,
2 ..., 5000;
Step S22, with each concept x1, x2, x3Corresponding concept keyword set Y1, Y2, Y3To un-annotated data text
Every text in set D carries out matching treatment respectively.As a text djWith some concept xiCorresponding concept keyword Yi's
When match condition meets matching condition, that is, judge text djIn whether have more than 25% preferably greater than 30% paragraph in occur
With concept xiCorresponding concept keyword set YiIn concept keyword, if exceeding, then text djThis can be associated with general
Read xi;If being no more than, then text djConcept x cannot be associated withi.An a certain piece in un-annotated data text collection D
Text djWith concept xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djBe labeled as with
Concept xiAssociated text, then by this text djIt is labeled as and concept xiIt is associated.To un-annotated data text collection
Every text, which is labeled, in D obtains mark language material text collection C.
Step S3, disaggregated model training:
It marked and concept x in step S31, mark language material text collection C1Associated text has 208, marked general
Read x2Associated text has 154, marked concept x3Associated text has 34;
Step S32, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB.Take threshold
Value α=100, then XA={ x1, x2, XB={ x3}.Concept x1, x2Associated mark language material text collection CiA is designated as respectively1,
A2。
Step S33, respectively by concept subclass XAIn concept x1, x2Corresponding mark language material text collection A1In text
This is as the positive example of train classification models, and random extraction 1000 is not belonging to mark language material text from mark language material text collection C
This set A1, A2In text, as the negative example of train classification models, be designated as mark language material text collection A1', A2'.Using branch
Vector machine is held to mark language material text collection A1、A1' and A2、A2' it is trained concept x1, x2Corresponding text classifier, respectively
It is designated as m1, m2.Concept subclass XAIn concept x1, x2Corresponding text classifier set is designated as M0。
Step S34, utilizes text classifier set M0In concept x1, x2Corresponding text classifier m1, m2To not marking
Text in language material text collection D carries out classification processing, obtains corresponding text classification result.
Step S35, for concept subclass XAIn x1, x2, use text classifier m1, m2Calculate text and correspond to concept
x1, x2Probability, selected respectively from text classification result and correspond to concept x1, x2Probability be more than threshold value beta text, by it
It is added to concept x1, x2Corresponding mark language material text collection A1, A2In, form new mark language material set A1, A2。
Step S36, for new mark language material set A1, A2, respectively repeat steps S32~S35 5 times, conformed to
The concept x asked1, x2Corresponding text classifier m1, m2, so as to obtain final satisfactory text classifier set M.
Step S4, concept related processing:
Step S41, utilizes concept x in text classifier set M1, x2Text classifier m1, m2Text d is classified
Processing, uses text classifier m1, m2Calculate text d and correspond to concept x1, x2Probability, if text d corresponds to concept x1, x2's
Probability is more than given threshold β, and text d is labeled as and concept subclass XAIn concept x1, x2It is associated;
Step S42, meanwhile, also to utilize concept subclass XBMiddle concept x3Corresponding concept keyword set YiIn it is general
Read keyword and matching treatment is carried out to text d, when the match conditions are met, text d is labeled as and concept subclass XB
In concept x3It is associated;
Step S43, combining step S41 and step S42 obtain text d concept related result, obtain text d with
Each concept x in concept set X1, x2, x3Final association results.
Claims (24)
1. a kind of method for obtaining the text classifier for automatic marking language material, it is characterised in that this method includes:
Concept determines step I, determines concept set, and each concept correspondence in concept set has at least one concept keyword
Concept keyword set;
Language material automatic marking step II, it is crucial using the concept in the corresponding concept keyword set of each concept in concept set
Word carries out matching treatment to un-annotated data text collection, and with corresponding concept to the text in un-annotated data text collection
It is associated and automatic marking processing, acquisition mark language material text collection;
Disaggregated model training step III, for each concept in concept set, when the mark language material text concept related with this
When set Chinese version quantity meets threshold condition, then the mark language material text collection concept related with this is regard as text classification mould
The positive example of type, not with negative example of the concept related mark language material text collection as textual classification model, to concept correspondence
Textual classification model positive example and negative example be trained, obtain the corresponding text classifier of the concept, finally obtain all texts
This quantity meets the corresponding text classifier of concept of threshold condition.
2. the method as described in claim 1, it is characterised in that concept is determined in step I, determines concept set X by concept xiGroup
Into wherein i=1,2,3 ... n, each concept x in concept set XiCorrespondence with least one concept crucial phrase into it is general
Read keyword set Yi。
3. method as claimed in claim 2, it is characterised in that language material automatic marking step II includes:
Step II 1, sufficient amount N un-annotated data is collected according to specific practical situations, remembers un-annotated data text set
It is combined into D={ dj, wherein j=1,2 ..., N;
Step II 2, utilizes each concept xiCorresponding concept keyword set YiIn concept keyword to un-annotated data text
Every text in set D carries out matching treatment respectively, as a certain piece text d in un-annotated data text collection DjWith concept
xiWhen the match condition of corresponding concept keyword meets matching condition, then by this text djIt is labeled as and concept xiIt is related
Connection;Matching treatment is carried out to every text in un-annotated data text collection D and obtains mark language material text collection C.
4. method as claimed in claim 3, it is characterised in that disaggregated model training step III includes:
Step III 1, concept set X is divided into the concept subclass X of two supplementary sets each otherAWith concept subclass XB, division principle
Be, if in mark language material text collection C with concept xiWhen associated mark language material amount of text is more than or equal to threshold alpha, then
By concept xiIt is subdivided into concept subclass XAIn, with concept xiAssociated mark language material text collection is designated as Ai;If being less than
During threshold alpha, then by concept xiIt is subdivided into concept subclass XBIn;
Step III 2, by concept subclass XAIn concept xiCorresponding mark language material text collection AiIt is used as train classification models
Positive example, random extraction k is not belonging to mark language material text collection A from mark language material text collection CiIn text be designated as mark
Language material text collection Ai', it is used as the negative example of train classification models;
Step III 3, using the textual classification model of naive Bayesian, SVMs or logistic regression to mark language material text set
Close AiAnd Ai' training concept xiCorresponding text classifier, is designated as mi;The concept subclass X trainedAIn each concept xiIt is right
The text classifier set answered is designated as M0。
5. method as claimed in claim 4, it is characterised in that disaggregated model training step III also includes:
Step III 4, utilizes text classifier set M0In respectively with each concept xiCorresponding text classifier miTo not marking
Text in language material text collection D carries out classification processing i.e. association and handled with mark, obtains corresponding text classification result, this article
This classification results is individually deposited;
Step III 5, for concept subclass XAIn each concept xi, use text classifier miCalculate text and correspond to concept xi's
Probability, selects from text classification result and corresponds to concept xiProbability be more than threshold value beta text, add it to concept xiIt is right
The mark language material text collection A answerediIn, form new mark language material set Ai;
Step III 6, for new mark language material set Ai, repeat step III 2~III 5 2~10 time obtain satisfactory concept
xiCorresponding text classifier mi, so as to obtain final satisfactory text classifier set M;Or for new mark language material
Set Ai, artificial matching, which is assessed, obtains satisfactory concept xiCorresponding text classifier mi, so as to finally be met the requirements
Text classifier set M.
6. method as claimed in claim 4, it is characterised in that in disaggregated model training step III 3:
Using the textual classification model of SVMs to mark language material text collection AiAnd Ai' train for concept xiText
Grader.
7. method as claimed in claim 5, it is characterised in that in disaggregated model training step III 5:
Threshold value beta span in step III 5 is 0.1~0.5;In step III 6, repeat step III 2~III 5 5~10 time.
8. method as claimed in claim 5, it is characterised in that in disaggregated model training step III 6:
Artificial matching assessment refers to for concept xi, from mark language material set AiIn randomly select some texts, then from mark language
Randomly selected in material text collection C some pieces not with concept xiAll text K of extraction are re-started people by the text of association
Work is marked, and obtains criteria classification result;Text classifier m is trained in step III 3 every timeiAfterwards, text classifier m is usediTo extracting
All text K separately carry out classification processing obtain provisional classifications result, i.e., using concept xiCorresponding text classifier miCalculate
Every textual association in all text K is to concept xiProbability, if probability be more than threshold value beta, by the text be labeled as with generally
Read xiThe text of association;Provisional classifications result and criteria classification result are compared, the accuracy rate of provisional classifications result is calculated,
When accuracy rate is more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;
When accuracy rate is less than threshold gamma, then re-starts concept and determine step I, that is, redefine concept xiCorresponding at least one
Individual new concept keyword, forms new concept keyword set Yi, and/or, redefine the matching condition of step II 2;When
When re-starting concept and determining step I, according to new concept keyword set YiCarry out step II 2 and obtain new mark language material
Text collection C;Language material text collection C will be marked and carry out the new concept subclass X of the acquisition of step III 1AWith new concept subclass
XB;To new concept subclass XAWith new concept subclass XBProceed step III 2~III 6, until as text classifier mi
The accuracy rate of provisional classifications result is more than or equal to threshold gamma, then text grader miFor satisfactory text classifier;
When only redefining the matching condition of step II 2, up to text grader m since step II 2iTo meet the requirements
Text classifier untill.
9. method as claimed in claim 8, it is characterised in that also including concept related step IV:
Concept related step IV, each concept x is directed to using in text classifier set MiText classifier miTo text d
Carry out classification processing;Concept subclass X is utilized simultaneouslyBIn each concept xiConcept in corresponding concept keyword set is closed
Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set XiFinal association results.
10. method as claimed in claim 9, it is characterised in that concept related step IV is specifically included:
Step IV 1, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and calculates text
This d corresponds to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, by text d be labeled as with
Concept subclass XAIn concept xiIt is associated;
Step IV 2, meanwhile, also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn it is general
Read keyword and matching treatment is carried out to text d, when the match conditions are met, text d is labeled as and concept subclass XB
In concept xiIt is associated;Obtain each concept x in text d and concept set XiFinal association results.
11. the method as described in claim 3,8 or 10, it is characterised in that matching condition is specifically referred to:
Judge whether to have more than 25% in the text and be preferably greater than in 30% paragraph occur and concept xiCorresponding concept is crucial
Set of words YiIn concept keyword, if exceeding, then the text is labeled as and concept xiAssociation;If being no more than, then this article
This is not marked and concept xiAssociation.
12. method as claimed in claim 9, it is characterised in that also update step V including language material,
Language material updates step V, by each concept x in text d and concept set XiFinal association results be added to mark language
Expect in text collection C, periodically remove older mark language material text from mark language material text collection C using certain removal mode
This, the mark language material text collection C after being updated.
13. method as claimed in claim 12, it is characterised in that language material updates removal mode in step V and referred to:Make mark
Each concept x in language material text collection CiCorresponding amount of text is maintained at tens of between hundreds of, if certain concept xiIt is corresponding
Amount of text then removes older text more than hundreds of;Make not with any concept xiThe amount of text of association is maintained at thousands of and arrived
Between tens thousand of, if more than tens thousand of, then removing older text.
14. method as claimed in claim 12, it is characterised in that also update step VI including grader:
Disaggregated model training step III, the text classifier collection updated are repeated to the mark language material text collection C after renewal
Close M.
15. method as claimed in claim 14, it is characterised in that also increase step VII newly including concept:
Step VII 1, is increasing some newly-increased concept xpAfterwards, concept is taken to increase set XP={ xp, increase concept x newlypCorrespondence concept is crucial
Set of words Yp;
Step VII 2, set X is increased to conceptP={ xpLanguage material automatic marking step II 2 is carried out, then walked according to disaggregated model training
Rapid III 1 division principle is by XPIt is divided into the concept subclass X of two supplementary sets each otherPAWith concept subclass XPB, then classified
Model training step III 2~III 6;
Judge concept xpCorresponding mark language material text collection CpChinese version quantity whether be more than or equal to threshold alpha, if be more than or
Equal to threshold alpha, then by concept xpIt is assigned to concept subclass XAIn, satisfactory text classifier m will be trainedpIt is added to
In text classifier set M;If less than threshold alpha, then by concept xpIt is assigned to concept subclass XBIn.
16. method as claimed in claim 15, it is characterised in that the concept related step VIII after also being increased newly including concept:Profit
Increased newly with concept in the text classifier set M obtained after step VII and be directed to each concept xpText classifier mpTo text d
Carry out classification processing;Concept subclass X is utilized simultaneouslyBIn each concept xpConcept in corresponding concept keyword set is closed
Keyword carries out matching treatment to text d, obtains each concept x in text d and concept set XpFinal association results.
17. method as claimed in claim 3, it is characterised in that in step II 2, when certain in un-annotated data text collection D
One text djWith multiple concept xiWhen the match condition of corresponding concept keyword is satisfied by matching condition, then by this text
djIt is labeled as multiple concept x corresponding with thisiHomogeneous association.
18. method as claimed in claim 3, it is characterised in that in step II 2, when certain in un-annotated data text collection D
One text djWith any concept xiWhen the match condition of corresponding concept keyword is unsatisfactory for matching condition, then by this text
This djIt is labeled as " other " or does not make any mark.
19. method as claimed in claim 8, it is characterised in that in the method manually marked, when a certain piece text with it is any
When the match condition of the corresponding concept keyword of concept is unsatisfactory for matching condition, then by this text marking be " other " or
Any mark is not made.
20. method as claimed in claim 10, it is characterised in that after step IV 2, as text d and any concept xiIt is corresponding
When the match condition of concept keyword is unsatisfactory for matching condition, then this text d is labeled as " other " or does not make any
Mark, obtains each concept x in text d and concept set XiFinal association results.
21. a kind of text point for the automatic marking language material that method as any one of claim 1~8,13~14 is obtained
Class device.
22. the mark language material text collection that a kind of method as any one of claim 3,9,12,14 and 15 is obtained.
23. a kind of file classification method, it is characterised in that:Obtained using the method described in claim 1~8,13~14 any one
To text classifier set M in be directed to each concept xiText classifier miClassification processing is carried out to text d;It is simultaneously sharp
With concept subclass XBIn each concept xiConcept keyword in corresponding concept keyword set is to text d progress
Match somebody with somebody and processing of classifying, obtain each concept x in text d and concept set XiFinal classification result.
24. method as claimed in claim 22, it is characterised in that specifically include,
The first step, utilizes each text classifier m in text classifier set MiClassification processing is carried out to text d, and calculates text
D corresponds to concept xiProbability, if text d corresponds to concept xiProbability be more than given threshold β, be and concept by text d points
Subclass XAIn concept xiIt is associated;
Second step, while also to utilize concept subclass XBIn each concept xiCorresponding concept keyword set YiIn concept
Keyword carries out matching treatment to text d, when the match conditions are met, by text d and concept subclass XBIn concept xi
It is associated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710216502.1A CN106951565B (en) | 2017-04-05 | 2017-04-05 | File classification method and the text classifier of acquisition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710216502.1A CN106951565B (en) | 2017-04-05 | 2017-04-05 | File classification method and the text classifier of acquisition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106951565A true CN106951565A (en) | 2017-07-14 |
CN106951565B CN106951565B (en) | 2018-04-27 |
Family
ID=59475323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710216502.1A Active CN106951565B (en) | 2017-04-05 | 2017-04-05 | File classification method and the text classifier of acquisition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106951565B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108549723A (en) * | 2018-04-28 | 2018-09-18 | 北京神州泰岳软件股份有限公司 | A kind of text concept sorting technique, device and server |
CN109189941A (en) * | 2018-09-07 | 2019-01-11 | 百度在线网络技术(北京)有限公司 | For updating the method, apparatus, equipment and medium of model parameter |
CN109766440A (en) * | 2018-12-17 | 2019-05-17 | 航天信息股份有限公司 | A kind of method and system for for the determining default categories information of object text description |
CN111132056A (en) * | 2019-08-07 | 2020-05-08 | 上海钧正网络科技有限公司 | Battery management method and device, battery and server |
CN111539209A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023986A (en) * | 2009-09-22 | 2011-04-20 | 日电(中国)有限公司 | Method and equipment for constructing text classifier by referencing external knowledge |
CN104298665A (en) * | 2014-10-16 | 2015-01-21 | 苏州大学 | Identification method and device of evaluation objects of Chinese texts |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
-
2017
- 2017-04-05 CN CN201710216502.1A patent/CN106951565B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023986A (en) * | 2009-09-22 | 2011-04-20 | 日电(中国)有限公司 | Method and equipment for constructing text classifier by referencing external knowledge |
CN104298665A (en) * | 2014-10-16 | 2015-01-21 | 苏州大学 | Identification method and device of evaluation objects of Chinese texts |
CN106202177A (en) * | 2016-06-27 | 2016-12-07 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
Non-Patent Citations (1)
Title |
---|
XIAOLI LI ET AL.: "Learning to Classify Texts Using Positive and Unlabeled Data", 《PROCEEDINGS OF THE 18TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108549723A (en) * | 2018-04-28 | 2018-09-18 | 北京神州泰岳软件股份有限公司 | A kind of text concept sorting technique, device and server |
CN108549723B (en) * | 2018-04-28 | 2022-04-05 | 北京神州泰岳软件股份有限公司 | Text concept classification method and device and server |
CN109189941A (en) * | 2018-09-07 | 2019-01-11 | 百度在线网络技术(北京)有限公司 | For updating the method, apparatus, equipment and medium of model parameter |
CN109766440A (en) * | 2018-12-17 | 2019-05-17 | 航天信息股份有限公司 | A kind of method and system for for the determining default categories information of object text description |
CN109766440B (en) * | 2018-12-17 | 2023-09-01 | 航天信息股份有限公司 | Method and system for determining default classification information for object text description |
CN111132056A (en) * | 2019-08-07 | 2020-05-08 | 上海钧正网络科技有限公司 | Battery management method and device, battery and server |
CN111132056B (en) * | 2019-08-07 | 2023-05-05 | 上海钧正网络科技有限公司 | Battery management method and device, battery and server |
CN111539209A (en) * | 2020-04-15 | 2020-08-14 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
CN111539209B (en) * | 2020-04-15 | 2023-09-15 | 北京百度网讯科技有限公司 | Method and apparatus for entity classification |
Also Published As
Publication number | Publication date |
---|---|
CN106951565B (en) | 2018-04-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106951565B (en) | File classification method and the text classifier of acquisition | |
CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
CN111860981B (en) | Enterprise national industry category prediction method and system based on LSTM deep learning | |
CN108717433A (en) | A kind of construction of knowledge base method and device of programming-oriented field question answering system | |
CN106021410A (en) | Source code annotation quality evaluation method based on machine learning | |
CN106021389A (en) | System and method for automatically generating news based on template | |
CN106649260A (en) | Product feature structure tree construction method based on comment text mining | |
CN105975555A (en) | Enterprise abbreviation extraction method based on bidirectional recurrent neural network | |
CN102270212A (en) | User interest feature extraction method based on hidden semi-Markov model | |
CN110310012B (en) | Data analysis method, device, equipment and computer readable storage medium | |
CN107004141A (en) | To the efficient mark of large sample group | |
CN112307153A (en) | Automatic construction method and device of industrial knowledge base and storage medium | |
CN108027814A (en) | Disable word recognition method and device | |
CN110046225B (en) | Scientific and technological project material integrity assessment decision model training method | |
CN112800232B (en) | Case automatic classification method based on big data | |
CN113434688B (en) | Data processing method and device for public opinion classification model training | |
CN110705283A (en) | Deep learning method and system based on matching of text laws and regulations and judicial interpretations | |
CN116703328B (en) | Project review method and system | |
Shanmugalingam et al. | Corporate it-support help-desk process hybrid-automation solution with machine learning approach | |
CN114817575A (en) | Large-scale electric power affair map processing method based on extended model | |
CN117520754B (en) | Pretreatment system for model training data | |
CN112785156B (en) | Industrial collar and sleeve identification method based on clustering and comprehensive evaluation | |
CN110362828B (en) | Network information risk identification method and system | |
CN107220238A (en) | A kind of text object abstracting method based on Mixed Weibull distribution | |
CN114281998B (en) | Event labeling system construction method for multi-level labeling person based on crowdsourcing technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |