Summary of the invention
Existing text classifier is not applied for the serious text of characteristic crossover, to solve this technical problem, first
Aspect, the application provide a kind of construction method of text classifier, comprising the following steps:
Classification system is obtained, the classification system is stored with multi-branch tree data structure, generates ontology tree;
Keyword is extracted from this body node of the ontology tree;
Ontology expression formula is obtained, the ontology expression formula is generated according to classifying rules and semantic model, the classifying rules
It is generated according to the keyword and logical operator, the semantic model is generated according to the keyword;
Described body node is established with the corresponding ontology expression formula and is associated with, text classifier, the text are obtained
Classifier includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree.
With reference to first aspect, in a first possible implementation of that first aspect, from this body node of the ontology tree
The step of middle extraction keyword, comprising:
Descriptor is extracted from the title of this body node;
According to the descriptor obtain expansion word, obtain include the descriptor and the expansion word the keyword.
With reference to first aspect and above-mentioned possible implementation, in a second possible implementation of that first aspect, root
Include: according to the step of descriptor acquisition expansion word
Preset sample text is segmented to obtain the first character;
Inverted index is constructed according to the first character, obtains index database;
The descriptor is segmented to obtain the second character;
By the second character and the index storehouse matching;
The degree of correlation of sample text and descriptor is calculated according to matched result;
According to the degree of correlation, descending shows the sample text of the degree of correlation greater than zero from large to small;
The first character of highlight mark and second character match in the sample text of display;
Expansion word is obtained with the matched character in the descriptor part according in the sample text of display.
With reference to first aspect and above-mentioned possible implementation, in first aspect in the third possible implementation, benefit
The prediction tag along sort of preset test text is determined with the text classifier;
When accuracy rate is less than preset threshold, the ontology expression formula in the text classifier is adjusted, the accuracy rate is
The ratio of prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text.
With reference to first aspect and above-mentioned possible implementation is adjusted in the 4th kind of possible implementation of first aspect
The step of ontology expression formula in whole classifier, comprising:
It extracts and ontology expression formula corresponding to the unmatched prediction tag along sort of original classification label;
When lacking constraint factor in corresponding ontology expression formula, increase constraint factor in ontology expression formula, obtains excellent
The sheet of change
Body expression formula, the constraint factor include concept and/or logical operator in semantic model.
Second aspect, the application provide a kind of file classification method, comprising the following steps:
Obtain text to be sorted;
Determine the ontology expression formula in text classifier with the text matches to be sorted, wherein the text classifier
Including ontology tree, and the ontology expression formula with each body node respective associated in the ontology tree;
Determining and associated body node of the ontology expression formula;
According to the information of this body node determine the text to be sorted belonging to classification.
In conjunction with second aspect, in second aspect in the first possible implementation, determine in text classifier with it is described
The step of ontology expression formula of text matches to be sorted includes:
When the associated ontology expression formula of this body node is more than one, judge parallel the text to be sorted whether with ontology
Expression formula matching.
The third aspect, the application provide a kind of text classifier construction device, comprising:
First acquisition unit stores the classification system with multi-branch tree data structure, generates this for obtaining classification system
Body tree;
Extraction unit, for extracting keyword from this body node of the ontology tree;
Second acquisition unit, for obtaining ontology expression formula, the ontology expression formula is according to classifying rules and semantic model
It generates, the classifying rules is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword;
Generation unit is associated with for establishing described body node with the corresponding ontology expression formula, obtains text point
Class device, the text classifier include the ontology tree and this body surface with each body node respective associated of ontology tree
Up to formula.
In conjunction with the third aspect, in the third aspect in the first possible implementation, the extraction unit further include:
Key phrases extraction subelement, for extracting descriptor from the title of this body node;
Subelement is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the extension
The keyword of word.
In conjunction with the third aspect and above-mentioned possible implementation, in second of the third aspect possible implementation, text
The construction device of this classifier further include:
Test text taxon, for determining the prediction contingency table of preset test text using the text classifier
Label;
Optimize unit, when being less than preset threshold for accuracy rate, adjusts the ontology expression formula in the text classifier, institute
Stating accuracy rate is that prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text
Ratio.
Text classifier construction method and file classification method in above-mentioned technical proposal, firstly generate ontology tree, then from
The ontology Node extraction keyword of ontology tree, is then based on keyword generative semantics model, raw based on keyword and logical operator
Ingredient rule-like, then ontology expression formula is generated with semantic model and classifying rules, constructed ontology expression formula is corresponding
This body node associate, to ontology tree and all constitute text classification with the associated ontology expression formula of this body node
Device.When text classifier is used for text classification, text to be sorted triggers specific ontology expression formula, due to ontology expression formula with
Specific this body node association, therefore the ontology expression formula by being triggered can determine this body node.With this body node
Information, such as title mark text to be sorted as tag along sort, determine the classification of text to be sorted.
Due to ontology expression formula include at least one can concept in the semantic model of Efficient Characterization text to be sorted, and
And when there are the concept in multiple semantic models, there are identical or different logical relations between multiple semantic models, therefore,
Even if the possible identical but associated ontology expression formula of the keyword extracted in this different body nodes is different, because
This is suitable for the classification that Feature Words intersect serious text.
Simultaneously as determining the classification of text by triggering ontology expression formula, it is not necessary to calculate feature covering quantity or
Person's weight, therefore, even if training expectation is unbalanced, the Feature Words quantity of some classification is especially few, feature inclination will not occurs
Lead to the situation of text classification mistake.This is because the Feature Words that can characterize text semantic are once extracted, and by
For constructing ontology expression formula, then once triggering ontology expression formula, so that it may it treats classifying text and is marked, it is special without considering
Quantity and weight that word occurs are levied, thus caused by avoiding training from expecting unevenly the case where classification error.
Specific embodiment
It elaborates with reference to the accompanying drawing to embodiments herein.
Text classification refers to given classification system, and text is assigned in some or certain several classifications.Text classifier is
The general designation for the method classified during text mining to text.
Classification system includes the label of multiple levels, embodies in different application scene people for the specific of text classification
Demand.It is illustrated using bank credit card division customer service work order text as concrete application scene, classification system can be such as 1 institute of table
Show, including first-level class label, is under the jurisdiction of the secondary classification label of first-level class label, and be under the jurisdiction of corresponding secondary classification mark
The three-level tag along sort of label.In addition to classifying shown in table 1, which can also include other first-level class labels,
And it is under the jurisdiction of the secondary classification label of corresponding first-level class label, it, can also be with it under the secondary classification label in classification system
His three-level tag along sort, the tag along sort of other ranks are similar.
1 classification system embodiment schematic table of table
Using the method for the text classification based on statistical method, at least there is following two defect.
First, when text classification requires to be fine grit classification, there are identical for the corpus content between classification and classification
Feature Words, i.e. generation characteristic crossover phenomenon.
It is illustrated using bank credit card division customer service work order text as concrete application scene, existing two texts to be sorted
This:
Text 1 to be sorted:
There is xw puppet to emit list before: 20150503nxxxxx181.Existing client sends a telegram here again is discontented with result, stakes out a claim me
Row will also undertake a responsibility, and strong dissatisfaction is complained again, and verification is handled Wang Guibu as early as possible, thanks!Telephone number:
152xxxx4718。
Text 2 to be sorted:
Caller client requires to press form NO.: 20150916nxxxxx311, it is desirable that handles as early as possible and accuses processing result
Know, indicates non-someone's connection so far, it is desirable that deduction and exemption loss, and require first to do dispute registration for 4900 yuan, only it is willing to that also it normally disappears
The amount of money taken is reluctant also to be stolen the amount of money of brush, and verification is handled Wang Guibu as early as possible, thanks!Telephone number: 138xxxx8628.
In above-mentioned two text to be sorted, it is related that semanteme that text 1 to be sorted indicates and puppet emit robber's brush, text 2 to be sorted
The semanteme of expression is pressed related to business.However, occurring many same or similar concepts in two texts to be sorted simultaneously
Feature Words.For example, all occurring " discontented ", " incoming call ", " verification " these Feature Words in two texts to be sorted;Further for example,
" stealing brush " for bank credit card division customer service work order, in " puppet emits " and text to be sorted 2 in text 1 to be sorted
It is considered as similar Feature Words.In the two texts to be sorted, can effectively it characterize belonging to the text reality to be sorted
The Feature Words of classification are relatively fewer, such as: " puppet emits list " in " puppet emit steal brush ", in " pressing business " " it is required that locating as early as possible
Reason ", " non-someone's connection so far ".
During being classified using the method for the text classification based on statistical method, due to above-mentioned two to be sorted
Text can extract many same or similar Feature Words, and characteristic crossover is serious, is actually difficult or can not effectively extract
Feature Words as similar " it is required that handling as early as possible ", " non-someone's connection so far ".In face of of this sort training corpus, meter is used
The statistical classification method that calculation machine learns automatically, due to being easy to misjudge, text classifier is extremely difficult to ideal precision and wants
It asks.
Second, when training expects uneven, the training corpus of partial category is very more, and there are many feature of extraction, covering
Wide, the training corpus of partial category is considerably less, and extraction feature is limited, is not enough to cover all aspects of current class.At this point,
The problem of text classification is easy to cause feature to tilt is carried out using statistical method.
Still it is illustrated, is continued to use above-mentioned wait divide as concrete application scene using bank credit card division customer service work order text
Class text 2, in the text to be sorted, " it is required that as early as possible handle ", " non-someone's connection so far " etc. can Efficient Characterization press industry
The Feature Words of business concept are difficult to be extracted to;Meanwhile " being reluctant also to be stolen the amount of money of brush " in text 2 to be sorted, it is easy to extract
Feature Words " stealing brush " out.Therefore, in text 2 to be sorted, the Feature Words for capableing of Efficient Characterization can not be extracted, misleading spy
Sign word is extracted, to be easy the problem of causing erroneous judgement, classification is caused to malfunction.
In addition, if the training corpus of " pressing business " this classification is seldom, Cong Zhongti in text classifier building
The Feature Words taken are limited, only " pressing ", " steal brush ", " amount of money ", " verification ", " processing " totally 5 Feature Words;And simultaneously, " puppet emits
There are many training corpus of this classification of robber's brush ", and the Feature Words covering surface therefrom extracted can extract " letter than broad
With card ", " amount of money ", " mentioning volume ", " amount ", " steal brush ", " incoming call ", " discontented ", " verification ", " processing ", " responsibility ", " refund ",
" complaint ", " progress ", " accepting " totally 14 Feature Words.
When facing text 3 to be sorted below, text 3 to be sorted actually belongs to " pressing business " classification, and is based on
The file classification method of statistical method is easy to judge by accident.
Text 3 to be sorted:
It complains before single: 20150826j00000044,20150902j00000248,20150910j00000149, client
It indicates to complain from August incoming call on the 26th, credit card application mentions volume, and staff's refusal is accepted, and is not connected to branch responsible person so far
It ringing back, customer requirement is handled ability for branch and is complained, and repeatedly inquires mechanism of supervision department phone in the phone,
And require regardless of processing result, it is desirable that branch's reply processing progress, client indicate the processing time too long, be reluctant to continue
To which the tired processing of labor be thanks.
Classification can be determined according to the quantity and weight of extracting feature based on the file classification method of statistical method.If adopted
Classified with statistical method, text 3 to be sorted can be marked as " puppet emits robber's brush " classification.This is because " puppet emits robber's brush " class covers
More features word, and " pressing business " category feature word is few, covering surface is limited, can not more preferably be matched to text 3 to be sorted
Content.
Existing text classifier is not applied for the serious text of characteristic crossover, and is not applied for training corpus not
Uniform situation, to solve this technical problem, referring to FIG. 1, providing a kind of text in the specific embodiment of the application
The construction method of this classifier, comprising the following steps:
S100 obtains classification system, stores the classification system with multi-branch tree data structure, generates ontology tree;
S200 extracts keyword from this body node of the ontology tree;
S300 obtains ontology expression formula, and the ontology expression formula is generated according to classifying rules and semantic model, the classification
Rule is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword;
Described body node is established with the corresponding ontology expression formula and is associated with by S400, obtains text classifier, described
Text classifier includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree.
In the S100 the step of, classification system can also be constructed by artificial constructed by computer, the application to this not
It is restricted.The step of " classification system being stored in the step of S100 with multi-branch tree data structure ", can specifically use with
Under type: initially setting up root node, using root node as father node, adds this body node of level-one, and with the level-one in classification system
Title of the tag along sort as corresponding this body node of level-one;Similar, then using this body node of level-one as father node, add second level
This body node, and using the secondary classification label in classification system as the title of corresponding this body node of second level;And so on, directly
This body node of the tag along sort of all ranks all correspondence establishments into the classification system of acquisition.At different levels body nodes and corresponding
Set membership between this body node just constitutes ontology tree.In ontology tree, level-one this body node, second level this body node, three
This body node of grade etc. may be referred to collectively as this body node.
For example, the example in the above table 1 is continued to use, the classification system, the sheet of generation are stored with multi-branch tree data structure
Body tree is as shown in table 2.
2 ontology tree example schematic table of table
Optionally, the step of keyword is extracted in this body node of the slave ontology tree in S200, may include: acquisition ontology
The title of node;The title of this body node is segmented, descriptor is obtained;Using these descriptor as this body node
Keyword.
For example, the example in table 2 is continued to use, by taking this body node of the three-level of entitled " puppet, which emits, steals brush " as an example, the ontology
Entitled " puppet emits robber's brush " of node, segments " puppet emits robber's brush ", obtains descriptor " puppet emits " and " stealing brush ".With " puppet emits "
" stealing brush " is used as keyword, carries out the step of obtaining ontology expression formula in next step.
Optionally, the step of referring to Fig. 2, keyword is extracted in this body node of the slave ontology tree in S200, can wrap
It includes:
S210 extracts descriptor from the title of this body node;
S220 according to the descriptor obtain expansion word, obtain include the descriptor and the expansion word the key
Word.
By obtaining expansion word, the expansion word of more implicit close semanteme can be excavated, with descriptor and expansion word
Classifying rules and semantic model are constructed collectively as keyword, to improve the nicety of grading of text classifier.
Wherein, the method that descriptor is extracted in step S210 may refer to the side that descriptor is extracted in aforementioned implementation
Method.
The step of referring to Fig. 3, obtaining expansion word according to descriptor in step S220 may include:
S221 is segmented preset sample text to obtain the first character;
S222 constructs inverted index according to the first character, obtains index database;
S223 is segmented the descriptor to obtain the second character;
S224 is by the second character and the index storehouse matching;
S225 calculates the degree of correlation of sample text and descriptor according to matched result;
According to the degree of correlation, descending shows the sample text of the degree of correlation greater than zero to S226 from large to small;
First character of S227 highlight mark and second character match in the sample text of display;
S228 obtains the first expansion word with the matched character in the descriptor part according in the sample text of display.
Step S221 and S222 utilize specific sample text, establish the index database based on specific sample text, for benefit
The first expansion word is obtained with the index database.For example, 10000, bank credit card division customer service work order text is obtained in advance
As sample text, this 10000 sample texts segment according to monocase granularity, the first character is obtained.To every
A first character word for word constructs inverted index, forms index database.
In the S223 the step of, according to and the step of S221 same segmenting method, the theme that S210 step is extracted
Word is segmented, and the second character is obtained.
In the S224 the step of, the second character is word for word matched with the inverted index in index database, every sample text
In this, matched character is more, it is believed that the degree of correlation of the sample text and descriptor is higher, specifically can be by matching character
Length calculate the degree of correlation of the sample text and descriptor.
In the S228 the step of, with matched first character in descriptor part in sample text, extend backward or forward,
To obtain the character string of the word grade with complete meaning, using the character string as the first expansion word.In addition to the matched feelings in part
Outside condition, it even if being exactly matched with descriptor, can also be extended forward or backward herein in sample, obtain the word with complete meaning
Symbol string is as the first expansion word.The step can be completed, the application does not make this by being accomplished manually by computer
Limitation.
Step S223 to S228, using descriptor as the inverted index in term, with index database in specific index database
It is matched, to calculate the degree of correlation of sample text and descriptor, sample text is subjected to descending sort according to the degree of correlation
It has been shown that, and in the sample text data of display highlight mark and the second character match the first character, carry out visualization exhibition
Show, so that quickly positioning be assisted to obtain the first expansion word.Especially when sample text content is very long or sample of the degree of correlation greater than zero
When this amount of text is very big, if obtaining the first expansion word according to the first highlighted character by artificial, acquisition is greatly promoted
The efficiency of first expansion word reduces workload.
For example, the example in table 2 is continued to use, descriptor " puppet emits " and " stealing brush " are extracted.Respectively with " puppet emits " and " robber
Brush " is used as term, using index database constructed by aforementioned 10000 sample texts, the letter of position matching in sample text
Cease content.If sample text of the degree of correlation greater than 0 has 3 as the result is shown, it is specific as follows shown in.
Sample text 1:
There is xw puppet to emit list before: 20150503nxxxxx181.Existing client sends a telegram here again is discontented with result, stakes out a claim me
Row will also undertake a responsibility, and strong dissatisfaction is complained again, and verification is handled Wang Guibu as early as possible, thanks!Telephone number:
152xxxx4718。
Sample text 2:
Customers' responsiveness is not applied for card, and is traded, and be detailed in puppet and emit list: 20150207j11000092, during which client is multiple
Incoming call is pressed, and list is shown in Table: 20150209j23240075,20150210j23240017,20150211j23240055, existing client
Send a telegram here again, indicate reflection problem so far and after repeatedly pressing, 2/11 be connected to when branch is replied only inquiry card whether I
Application, any reply about processing result all do not have, and it is very discontented to manage it processing progress for me, it is desirable that replys most terminate as early as possible
Fruit, or reply and inform accurate process limited, my portion pacifies in vain online, please your reply processing, thanks!
Sample text 3:
Customer complaint card is stolen problem, has filled out list: 20150708j00000081,20150714j00000214, right
Currently processed result is discontented, still needs to me and manages it provide the short message to be the pseudo- proof for emitting short message, and require to handle as early as possible, your portion is hoped to assist
Processing, thanks!
It all exactly matches in the text of sample text 1 and 2 to " puppet emits " this descriptor.And sample text 3 is in addition to complete
It is matched to " puppet emits " outside, is also partially matched to " robber ".For this purpose, " puppet emits " can be extended to the right " puppet emits short message ", by " robber "
It is extended to " usurping " to the right, so that " puppet emits short message " and " usurping " is used as the first expansion word.By the first expansion word and descriptor
Together as keyword, for the step of obtaining ontology expression formula in next step.
Other than obtaining the first expansion word using index database, can also obtain character according to the semanteme of descriptor can not
It is matched to, but semantic the second same or similar expansion word, with the first expansion word, the second expansion word and descriptor collectively as pass
Keyword, for the step of obtaining ontology expression formula in next step.
For example, from above-mentioned sample text 2 it can be found that in the sample text, even if do not occur " puppet emits " this
A word, but when " not applying for card " in the text and " transaction " while when occurring, the content of the text, which is remained on, emits robber's brush with puppet
Correlation, user are also desirable that under the classification that the text is categorized into " puppet, which emits, steals brush ".For this purpose, " will can not apply for card " and " hand over
Easily " it is used as the second expansion word.
By the step of obtaining the second expansion word according to descriptor, implicit expansion word can be further excavated, thus
Further increase the nicety of grading of text classifier.The acquisition of second expansion word can also pass through calculating by manually carrying out
Machine obtains, and the application is not construed as limiting this.
In the S300 the step of, ontology expression formula can also be generated, the application by artificial constructed generation by computer
With no restriction to this.Obtaining ontology expression formula can be a certain computer of ontology expression formula input manually, be obtained by the computer
It takes, is also possible to the computer and receives the ontology expression formula generated and sent by another computer, to complete to obtain this body surface
The step of up to formula, the application to this also with no restriction.
The generating process of ontology expression formula can specifically be realized by following steps:
Firstly, at least one keyword is connected using logical operator according to the keyword that the step of S200 extracts
Come, makes between logical operator and keyword, keyword and keyword to generate classifying rules there are logic association.
Logical operator, also known as logical operator, the logical operator in the embodiment of the present application includes: logical AND "+", logic
Non- "-", logic or " | " and polynary rounding " () ".For example, classifying rules A+B, indicate require simultaneously comprising A with
B;Classifying rules is A+ (B | C), indicates to require to include any of B or C, while also requiring to include A.
The example for continuing to use in S200 " puppet, which emits, steals brush ", the master extracted from this body node of the three-level of entitled " puppet, which emits, steals brush "
Epigraph is " puppet emits " and " stealing brush ", is " puppet emits short message " and " usurping ", the second extension by the first expansion word that descriptor obtains
Word is " not applying for card " and " transaction ".By descriptor and two class expansion words collectively as keyword, 3 classification gauges are generated
Then:
Classifying rules 1: puppet emits | steal brush;
Classifying rules 2: puppet emits short message+usurp;
Classifying rules 3: it does not apply for card+trades.
Above-mentioned classifying rules can also be generated by manually constructing generation by computer, the application to this not
It limits.
Secondly, according to keyword generative semantics model.Semantic model refers to towards known concept, concludes from sample data
The text presentation form for being used to describe known concept semanteme that exhaustion goes out.
Specifically, in one implementation, semantic model may include in all-purpose language concept and business factor concept
Any one, respectively with " c_ " and " e_ " two kinds of sign flags.Keyword is divided into all-purpose language concept and business factor is general
It reads, respectively from each keyword, the different expression form of known concept is extracted from context text information.
For example, still continue to use in S200 the example of " puppet emit steal brush ", will " puppet emits ", " stealing brush ", " puppet emits information ",
" negative concept " is used as all-purpose language concept respectively as business factor concept by " usurping ", " applying for card ", " with card ", from existing
Sample data in conclude exhaustion go out specific concept under, indicate the different text presentation forms of the concept.It is as shown in table 3 below:
3 semantic model example one of table
Concept type |
Concept |
The different expression form (Feature Words) of concept |
Business factor concept |
E_ puppet emits |
Puppet emits, palms off, pretends to be |
Business factor concept |
E_ steals brush |
Steal brush |
Business factor concept |
E_ puppet emits information |
Puppet emits short message, puppet emits message, puppet emits incoming call, puppet emits mail |
Business factor concept |
E_ is usurped |
It usurps |
Business factor concept |
E_ applies for card |
Do { 0,2 } card |
Business factor concept |
E_ card |
It is brushed with card, generation transaction, using card, card |
All-purpose language concept |
C_ negates concept |
Not, do not have, never |
It can also include i.e. other than including two class of all-purpose language concept and/or business factor concept in semantic model
With concept, i.e., it is user according to actual needs come the genus set immediately with concept, can be marked with " k_ " symbol.Example
Such as, it needs " initial amount " occur in text when classification, can directly define an instant concept, be come with " the initial amount of k_ "
It indicates, should use only includes " initial amount " word in concept.
Above-mentioned semantic model can also be constructed generation, the application couple by manually constructing generation by computer
This is not construed as limiting.
Finally, generating ontology expression formula according to classifying rules and semantic model.Specifically, by the key in classifying rules
Word correspond in semantic model it is corresponding conceptive, and using with logical operator identical in classifying rules by corresponding concepts with patrolling
It collects operator to associate, generates ontology expression formula.
For example, the example for continuing to use aforementioned " puppet, which emits, steals brush ", can be generated with lower body expression formula:
Ontology expression formula 1:e_ puppet emits | and e_ steals brush;
Ontology expression formula 2:e_ puppet emits information+e_ and usurps;
Ontology expression formula 3:c_ negative concept+e_ apply for card+e_ with card.
It should be noted that classifying rules and semantic model can generate simultaneously, can also be successive in the S300 the step of
Generate, the application to its genesis sequence with no restriction.
It, will be in S300 step based on generated body surface of some this body node in S200 step in the S400 the step of
Up to formula, it is associated with this body node foundation in S200 step.One this body node can sheet corresponding with one or more
Body expression formula establishes association.After this body node all in ontology tree is all associated with the foundation of respective ontology expression formula,
The ontology tree, and the ontology expression formula with each body node respective associated of ontology tree, collectively form text classifier, are used for
The classification of unknown text.
Text classifier construction method and file classification method in above embodiment, firstly generate ontology tree, then from
The ontology Node extraction keyword of ontology tree, is then based on keyword generative semantics model, raw based on keyword and logical operator
Ingredient rule-like, then ontology expression formula is generated with semantic model and classifying rules, constructed ontology expression formula is corresponding
This body node associate, to ontology tree and all constitute text classification with the associated ontology expression formula of this body node
Device.When text classifier is used for text classification, text to be sorted triggers specific ontology expression formula, due to ontology expression formula with
Specific this body node association, therefore the ontology expression formula by being triggered can determine this body node.With this body node
Information, such as title mark text to be sorted as tag along sort, determine the classification of text to be sorted.
Due to ontology expression formula include at least one can concept in the semantic model of Efficient Characterization text to be sorted, and
And when there are the concept in multiple semantic models, there are identical or different logical relations between multiple semantic models, therefore,
Even if the possible identical but associated ontology expression formula of the keyword extracted in this different body nodes is different, because
This is suitable for the classification that Feature Words intersect serious text.
Simultaneously as being the classification for determining text by triggering ontology expression formula using above-mentioned text classifier, it is not necessary to
The quantity or weight of feature covering are calculated, therefore, even if training expectation is unbalanced, the Feature Words quantity of some classification is special
Few, feature inclination will not occur leads to the situation of text classification mistake.This is because the Feature Words of text semantic can be characterized
Once being extracted, and it be used to construct ontology expression formula, then once triggering ontology expression formula, so that it may treat classifying text
Classification is marked, without considering the number and weight of Feature Words appearance, to classify caused by avoiding training from expecting unevenly
The case where error.
For example, text to be sorted 1, text to be sorted 2 in the shortcomings that continuing to use above-mentioned statistical method part and to
The example of classifying text 3, by this body node of three-level association ontology expression formula " k_ puppet emits list " of entitled " puppet, which emits, steals brush " and " e_
Puppet emits+c_ and requires load duty | and c_ is discontented ", so that it may the ontology expression formula in triggering text classifier will be passed through, thus by be sorted
Text 1 is categorized into " puppet emits robber's brush " classification.Similarly, this body node of three-level association ontology expression of entitled " pressing business "
Formula " e_ is pressed ", " c_ inquiry+e_ processing progress " and " it is long that e_ does not reply+e_ processing time+c_ ", can be by text to be sorted
2, text 3 to be sorted is identified in " pressing business " classification.
Wherein, semantic model is as shown in table 4.
4 semantic model example two of table
It should be noted that in table 2, " undertaking { 0,2 } responsibility " is indicated in matched text, if undertake with responsibility it
Between also include 0~2 character text, also " { 0,2 } responsibility can be undertaken " and be matched, " undertake one for example, having in the text
Severely rebuke and appoint " or when " undertake and had a responsibility for ", just will be considered that and be matched to " undertaking { 0,2 } responsibility ".It is other similar in the application
Representation method meaning is identical with this.
In table 2, " [^ is not] { 0,5 } is discontented " is indicated in matched text, as long as before discontented including 0~5 character
Text can all be matched by " [^ is not] { 0,5 } is discontented ", such as " very discontented ", while exclude " not being discontented ", " not counting discontented "
Deng reversed semantic Feature Words.Other similar representation method meaning is identical with this in the application.
Optionally, referring to FIG. 4, the construction method of text classifier can also include:
S500 determines the prediction tag along sort of preset test text using the text classifier;
S600 adjusts the ontology expression formula in the text classifier when accuracy rate is less than preset threshold, described accurate
Rate is the ratio that prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text.
In the S500 the step of, preset test text has used original classification label handmarking good, usually tests
The quantity of text more than one, belong to same class text with sample text.For example, sample text is bank credit card division client clothes
It works single text, then test text is generally also bank credit card division customer service work order text.
In the S600 the step of, if accuracy rate is more than or equal to preset threshold, indicate that text classifier can be effectively
Classify to unknown text.If it is less than preset threshold, then optimize text classifier by adjusting ontology expression formula.
The step of S500 and S600, can be the process of an iteration, the text by continuing to optimize, after making optimization
The accuracy rate of classifier can achieve the desired threshold value of user.
Referring to FIG. 5, the step of S600, can specifically include:
S610 is extracted and ontology expression formula corresponding to the unmatched prediction tag along sort of original classification label;
S620 increases constraint factor in ontology expression formula, obtains when lacking constraint factor in corresponding ontology expression formula
To the ontology expression formula of optimization, the constraint factor includes concept and/or logical operator in semantic model.
It in the S610 the step of, can specifically be accomplished by the following way, not extracted first with original classification label not
The prediction tag along sort matched finds identical with default tag along sort body node of title, then determines and the default tag along sort
Associated ontology expression formula, thus the ontology expression formula that exact p-value text is triggered to.
In the S620 the step of, when the ontology expression formula extracted in S610 step lacks constraint factor, Ke Yi
Increase constraint factor in ontology expression formula, that is, by adding business factor concept, all-purpose language concept or i.e. in concept
At least one and logical operator enable and the matched text to be sorted of script body expression formula to optimize script body expression formula
Originally the ontology expression formula after being unable to matching optimization, or cannot be with the matched text energy matching optimization to be sorted of script body expression formula
Ontology expression formula afterwards.For example, new concept can be increased in semantic model, while increasing new logical operator, thus raw
At the ontology expression formula after optimization;Feature Words can also be increased or decreased in former concept;Logic calculation can also be increased or decreased
Son makes to form new logical relation between concept.Script body expression formula is replaced with the ontology expression formula of the optimization, in ontology tree
This corresponding body node establishes association, the text classifier after being optimized.
For example, the example of aforementioned " puppet, which emits, steals brush " is continued to use, includes test text 1 in test text.
Test text 1:
It is discontented that client emits single processing result for puppet, it is desirable that complains, is detailed in form NO.: 20150810s11000063.Exist
Line fills in new complain at list 20150906s00000076 to risk management for it.But client stakes out a claim and goes to Hubei Wuhan
The credit card centre in city solves this problem face to face, and asking your portion, verification is handled as early as possible, thanks.Telephone number: 138xxxxx124.
The test text 1 can be matched with ontology expression formula " e_ puppet emits | e_ steal brush ", that is, trigger the ontology expression formula,
According to the ontology expression formula, this body node being associated is determined in ontology tree, and according to the title of this body node, by the survey
Examination text is identified with prediction tag along sort " puppet emits robber's brush ".However, the practical semanteme of prediction text 1 is not that puppet emits robber's brush,
But business is pressed, the original classification label of manual identification is " pressing business ", is mismatched with prediction tag along sort.If such
It was found that emitting in the test text 1 if there is puppet or stealing brush, while there is not " processing result ", then the test text 1 will not touch
Therefore the ontology expression formula for sending out above-mentioned lacks constraint factor in current ontology expression formula: logical operator "-" and business factor
Concept " e_ processing result ", " e_ processing result " include Feature Words " processing result ".Increase the pact lacked in ontology expression formula
Shu Yinzi, the ontology expression formula optimized: e_ puppet emits | and e_ steals brush-e_ processing result.It is replaced with the ontology expression formula of the optimization
Script body expression formula is established with this body node corresponding in ontology tree and is associated with, the text classifier after being optimized.
Referring to FIG. 6, in another embodiment, providing a kind of file classification method, comprising the following steps:
S710 obtains text to be sorted;
S720 determines the ontology expression formula in text classifier with the text matches to be sorted, wherein the text point
Class device includes ontology tree, and the ontology expression formula with each body node respective associated in the ontology tree;
S730 determination and associated body node of the ontology expression formula;
S740 according to the information of this body node determine the text to be sorted belonging to classification.
In the S720 the step of, ontology tree is stored in the form of multi-branch tree data structure.In an ontology tree, an ontology
Node can be associated at least one ontology expression formula.When the associated ontology expression formula of this body node is more than one, multiple ontologies
Expression formula forms ontology expression formula collection, whether can judge text to be sorted by way of traversing ontology expression formula collection one by one
It is matched with ontology expression formula therein;It can also judge whether text to be sorted matches with ontology expression formula parallel, to improve
Matching speed can improve text classification speed especially when amount of text to be sorted is larger on the whole.
In the S740 the step of, the information of this body node specifically can be the title etc. of this body node.In S730 and originally
The name of associated body node of body expression formula is referred to as tag along sort, marks text to be sorted, to treat classifying text progress
Classification.When the same text to be sorted is triggered greater than an ontology expression formula, and a plurality of ontology expression formula respectively corresponds to
This body node difference when, tag along sort can be referred to as with the name of multiple body nodes, mark the same text to be sorted respectively
This, reaches polytypic effect.
Referring to FIG. 7, in another embodiment, providing a kind of text classifier construction device, comprising:
First acquisition unit 1 stores the classification system with multi-branch tree data structure, generates for obtaining classification system
Ontology tree;
Extraction unit 2, for extracting keyword from this body node of the ontology tree;
Second acquisition unit 3, for obtaining ontology expression formula, the ontology expression formula is according to classifying rules and semantic model
It generates, the classifying rules is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword;
Generation unit 4 is associated with for establishing described body node with the corresponding ontology expression formula, obtains text point
Class device, the text classifier include the ontology tree and this body surface with each body node respective associated of ontology tree
Up to formula.
Optionally, referring to FIG. 8, the step of generating ontology expression formula can carry out by outer computer or manually, at this point,
After extraction unit is extracted keyword, keyword is sent.Outer computer or manually according to keyword next life constituent class
Rule and semantic model, and ontology expression formula is generated according to semantic model and classifying rules.Then it is received by second acquisition unit
Externally input ontology expression formula, finally constructs text classifier by generation unit.In such a case, it is possible to reduce text
The calculation amount of classifier construction device itself.
Optionally, referring to FIG. 9, extraction unit 2 may include:
Key phrases extraction subelement 21, for extracting descriptor from the title of this body node;
Subelement 22 is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the expansion
Open up the keyword of word.
Expansion word is obtained by extension subelement, the expansion word of more implicit close semanteme can be excavated, with theme
Word and expansion word construct classifying rules and semantic model collectively as keyword, to improve the classification essence of text classifier
Degree.
Optionally, referring to FIG. 9, text classifier construction device can also include:
Test text taxon 5, for determining that the prediction of preset test text is classified using the text classifier
Label;
Optimize unit 6, when being less than preset threshold for accuracy rate, adjust the ontology expression formula in the text classifier,
The accuracy rate is to account for predict that tag along sort is total with the quantity of the prediction tag along sort of the original classification tag match of test text
Several ratio.
By continuing to optimize, the accuracy rate of the text classifier after making optimization can achieve the desired threshold value of user.
Optionally, extension subelement 22 may include:
First participle unit obtains the first character for being segmented preset sample text;
Index database construction unit obtains index database for constructing inverted index according to the first character;
Second participle unit obtains the second character for being segmented the descriptor;
Matching unit is used for the second character and the index storehouse matching;
Correlation calculating unit, for calculating the degree of correlation of sample text and descriptor according to matched result;
Display unit, for descending to show the sample text of the degree of correlation greater than zero from large to small according to the degree of correlation;
Highlighted unit, the first character for highlight mark and second character match in the sample text of display;
First expansion word acquiring unit, in the sample text according to display with the matched character in descriptor part
Obtain expansion word.
Optionally, optimization unit 6 may include:
Ontology expression formula extraction unit, for corresponding to extraction and the unmatched prediction tag along sort of original classification label
Ontology expression formula;
Adjustment unit, for increasing about in ontology expression formula when lacking constraint factor in corresponding ontology expression formula
Shu Yinzi, the ontology expression formula optimized, the constraint factor include concept and/or logical operator in semantic model.
Same and similar part may refer to each other between each embodiment in this specification.Invention described above is real
The mode of applying is not intended to limit the scope of the present invention..