CN107491554B

CN107491554B - Construction method, construction device and the file classification method of text classifier

Info

Publication number: CN107491554B
Application number: CN201710779864.1A
Authority: CN
Inventors: 李德彦; 晋耀红; 席丽娜
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Dingfu Intelligent Technology Co., Ltd
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2018-12-04
Anticipated expiration: 2037-09-01
Also published as: CN107491554A

Abstract

The embodiment of the present application discloses a kind of construction method of text classifier, comprising the following steps: obtains classification system, stores the classification system with multi-branch tree data structure, generate ontology tree；Keyword is extracted from this body node of the ontology tree；Ontology expression formula is obtained, the ontology expression formula is generated according to classifying rules and semantic model, and the classifying rules is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword；Described body node is established with the corresponding ontology expression formula and is associated with, obtains text classifier, the text classifier includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree.Classified using text classifier constructed by the above method to unknown text, can accurately be classified to the serious text of characteristic crossover, meanwhile, avoid training corpus unbalanced the problem of causing classification to malfunction.

Description

Construction method, construction device and the file classification method of text classifier

Technical field

This application involves Text Mining Technology field more particularly to a kind of construction methods of text classifier.In addition, this Shen It please further relate to the construction device and a kind of file classification method of a kind of text classifier.

Background technique

With the fast development of Internet resources, various texts are quicklyd increase.Text include structured text and Non-structured text obtains the process of user's text information interested or useful from non-structured text, referred to as literary This excavation.Text classification is one kind important in Text Mining Technology.

Common text classification mainly uses statistical method, including k nearest neighbour method, naive Bayesian method, neural network and Support vector machine method etc..Text classification based on statistical method is trained using the training corpus marked in advance and is obtained respectively The template of classification recycles template to classify unknown text.When text classification requires to be fine grit classification, classification and class There are identical features, i.e. generation characteristic crossover phenomenon for corpus content between not.When characteristic crossover phenomenon is more serious, just The precision of text classification can be significantly reduced.

Summary of the invention

Existing text classifier is not applied for the serious text of characteristic crossover, to solve this technical problem, first Aspect, the application provide a kind of construction method of text classifier, comprising the following steps:

Classification system is obtained, the classification system is stored with multi-branch tree data structure, generates ontology tree；

Keyword is extracted from this body node of the ontology tree；

Ontology expression formula is obtained, the ontology expression formula is generated according to classifying rules and semantic model, the classifying rules It is generated according to the keyword and logical operator, the semantic model is generated according to the keyword；

Described body node is established with the corresponding ontology expression formula and is associated with, text classifier, the text are obtained Classifier includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree.

With reference to first aspect, in a first possible implementation of that first aspect, from this body node of the ontology tree The step of middle extraction keyword, comprising:

Descriptor is extracted from the title of this body node；

According to the descriptor obtain expansion word, obtain include the descriptor and the expansion word the keyword.

With reference to first aspect and above-mentioned possible implementation, in a second possible implementation of that first aspect, root Include: according to the step of descriptor acquisition expansion word

Preset sample text is segmented to obtain the first character；

Inverted index is constructed according to the first character, obtains index database；

The descriptor is segmented to obtain the second character；

By the second character and the index storehouse matching；

The degree of correlation of sample text and descriptor is calculated according to matched result；

According to the degree of correlation, descending shows the sample text of the degree of correlation greater than zero from large to small；

The first character of highlight mark and second character match in the sample text of display；

Expansion word is obtained with the matched character in the descriptor part according in the sample text of display.

With reference to first aspect and above-mentioned possible implementation, in first aspect in the third possible implementation, benefit The prediction tag along sort of preset test text is determined with the text classifier；

When accuracy rate is less than preset threshold, the ontology expression formula in the text classifier is adjusted, the accuracy rate is The ratio of prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text.

With reference to first aspect and above-mentioned possible implementation is adjusted in the 4th kind of possible implementation of first aspect The step of ontology expression formula in whole classifier, comprising:

It extracts and ontology expression formula corresponding to the unmatched prediction tag along sort of original classification label；

When lacking constraint factor in corresponding ontology expression formula, increase constraint factor in ontology expression formula, obtains excellent The sheet of change

Body expression formula, the constraint factor include concept and/or logical operator in semantic model.

Second aspect, the application provide a kind of file classification method, comprising the following steps:

Obtain text to be sorted；

Determine the ontology expression formula in text classifier with the text matches to be sorted, wherein the text classifier Including ontology tree, and the ontology expression formula with each body node respective associated in the ontology tree；

Determining and associated body node of the ontology expression formula；

According to the information of this body node determine the text to be sorted belonging to classification.

In conjunction with second aspect, in second aspect in the first possible implementation, determine in text classifier with it is described The step of ontology expression formula of text matches to be sorted includes:

When the associated ontology expression formula of this body node is more than one, judge parallel the text to be sorted whether with ontology Expression formula matching.

The third aspect, the application provide a kind of text classifier construction device, comprising:

First acquisition unit stores the classification system with multi-branch tree data structure, generates this for obtaining classification system Body tree；

Extraction unit, for extracting keyword from this body node of the ontology tree；

Second acquisition unit, for obtaining ontology expression formula, the ontology expression formula is according to classifying rules and semantic model It generates, the classifying rules is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword；

Generation unit is associated with for establishing described body node with the corresponding ontology expression formula, obtains text point Class device, the text classifier include the ontology tree and this body surface with each body node respective associated of ontology tree Up to formula.

In conjunction with the third aspect, in the third aspect in the first possible implementation, the extraction unit further include:

Key phrases extraction subelement, for extracting descriptor from the title of this body node；

Subelement is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the extension The keyword of word.

In conjunction with the third aspect and above-mentioned possible implementation, in second of the third aspect possible implementation, text The construction device of this classifier further include:

Test text taxon, for determining the prediction contingency table of preset test text using the text classifier Label；

Optimize unit, when being less than preset threshold for accuracy rate, adjusts the ontology expression formula in the text classifier, institute Stating accuracy rate is that prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text Ratio.

Text classifier construction method and file classification method in above-mentioned technical proposal, firstly generate ontology tree, then from The ontology Node extraction keyword of ontology tree, is then based on keyword generative semantics model, raw based on keyword and logical operator Ingredient rule-like, then ontology expression formula is generated with semantic model and classifying rules, constructed ontology expression formula is corresponding This body node associate, to ontology tree and all constitute text classification with the associated ontology expression formula of this body node Device.When text classifier is used for text classification, text to be sorted triggers specific ontology expression formula, due to ontology expression formula with Specific this body node association, therefore the ontology expression formula by being triggered can determine this body node.With this body node Information, such as title mark text to be sorted as tag along sort, determine the classification of text to be sorted.

Due to ontology expression formula include at least one can concept in the semantic model of Efficient Characterization text to be sorted, and And when there are the concept in multiple semantic models, there are identical or different logical relations between multiple semantic models, therefore, Even if the possible identical but associated ontology expression formula of the keyword extracted in this different body nodes is different, because This is suitable for the classification that Feature Words intersect serious text.

Simultaneously as determining the classification of text by triggering ontology expression formula, it is not necessary to calculate feature covering quantity or Person's weight, therefore, even if training expectation is unbalanced, the Feature Words quantity of some classification is especially few, feature inclination will not occurs Lead to the situation of text classification mistake.This is because the Feature Words that can characterize text semantic are once extracted, and by For constructing ontology expression formula, then once triggering ontology expression formula, so that it may it treats classifying text and is marked, it is special without considering Quantity and weight that word occurs are levied, thus caused by avoiding training from expecting unevenly the case where classification error.

Detailed description of the invention

In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.

Fig. 1 is the flow chart of one embodiment of the construction method of the application text classifier；

Fig. 2 is the flow chart of step S200 in second embodiment of the construction method of the application text classifier；

Fig. 3 is the flow chart of step S220 in the third embodiment of the construction method of the application text classifier；

Fig. 4 is the flow chart of the 4th embodiment of the construction method of the application text classifier；

Fig. 5 is the flow chart of step S600 in the 5th embodiment of the construction method of the application text classifier；

Fig. 6 is the flow chart of one embodiment of the application file classification method；

Fig. 7 is the structural schematic diagram of one embodiment of the construction device of the application text classifier；

Fig. 8 is the structural schematic diagram of second embodiment of the construction device of the application text classifier；

Fig. 9 is the structural schematic diagram of the third embodiment of the construction device of the application text classifier.

Specific embodiment

It elaborates with reference to the accompanying drawing to embodiments herein.

Text classification refers to given classification system, and text is assigned in some or certain several classifications.Text classifier is The general designation for the method classified during text mining to text.

Classification system includes the label of multiple levels, embodies in different application scene people for the specific of text classification Demand.It is illustrated using bank credit card division customer service work order text as concrete application scene, classification system can be such as 1 institute of table Show, including first-level class label, is under the jurisdiction of the secondary classification label of first-level class label, and be under the jurisdiction of corresponding secondary classification mark The three-level tag along sort of label.In addition to classifying shown in table 1, which can also include other first-level class labels, And it is under the jurisdiction of the secondary classification label of corresponding first-level class label, it, can also be with it under the secondary classification label in classification system His three-level tag along sort, the tag along sort of other ranks are similar.

1 classification system embodiment schematic table of table

Using the method for the text classification based on statistical method, at least there is following two defect.

First, when text classification requires to be fine grit classification, there are identical for the corpus content between classification and classification Feature Words, i.e. generation characteristic crossover phenomenon.

It is illustrated using bank credit card division customer service work order text as concrete application scene, existing two texts to be sorted This:

Text 1 to be sorted:

There is xw puppet to emit list before: 20150503nxxxxx181.Existing client sends a telegram here again is discontented with result, stakes out a claim me Row will also undertake a responsibility, and strong dissatisfaction is complained again, and verification is handled Wang Guibu as early as possible, thanks！Telephone number: 152xxxx4718。

Text 2 to be sorted:

Caller client requires to press form NO.: 20150916nxxxxx311, it is desirable that handles as early as possible and accuses processing result Know, indicates non-someone's connection so far, it is desirable that deduction and exemption loss, and require first to do dispute registration for 4900 yuan, only it is willing to that also it normally disappears The amount of money taken is reluctant also to be stolen the amount of money of brush, and verification is handled Wang Guibu as early as possible, thanks！Telephone number: 138xxxx8628.

In above-mentioned two text to be sorted, it is related that semanteme that text 1 to be sorted indicates and puppet emit robber's brush, text 2 to be sorted The semanteme of expression is pressed related to business.However, occurring many same or similar concepts in two texts to be sorted simultaneously Feature Words.For example, all occurring " discontented ", " incoming call ", " verification " these Feature Words in two texts to be sorted；Further for example, " stealing brush " for bank credit card division customer service work order, in " puppet emits " and text to be sorted 2 in text 1 to be sorted It is considered as similar Feature Words.In the two texts to be sorted, can effectively it characterize belonging to the text reality to be sorted The Feature Words of classification are relatively fewer, such as: " puppet emits list " in " puppet emit steal brush ", in " pressing business " " it is required that locating as early as possible Reason ", " non-someone's connection so far ".

During being classified using the method for the text classification based on statistical method, due to above-mentioned two to be sorted Text can extract many same or similar Feature Words, and characteristic crossover is serious, is actually difficult or can not effectively extract Feature Words as similar " it is required that handling as early as possible ", " non-someone's connection so far ".In face of of this sort training corpus, meter is used The statistical classification method that calculation machine learns automatically, due to being easy to misjudge, text classifier is extremely difficult to ideal precision and wants It asks.

Second, when training expects uneven, the training corpus of partial category is very more, and there are many feature of extraction, covering Wide, the training corpus of partial category is considerably less, and extraction feature is limited, is not enough to cover all aspects of current class.At this point, The problem of text classification is easy to cause feature to tilt is carried out using statistical method.

Still it is illustrated, is continued to use above-mentioned wait divide as concrete application scene using bank credit card division customer service work order text Class text 2, in the text to be sorted, " it is required that as early as possible handle ", " non-someone's connection so far " etc. can Efficient Characterization press industry The Feature Words of business concept are difficult to be extracted to；Meanwhile " being reluctant also to be stolen the amount of money of brush " in text 2 to be sorted, it is easy to extract Feature Words " stealing brush " out.Therefore, in text 2 to be sorted, the Feature Words for capableing of Efficient Characterization can not be extracted, misleading spy Sign word is extracted, to be easy the problem of causing erroneous judgement, classification is caused to malfunction.

In addition, if the training corpus of " pressing business " this classification is seldom, Cong Zhongti in text classifier building The Feature Words taken are limited, only " pressing ", " steal brush ", " amount of money ", " verification ", " processing " totally 5 Feature Words；And simultaneously, " puppet emits There are many training corpus of this classification of robber's brush ", and the Feature Words covering surface therefrom extracted can extract " letter than broad With card ", " amount of money ", " mentioning volume ", " amount ", " steal brush ", " incoming call ", " discontented ", " verification ", " processing ", " responsibility ", " refund ", " complaint ", " progress ", " accepting " totally 14 Feature Words.

When facing text 3 to be sorted below, text 3 to be sorted actually belongs to " pressing business " classification, and is based on The file classification method of statistical method is easy to judge by accident.

Text 3 to be sorted:

It complains before single: 20150826j00000044,20150902j00000248,20150910j00000149, client It indicates to complain from August incoming call on the 26th, credit card application mentions volume, and staff's refusal is accepted, and is not connected to branch responsible person so far It ringing back, customer requirement is handled ability for branch and is complained, and repeatedly inquires mechanism of supervision department phone in the phone, And require regardless of processing result, it is desirable that branch's reply processing progress, client indicate the processing time too long, be reluctant to continue To which the tired processing of labor be thanks.

Classification can be determined according to the quantity and weight of extracting feature based on the file classification method of statistical method.If adopted Classified with statistical method, text 3 to be sorted can be marked as " puppet emits robber's brush " classification.This is because " puppet emits robber's brush " class covers More features word, and " pressing business " category feature word is few, covering surface is limited, can not more preferably be matched to text 3 to be sorted Content.

Existing text classifier is not applied for the serious text of characteristic crossover, and is not applied for training corpus not Uniform situation, to solve this technical problem, referring to FIG. 1, providing a kind of text in the specific embodiment of the application The construction method of this classifier, comprising the following steps:

S100 obtains classification system, stores the classification system with multi-branch tree data structure, generates ontology tree；

S200 extracts keyword from this body node of the ontology tree；

S300 obtains ontology expression formula, and the ontology expression formula is generated according to classifying rules and semantic model, the classification Rule is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword；

Described body node is established with the corresponding ontology expression formula and is associated with by S400, obtains text classifier, described Text classifier includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree.

In the S100 the step of, classification system can also be constructed by artificial constructed by computer, the application to this not It is restricted.The step of " classification system being stored in the step of S100 with multi-branch tree data structure ", can specifically use with Under type: initially setting up root node, using root node as father node, adds this body node of level-one, and with the level-one in classification system Title of the tag along sort as corresponding this body node of level-one；Similar, then using this body node of level-one as father node, add second level This body node, and using the secondary classification label in classification system as the title of corresponding this body node of second level；And so on, directly This body node of the tag along sort of all ranks all correspondence establishments into the classification system of acquisition.At different levels body nodes and corresponding Set membership between this body node just constitutes ontology tree.In ontology tree, level-one this body node, second level this body node, three This body node of grade etc. may be referred to collectively as this body node.

For example, the example in the above table 1 is continued to use, the classification system, the sheet of generation are stored with multi-branch tree data structure Body tree is as shown in table 2.

2 ontology tree example schematic table of table

Optionally, the step of keyword is extracted in this body node of the slave ontology tree in S200, may include: acquisition ontology The title of node；The title of this body node is segmented, descriptor is obtained；Using these descriptor as this body node Keyword.

For example, the example in table 2 is continued to use, by taking this body node of the three-level of entitled " puppet, which emits, steals brush " as an example, the ontology Entitled " puppet emits robber's brush " of node, segments " puppet emits robber's brush ", obtains descriptor " puppet emits " and " stealing brush ".With " puppet emits " " stealing brush " is used as keyword, carries out the step of obtaining ontology expression formula in next step.

Optionally, the step of referring to Fig. 2, keyword is extracted in this body node of the slave ontology tree in S200, can wrap It includes:

S210 extracts descriptor from the title of this body node；

S220 according to the descriptor obtain expansion word, obtain include the descriptor and the expansion word the key Word.

By obtaining expansion word, the expansion word of more implicit close semanteme can be excavated, with descriptor and expansion word Classifying rules and semantic model are constructed collectively as keyword, to improve the nicety of grading of text classifier.

Wherein, the method that descriptor is extracted in step S210 may refer to the side that descriptor is extracted in aforementioned implementation Method.

The step of referring to Fig. 3, obtaining expansion word according to descriptor in step S220 may include:

S221 is segmented preset sample text to obtain the first character；

S222 constructs inverted index according to the first character, obtains index database；

S223 is segmented the descriptor to obtain the second character；

S224 is by the second character and the index storehouse matching；

S225 calculates the degree of correlation of sample text and descriptor according to matched result；

According to the degree of correlation, descending shows the sample text of the degree of correlation greater than zero to S226 from large to small；

First character of S227 highlight mark and second character match in the sample text of display；

S228 obtains the first expansion word with the matched character in the descriptor part according in the sample text of display.

Step S221 and S222 utilize specific sample text, establish the index database based on specific sample text, for benefit The first expansion word is obtained with the index database.For example, 10000, bank credit card division customer service work order text is obtained in advance As sample text, this 10000 sample texts segment according to monocase granularity, the first character is obtained.To every A first character word for word constructs inverted index, forms index database.

In the S223 the step of, according to and the step of S221 same segmenting method, the theme that S210 step is extracted Word is segmented, and the second character is obtained.

In the S224 the step of, the second character is word for word matched with the inverted index in index database, every sample text In this, matched character is more, it is believed that the degree of correlation of the sample text and descriptor is higher, specifically can be by matching character Length calculate the degree of correlation of the sample text and descriptor.

In the S228 the step of, with matched first character in descriptor part in sample text, extend backward or forward, To obtain the character string of the word grade with complete meaning, using the character string as the first expansion word.In addition to the matched feelings in part Outside condition, it even if being exactly matched with descriptor, can also be extended forward or backward herein in sample, obtain the word with complete meaning Symbol string is as the first expansion word.The step can be completed, the application does not make this by being accomplished manually by computer Limitation.

Step S223 to S228, using descriptor as the inverted index in term, with index database in specific index database It is matched, to calculate the degree of correlation of sample text and descriptor, sample text is subjected to descending sort according to the degree of correlation It has been shown that, and in the sample text data of display highlight mark and the second character match the first character, carry out visualization exhibition Show, so that quickly positioning be assisted to obtain the first expansion word.Especially when sample text content is very long or sample of the degree of correlation greater than zero When this amount of text is very big, if obtaining the first expansion word according to the first highlighted character by artificial, acquisition is greatly promoted The efficiency of first expansion word reduces workload.

For example, the example in table 2 is continued to use, descriptor " puppet emits " and " stealing brush " are extracted.Respectively with " puppet emits " and " robber Brush " is used as term, using index database constructed by aforementioned 10000 sample texts, the letter of position matching in sample text Cease content.If sample text of the degree of correlation greater than 0 has 3 as the result is shown, it is specific as follows shown in.

Sample text 1:

Sample text 2:

Customers' responsiveness is not applied for card, and is traded, and be detailed in puppet and emit list: 20150207j11000092, during which client is multiple Incoming call is pressed, and list is shown in Table: 20150209j23240075,20150210j23240017,20150211j23240055, existing client Send a telegram here again, indicate reflection problem so far and after repeatedly pressing, 2/11 be connected to when branch is replied only inquiry card whether I Application, any reply about processing result all do not have, and it is very discontented to manage it processing progress for me, it is desirable that replys most terminate as early as possible Fruit, or reply and inform accurate process limited, my portion pacifies in vain online, please your reply processing, thanks！

Sample text 3:

Customer complaint card is stolen problem, has filled out list: 20150708j00000081,20150714j00000214, right Currently processed result is discontented, still needs to me and manages it provide the short message to be the pseudo- proof for emitting short message, and require to handle as early as possible, your portion is hoped to assist Processing, thanks！

It all exactly matches in the text of sample text 1 and 2 to " puppet emits " this descriptor.And sample text 3 is in addition to complete It is matched to " puppet emits " outside, is also partially matched to " robber ".For this purpose, " puppet emits " can be extended to the right " puppet emits short message ", by " robber " It is extended to " usurping " to the right, so that " puppet emits short message " and " usurping " is used as the first expansion word.By the first expansion word and descriptor Together as keyword, for the step of obtaining ontology expression formula in next step.

Other than obtaining the first expansion word using index database, can also obtain character according to the semanteme of descriptor can not It is matched to, but semantic the second same or similar expansion word, with the first expansion word, the second expansion word and descriptor collectively as pass Keyword, for the step of obtaining ontology expression formula in next step.

For example, from above-mentioned sample text 2 it can be found that in the sample text, even if do not occur " puppet emits " this A word, but when " not applying for card " in the text and " transaction " while when occurring, the content of the text, which is remained on, emits robber's brush with puppet Correlation, user are also desirable that under the classification that the text is categorized into " puppet, which emits, steals brush ".For this purpose, " will can not apply for card " and " hand over Easily " it is used as the second expansion word.

By the step of obtaining the second expansion word according to descriptor, implicit expansion word can be further excavated, thus Further increase the nicety of grading of text classifier.The acquisition of second expansion word can also pass through calculating by manually carrying out Machine obtains, and the application is not construed as limiting this.

In the S300 the step of, ontology expression formula can also be generated, the application by artificial constructed generation by computer With no restriction to this.Obtaining ontology expression formula can be a certain computer of ontology expression formula input manually, be obtained by the computer It takes, is also possible to the computer and receives the ontology expression formula generated and sent by another computer, to complete to obtain this body surface The step of up to formula, the application to this also with no restriction.

The generating process of ontology expression formula can specifically be realized by following steps:

Firstly, at least one keyword is connected using logical operator according to the keyword that the step of S200 extracts Come, makes between logical operator and keyword, keyword and keyword to generate classifying rules there are logic association.

Logical operator, also known as logical operator, the logical operator in the embodiment of the present application includes: logical AND "+", logic Non- "-", logic or " | " and polynary rounding " () ".For example, classifying rules A+B, indicate require simultaneously comprising A with B；Classifying rules is A+ (B | C), indicates to require to include any of B or C, while also requiring to include A.

The example for continuing to use in S200 " puppet, which emits, steals brush ", the master extracted from this body node of the three-level of entitled " puppet, which emits, steals brush " Epigraph is " puppet emits " and " stealing brush ", is " puppet emits short message " and " usurping ", the second extension by the first expansion word that descriptor obtains Word is " not applying for card " and " transaction ".By descriptor and two class expansion words collectively as keyword, 3 classification gauges are generated Then:

Classifying rules 1: puppet emits | steal brush；

Classifying rules 2: puppet emits short message+usurp；

Classifying rules 3: it does not apply for card+trades.

Above-mentioned classifying rules can also be generated by manually constructing generation by computer, the application to this not It limits.

Secondly, according to keyword generative semantics model.Semantic model refers to towards known concept, concludes from sample data The text presentation form for being used to describe known concept semanteme that exhaustion goes out.

Specifically, in one implementation, semantic model may include in all-purpose language concept and business factor concept Any one, respectively with " c_ " and " e_ " two kinds of sign flags.Keyword is divided into all-purpose language concept and business factor is general It reads, respectively from each keyword, the different expression form of known concept is extracted from context text information.

For example, still continue to use in S200 the example of " puppet emit steal brush ", will " puppet emits ", " stealing brush ", " puppet emits information ", " negative concept " is used as all-purpose language concept respectively as business factor concept by " usurping ", " applying for card ", " with card ", from existing Sample data in conclude exhaustion go out specific concept under, indicate the different text presentation forms of the concept.It is as shown in table 3 below:

3 semantic model example one of table

Concept type	Concept	The different expression form (Feature Words) of concept
			Business factor concept	E_ puppet emits	Puppet emits, palms off, pretends to be
Business factor concept	E_ steals brush	Steal brush
			Business factor concept	E_ puppet emits information	Puppet emits short message, puppet emits message, puppet emits incoming call, puppet emits mail
Business factor concept	E_ is usurped	It usurps
			Business factor concept	E_ applies for card	Do { 0,2 } card
Business factor concept	E_ card	It is brushed with card, generation transaction, using card, card
			All-purpose language concept	C_ negates concept	Not, do not have, never

It can also include i.e. other than including two class of all-purpose language concept and/or business factor concept in semantic model With concept, i.e., it is user according to actual needs come the genus set immediately with concept, can be marked with " k_ " symbol.Example Such as, it needs " initial amount " occur in text when classification, can directly define an instant concept, be come with " the initial amount of k_ " It indicates, should use only includes " initial amount " word in concept.

Above-mentioned semantic model can also be constructed generation, the application couple by manually constructing generation by computer This is not construed as limiting.

Finally, generating ontology expression formula according to classifying rules and semantic model.Specifically, by the key in classifying rules Word correspond in semantic model it is corresponding conceptive, and using with logical operator identical in classifying rules by corresponding concepts with patrolling It collects operator to associate, generates ontology expression formula.

For example, the example for continuing to use aforementioned " puppet, which emits, steals brush ", can be generated with lower body expression formula:

Ontology expression formula 1:e_ puppet emits | and e_ steals brush；

Ontology expression formula 2:e_ puppet emits information+e_ and usurps；

Ontology expression formula 3:c_ negative concept+e_ apply for card+e_ with card.

It should be noted that classifying rules and semantic model can generate simultaneously, can also be successive in the S300 the step of Generate, the application to its genesis sequence with no restriction.

It, will be in S300 step based on generated body surface of some this body node in S200 step in the S400 the step of Up to formula, it is associated with this body node foundation in S200 step.One this body node can sheet corresponding with one or more Body expression formula establishes association.After this body node all in ontology tree is all associated with the foundation of respective ontology expression formula, The ontology tree, and the ontology expression formula with each body node respective associated of ontology tree, collectively form text classifier, are used for The classification of unknown text.

Text classifier construction method and file classification method in above embodiment, firstly generate ontology tree, then from The ontology Node extraction keyword of ontology tree, is then based on keyword generative semantics model, raw based on keyword and logical operator Ingredient rule-like, then ontology expression formula is generated with semantic model and classifying rules, constructed ontology expression formula is corresponding This body node associate, to ontology tree and all constitute text classification with the associated ontology expression formula of this body node Device.When text classifier is used for text classification, text to be sorted triggers specific ontology expression formula, due to ontology expression formula with Specific this body node association, therefore the ontology expression formula by being triggered can determine this body node.With this body node Information, such as title mark text to be sorted as tag along sort, determine the classification of text to be sorted.

Simultaneously as being the classification for determining text by triggering ontology expression formula using above-mentioned text classifier, it is not necessary to The quantity or weight of feature covering are calculated, therefore, even if training expectation is unbalanced, the Feature Words quantity of some classification is special Few, feature inclination will not occur leads to the situation of text classification mistake.This is because the Feature Words of text semantic can be characterized Once being extracted, and it be used to construct ontology expression formula, then once triggering ontology expression formula, so that it may treat classifying text Classification is marked, without considering the number and weight of Feature Words appearance, to classify caused by avoiding training from expecting unevenly The case where error.

For example, text to be sorted 1, text to be sorted 2 in the shortcomings that continuing to use above-mentioned statistical method part and to The example of classifying text 3, by this body node of three-level association ontology expression formula " k_ puppet emits list " of entitled " puppet, which emits, steals brush " and " e_ Puppet emits+c_ and requires load duty | and c_ is discontented ", so that it may the ontology expression formula in triggering text classifier will be passed through, thus by be sorted Text 1 is categorized into " puppet emits robber's brush " classification.Similarly, this body node of three-level association ontology expression of entitled " pressing business " Formula " e_ is pressed ", " c_ inquiry+e_ processing progress " and " it is long that e_ does not reply+e_ processing time+c_ ", can be by text to be sorted 2, text 3 to be sorted is identified in " pressing business " classification.

Wherein, semantic model is as shown in table 4.

4 semantic model example two of table

It should be noted that in table 2, " undertaking { 0,2 } responsibility " is indicated in matched text, if undertake with responsibility it Between also include 0~2 character text, also " { 0,2 } responsibility can be undertaken " and be matched, " undertake one for example, having in the text Severely rebuke and appoint " or when " undertake and had a responsibility for ", just will be considered that and be matched to " undertaking { 0,2 } responsibility ".It is other similar in the application Representation method meaning is identical with this.

In table 2, " [^ is not] { 0,5 } is discontented " is indicated in matched text, as long as before discontented including 0~5 character Text can all be matched by " [^ is not] { 0,5 } is discontented ", such as " very discontented ", while exclude " not being discontented ", " not counting discontented " Deng reversed semantic Feature Words.Other similar representation method meaning is identical with this in the application.

Optionally, referring to FIG. 4, the construction method of text classifier can also include:

S500 determines the prediction tag along sort of preset test text using the text classifier；

S600 adjusts the ontology expression formula in the text classifier when accuracy rate is less than preset threshold, described accurate Rate is the ratio that prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text.

In the S500 the step of, preset test text has used original classification label handmarking good, usually tests The quantity of text more than one, belong to same class text with sample text.For example, sample text is bank credit card division client clothes It works single text, then test text is generally also bank credit card division customer service work order text.

In the S600 the step of, if accuracy rate is more than or equal to preset threshold, indicate that text classifier can be effectively Classify to unknown text.If it is less than preset threshold, then optimize text classifier by adjusting ontology expression formula.

The step of S500 and S600, can be the process of an iteration, the text by continuing to optimize, after making optimization The accuracy rate of classifier can achieve the desired threshold value of user.

Referring to FIG. 5, the step of S600, can specifically include:

S610 is extracted and ontology expression formula corresponding to the unmatched prediction tag along sort of original classification label；

S620 increases constraint factor in ontology expression formula, obtains when lacking constraint factor in corresponding ontology expression formula To the ontology expression formula of optimization, the constraint factor includes concept and/or logical operator in semantic model.

It in the S610 the step of, can specifically be accomplished by the following way, not extracted first with original classification label not The prediction tag along sort matched finds identical with default tag along sort body node of title, then determines and the default tag along sort Associated ontology expression formula, thus the ontology expression formula that exact p-value text is triggered to.

In the S620 the step of, when the ontology expression formula extracted in S610 step lacks constraint factor, Ke Yi Increase constraint factor in ontology expression formula, that is, by adding business factor concept, all-purpose language concept or i.e. in concept At least one and logical operator enable and the matched text to be sorted of script body expression formula to optimize script body expression formula Originally the ontology expression formula after being unable to matching optimization, or cannot be with the matched text energy matching optimization to be sorted of script body expression formula Ontology expression formula afterwards.For example, new concept can be increased in semantic model, while increasing new logical operator, thus raw At the ontology expression formula after optimization；Feature Words can also be increased or decreased in former concept；Logic calculation can also be increased or decreased Son makes to form new logical relation between concept.Script body expression formula is replaced with the ontology expression formula of the optimization, in ontology tree This corresponding body node establishes association, the text classifier after being optimized.

For example, the example of aforementioned " puppet, which emits, steals brush " is continued to use, includes test text 1 in test text.

Test text 1:

It is discontented that client emits single processing result for puppet, it is desirable that complains, is detailed in form NO.: 20150810s11000063.Exist Line fills in new complain at list 20150906s00000076 to risk management for it.But client stakes out a claim and goes to Hubei Wuhan The credit card centre in city solves this problem face to face, and asking your portion, verification is handled as early as possible, thanks.Telephone number: 138xxxxx124.

The test text 1 can be matched with ontology expression formula " e_ puppet emits | e_ steal brush ", that is, trigger the ontology expression formula, According to the ontology expression formula, this body node being associated is determined in ontology tree, and according to the title of this body node, by the survey Examination text is identified with prediction tag along sort " puppet emits robber's brush ".However, the practical semanteme of prediction text 1 is not that puppet emits robber's brush, But business is pressed, the original classification label of manual identification is " pressing business ", is mismatched with prediction tag along sort.If such It was found that emitting in the test text 1 if there is puppet or stealing brush, while there is not " processing result ", then the test text 1 will not touch Therefore the ontology expression formula for sending out above-mentioned lacks constraint factor in current ontology expression formula: logical operator "-" and business factor Concept " e_ processing result ", " e_ processing result " include Feature Words " processing result ".Increase the pact lacked in ontology expression formula Shu Yinzi, the ontology expression formula optimized: e_ puppet emits | and e_ steals brush-e_ processing result.It is replaced with the ontology expression formula of the optimization Script body expression formula is established with this body node corresponding in ontology tree and is associated with, the text classifier after being optimized.

Referring to FIG. 6, in another embodiment, providing a kind of file classification method, comprising the following steps:

S710 obtains text to be sorted；

S720 determines the ontology expression formula in text classifier with the text matches to be sorted, wherein the text point Class device includes ontology tree, and the ontology expression formula with each body node respective associated in the ontology tree；

S730 determination and associated body node of the ontology expression formula；

S740 according to the information of this body node determine the text to be sorted belonging to classification.

In the S720 the step of, ontology tree is stored in the form of multi-branch tree data structure.In an ontology tree, an ontology Node can be associated at least one ontology expression formula.When the associated ontology expression formula of this body node is more than one, multiple ontologies Expression formula forms ontology expression formula collection, whether can judge text to be sorted by way of traversing ontology expression formula collection one by one It is matched with ontology expression formula therein；It can also judge whether text to be sorted matches with ontology expression formula parallel, to improve Matching speed can improve text classification speed especially when amount of text to be sorted is larger on the whole.

In the S740 the step of, the information of this body node specifically can be the title etc. of this body node.In S730 and originally The name of associated body node of body expression formula is referred to as tag along sort, marks text to be sorted, to treat classifying text progress Classification.When the same text to be sorted is triggered greater than an ontology expression formula, and a plurality of ontology expression formula respectively corresponds to This body node difference when, tag along sort can be referred to as with the name of multiple body nodes, mark the same text to be sorted respectively This, reaches polytypic effect.

Referring to FIG. 7, in another embodiment, providing a kind of text classifier construction device, comprising:

First acquisition unit 1 stores the classification system with multi-branch tree data structure, generates for obtaining classification system Ontology tree；

Extraction unit 2, for extracting keyword from this body node of the ontology tree；

Second acquisition unit 3, for obtaining ontology expression formula, the ontology expression formula is according to classifying rules and semantic model It generates, the classifying rules is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword；

Generation unit 4 is associated with for establishing described body node with the corresponding ontology expression formula, obtains text point Class device, the text classifier include the ontology tree and this body surface with each body node respective associated of ontology tree Up to formula.

Optionally, referring to FIG. 8, the step of generating ontology expression formula can carry out by outer computer or manually, at this point, After extraction unit is extracted keyword, keyword is sent.Outer computer or manually according to keyword next life constituent class Rule and semantic model, and ontology expression formula is generated according to semantic model and classifying rules.Then it is received by second acquisition unit Externally input ontology expression formula, finally constructs text classifier by generation unit.In such a case, it is possible to reduce text The calculation amount of classifier construction device itself.

Optionally, referring to FIG. 9, extraction unit 2 may include:

Key phrases extraction subelement 21, for extracting descriptor from the title of this body node；

Subelement 22 is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the expansion Open up the keyword of word.

Expansion word is obtained by extension subelement, the expansion word of more implicit close semanteme can be excavated, with theme Word and expansion word construct classifying rules and semantic model collectively as keyword, to improve the classification essence of text classifier Degree.

Optionally, referring to FIG. 9, text classifier construction device can also include:

Test text taxon 5, for determining that the prediction of preset test text is classified using the text classifier Label；

Optimize unit 6, when being less than preset threshold for accuracy rate, adjust the ontology expression formula in the text classifier, The accuracy rate is to account for predict that tag along sort is total with the quantity of the prediction tag along sort of the original classification tag match of test text Several ratio.

By continuing to optimize, the accuracy rate of the text classifier after making optimization can achieve the desired threshold value of user.

Optionally, extension subelement 22 may include:

First participle unit obtains the first character for being segmented preset sample text；

Index database construction unit obtains index database for constructing inverted index according to the first character；

Second participle unit obtains the second character for being segmented the descriptor；

Matching unit is used for the second character and the index storehouse matching；

Correlation calculating unit, for calculating the degree of correlation of sample text and descriptor according to matched result；

Display unit, for descending to show the sample text of the degree of correlation greater than zero from large to small according to the degree of correlation；

Highlighted unit, the first character for highlight mark and second character match in the sample text of display；

First expansion word acquiring unit, in the sample text according to display with the matched character in descriptor part Obtain expansion word.

Optionally, optimization unit 6 may include:

Ontology expression formula extraction unit, for corresponding to extraction and the unmatched prediction tag along sort of original classification label Ontology expression formula；

Adjustment unit, for increasing about in ontology expression formula when lacking constraint factor in corresponding ontology expression formula Shu Yinzi, the ontology expression formula optimized, the constraint factor include concept and/or logical operator in semantic model.

Same and similar part may refer to each other between each embodiment in this specification.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims

1. a kind of construction method of text classifier, which comprises the following steps:

Keyword is extracted from this body node of the ontology tree；

Obtain ontology expression formula, the ontology expression formula is generated according to classifying rules and semantic model, the classifying rules according to The keyword and logical operator generate, and the semantic model is generated according to the keyword；

Described body node is established with the corresponding ontology expression formula and is associated with, text classifier, the text classification are obtained Device includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree；

The step of keyword is extracted from this body node of the ontology tree, comprising:

Descriptor is extracted from the title of this body node；

According to the descriptor obtain expansion word, obtain include the descriptor and the expansion word the keyword；

The step of obtaining expansion word according to the descriptor, comprising:

Preset sample text is segmented to obtain the first character；

The descriptor is segmented to obtain the second character；

By the second character and the index storehouse matching；

2. the construction method of text classifier according to claim 1, which is characterized in that further include:

The prediction tag along sort of preset test text is determined using the text classifier；

When accuracy rate is less than preset threshold, the ontology expression formula in the text classifier is adjusted, the accuracy rate is and survey The quantity for trying the prediction tag along sort of the original classification tag match of text accounts for the ratio of prediction tag along sort sum.

3. the construction method of text classifier according to claim 2, which is characterized in that this body surface in adjustment classifier The step of up to formula, comprising:

When lacking constraint factor in corresponding ontology expression formula, increases constraint factor in ontology expression formula, optimized Ontology expression formula, the constraint factor include concept and/or logical operator in semantic model.

4. a kind of file classification method, which comprises the following steps:

Obtain text to be sorted；

Determine the ontology expression formula in text classifier with the text matches to be sorted, wherein the text classifier includes Ontology tree, and the ontology expression formula with each body node respective associated in the ontology tree, the ontology expression formula root It is generated according to classifying rules and semantic model, the classifying rules is generated according to keyword and logical operator, the semantic model root It is generated according to the keyword, the keyword is extracted from this body node of the ontology tree and obtained, comprising the following steps: from this Descriptor is extracted in the title of body node；Expansion word is obtained according to the descriptor, obtains including the descriptor and the expansion Open up the keyword of word；The step of obtaining expansion word according to the descriptor, comprising: segment preset sample text Obtain the first character；Inverted index is constructed according to the first character, obtains index database；It is segmented the descriptor to obtain second Character；By the second character and the index storehouse matching；The degree of correlation of sample text and descriptor is calculated according to matched result；It presses According to the degree of correlation, descending shows the sample text of the degree of correlation greater than zero from large to small；The highlight mark in the sample text of display With the first character of second character match；It is obtained according in the sample text of display with the matched character in descriptor part Take expansion word；

Determining and associated body node of the ontology expression formula；

5. file classification method according to claim 4, which is characterized in that determine in text classifier with it is described to be sorted The step of ontology expression formula of text matches includes:

When the associated ontology expression formula of this body node is more than one, judge whether the text to be sorted expresses with ontology parallel Formula matching.

6. a kind of text classifier construction device characterized by comprising

First acquisition unit stores the classification system with multi-branch tree data structure, generates ontology for obtaining classification system Tree；

Second acquisition unit, for obtaining ontology expression formula, the ontology expression formula is generated according to classifying rules and semantic model, The classifying rules is generated according to the keyword and logical operator, and the semantic model is generated according to the keyword；

Generation unit is associated with for establishing described body node with the corresponding ontology expression formula, obtains text classifier, The text classifier includes the ontology tree and the ontology expression formula with each body node respective associated of the ontology tree；

The extraction unit further include:

Subelement is extended, for obtaining expansion word according to the descriptor, obtains including the descriptor and the expansion word The keyword；

The extension subelement specifically includes:

First expansion word acquiring unit, for being obtained in the sample text according to display with the matched character in the descriptor part Expansion word.

7. text classifier construction device according to claim 6, which is characterized in that further include:

Test text taxon, for determining the prediction tag along sort of preset test text using the text classifier；

Optimize unit, when being less than preset threshold for accuracy rate, adjusts the ontology expression formula in the text classifier, the standard True rate is the ratio that prediction tag along sort sum is accounted for the quantity of the prediction tag along sort of the original classification tag match of test text Value.