CN107193959A

CN107193959A - A kind of business entity's sorting technique towards plain text

Info

Publication number: CN107193959A
Application number: CN201710371464.7A
Authority: CN
Inventors: 张雷; 陈嘉伟; 谢璐遥; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2017-09-22
Anticipated expiration: 2037-05-24
Also published as: CN107193959B

Abstract

The present invention discloses a kind of business entity's sorting technique towards plain text, comprises the following steps：S1, classification mark is carried out to the business entity in the plain text data that collects, be used as the training set of business entity's identification module；Classification mark is carried out according to industrial nature to the business entity in the plain text data that collects, using the training sample set as business entity's sort module；S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model；S3, the text data to original training set carry out semantic vector structure；S4, the training set data for having classification to mark after semantic vector as training parameter trained into business entity's disaggregated model；S5, using business entity's disaggregated model the business entity in text to be predicted is classified.This method feature of the obtained semantic vector as entity, reduces the dependence to manual features and external data, and versatility and robustness are guaranteed.

Description

A kind of business entity's sorting technique towards plain text

Technical field

The invention belongs to name Entity recognition and fine granularity entity classification technical field, and in particular to a kind of towards plain text Business entity's sorting technique.

Background technology

In recent years, with the upsurge of " internet finance ", increasing corporate decision maker is more advanced in the urgent need to utilizing Information processing manner extracted and analyzed come the internet data to magnanimity, to make more preferable decision-making.In these seas Measure among data, the plain text data such as law court's document class, news public sentiment class turns into the primary source that enterprise obtains high price value information.

Name entity recognition techniques are that enterprise carries out Entity Semantics analysis, the basis of the work such as entity relation extraction.At present Entity is simply divided into name, place name, mechanism name etc. by the name entity recognition techniques of main flow, and this causes the type of entity to lack language Justice.Meanwhile, carry out excessively depending on manual features and external data during entity classification, its versatility and robustness is protected Card.

The content of the invention

Entity is simply divided into name, place name, mechanism name etc. by the present invention for the name entity recognition techniques of current main flow, So that the type of entity lacks semantic.In addition, excessively depending on manual features and external data when carrying out entity classification, lead to it It cannot be guaranteed with property and robustness.To solve the above problems, the present invention proposes that a kind of business entity towards plain text classifies Method, using the more fine-grained dividing mode of business entity, and the semantic construction feature using text in itself, finally looked forward to The classification of industry entity.Wherein, plain text, that is, include the text of business activity information, for example newsletter archive, law court's letter.

As shown in figure 1, business entity's sorting technique disclosed in this invention towards plain text, comprises the following steps：

S1, classification mark is carried out to the business entity in the plain text data that collects, will mark the data that complete as The training set of business entity's identification module；Classification is carried out according to industrial nature to the business entity in the plain text data that collects Mark, will mark the data of completion as the training sample set of business entity's sort module；

S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model；

S3, the text data to original training set carry out semantic vector structure；

S4, the training set data for having classification to mark after semantic vector as training parameter trained into business entity Disaggregated model；

S5, using business entity's disaggregated model the business entity in text to be predicted is classified.

Further, in S1, the plain text data collected is subjected to subordinate sentence, participle and part-of-speech tagging, using artificial mark The method of note is labeled to the business entity in plain text data and category of employment.

Further, using the participle and part-of-speech tagging software HanLP increased income to plain text data carry out subordinate sentence, participle and Part-of-speech tagging.

Further, it is " BIO " mark pattern to business entity's notation methods in plain text data, wherein, enterprise is real The starting word of body is labeled as " B ", and the other parts word of the non-starting word of business entity is labeled as " I ", unrelated with business entity Word is labeled as " O ".

Further, using in the method manually marked, to the business entity in plain text data according to context Classification mark is carried out to it according to industrial nature.

Further, in S2, business entity's identification model instruction is carried out by the conditional random field models for introducing boundary characteristic Practice.

Further, introducing the conditional random field models of boundary characteristic includes：Will be whole after enterprise name participle by HanLP Reason obtains left and right border dictionary；The forecast model on left and right border is obtained using the libSVM training increased income；Successively from training set Middle taking-up word simultaneously judges whether the word is left and right border word by the forecast model on left and right border；Word sheet will be included The condition random field instrument that body, part-of-speech tagging, right boundary mark, the training set data input of entity mark are increased income carries out enterprise The training of entity recognition model and the identification model for obtaining business entity.

Further, in S3, the term vector that training sample concentrates all words is obtained using term vector calculating instrument, instruction is calculated Practice inverse text frequency (IDF) value of all words in sample set, include business entity's language using term vector and TF-IDF value calculating The vector sum context vector of business entity in sentence, the vector sum context vector of business entity is spliced, to obtain Include business entity's semantic vector that context is semantic.

Further, the term vector of all words in training set is calculated using the word2vec instruments increased income.

Further, in S4, enterprise is gone out to having had the training set data that classification is marked using softmax model trainings real The disaggregated model of body.

The present invention has the advantage that as described below：

1) right boundary of entity is predefined using lexicon rules and SVM classifier, afterwards by the left and right side of judgement The result on boundary is incorporated into conditional random field models as new feature, and improved method of the invention has very in recall rate and F1 values Big lifting.

2) mode of weighting is embedded in using word, semantic vector expression is carried out to entity and its context, so that real Semanteme between body can be measured by semantic vector distance.With feature of the obtained semantic vector as entity, reduction pair The dependence of manual features and external data.

3) entity boundary characteristic is introduced in existing conditional random field models, and the introducing of entity boundary characteristic is strengthened Conditional random field models are to the control ability on entity border, and the recall rate of such as identification, which has, obviously to be improved, and also leads to it It is guaranteed with property and robustness.

Brief description of the drawings

Fig. 1 is business entity's sorting technique FB(flow block) disclosed in this invention towards plain text.

Fig. 2 is that the training set in embodiment builds flow chart.

Fig. 3 is business entity's identification model training flow chart based on improvement condition random field in embodiment.

Fig. 4 builds flow chart for the Entity Semantics vector of the word-based vector sum TF-IDF values weighting in embodiment.

Fig. 5 is business entity's disaggregated model training flow chart.

Fig. 6 is business entity's classification process figure.

Embodiment

In order to know more about the technology contents of the present invention, especially exemplified by specific business entity's sorting technique towards law court's document Embodiment simultaneously coordinates institute's accompanying drawings to be described as follows.

As shown in Fig. 2 the present invention first builds training sample set before implementation.The mistake of training sample set is built in embodiment Journey is as follows：

Step 1-0, the initial state for setting up training set.

Step 1-1, using web crawlers instrument law court's document is gathered from internet, be used as original language material storehouse.

Step 1-2, the document data to collecting, using the participle and part-of-speech tagging software HanLP increased income to document Text carries out subordinate sentence, participle and part-of-speech tagging.Certainly, general participle software of increasing income can be used, for example Chinese Academy of Sciences's participle Etc., the HanLP softwares selected in embodiment are relatively more preferable compared to the effect of participle for current participle software of increasing income, and And can artificial Customized dictionary, be also more convenient.

Step 1-3, due in text business entity's word (be the title of enterprise, mainly include full name and referred to as two kinds Form) meeting cutting is multiple words after participle, so the method by manually marking is needed, by the business entity in document text Mark out and, the mode of mark is labeled as " B " for the starting word of " BIO " mark pattern, i.e. business entity, the non-starting of business entity The other parts word of word is labeled as " I ", and the word unrelated with business entity is labeled as " O ", and such as " defendant (O) Jiangsu (B) is Eurasian (I) film (I) Co., Ltd (I) ".The data for marking completion are used as the training set of business entity's identification model.

Meanwhile, the business entity in the document text that collects is carried out according to context according to industrial nature to it Classification is marked.The data completed are marked as the training set of business entity's disaggregated model, the data that mark is completed are to include one The category of sentence comprising enterprise name and the affiliated industry of the enterprise, and whole training set is exactly some such sentence+classes Target set.Wherein, the standard of classification mark can be selected with accuracy and authoritative industrial sectors of national economy classification (GB/T Dividing mode in 4754-2011).

Step 1-4, the end for setting up training set.

As shown in figure 3, after training set has been built, using improved maximum matching method, i.e. by introducing border The conditional random field models of feature carry out business entity's identification model training.

Step 2-0, the beginning of business entity's identification model training.

Training set data (the i.e. step 1-3 of step 2-1, input after subordinate sentence, participle, part-of-speech tagging and entity mark In mark complete data).

Step 2-2, some business directories are crawled from internet, these enterprise names are passed through into HanLP participle Final finishings Obtain left and right border dictionary.Left margin word refers to first word after enterprise name participle, and right margin word refers to enterprise's name Claim last word after participle.All left and right border words, which are arranged, turns into left and right border word dictionary.

Step 2-3, using the libSVM training increased income obtain the forecast model on left and right border.Left margin forecast model is instructed The feature selected during white silk is：The word of current word and latter two word is in itself and part of speech；Right margin forecast model was trained The feature selected in journey is：The word of current word and first two words is in itself and part of speech.Wherein, the libSVM increased income the tools used There are preferable robustness and more preferable classification boundaries

Step 2-4, taken out from training set word and judge that the word is by the forecast model on left and right border successively No is left and right border word.

Whether current term is that the determination methods of left margin word are：If the word is appeared in left margin dictionary, and should There is word to be determined as that left margin word is then correct left margin word under SVM methods on the right of word in two word windows, otherwise give up Go.Certainly, each word has a judged result under dictionary methods and SVM methods, but the two methods all have shortcoming, The step of this in embodiment is the result of comprehensive two methods, selects a more reasonably result.

Whether current term is that the determination methods of right margin word are：If the word is appeared in right margin dictionary, and should There is word to be determined as that right margin word is then correct right margin word under SVM methods in the word window of two, the word left side, otherwise give up Go.

Step 2-5, judge whether to have traveled through all words, step 2-7 is arrived if traveling through and completing, otherwise to 2-6.

Step 2-6, counter i add 1, take out next word in text.Actual above step is exactly to judge some word Whether it is right boundary word.

Step 2-7, the condition random field instrument CRF++ that the data input of training set is increased income carry out business entity's identification mould The training of type, exports the identification model of business entity.Training data selection feature for word in itself, part-of-speech tagging, left and right side Boundary mark note, entity mark.

Step 2-8, the end of business entity's identification model training.

It can be seen that, the present invention introduces entity boundary characteristic in existing conditional random field models, in use condition random field Judge whether this word is right boundary word before model, using this result as feature, use condition random field afterwards Model, and the introducing of entity boundary characteristic strengthens control ability of the conditional random field models to entity border, is embodied as The recall rate of identification is significantly improved.

As shown in figure 4, carrying out the flow chart of semantic vector structure to the text data of original training set.

The beginning that step 3-0, training set text semantic vector are built.

Step 3-1, input have completed the training set text collection of subordinate sentence, participle, part-of-speech tagging and classification mark.

Step 3-2, the term vector using all words in the word2vec instruments calculating training set increased income.It is noticeable It is that word2vec is the instrument for the calculating term vector that Google increases income, instrument such at present is a lot, and word2vec relatively knows Name, alternative instrument also has many such as java word2vec4j etc..

Step 3-3, inverse text frequency (IDF) value for calculating all words in training set, the formula that it is calculated are as follows：

Wherein, the fraction in logarithmic function, molecule represents the sum of document in whole document, and denominator represents to include some word The number of files of language adds 1 again, takes both ratio.

Step 3-4, each text taken out successively since first text in training set in document.

Step 3-5, using business entity's identification model judge take out this text in whether have depositing for business entity If then arriving step 3-6, otherwise to step 3-10.

Step 3-6, judge in step 3-5 to include business entity in text after, to the semantic vector of entity part Calculated, it is assumed that the vector representation of an entity is v_m, constituting its phrase its vector representation is respectively：w₁, w₂..., w_n, Then v_mCalculation formula it is as follows：

Step 3-7, after the semantic vector of step 3-6 computational entities, the context section of entity is calculated it is semantic to Amount, its calculation is as follows：

Wherein, v (context) is the vectorial forms of characterization of context, tfidf (w_i) represent word w_iTF-IDF Value, v (w_i) it is word w_iTerm vector, k be word window size (take in context close to central entity preceding k word).Word TF values be the frequency of the word occur in text, the TF-IDF values of word are the TF values of word and the product of IDF values.

Step 3-8, the entity and the semantic vector of context obtained in step 3-6 and step 3-7 is spliced, specifically The context vector for the entity vector sum k dimensions for operating to tie up k, with entity vector preceding, the posterior mode of context vector is spelled Connect the vector for obtaining a 2k dimension.

Step 3-9, judge whether to have traveled through sentences all in training set text, step 3- is arrived if traveling through and completing 11, otherwise to step 3-10.

Step 3-10, counter i add 1, take out next sentence in training set text.

Step 3-11, the entity vector output by obtained integrating context semanteme, are used as business entity's disaggregated model Training data.It is worth noting that, step 1-3 mark after data be a plain text+category data, herein before Step is that text is changed into vector, thus data here are the data of vector+category.

The end that step 3-12, training set text semantic are built.

As shown in figure 5, carrying out semantic vector structure to original language material (be obtain after step 1-3 data set) Afterwards, the training of business entity's disaggregated model is carried out using softmax multi-classification algorithms.Softmax multi-classification algorithms are a kind of Conventional method, for other methods, its calculating speed is fast, and occupying little space, and can obtain test sample exists Probability in each classification

Step 4-0, the beginning of business entity's disaggregated model training.

Step 4-1, softmax classification will be input to by the training set data for having classification to mark after semantic vector In model, training parameter is used as.

Step 4-2, many disaggregated model training are carried out using softmax algorithms, export many points of the softmax after training Class model, is predicted to follow-up classification.

Step 4-3, the end of business entity's disaggregated model training.

As shown in fig. 6, after business entity's disaggregated model is obtained, the flow chart classified using the disaggregated model.

Step 5-0, the beginning of business entity's classification.

Step 5-1, the text to business entity's disaggregated model input entity class to be predicted.

Step 5-2, using business entity's identification model judge input text in whether have business entity, if then going to Step 5-3, otherwise goes to step 5-5.

Step 5-3, carry out the vectorial structure of Entity Semantics to including business entity's text using step 3-1 to step 3-12 Build, in the business entity's disaggregated model for afterwards training obtained vector input, obtain the classification results of entity in text.

Step 5-4, the classification results for exporting 5-3 steps.

Step 5-5, the end of business entity's classification.

In summary, the TF-IDF of utilization term vector technology proposed by the present invention and document word is worth to comprising context The method classified again after semantic business entity's vector representation form, can solve the problem that at present to business entity's sorting technique Middle type is less and the problem of lacking semantic, making the type of business entity has thinner granularity and stronger semantic feature.

Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when can It is used for a variety of modifications and variations.Therefore, the scope of protection of the present invention is defined by those of the claims.

Claims

1. a kind of business entity's sorting technique towards plain text, it is characterised in that comprise the following steps：

S1, classification mark is carried out to the business entity in the plain text data that collects, the data that complete will be marked and be used as enterprise The training set of Entity recognition module；Classification mark is carried out according to industrial nature to the business entity in the plain text data that collects Note, will mark the data of completion as the training sample set of business entity's sort module；

S4, using after semantic vector have classification mark training set data as training parameter train business entity classification Model；

2. business entity's sorting technique as claimed in claim 1, it is characterised in that in S1, by the plain text data collected Subordinate sentence, participle and part-of-speech tagging are carried out, using the method manually marked to the business entity in plain text data and category of employment It is labeled.

3. business entity's sorting technique as claimed in claim 2, it is characterised in that soft using the participle and part-of-speech tagging increased income Part HanLP carries out subordinate sentence, participle and part-of-speech tagging to plain text data.

4. business entity's sorting technique as claimed in claim 2, it is characterised in that marked to the business entity in plain text data Note mode is " BIO " mark pattern, wherein, the starting word of business entity is labeled as " B ", other portions of the non-starting word of business entity Participle language is labeled as " I ", and the word unrelated with business entity is labeled as " O ".

5. business entity's sorting technique as claimed in claim 2, it is characterised in that using in the method manually marked, to pure Business entity in text data carries out classification mark according to industrial nature according to context to it.

6. business entity's sorting technique as claimed in claim 1, it is characterised in that in S2, by the bar for introducing boundary characteristic Part random field models carry out business entity's identification model training.

7. business entity's sorting technique as claimed in claim 6, it is characterised in that introduce the condition random field mould of boundary characteristic Type includes：Enterprise name participle Final finishing is obtained by left and right border dictionary by HanLP；Trained using the libSVM increased income To the forecast model on left and right border；Word is taken out from training set successively and this is judged by the forecast model on left and right border Whether word is left and right border word；By including word in itself, part-of-speech tagging, right boundary mark, entity mark training set number The condition random field instrument increased income according to input carries out the training of business entity's identification model and obtains the identification model of business entity.

8. business entity's sorting technique as claimed in claim 1, it is characterised in that in S3, is obtained using term vector calculating instrument The term vector of all words is concentrated to training sample, inverse text frequency (IDF) value that training sample concentrates all words is calculated, utilizes word Vector sum TF-IDF values calculate the vector sum context vector of the business entity included in business entity's sentence, by business entity Vector sum context vector spliced, to obtain comprising the semantic business entity's semantic vector of context.

9. business entity's sorting technique as claimed in claim 8, it is characterised in that calculated using the word2vec instruments increased income The term vector of all words in training set.

10. business entity's sorting technique as claimed in claim 1, it is characterised in that in S4, to the instruction for having there is classification to mark Practice the disaggregated model that collection data go out business entity using softmax model trainings.