CN107193959B

CN107193959B - Pure text-oriented enterprise entity classification method

Info

Publication number: CN107193959B
Application number: CN201710371464.7A
Authority: CN
Inventors: 张雷; 陈嘉伟; 谢璐遥; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-11-27
Anticipated expiration: 2037-05-24
Also published as: CN107193959A

Abstract

The invention discloses a pure text-oriented enterprise entity classification method, which comprises the following steps: s1, carrying out category marking on the enterprise entities in the collected plain text data to be used as a training set of an enterprise entity identification module; carrying out category marking on the enterprise entities in the acquired plain text data according to the industry properties to be used as a training sample set of an enterprise entity classification module; s2, carrying out enterprise entity recognition model training through the conditional random field model, and obtaining an enterprise entity recognition model; s3, semantic vectorization construction is carried out on the text data of the original training set; s4, training an enterprise entity classification model by using the training set data which is subjected to semantic vectorization and has class labels as training parameters; and S5, classifying the business entities in the text to be predicted by utilizing the business entity classification model. The method uses the obtained semantic vector as the characteristic of the entity, reduces the dependence on artificial characteristics and external data, and ensures the universality and the robustness.

Description

Pure text-oriented enterprise entity classification method

Technical Field

The invention belongs to the technical field of named entity recognition and fine-grained entity classification, and particularly relates to a plain text-oriented enterprise entity classification method.

Background

In recent years, with the trend of "internet finance", more and more enterprise decision makers urgently need to extract and analyze massive internet data by using a more advanced information processing mode so as to make better decisions. Among the mass data, the plain text data such as court documents and news public opinions becomes the primary source for enterprises to acquire high-value information.

The named entity recognition technology is the basis for enterprises to carry out the work of entity semantic analysis, entity relationship extraction and the like. At present, the mainstream named entity recognition technology only divides entities into names of people, places, organizations and the like, so that the types of the entities lack semantics. Meanwhile, entity classification is performed by depending on artificial features and external data too much, so that the universality and the robustness of the entity classification cannot be guaranteed.

Disclosure of Invention

Aiming at the current mainstream named entity recognition technology, the invention only divides entities into names of people, places, mechanism names and the like, so that the types of the entities lack semantics. In addition, the entity classification is performed by depending on artificial features and external data too much, so that the universality and the robustness are not guaranteed. In order to solve the problems, the invention provides a pure text-oriented enterprise entity classification method, which adopts a finer-grained division mode of enterprise entities and uses the semantic construction characteristics of texts to classify the enterprise entities. Here, plain text is text containing information about the activity of the enterprise, such as news text, court documents, etc.

As shown in fig. 1, the method for classifying business entities oriented to plain text disclosed by the present invention includes the following steps:

s1, carrying out category marking on the enterprise entities in the collected plain text data, and taking the marked data as a training set of an enterprise entity identification module; carrying out category marking on enterprise entities in the collected plain text data according to the industry properties, and using the marked data as a training sample set of an enterprise entity classification module;

s2, carrying out enterprise entity recognition model training through the conditional random field model, and obtaining an enterprise entity recognition model;

s3, semantic vectorization construction is carried out on the text data of the original training set;

s4, training an enterprise entity classification model by using the training set data which is subjected to semantic vectorization and has class labels as training parameters;

and S5, classifying the business entities in the text to be predicted by utilizing the business entity classification model.

Further, in S1, the collected plain text data is labeled with sentence, word and part of speech, and the enterprise entities and industry categories in the plain text data are labeled by a manual labeling method.

Furthermore, sentence segmentation, word segmentation and part-of-speech tagging are carried out on the plain text data by using open-source word segmentation and part-of-speech tagging software HanLP.

Furthermore, the business entity in the plain text data is labeled in a "BIO" labeling mode, wherein the initial word of the business entity is labeled as "B", other words of the business entity, which are not the initial word, are labeled as "I", and words which are irrelevant to the business entity are labeled as "O".

Further, in a manual labeling method, the enterprise entities in the plain text data are labeled according to the category and the industry property according to the context.

Further, in S2, training the business entity recognition model is performed by introducing a conditional random field model of the boundary features.

Further, the conditional random field model for introducing boundary features comprises: dividing words of enterprise names by HanLP, and then sorting to obtain a left boundary dictionary and a right boundary dictionary; training by using an open-source libSVM to obtain a prediction model of a left boundary and a right boundary; sequentially taking out words from the training set and judging whether the words are left and right boundary words or not through the prediction models of the left and right boundaries; and inputting training set data comprising words, part of speech labels, left and right boundary labels and entity labels into an open-source conditional random field tool to train the enterprise entity recognition model and obtain the recognition model of the enterprise entity.

Further, in S3, a word vector calculation tool is used to obtain word vectors of all words in the training sample set, inverse text frequency (IDF) values of all words in the training sample set are calculated, a word vector and TF-IDF values are used to calculate a vector and a context vector that include an enterprise entity in the enterprise entity sentence, and the vector and the context vector of the enterprise entity are spliced to obtain an enterprise entity semantic vector that includes context semantics.

Further, word vectors for all words in the training set are calculated using the open source word2vec tool.

Further, in S4, a classification model of the business entity is trained on the training set data with class labels using the softmax model.

The invention has the following beneficial effects:

1) the improved method of the invention has great improvement in recall rate and F1 value by using dictionary rules and SVM classifiers to predetermine the left and right boundaries of the entity and then introducing the results of the determined left and right boundaries as new features into the conditional random field model.

2) And performing semantic vectorization representation on the entities and the contexts thereof by using a word embedding weighting mode, so that the semantics between the entities can be measured by a semantic vector distance. And the obtained semantic vector is used as the characteristic of the entity, so that the dependence on artificial characteristics and external data is reduced.

3) The introduction of the entity boundary characteristics in the existing conditional random field model strengthens the control capability of the conditional random field model on the entity boundary, for example, the recognition recall rate is obviously improved, and the universality and the robustness are ensured.

Drawings

Fig. 1 is a flowchart of a plain text-oriented business entity classification method disclosed in the present invention.

Fig. 2 is a training set construction flowchart in the embodiment.

FIG. 3 is a flowchart of an embodiment of training an improved conditional random field-based enterprise entity recognition model.

FIG. 4 is a flow chart of entity semantic vector construction based on word vectors and TF-IDF value weighting in an embodiment.

FIG. 5 is a flowchart of an enterprise entity classification model training process.

FIG. 6 is a flowchart of business entity classification.

Detailed Description

To better understand the technical content of the present invention, a specific embodiment of the method for classifying business entities for court documents is described below with reference to the accompanying drawings.

As shown in FIG. 2, the present invention constructs a training sample set prior to implementation. The process of constructing the training sample set in the embodiment is as follows:

and 1-0, establishing an initial state of a training set.

Step 1-1, collecting court documents from the Internet by using a web crawler tool to serve as an original corpus.

And step 1-2, performing sentence segmentation, word segmentation and part-of-speech tagging on the document text by using open-source word segmentation and part-of-speech tagging software HanLP on the acquired document data. Of course, common open-source word segmentation software can be used, such as Chinese academy word segmentation and the like, and the HanLP software selected in the embodiment has a relatively better word segmentation effect compared with the existing open-source word segmentation software, and can customize a dictionary manually and is more convenient.

Step 1-3, because the enterprise entity words (namely the names of enterprises, mainly including the full names and the short names) in the text are segmented into a plurality of words after word segmentation, the enterprise entities in the text are marked out by a manual marking method in a form of "BIO" marks, namely the initial word of the enterprise entity is marked as "B", other partial words of the non-initial word of the enterprise entity are marked as "I", words irrelevant to the enterprise entities are marked as "O", such as "informed (O) Jiangsu (B) Eurasian (I) film (I) Limited company (I)". And the labeled data is used as a training set of the enterprise entity recognition model.

Meanwhile, the enterprise entities in the collected document text are labeled according to the categories and the industry properties according to the context content. The marked data is used as a training set of the entity classification model of the enterprise, the marked data comprises a sentence containing the name of the enterprise and a class label of the industry to which the enterprise belongs, and the whole training set is a set of a plurality of sentences and class labels. The standard of the category marking can adopt a dividing mode in national economic industry classification (GB/T4754-2011) with accuracy and authority.

And 1-4, finishing establishing the training set.

As shown in FIG. 3, after the training set is constructed, the modified conditional random field method, namely, the training of the enterprise entity recognition model is performed by introducing the conditional random field model of the boundary features.

And 2-0, starting the training of the enterprise entity recognition model.

And 2-1, inputting training set data (namely the labeling completion data in the step 1-3) after sentence segmentation, word segmentation, part of speech labeling and entity labeling.

And 2-2, crawling some enterprise directories from the Internet, and sorting the enterprise names through HanLP word segmentation to obtain a left boundary dictionary and a right boundary dictionary. The left boundary word refers to the first word after the division of the enterprise name, and the right boundary word refers to the last word after the division of the enterprise name. And sorting all the left and right boundary words into a left and right boundary word dictionary.

And 2-3, training by using an open-source libSVM to obtain a prediction model of the left boundary and the right boundary. The features selected in the left boundary prediction model training process are as follows: the words and the parts of speech of the current word and the last two words; the features selected in the training process of the right boundary prediction model are as follows: the words themselves and part of speech of the current word as well as the first two words. Wherein, the used open-source libSVM has better robustness and better classification boundary

And 2-4, sequentially taking out words from the training set and judging whether the words are left and right boundary words through the prediction models of the left and right boundaries.

The method for judging whether the current word is the left boundary word comprises the following steps: if the word appears in the left boundary dictionary and words in two word windows at the right of the word are judged as left boundary words under the SVM method, the left boundary words are correct, and if not, the left boundary words are discarded. Of course, each word has a judgment result under the dictionary method and the SVM method, but both methods have disadvantages, and in the embodiment, the step is to combine the results of the two methods and select a more reasonable result.

The method for judging whether the current word is the right boundary word comprises the following steps: and if the word appears in the right boundary dictionary and words in two word windows on the left side of the word are judged as right boundary words under the SVM method, the right boundary words are correct, and if not, the right boundary words are discarded.

And 2-5, judging whether all the words are traversed or not, if so, going to step 2-7, and otherwise, going to step 2-6.

And 2-6, adding 1 to the counter i, and taking out the next word in the text. The above steps are actually to determine whether a word is a left or right boundary word.

And 2-7, inputting the data of the training set into an open-source conditional random field tool CRF + + to train the enterprise entity recognition model, and outputting the recognition model of the enterprise entity. The characteristics of the training data selection are word self, part of speech tagging, left and right boundary tagging and entity tagging.

And 2-8, finishing the training of the enterprise entity recognition model.

Therefore, the method introduces the entity boundary characteristics into the existing conditional random field model, judges whether the word is a left boundary word or a right boundary word before using the conditional random field model, takes the result as the characteristics, and uses the conditional random field model after using the conditional random field model, and the introduction of the entity boundary characteristics strengthens the control capability of the conditional random field model on the entity boundary, which is embodied in that the recognition recall rate is obviously improved.

As shown in fig. 4, a flowchart of semantic vectorization construction is performed on the text data of the original training set.

And 3-0, starting the construction of the text semantic vector of the training set.

And 3-1, inputting a training set text set with completed sentences, participles, part of speech labels and category labels.

And 3-2, calculating word vectors of all words in the training set by using an open-source word2vec tool. It is noted that word2vec is a Google open-source tool for computing word vectors, and there are many such tools at present, word2vec is well known, and there are many alternative tools such as java's word2vec4j, etc.

Step 3-3, calculating the inverse text frequency (IDF) values of all the words in the training set, wherein the calculation formula is as follows:

wherein, the fraction in the logarithmic function, the numerator represents the total number of documents in the whole document, the denominator represents the number of documents containing a certain term, 1 is added, and the ratio of the two is taken.

And 3-4, sequentially taking out each sentence of text in the document from the first sentence of text in the training set.

And 3-5, judging whether the extracted text has the enterprise entity by using the enterprise entity identification model, if so, going to step 3-6, otherwise, going to step 3-10.

Step 3-6, after the fact that the text contains the enterprise entity is judged in the step 3-5, semantic vectors of the entity part are calculated, and it is assumed that the vector of one entity is represented as v_mThe vector representation of the phrase forming it is respectively: w is a₁，w₂，...，w_nThen v is_mThe calculation formula of (a) is as follows:

step 3-7, after calculating the semantic vector of the entity in step 3-6, calculating the semantic vector for the context part of the entity in the following way:

where v (context) is a vector-characterized form of the context, tf · idf (w)_i) Meaning word w_iTF-IDF value of (v) (w)_i) Is the word w_iK is the word window size (i.e., take the first k words in the context that are close to the central entity). The TF value of a word is the frequency of the word appearing in the text, and the TF-IDF value of the word is the product of the TF value and the IDF value of the word.

And 3-8, splicing the semantic vectors of the entities and the contexts obtained in the steps 3-6 and 3-7, specifically, splicing the entity vector of the k dimension and the context vector of the k dimension in a mode that the entity vector is in front and the context vector is in back to obtain a vector of the 2k dimension.

And 3-9, judging whether all sentences in the text of the training set are traversed or not, if so, going to step 3-11, and otherwise, going to step 3-10.

And 3-10, adding 1 to the counter i, and taking out the next sentence in the training set text.

And 3-11, outputting the obtained entity vector fusing the context semantics as training data of the enterprise entity classification model. It should be noted that the data labeled in steps 1-3 is a plain text + type label data, and the previous step is to convert the text into a vector, so that the data is a vector + type label data.

And 3-12, finishing the semantic construction of the training set text.

As shown in FIG. 5, after semantic vectorization construction is performed on the original corpus (i.e., the data set obtained after steps 1-3), training of the enterprise entity classification model is performed using softmax multi-classification algorithm. The softmax multi-classification algorithm is a common method, and compared with other methods, the softmax multi-classification algorithm has the advantages of high calculation speed, small occupied space and capability of obtaining the probability of the test sample on each class

And 4-0, starting the training of the enterprise entity classification model.

And 4-1, inputting the training set data which is subjected to semantic vectorization and has class labels into a softmax classification model to serve as training parameters.

And 4-2, performing multi-classification model training by adopting a softmax algorithm, and outputting the trained softmax multi-classification model for subsequent classification prediction.

And 4-3, finishing the training of the enterprise entity classification model.

As shown in FIG. 6, after the business entity classification model is obtained, a flow chart for classification using the classification model is provided.

Step 5-0, start of business entity classification.

And 5-1, inputting texts of entity categories to be predicted into the enterprise entity classification model.

And 5-2, judging whether an enterprise entity exists in the input text by using the enterprise entity identification model, if so, turning to the step 5-3, otherwise, turning to the step 5-5.

And 5-3, constructing entity semantic vectors for the texts containing the enterprise entities by using the steps 3-1 to 3-12, and inputting the obtained vectors into the trained enterprise entity classification model to obtain the classification results of the entities in the texts.

And 5-4, outputting the classification result of the 5-3 steps.

And 5-5, finishing the enterprise entity classification.

In summary, the method for classifying the enterprise entity vector representation form containing the context semantics by using the word vector technology and the TF-IDF value of the document word provided by the invention can solve the problems of less types and lack of semantics in the existing enterprise entity classification method, so that the types of the enterprise entities have finer granularity and stronger semantic features.

Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims

1. A plain text-oriented enterprise entity classification method is characterized by comprising the following steps:

s1, marking the enterprise entities in the collected plain text data, and taking the marked data as a training set of an enterprise entity identification module; carrying out category marking on enterprise entities in the collected plain text data according to the industry properties, and using the marked data as a training sample set of an enterprise entity classification module;

s2, carrying out enterprise entity recognition model training by introducing a conditional random field model of boundary characteristics, and obtaining an enterprise entity recognition model;

s3, performing semantic vectorization construction on the text data of the original training set, namely, using a word vector calculation tool to obtain word vectors of all words in a training sample set, calculating inverse text frequency IDF values of all words in the training sample set, calculating vectors and context vectors of enterprise entities in enterprise entity sentences by using the word vectors and TF-IDF values, and splicing the vectors and the context vectors of the enterprise entities to obtain enterprise entity semantic vectors containing context semantics;

s5, classifying the enterprise entities in the text to be predicted by utilizing the enterprise entity classification model;

the step S2 specifically includes: dividing words of enterprise names by HanLP, and then sorting to obtain a left boundary dictionary and a right boundary dictionary; training by using an open-source libSVM to obtain a prediction model of a left boundary and a right boundary; sequentially taking out words from the training set and judging whether the words are left and right boundary words or not through the prediction models of the left and right boundaries; inputting training set data comprising words, part of speech labels, left and right boundary labels and entity labels into an open-source conditional random field tool to train an enterprise entity recognition model and obtain the recognition model of the enterprise entity;

the step S3 specifically includes:

step 3-1, inputting a training set which is already finished with sentences, participles, part of speech labels and category labels;

step 3-2, calculating word vectors of all words in the training set;

step 3-3, calculating the inverse text frequency IDF values of all the words in the training set, wherein the calculation formula is as follows:

in the formula, a numerator represents the total number of documents in the whole document, and a denominator represents the number of documents containing a certain term plus 1;

3-4, sequentially taking out each sentence of text in the document from the first sentence of text in the training set;

3-5, judging whether the extracted text has the enterprise entity by using an enterprise entity identification model, if so, going to step 3-6, otherwise, going to step 3-10;

step 3-6, after the fact that the text contains the enterprise entity is judged in the step 3-5, semantic vectors of entity parts are calculated, wherein the semantic vector v of one entity_mThe calculation formula of (a) is as follows:

in the formula, w_iA vector representing the ith phrase constituting an entity, i ═ 1,2, …, n;

step 3-7, calculating a semantic vector for the context part of the entity, wherein the calculation mode is as follows:

where v (context) is the semantic vector of the context, f · idf (w)_i) Meaning word w_iTF-IDF value of (v) (w)_i) Is the word w_iK is the word window size; the TF value is the frequency of the word appearing in the text, and the TF-IDF value of the word is the product of the TF value and the IDF value of the word;

3-8, splicing the semantic vectors of the entities and the contexts obtained in the steps 3-6 and 3-7, specifically, splicing the k-dimensional entity vector and the k-dimensional context vector in a mode that the entity vector is in front and the context vector is in back to obtain a 2 k-dimensional vector;

3-9, judging whether all sentences in the text of the training set are traversed or not, if so, going to step 3-11, otherwise, going to step 3-10;

3-10, adding 1 to the counter i, and taking out the next sentence in the training set text;

3-11, outputting the obtained entity vector fused with the context semantics as training data of an enterprise entity classification model;

3-12, finishing the semantic construction of the training set text;

the step S5 specifically includes:

step 5-1, inputting a text of an entity category to be predicted to the enterprise entity classification model;

step 5-2, judging whether an enterprise entity exists in the input text by using the enterprise entity identification model, if so, turning to step 5-3, otherwise, turning to step 5-5;

step 5-3, entity semantic vector construction is carried out on the text containing the enterprise entities by utilizing the steps 3-1 to 3-12, and then the obtained vectors are input into a trained enterprise entity classification model to obtain the classification results of the entities in the text;

step 5-4, outputting the classification result of the step 5-3;

and 5-5, finishing the enterprise entity classification.

2. The business entity classification method of claim 1, wherein in S1, the collected plain text data is labeled by sentence segmentation, word segmentation and part of speech, and the business entities and industry categories in the plain text data are labeled by manual labeling.

3. The business entity classification method of claim 2, wherein the plain text data is sentence-segmented, word-segmented and part-of-speech tagged using open-source word segmentation and part-of-speech tagging software HanLP.

4. The business entity classification method of claim 2, wherein the business entities in the plain text data are labeled in a "BIO" label format, wherein the initial word of the business entity is labeled as "B", other partial words of the business entity that are not the initial word are labeled as "I", and words that are not related to the business entity are labeled as "O".

5. The method of classifying business entities according to claim 2, wherein in the manual labeling method, the business entities in the plain text data are classified according to their business properties according to the context.

6. The business entity classification method of claim 1, wherein word vectors for all words in the training set are computed using an open-source word2vec tool.

7. The business entity classification method of claim 1, wherein in S4, a classification model of the business entity is trained using softmax model on the training set data with class labels already.