CN112231449A

CN112231449A - Vertical field entity chain finger system based on multi-path recall

Info

Publication number: CN112231449A
Application number: CN202011431197.6A
Authority: CN
Inventors: 刘广峰; 鲁思帆
Original assignee: Hangzhou Zhidu Technology Co ltd
Current assignee: Hangzhou Zhidu Technology Co ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-01-15

Abstract

The invention relates to a vertical field entity chain finger system based on multi-channel recall, which comprises: a text processing module: the system is used for segmenting and screening input texts obtained by a user and splicing the input texts into effective texts; a similarity matching module: the system is used for calculating the similarity between the effective text and the entity, and sequencing and screening the entity according to the similarity; a dictionary matching module: the system is used for retrieving the effective text input by the user to obtain a candidate entity; an entity identification module: utilizing the recognition model to correspondingly recognize named entities of the effective texts and generate candidate entities; text binary model: constructing a binary classification model, judging whether the entity is a matched text or not based on the binary classification model, and formulating a rule for post-verification operation; the method can fully utilize the word information of the text input by the user, and solves the problems of low accuracy and low recall rate of entity chain index caused by serious spoken language of the identification and labeling corpus of the named entity of the Chinese consultation text.

Description

Vertical field entity chain finger system based on multi-path recall

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a vertical field entity chain finger system based on multi-channel recall.

Background

With the rapid development and popularization of the internet, more and more people choose to carry out online Chinese consultation on professional consultants in an online question-and-answer mode through the internet, the method can conveniently and efficiently promote the communication between the consultants and users, but in many times, due to the fact that resources of professional people in various fields of China are relatively short, online Chinese consultation problems of many users often cannot be answered professionally in time. Meanwhile, with the application of artificial intelligence technology in text processing, more and more mechanisms construct relevant knowledge bases in the vertical domain. How to automatically acquire the intention of the user inquiry from the Chinese online consultation text of the user through a natural language processing technology and provide professional answers for the user by combining the conventional vertical domain knowledge base has important application significance.

The information extraction technology is a key technology for acquiring the user intention from the Chinese online consultation text, and comprises entity extraction and relationship extraction, wherein the entity extraction comprises entity identification and entity chain. The named entity identification is a prerequisite technology for realizing information extraction, and the quality of the named entity identification directly influences the effects of the subsequent relation extraction and other steps. The named entity recognition technology specifically refers to effective recognition and extraction of entities of specified types such as names of people, time, place names, organization names and the like in a text, and in a Chinese online consultation text, mainly refers to recognition of named entities for a user consultation subject. The entity chain refers to a set of entity mentions in the result of entity identification, judges whether two or more entities are the same entity, and aggregates named entities with the same reference.

Under the condition that the amount of the consulting texts in the marked vertical fields is less, when the frequently-used deep neural network carries out online consulting text naming entity identification and entity chain indication, the model is easy to generate an overfitting phenomenon, so that a better effect cannot be achieved. Meanwhile, in the user corpus of Chinese online consultation, the user problems are spoken, and some entities are often separated by invalid words such as conjunctions and the like, so that the entities are disconnected in the text, while the existing deep learning model has poor recognition effect on the disconnected entities, and the disconnected entities can be better recognized through a word-based retrieval algorithm and a word-based similarity matching algorithm. Moreover, since the vertical domain-related knowledge bases of the respective mechanism structures are different, accurate entity alignment cannot be performed when entity recognition is performed based on different knowledge maps.

Disclosure of Invention

In order to solve the problems, the invention provides a vertical field entity chain indicating system based on multi-channel recall, which can fully utilize word information of a text input by a user and solve the problems of low accuracy and low recall rate of entity chain indicating caused by serious spoken language of a recognition and labeling corpus of a named entity of a Chinese consultation text, thereby improving the quality of artificial intelligent consultation.

The technical scheme of the invention is as follows:

a multi-recall based vertical domain entity chaining finger system, comprising:

a text processing module: the system is used for segmenting and screening input texts obtained by a user and splicing the input texts into effective texts;

a similarity matching module: the system comprises a first candidate entity set, a second candidate entity set and a third candidate entity set, wherein the first candidate entity set is used for calculating the similarity between the effective text and the entities, sorting and screening the entities according to the similarity and then putting the entities into the first candidate entity set;

a dictionary matching module: the system is used for retrieving the effective text input by the user, obtaining candidate entities and putting the candidate entities into a second candidate entity set;

an entity identification module: sorting the existing linguistic data into a training set and a testing set of a CRF model, constructing the training model, carrying out named entity recognition on the effective text correspondingly by using the model, and putting the generated entity into a third candidate entity set;

text binary model: based on the existing data set in the legal field, a binary model is constructed, whether the entity is matched with the text is judged based on the binary model, and a rule is formulated to carry out post-verification operation.

Preferably, the text processing module is implemented by the following steps: carrying out data preprocessing on input texts of a user: based on the existing legal field dictionary, a jieba is used for word segmentation of an input text of a user, the jieba loads a custom word bank in the legal field to generate a result set after word segmentation of the input text of the user, and based on the result set after word segmentation and the existing legal field invalid word dictionary, the result set after word segmentation is screened and spliced into valid texts according to the sequence before screening.

Preferably, the similarity matching module includes:

s2.1: pre-training a BERT similarity model: taking the prior user consultation text as a corpus to obtain an Embedding dictionary; extracting entity mentions of the corpus to be matched as correct corresponding entities through LCS character string matching, using the Embedding dictionary as the Embedding of a similarity model, using the extracted entities and the original corpus as the input of a BERT model, and obtaining the BERT similarity model through training;

s2.2: and (3) similarity calculation: based on the effective text, using a word-based algorithm to find an entity matched with the input text, acquiring description information of the entity in a knowledge graph, and calculating the similarity scores of the entity of the effective text and the candidate entity set and the entity description by using the BERT similarity model trained in the step S2.1;

s2.3: and (3) sorting the similarity between the effective text obtained in the step (S2.2) and the entity, screening the entity with the top five and the score larger than a preset threshold value as a candidate entity, and putting the candidate entity into a first candidate entity set.

Preferably, the dictionary matching module specifically comprises the following steps:

s3.1: constructing a legal field dictionary tree based on the prefix dictionary tree and the existing legal field dictionary;

s3.2: and based on the existing legal field dictionary, aiming at the effective text, obtaining candidate entities by using a complete word matching algorithm or a partial word matching algorithm, and putting the candidate entities into a second candidate entity set.

Preferably, the entity identification module is implemented by the following steps:

s4.1: training a named entity recognition model: taking the prior user consultation text as a corpus to obtain an Embedding dictionary; automatically labeling the corpus by using BIO labeling, taking the Embedding dictionary as Embedding of a similar model, inputting characters and words in the corpus into the BilSTM, and finally using a CRF layer as a sequence labeling layer for training to obtain a BERT + BilSTM + CRF named entity recognition model;

s4.2: and (4) aiming at the effective text, generating a candidate entity by using a BERT + BilSTM + CRF named entity recognition model algorithm, and putting the candidate entity into a third candidate entity set.

Preferably, the acquiring method of the Embedding dictionary comprises: the method comprises the steps of taking an existing user consultation text as a corpus, dividing the corpus into a training set and a testing set, preprocessing the training set, carrying out word segmentation on the training set by using jieba word segmentation in the preprocessing process, loading a custom word bank in the legal field by the jieba, training the training set with the word segmentation by using a BERT pre-training model, and obtaining expression vectors of all characters and words to serve as an Embedding dictionary.

Preferably, the text classification model is implemented by the following steps:

s5.1: based on the data set of the existing legal field, a BERT two-classification model is utilized, a two-classification positive and negative sample training set is generated according to whether the entity is matched with the user input, and a text two-classification model is generated through the two-classification positive and negative sample training set based on the BERT model training;

s5.2: merging the first candidate entity set, the second candidate entity set and the third candidate entity set into a fourth candidate entity set;

s5.3: inputting all entities in the fourth entity candidate set into a text classification model, calculating the possibility that each entity corresponds to the entity of the text input by the user, and outputting a score;

s5.4: and sorting all the entities according to the scores, marking the entities with the top 2 and the scores larger than a preset threshold value as identification entities, and performing duplication elimination output to form an entity result set, wherein the entity result set comprises each entity word and the score corresponding to the entity word.

Preferably, the method further comprises the step S5.5: and aiming at the output entity result set, formulating rules for post-verification operation, wherein the rules comprise full-word matching and post-screening.

Preferably, the full word matching is: and regarding the entity words of the continuous top 2 in the entity result set, if each word of the entity word is contained in the user input text, taking the entity words as final recognition results.

Preferably, the post-screening is: and sequentially judging whether each character in the entity words with high scores is contained in the user input text or not according to the ranking of the scores of the entity words in the entity result set, recording the information of the entity words if the characters are not contained in the user input text, and finally taking the recorded entity words as final recognition results.

The invention has the beneficial effects that: the system provided by the invention can fully utilize the word information of the text input by the user, and solve the problems of low accuracy and low recall rate of entity chain index caused by serious spoken language identification and labeling of the named entity of the Chinese consultation text, thereby improving the quality of artificial intelligent consultation and solving the problem that various professional fields cannot be popularized to remote areas caused by less resources of consultants.

Drawings

Fig. 1 is a schematic flow chart illustrating the implementation functions of the modules in the system of the present invention.

Fig. 2 is a block architecture diagram and a text classification algorithm architecture diagram of a similarity matching method based on similarity calculation, which is provided by the present invention and takes a BERT model as an example.

FIG. 3 is a block architecture diagram of the named entity algorithm based on BERT + BilSTM + CRF according to the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The application accuracy of the existing entity chain technology in the vertical field is still greatly improved, the multi-recall-based vertical field entity chain technology provided by the invention is characterized in that the advantages of the multi-recall technology in information retrieval are fused into the entity chain technology in information extraction and then applied to the vertical fields of law, medical treatment, finance and the like, and further the purpose of identifying the associated entity name aiming at the given Chinese short text is achieved. Specifically, for a certain vertical field such as law, for a given Chinese short text, firstly, a candidate entity set related to entities in the given text is recalled from multiple semantic dimensions by using multi-way recall, and is not limited to single characteristics such as context semantics or sequence states, and then, for the recalled candidate entity set, the similarity between the candidate entities and the given entity is measured by using a classical similarity calculation model, and then a final entity index is obtained according to the similarity, so that the purpose of realizing entity chain index by using multi-way recall is achieved.

As shown in FIG. 1, the present invention provides a multi-recall based vertical domain entity chain pointer system, which comprises a text processing module for segmenting and screening input texts obtained by users, and splicing the input texts into effective texts, a similarity matching module for calculating the similarity between the effective texts and entities, sorting and screening the entities according to the similarity, and putting the entities into a first candidate entity set, a dictionary matching module for retrieving the effective texts input by users, obtaining candidate entities and putting the candidate entities into a second candidate entity set, sorting the existing corpora into a training set and a test set of a CRF model, constructing a training model, carrying out named entity recognition on the effective texts correspondingly by using the model, putting the generated entities into an entity recognition module in a third candidate entity set, constructing a binary model based on the existing data set in the legal domain, and judging whether the entity is matched with the text based on the binary classification model, and formulating a rule to carry out a text binary classification model of post-verification operation.

As an embodiment of the present invention, a similarity matching module architecture diagram based on similarity calculation, taking a BERT model as an example, is shown in fig. 2, and the method specifically implements the following steps:

step 1, training a BERT similarity model in advance.

The method comprises the steps of taking an existing user consultation text as a corpus, dividing the corpus into a training set and a test set, preprocessing the training set, carrying out word segmentation on the training set by using jieba word segmentation in the preprocessing process, loading a custom word bank in the legal field by the jieba, and training the training set with the word segmentation by using a BERT pre-training model to obtain expression vectors of all characters and words to serve as an Embedding dictionary. For corpora in the training set, extracting corresponding entities through LCS character string matching, using a previously generated Embedding dictionary as an Embedding of a similar model, and then inputting the extracted corresponding entities and primitive materials into a BERT model as input of the BERT model, such as corpora: how the law defines the inheritance right of the legacy, and the entity set 'legacy, inheritance right and legacy inheritance right' can be obtained through LCS character string matching, how the corpus '1, 0' law defines the inheritance right of the legacy, [ SEP ] legacy ',' 1, 0 'law defines the inheritance right of the legacy, [ SEP ] inheritance right', '1, 0] law defines the inheritance right of the legacy, [ SEP ] inheritance right' as a positive sample of a BERT model, training is carried out, and finally a BERT similarity model M1 is obtained.

And 2, performing data preprocessing on the input text of the user.

Based on the existing legal field dictionary, the jieba is used for word segmentation of the input text of the user, wherein the jieba loads a custom word bank in the legal field to generate a result set after word segmentation of the input text of the user, and based on the result set after word segmentation and the existing legal field invalid word dictionary, the result set after word segmentation is screened and spliced according to the original sequence to form the valid text of the input text of the user.

And 3, calculating the similarity.

Based on the input effective text of the user, using a word-based algorithm to find an entity candidate set of the input text, then obtaining description information of the entities in the knowledge graph, and then using the trained BERT similarity model M1 to calculate the similarity between the input effective text of the user and the entities and the entity descriptions of the entity candidate set, such as: ("how law defines heritage inheritance right, 0.891), (" how law defines heritage inheritance right, which means inheritance right of heritage of inheritor according to the rules of law or legal and valid will of inheritor.0.861). The final score is obtained based on the similarity between the noun interpretations of the entity and the user input text, respectively ("how law defines heritage inheritance, 0.8761), where similarity is the average of the similarity of the two.

And 4, generating a candidate entity set.

And based on the similarity between each entity and the text input by the user, marking the entity with the top 5 and the score of more than 0.5 as a candidate entity, and putting the candidate entity into a candidate entity set.

As an implementation mode of the invention, the dictionary matching module based on the entity dictionary matching based on the word matching algorithm is specifically realized by the following steps:

step 1, constructing a dictionary tree.

And constructing a legal field dictionary tree based on the prefix dictionary tree and the existing legal field dictionary.

And 2, performing data preprocessing on the input text of the user.

And 3, screening a candidate set by using a word matching algorithm.

Based on the existing legal domain dictionary, the input text of the user is searched by using a complete word matching algorithm or a partial word matching algorithm, such as ahocorasick search algorithm, to obtain candidate entities, and the candidate entities are put into a candidate entity set.

As an embodiment of the present invention, a module architecture diagram of a named entity algorithm based on CRF is shown in fig. 3, and the method specifically implements the following steps:

step 1, training a BERT + BilSTM + CRF named entity recognition model in advance.

The method comprises the steps of taking an existing user consultation text as a corpus, dividing the corpus into a training set and a test set, preprocessing the training set, carrying out word segmentation on the training set by using jieba word segmentation in the preprocessing process, loading a custom word bank in the legal field by the jieba, and training the training set with the word segmentation by using a BERT pre-training model to obtain expression vectors of all characters and words to serve as an Embedding dictionary. The material is then automatically labeled using BIO labeling. For the corpus in the training set, the previously generated Embedding dictionary is used as the Embedding of the similar model, then the characters and words in the corpus are input into the BilSTM, finally the CRF layer is used as the sequence labeling layer for training, and finally the BERT + BilSTM + CRF named entity recognition model M2 is obtained.

And 2, performing data preprocessing on the input text of the user.

And 3, generating a candidate set by using a BERT + BilSTM + CRF named entity recognition model algorithm.

And (3) generating a candidate entity by using a BERT + BilSTM + CRF named entity recognition model algorithm for the input text of the user, and putting the candidate entity into a candidate entity set.

As an embodiment of the present invention, a text two-classification algorithm architecture diagram taking a BERT model as an example is shown in fig. 2, and the method specifically implements the following steps:

step 1, training a BERT two-classification model in advance.

The method comprises the following steps of taking the prior user consultation text as a corpus, and dividing the corpus into a training set and a testing set, such as the corpus: "what meaning the married child is estimated" ", sample corpus" is generated "[ 1, 0] what meaning the married child is estimated [ SEP ] married child estimation" "[ 0, 1] what meaning the married child is estimated [ SEP ] fostering right", training is performed, and finally a BERT two-classification model M2 is obtained.

And 2, performing classified calculation.

Based on the input effective text of the user, an entity candidate set of the input text is found by using a three-way matching algorithm, entities in the candidate set are spliced with the input text of the user and then input into a BERT two-classification model M2, and the possibility that each entity corresponds to the entity of the input text of the user is calculated.

And 3, outputting the identification entity.

Each entity generated based on the above steps is a possibility of corresponding to an entity to which a user inputs text. And marking the entity with the rank of 2 and the score of more than 0.8 as an identification entity, and performing duplicate removal output, wherein the output result comprises each entity word and the corresponding score thereof.

And 4, checking after the rule is regulated.

Since most of the entity words in the vertical domain knowledge base have strong domain knowledge representativeness, most of the user input texts obviously include the entity words. Therefore, aiming at the result set output by the entity, two rules are formulated to carry out post-verification operation:

rule 1: and matching the whole words. Regarding the entity words of the top 2 continuous ranking in the result set, if each word of the entity word is contained in the user input text, taking the entity words as the final recognition result;

rule 2: and (5) post-screening. And sequentially judging whether each character in the entity words with high scores is contained in the input text of the user or not according to the ranking of the scores of the entity words in the result set, recording the information of the entity words if the character is not contained in the input text of the user, and finally taking the recorded entity words as a final recognition result.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-recall based vertical domain entity chaining finger system, comprising:

2. The multi-recall-based vertical domain entity chain finger system of claim 1, wherein the text processing module is implemented by the following steps: carrying out data preprocessing on input texts of a user: based on the existing legal field dictionary, a jieba is used for word segmentation of an input text of a user, the jieba loads a custom word bank in the legal field to generate a result set after word segmentation of the input text of the user, and based on the result set after word segmentation and the existing legal field invalid word dictionary, the result set after word segmentation is screened and spliced into valid texts according to the sequence before screening.

3. The multi-recall-based vertical domain entity chain finger system of claim 1, wherein the similarity matching module is implemented by the following steps:

s2.2: and (3) similarity calculation: based on the effective text, finding an entity matched with the input text by using an ahocorasick automaton algorithm, acquiring description information of the entity in a knowledge graph, and calculating the similarity value of the entity of the effective text and the candidate entity set and the entity description by using a BERT similarity model trained in the step S2.1, wherein the similarity value is calculated by using a vectorized cosine;

4. The multi-recall-based vertical domain entity chaining instruction system of claim 1, wherein the dictionary matching module is implemented by the steps of:

5. The multi-recall-based vertical domain entity chain finger system of claim 1, wherein the entity identification module is implemented by the following steps:

6. The multi-recall-based vertical domain entity chain finger system of claim 4 or 5, wherein the acquiring method of the Embedding dictionary is as follows: the method comprises the steps of taking an existing user consultation text as a corpus, dividing the corpus into a training set and a testing set, preprocessing the training set, carrying out word segmentation on the training set by using jieba word segmentation in the preprocessing process, loading a custom word bank in the legal field by the jieba, training the training set with the word segmentation by using a BERT pre-training model, and obtaining expression vectors of all characters and words to serve as an Embedding dictionary.

7. The multi-recall-based vertical domain entity chain finger system of claim 1, wherein the text classification model is implemented by the following steps:

8. The multi-recall-based vertical domain entity chain finger system of claim 7, further comprising step S5.5: and aiming at the output entity result set, formulating rules for post-verification operation, wherein the rules comprise full-word matching and post-screening.

9. The multi-recall based vertical domain entity chaining finger system of claim 8 wherein said full word match is: and regarding the entity words of the continuous top 2 in the entity result set, if each word of the entity word is contained in the user input text, taking the entity words as final recognition results.

10. The multi-recall based vertical domain entity chaining finger system of claim 8, wherein said post-screening is: and sequentially judging whether each character in the entity words with high scores is contained in the user input text or not according to the ranking of the scores of the entity words in the entity result set, recording the information of the entity words if the characters are not contained in the user input text, and finally taking the recorded entity words as final recognition results.