CN113901824A

CN113901824A - Question-answering system construction method based on named entity recognition

Info

Publication number: CN113901824A
Application number: CN202111276164.3A
Authority: CN
Inventors: 周洁琴
Original assignee: Nanjing Inspector Intelligent Technology Co Ltd
Current assignee: Nanjing Inspector Intelligent Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-01-07

Abstract

The invention discloses a question-answer system construction method based on named entity recognition, which comprises the following steps of 1, constructing a question-answer database: step 2, conducting named entity recognition and non-named entity recognition on the questions in the question-answer database, step 3, storing the recognition results of the step 2 into corresponding fields in the question-answer database, step 4, calculating similarity, conducting entity recognition on the questions input by the user, dividing Chinese words to obtain named entities and non-named entities, finding corresponding entity questions from the question-answer database as candidate questions, and returning answers of the candidate questions with the highest similarity; the method comprises the steps of conducting named entity recognition and Chinese word segmentation on questions in a question-answering database to obtain word vectors of named entities and non-named entities, further obtaining corresponding candidate questions, obtaining the similarity between user input and the candidate questions according to an improved similarity calculation method, accurately matching the user input questions, and improving the accuracy of answers in a question-answering system.

Description

Question-answering system construction method based on named entity recognition

Technical Field

The invention relates to the field of natural language processing research, in particular to a question-answering system construction method based on named entity recognition.

Background

The rapid development of the mobile internet brings abundant and diverse information to internet users. In the face of the vast amount of information on the internet, people are increasingly relying on querying information through search engines. However, conventional search engines return a large number of related web pages, and it is difficult for a user to quickly and accurately locate the correct answer that matches the question from among the large number of web pages.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: unlike the conventional search engine, the question-answering system, as a novel information retrieval technology, can directly return accurate answers to users, thereby saving the time for the users to search for required information from a large number of related web pages. The short text similarity calculation plays an important role in a question-answering system, because the questions and the answers are in the form of short texts, particularly, the length of the questions is generally not more than 100 characters, and the contained information amount is small; and the user expression habits are different, and irregular expressions such as wrongly written characters, short words, spoken language and the like exist in the short text problem, so that the quality of the given answer is reduced. The short text is different from the long text, has the characteristics of short content, sparse features and the like, and causes poor calculation and measurement effects of the similarity of the short text. The existing short text similarity method cannot effectively solve the problem of interference of short text noise words, and improves the accuracy of short text similarity calculation. Therefore, a new semantic similarity method needs to be provided to improve the matching precision of automatically returning user answers. How to dig out valuable information from short text information, accurately position the most similar question and return the most accurate answer of the user is a problem to be solved urgently.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a question-answering system construction method based on named entity identification. The technical scheme is as follows:

a question-answering system construction method based on named entity recognition comprises the following steps:

step 1, constructing a question-answer database:

a question-answer data source is obtained, a web crawler is utilized to grab a question-answer platform as a data source of a question-answer database,

after the web page is captured, data cleaning operation is needed, useless data are eliminated, and problem elements are obtained: question, answer time, number of praise, number of comments field; calculating the effective score number S of each answer record according to the question elements; and according to the effective score S, only one answer record with the highest effective score is reserved for each question and is stored in a question-answer database.

Step 2, conducting named entity recognition on the questions in the question and answer database, wherein the named entity recognition refers to recognition of entities with specific meanings in the text, including names of people, places and organizations; and (3) carrying out named entity recognition on the questions in the question-answer database by using a BERT-BilSTM-CRF model, generating word vector semantic representation of input contents by using the BERT, and connecting the BilSTM-CRF model.

The method for carrying out entity identification by using the BERT-BilSTM-CRF model comprises the following steps:

(1) processing the problem by using a bidirectional Transformer encoder in a BERT pre-training language model, constructing an Embedding layer, and obtaining a vector representation of each word as the input of a downstream task BilSTM-CRF.

(2) The word vector obtained by BERT processing is used as the input of the BilSTM model, the sequence input is processed in the forward direction and the reverse direction at the same time, and then the forward information vector at the same time is output

And the output of the reverse information vector

Splicing to obtain sentence representation at time t

The association between text contexts is learned in both the forward and reverse directions.

(3) Using the output of the BilsTM layer as CRFInput sequence X ═ X₁,x₂,…,x_n) X represents a word vector, n represents the number of input word vectors, the constraint conditions among learning labels improve the accuracy of label prediction to obtain a final prediction label sequence, and marking information for each position of an input problem.

Performing Chinese word segmentation on the questions in the question-answer database, and identifying non-named entities: and performing word segmentation and part-of-speech tagging on the questions in the question-answer database by using a Baidu LAC word segmentation tool, skipping pronouns, adjectives and adverbs which have no value in calculating similarity, and screening out non-named entity nouns and non-named entity verbs.

And 3, storing the identification result of the step 2 into a corresponding field in a question-answer database, and adding the following field columns to each question in the database: and (2) organizing the organization entity, the person name entity, the place entity, the non-named entity noun and the non-named entity verb, and respectively storing the named entity and the non-named entity obtained in the step (2) into corresponding columns, wherein each element comprises a stored entity name and an entity word vector, and if a plurality of names exist in a certain class, the named entity and the non-named entity are separately stored by commas.

Step 4, calculating similarity, performing entity recognition on the user input questions, dividing Chinese words into words to obtain named entities and non-named entities, finding corresponding entity questions from the question-answer database as candidate questions, calculating the similarity between the user input questions and the candidate questions through an improved similarity calculation method, and returning answers of the candidate questions with the highest similarity; the method specifically comprises the following steps:

after the named entities are identified according to the questions input by the user, if the named entities exist, the corresponding problems of the named entities are found from the question-answer database and serve as candidate problems.

Calculating similarity sim1 according to the word vector of the user input question and the word vectors of the candidate questions thereof_(x,y)(ii) a And sequencing the similarity values of the candidate questions, and selecting the answer corresponding to the candidate question with the highest similarity score as the return answer of the question input by the user.

If no named entity exists, the question-answer data is obtainedFinding out the question of the corresponding non-named entity in the library as a candidate answer, and calculating the similarity sim2 according to the word vector of the question input by the user and the word vector of the candidate question_(x,y)；

And sequencing the similarity values of the candidate questions, and selecting the answer corresponding to the candidate question with the highest similarity score as the return answer of the question input by the user.

Preferably, the question-answering platform in step 1 selects one or more of the following platforms: baidu bar, Baidu know, search for questions, 360 questions and answers, search for fox questions and answers, and know answer.

Preferably, the effective score S of each answer record in step 1 is:

where d is the number of days +1, n from the answer record to the latest answer₁Indicates the number of praise, n₂Indicating the number of reviews.

Preferably, the step 1 further comprises periodically crawling the question-answer platform at regular time to update the question-answer database, calculating the effective score of the newly added answer for the same question existing in the database, and if the effective score is higher than the effective score of the answer of the question in the database, directly replacing the answer of the question in the database; if the effective score of the question in the database is lower than the effective score of the question in the database, the answer of the question in the database is unchanged.

Preferably, the similarity sim1_(x,y)The calculation method is as follows:

wherein, W₁,W₂,…,W_aRepresenting a named class entity word vector, N₁,N₂,…N_bRepresenting noun word vectors of non-named classes, V₁,V₂,…V_bRepresenting a non-naming class entity noun word vector, a representing the number of naming class entities, b representing the noun of the non-naming class entityThe number of the named entity verbs is c.

Preferably, the similarity sim2_(x,y)The calculation method is as follows:

further, the step 4 of finding the corresponding problem of the named entity from the question-answer database as a candidate problem refers to finding the same word and the same entity type to which the word belongs.

Compared with the prior art, one of the technical schemes has the following beneficial effects: the method comprises the steps of conducting named entity recognition and Chinese segmentation on questions in a question-answering database to obtain word vectors of named entities and non-named entities, obtaining corresponding candidate questions according to the named entities or the non-named entities of the questions input by a user, obtaining the similarity between the user input and the candidate questions according to an improved similarity calculation method, accurately matching the user input questions, and improving the accuracy of answers in a question-answering system.

Drawings

Fig. 1 is a flowchart of a method for constructing a question-answering system based on named entity identification according to an embodiment of the present disclosure.

Fig. 2 is a flow chart of constructing a question-answer database according to an embodiment of the present disclosure.

FIG. 3 is a block diagram of entity identification using a BERT-BilSTM-CRF model according to an embodiment of the present disclosure.

Detailed Description

In order to clarify the technical solution and the working principle of the present invention, the embodiments of the present disclosure will be described in further detail with reference to the accompanying drawings. All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

The terms "step 1," "step 2," "step 3," and the like in the description and claims of this application and the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be practiced in sequences other than those described herein.

The embodiment of the disclosure provides a question-answering system construction method based on named entity identification, and the attached figure 1 is a flow chart of the question-answering system construction method based on named entity identification, and the method mainly comprises the following steps by combining the flow chart:

step 1, constructing a question-answer database: fig. 2 is a flow chart of a question-answer database construction, and the specific method combining the chart is as follows:

obtaining a question-answer data source, and capturing a question-answer platform by using a web crawler as a data source of a question-answer database, wherein preferably, the question-answer platform selects one or more of the following platforms: baidu bar, Baidu know, search for questions, 360 questions and answers, search for fox questions and answers, and know answer.

After the web page is captured, data cleaning operation is needed, useless data (pictures, links, advertisements and the like) are eliminated, and problem elements are obtained: question, answer time, number of praise, number of comments field; calculating the effective score number S of each answer record according to the question elements; since a question is often answered by multiple people, the same question is recorded in multiple records, and the effective score of each answer record is calculated.

Preferably, the effective score S of each answer record is:

where d ═ n (the number of days the answer was recorded from the most recent answer +1), n₁Indicates the number of praise, n₂Representing the number of reviews; when the answer record is the only answer of a certain question or the latest answer of a certain question, d is 1, so that d is not equal to 0, and the denominator is avoided to be 0.

And according to the effective score S, only one answer record with the highest effective score is reserved for each question and is stored in a question-answer database.

And 2, carrying out named entity identification and non-named entity identification on the questions in the question and answer database, wherein the named entity identification refers to identification of entities with specific meanings in the text, including names of people, places and organizations. And (3) carrying out named entity recognition on the questions in the question-answer database by using a BERT-BilSTM-CRF model, generating word vector semantic representation of input contents by using the BERT, and connecting the BilSTM-CRF model. FIG. 3 is a block diagram of entity identification using a BERT-BilSTM-CRF model, and in conjunction with this figure, the method of entity identification using the BERT-BilSTM-CRF model is as follows:

And the output of the reverse information vector

Splicing to obtain sentence representation at time t

By learning the relation between the text contexts in the forward direction and the reverse direction, the defect that the context can only be derived by the unidirectional circulation neural network can be effectively overcome.

(3) Transporting the BilSt layerThe input sequence X ═ X (X) is extracted as CRF₁,x₂,…,x_n) X represents a word vector, n represents the number of input word vectors, the constraint conditions among learning labels improve the accuracy of label prediction to obtain a final prediction label sequence, and marking information for each position of an input problem.

For example, the input question of step one is "what the latest products released by apple are", and the output of step three is (O) of apple (B-ORG) and fruit (I-ORG) and cloth (O) of last (O) and new (O) products (O) of apple (B-ORG) and fruit (O) is (O) and apple (I-ORG) and fruit (O) of last (O) and new products (O) of apple (O) and fruit (O).

Performing Chinese word segmentation on the questions in the question-answer database, and identifying non-named entities: the method comprises the steps of using a Baidu LAC word segmentation tool to perform word segmentation and part-of-speech tagging on questions in a question-answer database, skipping pronouns, adjectives and adverbs which have no value in calculating similarity, and screening out non-named entity class nouns and non-named entity class verbs, wherein the non-named entity classes such as 'release' and 'product' are obtained by segmenting words from 'latest products released by apples'.

And 3, storing the identification result of the step 2 into a corresponding field in a question-answer database, and adding the following field columns to each question in the database: and (3) organizing the organization entity, the person name entity, the place entity, the non-naming entity noun and the non-naming entity verb obtained in the step (2), and storing the naming entity and the non-naming entity obtained in the step (2) into corresponding columns respectively, wherein each element comprises a stored entity name and an entity word vector, and if a plurality of names exist in a certain class, the names are stored separately by commas, such as (entity 1, word vector 1), (entity 2, word vector 2), …).

And 4, calculating the similarity, performing entity recognition on the user input questions, dividing words in Chinese language to obtain named entities and non-named entities, finding corresponding entity questions from the question-answer database to serve as candidate questions, calculating the similarity between the user input questions and the candidate questions by an improved similarity calculation method, and returning answers of the candidate questions with the highest similarity. The method specifically comprises the following steps:

after the user input question is subjected to named entity recognition, if the named entity (organization, person name and place name) exists, the named entity is selected from a question-answer databaseThe corresponding named entity is found as a candidate question, and the similarity sim1 is calculated according to the word vector of the question input by the user and the word vector of the candidate question_(x,y)：

Wherein, W₁,W₂,…,W_aRepresenting a named class entity word vector, N₁,N₂,…N_bRepresenting noun word vectors of non-named classes, V₁,V₂,…V_bRepresenting a non-naming entity noun word vector, a representing the number of naming entities, b representing the number of non-naming entity nouns, and c representing the number of non-naming entity verbs;

If no named entity exists, the question of the corresponding non-named entity (noun and verb) is found from the question-answer database as a candidate answer, and the similarity sim2 is calculated according to the word vector of the question input by the user and the word vector of the candidate question_(x，y)：

Preferably, the step 4 of finding the corresponding named entity from the question-answer database as a candidate question refers to finding the same word and the types of the entities to which the word belongs are the same, such as: the ' apple haute and the ' apple released by the apple is what ' the former ' apple ' is an unnamed entity noun, the latter ' apple ' is a named entity organizational structure type, and the two entity types are obviously inconsistent and do not belong to the problem of the corresponding named entity.

The invention has been described above by way of example with reference to the accompanying drawings, it being understood that the invention is not limited to the specific embodiments described above, but is capable of numerous insubstantial modifications when implemented in accordance with the principles and solutions of the present invention; or directly apply the conception and the technical scheme of the invention to other occasions without improvement and equivalent replacement, and the invention is within the protection scope of the invention.

Claims

1. A question-answering system construction method based on named entity recognition is characterized by comprising the following steps:

step 1, constructing a question-answer database:

after the web page is captured, data cleaning operation is needed, useless data are eliminated, and problem elements are obtained: question, answer time, number of praise, number of comments field; calculating the effective score number S of each answer record according to the question elements; according to the effective score S, only one answer record with the highest effective score is reserved for each question and is stored in a question-answer database;

step 2, conducting named entity recognition and non-named entity recognition on the questions in the question and answer database, wherein the named entity recognition refers to recognition of entities with specific meanings in the text, including names of people, places and organizations; using a BERT-BilSTM-CRF model to identify named entities of questions in a question-answer database, generating word vector semantic representation of input contents by using the BERT, and connecting the BilSTM-CRF model;

(1) processing the problem by using a bidirectional Transformer encoder in a BERT pre-training language model, constructing an Embedding layer, and obtaining the vector representation of each word as the input of a downstream task BiLSTM-CRF;

(2) the word vectors obtained by the BERT process are used as input of the BilSTM model, and the sequence input is processed in the forward and reverse directionsThen outputting the forward information vector at the same time

And the output of the reverse information vector

Splicing to obtain sentence representation at time t

Learning the association between text contexts in both forward and reverse directions;

(3) the output of the BilSTM layer is used as the input sequence X ═ X (X) of the CRF₁,x₂,…,x_n) X represents a word vector, n represents the number of input word vectors, the constraint conditions among learning labels improve the accuracy of label prediction to obtain a final prediction label sequence, and marking information for each position of an input problem;

performing Chinese word segmentation on the questions in the question-answer database, and identifying non-named entities: using a Baidu LAC word segmentation tool to perform word segmentation and part-of-speech tagging on the questions in the question-answer database, skipping pronouns, adjectives and adverbs which have no value in calculating similarity, and screening out non-named entity nouns and non-named entity verbs;

and 3, storing the identification result of the step 2 into a corresponding field in a question-answer database, and adding the following field columns to each question in the database: organization entity, name entity, location entity, noun of non-naming entity and verb of non-naming entity, storing the naming entity and non-naming entity obtained in step 2 into corresponding columns respectively, wherein each element includes storing entity name and entity word vector, if there are multiple names in a certain class, storing them separately with comma;

after the named entities are identified by the questions input by the user, if the named entities exist, the corresponding problems of the named entities are found from the question-answer database as candidate problems,

calculating similarity sim1 according to the word vector of the user input question and the word vectors of the candidate questions thereof_(x,y)(ii) a Sorting the similarity values of the candidate questions, and selecting the answer corresponding to the candidate question with the highest similarity score as the return answer of the user input question;

if no named entity exists, finding out the corresponding question of the non-named entity from the question-answer database as a candidate answer, and calculating the similarity sim2 according to the word vector of the question input by the user and the word vector of the candidate question_(x,y)；

2. The method for constructing the question-answering system based on named entity recognition according to claim 1, wherein the question-answering platform in step 1 selects one or more of the following platforms: baidu bar, Baidu know, search for questions, 360 questions and answers, search for fox questions and answers, and know answer.

3. The method for constructing a question-answering system based on named entity recognition according to claim 1, wherein the effective score of each answer record in step 1 is as follows:

4. The method for constructing the question-answer system based on named entity recognition according to any one of claims 1 to 3, wherein the step 1 further comprises periodically and regularly crawling the question-answer class platform to update the question-answer database, calculating the effective score of a newly added answer for the same question existing in the database, and directly replacing the answer of the question in the database if the effective score of the answer of the question is higher than that in the database; if the effective score of the question in the database is lower than the effective score of the question in the database, the answer of the question in the database is unchanged.

5. The method for constructing the question-answering system based on named entity recognition according to claim 4, wherein the similarity sim1 is_(x,y)The calculation method is as follows:

wherein, W₁,W₂,…,W_aRepresenting a named class entity word vector, N₁,N₂,…N_bRepresenting noun word vectors of non-named classes, V₁,V₂,…V_bThe word vector of the noun of the non-naming class entity is represented, a represents the number of the noun of the naming class entity, b represents the number of the noun of the non-naming class entity, and c represents the number of the verb of the non-naming class entity.

6. The method for constructing the question-answering system based on named entity recognition according to claim 4, wherein the similarity sim2 is_(x,y)The calculation method is as follows:

7. the method for constructing a question-answer system based on named entity recognition according to any one of claims 5 or 6, wherein the step 4 of finding the question of the corresponding named entity from the question-answer database as a candidate question refers to finding the question of the same word and the same entity type to which the same word belongs.