CN108681574B

CN108681574B - Text abstract-based non-fact question-answer selection method and system

Info

Publication number: CN108681574B
Application number: CN201810428163.8A
Authority: CN
Inventors: 马荣强; 张健; 李淼; 陈雷; 高会议
Original assignee: Hefei Institutes of Physical Science of CAS
Current assignee: Hefei Institutes of Physical Science of CAS
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2021-11-05
Anticipated expiration: 2038-05-07
Also published as: CN108681574A

Abstract

The invention discloses a text abstract-based non-fact question-answer selecting method and a text abstract-based non-fact question-answer selecting system, which belong to the technical field of intelligent retrieval and comprise the steps of extracting a head sentence and a tail sentence of an answer text to be selected; extracting the abstracts of the remaining texts of the answer text to be selected except the first sentence and the last sentence by using a text abstract model TextRank to obtain a preliminary text abstract; sequentially combining the first sentence, the preliminary text abstract and the tail sentence to obtain an answer text abstract to be selected; taking the question and the answer text abstract to be selected as the input of a neural network semantic representation model to obtain the semantic correlation degree of the question and the answer text abstract to be selected; and returning the answer text abstract with the highest semantic relevance degree with the question as an answer. When the method and the device extract the abstract of the answer, the first sentence and the last sentence of the text of the answer are extracted as the components of the abstract, so that the completeness of the subject of the extracted text abstract is ensured, and the accuracy rate of answer selection is improved.

Description

Text abstract-based non-fact question-answer selection method and system

Technical Field

The invention relates to the technical field of intelligent retrieval, in particular to a text abstract-based non-fact question-answer selecting method and system.

Background

Currently, question-answering systems have become an important research topic in the field of natural language processing, and are used in multiple fields of information acquisition, such as information retrieval, expert systems, automatic question-answering, man-machine natural language interaction, and the like. The question-answering system is different from the information retrieval in that it does not require the user to search for an answer by himself, but returns an answer directly.

According to different data sources of the question-answering system, the question-answering system is divided into three types: a question-answering system based on structured data, a question-answering system based on free text and a question-answering system based on question-answer pairs. The workflow of the question-answering system based on the question-answer pairs is that after a user puts forward a question, the answer most matched with the semantics is returned through semantic feature analysis, and the data mainly comes from the network community question-answering.

The early research on the answer selection method is generally based on traditional semantic feature extraction, text features are manually selected, then a high-performance classifier is used for training, the semantic representation method by manually defining features has strong interpretability, and the feature selection covers the whole data set. The selected characteristics are mainly the sentence quality reflected from the answer text content and the correlation between the question answer and the answer content. Manually selected features typically include N-grams of words, syntactic structures, and grammatical dependencies. When early researchers studied the answer selection method, the most common method is to train an answer selection model based on manually defined features after performing word segmentation, part of speech tagging or syntactic analysis on a text to be processed by means of an existing natural language processing tool.

However, the answer text form in the non-factual question-answer has multilaterality and has noise information, and it is difficult to match the correct answer by using general linguistic rules. Therefore, for the answer selection task of the non-factual question-answering system, the current mainstream method is to mine semantic information of a text by using a supervised machine learning method based on a standard text, such as:

the SVM model is utilized to train matching features at a word level, such as keyword matching features, phrase-level non-semantic features, and some named entity-based features. Still other researchers have developed a series of lexical features related to answer quality including whether punctuation, hyperlinks, the number of special words, part of speech and frequency of named entity features, and N-gram language models by extracting features of text through natural language processing tools. The syntax tree can be used for better capturing the local structural information of the sentence, and the answer selection method based on the syntax tree can effectively reduce the workload of feature selection. Answer selection is performed by a combined approach of syntactic and semantic features, the syntactic aspect calculates tree edit distances between dependency syntax trees for questions and answers, and the semantic aspect uses shallow semantic features such as entity types, synonyms, and the like.

The tree edit distance is a total dissipation value of operations (insertion, deletion and replacement) required in the conversion process from two trees, the calculation process is similar to the edit distance of a character string, a sequence in a question and answer is labeled by using a Conditional Random Field (CRF), and practical characteristics comprise the tree edit distance, the character string edit distance and the like. This is the first time to convert the answer selection questions of the community question-and-answer to sequence labeling questions. In addition to syntactic trees, there are also some researchers that compare the relevance of question and answer text from the perspective of language models and word vectors, for example, using translation-based models to compare how relevant a question is to an answer, treating the question and candidate answer as two different languages.

The answer selection method based on the traditional semantic feature extraction usually has good interpretability, and the basis of the answer selection method can be found through manually selected features, so that the answer selection method is easy to understand. However, there are some drawbacks when answer selection is performed using this method: first, it relies on a number of toolkits related to basic research in the natural language field, which makes the effect of the selected features dependent on the effect of the basic research. The idea of feature extraction may be very basic, but the desired result cannot be obtained in the case of texts with complex structures. Secondly, the features extracted from the answer selection model ultimately depend on the selection of a person, and the model has no self-learning capability, resulting in the limitation of model application.

Disclosure of Invention

The invention aims to provide a text abstract-based non-fact question-answer selection method and system to improve the answer selection accuracy of a question-answer system.

In order to achieve the above purpose, the present invention adopts a text abstract-based answer selection method for non-factual question answers, which comprises the following steps:

extracting a first sentence and a last sentence of the answer text to be selected;

extracting the abstracts of the remaining texts of the answer text to be selected except the first sentence and the last sentence by using a text abstract model TextRank to obtain a preliminary text abstract;

sequentially combining the first sentence, the preliminary text abstract and the tail sentence to obtain an answer text abstract to be selected;

taking the question and the answer text abstract to be selected as input of a neural network semantic representation model to obtain semantic correlation degree of the question and the answer text abstract to be selected;

and returning the answer text abstract with the highest semantic relevance degree with the question as an answer.

Preferably, the extracting the first sentence and the last sentence of the answer text to be selected includes:

and extracting the first sentence and the tail sentence of the answer text to be selected according to the positions of the first sentence and the tail sentence in the answer text to be selected.

Preferably, the extracting the abstract of the remaining text of the answer text to be selected except the first sentence and the last sentence by using the text abstract model TextRank to obtain a preliminary text abstract comprises:

dividing the answer text to be selected into sentences, and segmenting each sentence;

the part of speech of each word is labeled, and the information of the labeled words is filtered to obtain the terms of the specific words;

taking the terms or sentences of the specific words as text units, forming nodes by the text units, and forming edges between the nodes by the similarity between the text units to obtain a weight graph model;

calculating the similarity of any two nodes, and taking the similarity value as a calculation parameter of a node weight calculation formula;

iterating the node weight calculation formula until convergence is achieved to obtain a score result of each node;

according to the scores among all the nodes during convergence, all the nodes are sorted to obtain the sorted nodes;

and extracting text units from the sorted nodes according to a set extraction ratio to form a preliminary text abstract.

Preferably, the method for calculating the similarity between any two nodes includes: a vocabulary overlap method, a character string method, a cosine similarity method and a maximum common subsequence method.

On the other hand, a non-fact question-answer selecting system based on text summaries is adopted and comprises a first extraction module, a second extraction module, a combination module, a matching module and a determining module which are sequentially connected;

the first extraction module is used for extracting a first sentence and a last sentence of the answer text to be selected;

the second extraction module is used for extracting the abstracts of the remaining texts of the answer text to be selected except the first sentence and the last sentence by using a text abstract model TextRank to obtain a primary text abstract;

the combination module is used for sequentially combining the first sentence, the preliminary text abstract and the tail sentence to obtain an answer text abstract to be selected;

the matching module is used for taking the question and the answer text abstract to be selected as the input of a neural network semantic representation model to obtain the semantic correlation degree of the question and the answer text abstract to be selected;

and the determining module is used for returning the answer text abstract with the highest semantic relevance degree with the question as an answer.

Preferably, the first extraction module is specifically configured to:

Preferably, the second extraction module comprises a segmentation unit, a filtering unit, a weight graph model construction unit, a similarity calculation unit, an iteration unit, a sorting unit and a composition unit which are connected in sequence;

the segmentation unit is used for segmenting the answer text to be selected into sentences and segmenting each sentence;

the filtering unit is used for labeling the part of speech of each word and filtering the information of the labeled words to obtain the terms of the specific words;

the weight graph model building unit is used for taking the terms or sentences of the specific words as text units, forming the text units into nodes, and forming edges between the nodes by the similarity between the text units to obtain a weight graph model;

the similarity calculation unit is used for calculating the similarity of any two nodes and taking the similarity value as a calculation parameter of the node weight calculation formula;

the iteration unit is used for iterating the node weight calculation formula until convergence is achieved, and obtaining the score result of each node;

the sorting unit is used for sorting the nodes according to scores among the nodes during convergence to obtain the sorted nodes;

and the combination unit is used for extracting the text units from the sorted nodes according to the set extraction ratio to form a preliminary text abstract.

Preferably, the similarity calculation method adopted by the similarity calculation unit includes: a vocabulary overlap method, a character string method, a cosine similarity method and a maximum common subsequence method.

Compared with the prior art, the invention has the following technical effects: in practical application, considering that in a question-answering pair of a non-factual question-answering system, the length of an answer text is much longer than that of a question, if a single text abstract extraction method is adopted, only global information of the text is considered, and the self characteristic information of a text unit, such as the position of a sentence, the position of a term and the like, is lacked, and when the rate of extracting the abstract is set to be low, a topic drift problem is easily caused. When the abstract of the answer text is extracted, the first sentence and the last sentence of the answer text are reserved, the abstract extraction method is used for extracting the abstract, and the first sentence, the abstract and the last sentence are combined in sequence to serve as a final extracted abstract result. Because the first sentence of the answer text in the question and answer is generally the brief repetition of the question, and the tail sentence of the answer text is generally the brief summary of the answer content, when the answer abstract is extracted, the first sentence and the tail sentence of the answer text are extracted as the components of the abstract, thereby ensuring the theme integrity of the extracted text abstract and improving the accuracy of answer selection.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a flow chart of a method for selecting answers to non-factual question answers based on text summaries;

FIG. 2 is a schematic diagram of text summarization of answers;

FIG. 3 is a TextRank weight diagram;

FIG. 4 is a block diagram of a neural network semantic representation model;

fig. 5 is a schematic structural diagram of a text abstract-based non-fact question-answer selection system.

Detailed Description

To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.

The embodiment of the application provides a non-fact question-answer selecting method based on a text abstract, so that the problem that the answer selecting accuracy rate of the existing question-answer system is low is solved.

In order to solve the above problems, the main idea of this embodiment is to keep the first sentence and the last sentence of the answer text when extracting the abstract from the answer text to be selected, then extract the abstract from the text content left after removing the first sentence and the last sentence, then combine the first sentence, the abstract and the last sentence into the final text abstract in sequence, match the final text abstract and the question sentence, and obtain the answer for returning.

As shown in fig. 1 to fig. 2, the detailed description of the answer selection method for non-factual question answering based on text abstract according to the present embodiment includes the following steps S1 to S5:

s1, extracting a first sentence and a last sentence of the answer text to be selected;

s2, abstracting the remaining text except the first sentence and the last sentence of the answer text to be selected by using a text abstraction model TextRank to obtain a primary text abstract;

s3, sequentially combining the first sentence, the preliminary text abstract and the tail sentence to obtain an answer text abstract to be selected;

s4, taking the question and the answer text abstract to be selected as the input of a neural network semantic representation model to obtain the semantic correlation degree of the question and the answer text abstract to be selected;

and S5, returning the answer text abstract with the highest semantic relevance degree with the question as an answer.

It should be noted that the question and answer text abstract are input into the neural network answer selection model, the question and answer text abstract are encoded by using the neural network, vector representation of the text is obtained by mining the text semantics, and finally the semantic correlation degree of the text is obtained by calculating the similarity of the semantic vectors of the question and answer text.

More preferably, in step S1: extracting a head sentence and a tail sentence of the answer text to be selected, wherein the specific extraction process is as follows: firstly, the position of a first sentence and the position of a tail sentence of an answer text to be selected are identified, and then the first sentence and the tail sentence are extracted according to the position of the first sentence and the position of the tail sentence. For example, the position of the first appearing period in the answer text is identified, and the sentence before the period is extracted as the first sentence. And identifying the positions of two final appearing periods in the answer text, and extracting the sentence between the two periods to be used as a tail sentence.

More preferably, in step S2: and abstracting the rest texts of the answer text to be selected except the first sentence and the last sentence by using a text abstraction model TextRank to obtain a preliminary text abstraction. The detailed description is as follows:

when the TextRank algorithm is used for extracting key sentences, the sentences are marked as nodes, and then a graph model is established according to the number of edges. When the sentence similarity is calculated, the method adopted in the TextRank is generally based on a vocabulary overlapping method, that is, the similarity is higher when the number of overlapped words in two sentences is larger. Besides word overlap, sentence similarity calculation methods such as character strings, cosine similarity, maximum common subsequence, and the like can be used, and the part-of-speech method is based on statistical information. After the graph model is built, recursive calculation is carried out by utilizing a PageRank algorithm, and finally the score of the node is obtained. The higher the node score, the higher the importance of the sentence. After the sentences are sorted according to the importance of the sentences, key sentences are extracted according to a required ratio to form a text abstract.

The main steps are as follows:

(1) pretreatment: the text is divided into a plurality of text units (terms or sentences), and part-of-speech tagging is performed after word segmentation. And filtering the labeled word information, wherein the filtering content comprises stop words and parts of speech, and finally only the terms of the specific parts of speech are reserved.

(2) Constructing a weight graph model: and forming nodes by the text units, and forming edges among the nodes by the similarity among the text units to form a weight graph model.

(3) Sentence similarity calculation: calculating the similarity of two sentences by using a method based on word overlap, and matching sentences S_iAnd S_jThe similarity calculation is carried out by using the following formula:

wherein S is_iAnd S_jRepresenting two sentencesAnd (4) adding the active ingredients. Sentence S_iFrom N_iThe terms represent: s_i＝w₁ ⁱ,w₂ ⁱ,…,w_Ni ⁱ。w_kRepresenting the words contained in both sentences, the weight W of the edge_ji＝Similarity(S_i,S_j)。

(4) Iterating the node score calculation formula until convergence is achieved, and obtaining each node score: the TextRank algorithm model can be represented by G ═ V, E. The algorithm represents all the node sets in the graph as V and all the edge sets in the graph as E, and V and E constitute all the contents in the graph, wherein E is a subset of V multiplied by V. Node V_iThe score of (d) is as follows:

wherein, w_jiRepresents a node V_jAnd node V_iThe weight of the connecting edge between, typically using node V_jAnd node V_iA similarity representation of (d); in (V)_i) Indicates a point V_iAll node set of nodes, Out (V)_j) Represents a node V_jAll node sets pointed to represent multiplication numbers; d is called a damping coefficient (d is more than or equal to 0 and less than or equal to 1) and represents the probability of a certain node in the figure 3 jumping to any other node, and d is generally 0.85.

In addition, two points are to be noted when using the TextRank algorithm; firstly, setting an initial value, and generally making all nodes initially divided into 1; secondly, convergence is judged, and when the general convergence threshold value is 0.0001, namely the error rate of any node in the graph is less than 0.0001, convergence is achieved, and iteration is stopped.

(5) And sorting all the nodes according to the scores of all the nodes, and extracting the keywords from all the nodes according to different extraction ratios to form a preliminary abstract text.

It should be noted that different extraction ratios are set in combination with actual needs, so that spoken expressions and redundant information in answer texts can be removed, and the accuracy of keyword extraction is ensured.

It should be noted that the TextRank algorithm is a classic method for extracting keywords and abstract sentences of a text, and the principle of the TextRank algorithm is an unsupervised algorithm based on a graph. In the embodiment, the TextRank algorithm is used for sorting the keywords and the key sentences in the text and adopts a PageRank algorithm.

For example, calculate sentence S_iAnd sentence S_jThe similarity of (2) is to establish a weight graph as shown in FIG. 3, node V_iRepresenting a sentence S_iNode V_jRepresenting a sentence S_j. Node V_jAnd node V_kIs expressed as w_jk. Node V_jAnd node V_k+1Degree of similarity w of_jk+1Can be derived from the formula. The node V can be calculated according to a formula_iTextRank score of (1), wherein w_jk+w_jk+1Represents a node V_jScore of (a):

it should be noted that the TextRank algorithm is an unsupervised method for extracting keywords and key sentences. The method has the advantages that a corpus does not need to be trained, the method can be well used for texts with contents in different fields, linguistic knowledge or domain knowledge does not need to be considered, and the overall structure of the texts is comprehensively considered. The disadvantage is that the TextRank algorithm only considers the global information of the text, and lacks the self characteristic information of the text unit, such as the position of a sentence, the position of a term, and the like.

In practical application, in the question-answering pair of the non-factual question-answering system, the length of the answer text is much longer than that of a question sentence, but by using a single text abstract extraction method, when the extraction rate of the abstract is set to be low, the problem of theme drift is easily caused. As shown in fig. 4, when the answer text abstract extraction is performed, the first sentence and the last sentence of the answer text are retained, and then the abstract extraction is performed by using the abstract extraction method. The characteristics of the answer text in question and answer show that the first sentence of the answer is generally a brief repetition of the question and then a solution method of the question; the end of the answer is typically a brief summary of the content of the answer. Therefore, when the answer abstract is extracted, the subject integrity of the abstract is ensured by using the first sentence and the last sentence of the answer text, and the accuracy of answer selection is further improved.

Meanwhile, the extracted abstract deletes the spoken expressions and redundant information without practical significance relative to the original answer text, so that high-efficiency answer text representation is obtained, and then semantic vectors containing more key information are obtained through a neural network semantic representation model.

As shown in fig. 5, the embodiment discloses a non-factual question-answer selection system based on a text abstract, which includes a first extraction module 10, a second extraction module 20, a combination module 30, a matching module 40 and a determination module 50, which are connected in sequence;

the first extraction module 10 is configured to extract a first sentence and a last sentence of the answer text to be selected;

a second extraction module 20, configured to extract the abstracts of the remaining texts of the answer text to be selected, except for the first sentence and the last sentence, by using a text abstraction model TextRank, so as to obtain a preliminary text abstraction;

the combination module 30 is configured to sequentially combine the first sentence, the preliminary text abstract, and the last sentence to obtain an answer text abstract to be selected;

the matching module 40 is configured to use the question and the to-be-selected answer text abstract as inputs of a neural network semantic representation model to obtain semantic correlation degrees of the question and the to-be-selected answer text abstract;

and the determining module 50 is used for returning the answer text abstract with the highest semantic relevance degree with the question as the answer.

As a further preferred scheme, the first extraction module 10 is specifically configured to:

As a further preferred scheme, the second extraction module 20 includes a segmentation unit, a filtering unit, a weight map model construction unit, a similarity calculation unit, an iteration unit, a sorting unit, and a composition unit, which are connected in sequence;

As a further preferable aspect, the similarity calculation method adopted by the similarity calculation unit includes: a vocabulary overlap method, a character string method, a cosine similarity method and a maximum common subsequence method.

It should be understood that the system for selecting answers to questions and answers based on non-facts type abstract of the present embodiment is used for implementing the processes in fig. 1, and has the same technical features and the same effects as the method for selecting answers to questions and answers based on non-facts type abstract of the present embodiment, and will not be described in detail herein.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A non-factual question answer selection method based on text abstract is characterized by comprising the following steps:

extracting a first sentence and a last sentence of an answer text to be selected;

extracting the abstracts of the remaining texts of the answer text to be selected except the first sentence and the last sentence by using a text abstract model TextRank to obtain a preliminary text abstract, which comprises the following steps:

extracting text units from the sorted nodes according to a set extraction ratio to form a preliminary text abstract;

2. The method for selecting a non-factual question-answer based on a text abstract as claimed in claim 1, wherein said extracting the first sentence and the last sentence of the answer text to be selected comprises:

3. The method for selecting answers to questions and answers based on text excerpts as claimed in claim 1, wherein the method for calculating the similarity between any two nodes comprises: a vocabulary overlap method, a character string method, a cosine similarity method and a maximum common subsequence method.

4. A non-factual question-answer selection system based on a text abstract is characterized by comprising a first extraction module, a second extraction module, a combination module, a matching module and a determination module which are connected in sequence;

the second extraction module is used for performing abstract extraction on the remaining texts of the answer text to be selected except the first sentence and the last sentence by using a text abstract model TextRank to obtain a preliminary text abstract, and comprises a segmentation unit, a filtering unit, a weight map model construction unit, a similarity calculation unit, an iteration unit, a sorting unit and a composition unit which are sequentially connected;

the combination unit is used for extracting text units from the sorted nodes according to a set extraction ratio to form a preliminary text abstract;

5. The system of claim 4, wherein the first extraction module is specifically configured to:

6. The system for selecting a non-factual-class question-answer based on a text excerpt according to claim 4, wherein the similarity calculation means employs a similarity calculation method comprising: a vocabulary overlap method, a character string method, a cosine similarity method and a maximum common subsequence method.