CN110688472A

CN110688472A - Method for automatically screening answers to questions, terminal equipment and storage medium

Info

Publication number: CN110688472A
Application number: CN201910954584.9A
Authority: CN
Inventors: 刘继明; 肖肇宇; 谭云丹; 高力伟
Original assignee: Xiamen Jincun Technology Co Ltd
Current assignee: Xiamen Jincun Technology Co Ltd
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-01-14

Abstract

The invention relates to a method for automatically screening answers to questions, terminal equipment and a storage medium, wherein the method comprises the following steps: preprocessing the data of the questions to be answered; carrying out synonym normalization operation on the question data after the answer preprocessing; calculating the first question data with the similarity and the final ranking in the database according to the first similarity, the second similarity and the third similarity of the question data to be answered and the questions in the database; and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered. The invention improves the problem that the word order and the semanteme of sentences are not considered in most of the similarity algorithms based on word embedding at present, introduces the calculation of the second similarity based on word stream characteristics and the third similarity based on centripetal word grooves, can be used for representing the hidden association of the word order and the semanteme among texts, improves the accuracy and the stability of similarity calculation, and improves all expressions.

Description

Method for automatically screening answers to questions, terminal equipment and storage medium

Technical Field

The present invention relates to the field of automatically screening answers to questions, and in particular, to a method, a terminal device, and a storage medium for automatically screening answers to questions.

Background

The automatic question answering aims to automatically answer the questions posed by the user through a computer so as to meet the requirements of the user on knowledge and solve the questions of the user. In the process of answering the user questions, the automatic question-answering system needs to be capable of correctly understanding the question sentences in the natural language form input by the user, analyzing and acquiring key information in the question sentences, and further matching the acquired information to the most appropriate answer in a preset corpus database, a question-answering database or a knowledge database and returning the answer to the user.

At present, the mainstream method for problem sorting processing is realized through text similarity. The mainstream method for calculating text similarity has several branches, and among them, the method for obtaining word vectors by using word embedding method and calculating sentence vectors by using the word embedding method is widely applied due to its simple and efficient characteristics. However, this method often only considers the similarity of components in the text, but does not consider the word order and semantics in the text, resulting in an incorrect understanding of the text. For example, the two texts "he is my ancestor" and "my is his ancestor" are completely different in meaning, but many similarity algorithms judge that their similarity is high.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method, a terminal device and a storage medium for automatically screening answers to questions.

The specific scheme is as follows:

a method of automatically screening answers to questions, comprising the steps of:

s1: preprocessing the data of the questions to be answered, wherein the preprocessing comprises word segmentation and meaningless word removal;

s2: replacing all words with the same meaning or similar meaning in the question data after the answer preprocessing with the same word, wherein the same word is the word with the same meaning or similar meaning as the replaced word;

s3: calculating first similarity between the data of the questions to be answered and all the questions in the database, sorting the first similarity values of all the questions from big to small, and screening N question data with the largest first similarity values;

s4: respectively calculating second similarity of the N question data and the question data to be answered, and respectively sequencing the N question data according to the second similarity;

s5: respectively calculating third similarity of the N question data and the question data to be answered, and respectively sequencing the N question data according to the third similarity;

s6: calculating the final first-ranked problem data according to the magnitude and ranking condition of each problem data in the N problem data aiming at the first similarity value, the second similarity and the third similarity;

s7: and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered.

Further, the method for calculating the first similarity includes:

s31: calculating a text vector of each question data aiming at two question data to be compared;

s32: the text vectors of two problem data are combined together to form a text matrix X, and the text matrix X is subjected to singular value decomposition to form three matrixes U, sigma and V^τ：

X＝U·∑·V^τ

And calculating a matrix Y without main components in the text matrix X:

Y＝X-X·V·V^τ

wherein V is V^τThe transposed matrix of (2);

s33: and extracting vectors corresponding to the two question data from the matrix Y as preferred text vectors of the vectors, and calculating the similarity between the two question data according to the preferred text vectors corresponding to each question data as a first similarity.

Further, the calculation process of the text vector of each question data in step S31 is:

s311: calculating a word vector of each word in the problem data;

s312: calculating the weight of the word in the question data;

s313: and calculating a text vector of the question data according to the word vector of each word in the question data and the weight of the word in the question data.

Further, the weight W of the word ω in step S312_ωThe calculation formula of (2) is as follows:

where α is a tuning constant and P (ω) is the frequency with which the word ω appears in the corpus.

Further, the method for calculating the text vector of the question data in step S313 includes:

wherein, K_SA text vector representing the question data S, i representing the ith word in the question data, n representing the number of words in the question data, W_iIs the weight of the ith word in the question data, V_iA word vector representing the ith word in the question data.

Further, the first similarity S in step S33_yThe calculation formula of (2) is as follows:

wherein, K_i、K_jA preferred text vector representing the two question data, respectively, | denotes the modulo operation.

Further, the second similarity calculation method includes:

s41: aiming at two problem data to be compared, all words in the two problem data are combined into a word set, and each word in the word set is uniquely numbered;

s42: respectively constructing a feature vector of each problem data in the two problem data;

the construction method of the feature vector of each problem datum comprises the following steps: for each word in the word set, searching whether a word with the same or similar meaning to the word exists in the problem, and if the word with the same or similar meaning exists, setting the value of an element, which is positioned at the same sequence in the word set as the word, in the feature vector as the unique number of the word; if no word with the same or similar meaning exists, setting the value of an element in the feature vector, which is in the same sequence with the word in the word set, as 0;

s43: a second similarity of the two problem data is calculated based on the feature vectors.

Further, the specific calculation formula in step S43 is:

wherein S is_rRepresents a second degree of similarity, | represents a modulo operation,

feature vectors representing the two problem data, respectively.

Further, the third similarity calculation method includes:

s51: dividing all words in the two question data into three types of core words, key words and other words;

s52: dividing each question data into three phrases of a core phrase, a key phrase and other phrases according to the category of the words;

s53: calculating the core word similarity between the core word groups of the two question data, the keyword similarity between the key word groups and the similarity of other words between other word groups;

s54: and calculating the third similarity of the two question data according to the core word similarity, the keyword similarity and the other word similarity of the two question data.

Further, in step S53, the three similarity calculation methods are all:

wherein the content of the first and second substances,representing similarity, a function Sim represents similarity between two vectors, i represents the ith word in the word group of the first question data, j represents the jth word in the word group of the second question data, m and n represent the total number of words in the word group of the first question data and the word group of the second question data respectively, p and q represent the serial number of the words, V_i、V_j、V_pAnd V_qEach representing a vector of words corresponding to a sequence number.

Further, the specific process of step S6 is:

s61: calculating the total rank of each question data:

rank＝α·rank_vec+β·rank_wo+γ·rank_sdp

wherein, α, β, γ are weight adjusting parameters, α + β + γ is 1, rank_vecRank the first similarity, rank_woRank the second degree of similarity, rank_sdpRanking the third similarity;

s62: judging whether the number of the first problem data of the total rank is greater than 1, if so, entering S63, otherwise, taking the first problem data of the total rank as the first problem data of the final rank, and ending;

s63: calculating the total similarity sim of each question in the plurality of question data with the first total rank, taking the question data with the maximum total similarity as the question data with the first final rank, and ending;

sim＝α·S_y+β·S_r+γ·S_l

wherein S is_yIs the first similarity, S_rIs the second degree of similarity, S_lThe third similarity.

Further, the method comprises the following steps:

s3: calculating first similarity between the data of the questions to be answered and all the questions in the database, sorting the first similarity values of all the questions from big to small, and screening out a plurality of question data with the largest first similarity values;

s4: calculating second similarity between the data of the questions to be answered and all the questions in the database, and sorting the second similarity values of all the questions from big to small to screen out a plurality of question data with the largest second similarity values;

s5: calculating third similarity between the data of the questions to be answered and all the questions in the database, and sorting the third similarity values of all the questions from large to small to screen out a plurality of question data with the maximum third similarity values;

s6: calculating the problem data with the first final ranking according to the size and ranking condition of each problem data in all the problem data screened in S3, S4 and S5 aiming at the first similarity value, the second similarity and the third similarity;

A terminal device for automatically screening answers to questions, comprising a processor, a memory and a computer program stored in the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of the method of the embodiment of the present invention.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to an embodiment of the invention as described above.

The invention adopts the technical scheme, improves the problem that the word order and the semantics of sentences are not considered in most of the similarity algorithms based on word embedding at present, introduces the calculation of the second similarity based on word stream characteristics and the third similarity based on centripetal word grooves, can be used for representing the hidden association of the word order and the semantics among texts, improves the accuracy and the stability of similarity calculation, and improves all expressions.

Drawings

Fig. 1 is a flowchart illustrating a first embodiment of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

the embodiment of the invention provides a method for automatically screening answers to questions, which comprises the following steps as shown in figure 1:

s1: and preprocessing the data of the questions to be answered.

The preprocessing in this embodiment includes, but is not limited to, word segmentation and word decommissioning.

The word segmentation is a sequence of recombining problem data (sentences or texts) into words according to a certain standard, and the word segmentation is processed by adopting the existing algorithm.

In the result after word segmentation, there may be some characters or words with no specific meaning, such as word (o, or), punctuation (comma, period, etc.), etc., which may interfere with the acquisition of the actual content of the problem, so that these stop words need to be removed.

In the embodiment, the method for removing stop words is to set a stop word dictionary, add common stop words into the stop word dictionary, and in some special application scenarios, add or delete contents in the stop word dictionary manually. In other embodiments, the stop word may be used in other ways, and is not limited herein.

S2: and carrying out synonym normalization operation on the preprocessed to-be-answered question data, namely replacing all words with the same meaning or similar meaning in the preprocessed data with one same word, wherein the same word is a word with the same meaning or similar meaning as the replaced word.

In this embodiment, a synonym dictionary is provided, in which all words having the same meaning or similar meanings correspond to a certain word having the same meaning or similar meaning as these words, for example, three words of television, and TV all point to the same word, i.e., television/TV, and the pointed word is any one of the above-mentioned consents or similar meanings.

Through the synonym normalization operation, the error in the similarity calculation process can be reduced.

S3: and calculating first similarity between the data of the questions to be answered and all the questions stored in the database, and sorting the first similarity values of all the questions from big to small to screen out N question data with the maximum first similarity values.

A large number of questions and corresponding answers are stored in the database, and when any question in the database is selected, the corresponding answer can be obtained. The contents of the database may be considered to increase or decrease.

In this embodiment, the method for calculating the first similarity includes the following steps:

s31: aiming at every two pieces of problem data to be compared, calculating a text vector of each piece of problem data, wherein the calculation process of each text vector is as follows:

s311: a word vector for each word in the problem data is calculated.

The Word vector can be generated by selecting a generation algorithm according to specific scene requirements, such as Word2Vec, FastText and the like.

S312: the weight of the word in the question data is calculated.

In this embodiment, the weight W of the word ω is set_ωIs calculated byThe formula is as follows:

where α is an adjustment constant, and is set empirically by those skilled in the art, P (ω) is the frequency of the word ω appearing in the corpus, i.e., the word frequency of the word.

Problem vector K of problem data S in this embodiment_SThe calculation formula of (2) is as follows:

where i denotes the ith word, i.e. the word number, W_iIs the weight of the ith word in question data S, n represents the number of words in question data, V_iA word vector representing the ith word in the question data S.

X＝U·∑·V^τ

Wherein the obtained matrix V^τIn the case where the text matrix X has a generally similar theme, the matrix Y is calculated by removing the principal component from the text matrix X, and the calculation accuracy is increased by removing the principal component to remove meaningless information, i.e., information that has a small influence on the calculation of the similarity.

Y＝X-X·V·V^τ

Wherein V is V^τThe transposed matrix of (2);

The first similarity S adopted in this embodiment_yThe calculation formula of (2) is as follows:

wherein S is_yRepresenting a first degree of similarity, K, between two text data_i、K_jA preferred text vector representing the two text data, respectively, | denotes the modulo operation.

S4: and respectively calculating second similarity of the N question data and the question data to be answered, and sequencing the N question data according to the second similarity.

The calculation of the second similarity comprehensively considers the influence of both the order and the semantics of the words.

In this embodiment, the method for calculating the second similarity includes the following steps:

s41: and aiming at the two question data to be compared, forming a word set by all words in the two question data, and respectively and uniquely numbering each word in the word set.

In this embodiment, a word set formed by all words in the two question data is set as T, and assuming that the word set included in the question data S1 is { you, me, and he }, and the word set included in the question data S2 is { you, me, and they }, the word set T is { you, me, he, and they }, and the unique numbers respectively assigned to the word set are: you are a, i is b, he is c, and they are d.

the construction method of the feature vector of each problem datum comprises the following steps: for each word in the word set, searching whether a word with the same or similar meaning to the word exists in the problem, and if the word with the same or similar meaning exists, setting the value of an element, which is positioned at the same sequence in the word set as the word, in the feature vector as the unique number of the word; if there is no word of the same or similar meaning, the value of the element in the feature vector that is in the same sequence as the word in the set of words is set to 0.

In this embodiment, looking for the 1 st word "you" in the word set T, which has the same word in S1, the feature vector is usedThe 1 st element (sequence 1) in (a) is set to a; looking up the 2 nd word "me" in the word set T, which has the same word in S1, the feature vector is added

The 2 nd element in (b) is set as b; looking up the 3 rd word "he" in the word set T, which has the same word in S1, the feature vector is

The 3 rd element in (1) is set to c; looking up the 4 th word "them" in the word set T, which does not have the same word in S1, and has the similar word "he", the feature vector is then added

The 4 th element in (a) is set to d; then the feature vector

Is { a, b, c, d }.

Feature vectorAnd a method of constructing

Similarly, the feature vector constructed in this embodiment

Also { a, b, c, d }.

In this embodiment, words having similar meanings are set such that the similarity between two words is greater than a preset threshold.

In this embodiment, the second similarity S of two question data_rThe calculation formula of (2) is as follows:

wherein | denotes the modulo operation,

feature vectors representing the two problem data, respectively.

S5: and respectively calculating third similarity of the N question data and the question data to be answered, and sequencing the N question data according to the third similarity.

s51: all words in the two question data are classified into three categories, namely core words, keywords and other words.

Semantic dependency analysis is carried out on each problem data to obtain a semantic dependency tree of the problem data, and words forming the problem data are divided into three types of core words, key words and other words according to the semantic dependency tree.

The core word is a word serving as a root node in the semantic dependency tree, the keyword is a word serving as a child node with the distance of 1 from the root node in the semantic dependency tree, and the other words are the core word and words except the keyword.

S52: dividing each question data into three phrases of a core phrase, a key phrase and other phrases according to the category of the words, namely, all the core phrases in each question data are divided into the core phrase, all the key words form the key phrase, and the other phrases form the other phrases.

S53: and calculating the core word similarity between the core word groups of the two question data, the keyword similarity between the key word groups and the other word similarity between other word groups.

The core word similarity, the keyword similarity and the similarity of other words are calculated by the same calculation method in this embodiment, that is:

wherein the content of the first and second substances,

representing similarity, the function Sim represents similarity between two vectors, in this embodiment, similarity between two word vectors, i represents the ith word in the word group of the first question data, j represents the jth word in the word group of the second question data, m and n represent the total number of words in the word group of the first question data and the word group of the second question data, p and q represent the serial numbers of the words, V_i、V_j、V_pAnd V_qEach representing a vector of words corresponding to a sequence number.

In this example, Sim (V)_i，V_p) The calculation formula of (2) is as follows:

In this embodiment, the third similarity S_lThe calculation formula of (2) is as follows:

S_l＝φ₁sim_c+φ₂sim_k+φ₃sim_b

wherein phi is₁、φ₂、φ₃For the weighting adjustment parameter, phi₁+φ₂+φ₃1. sim _ c, im _ k, sim _ b are the core word similarity, keyword similarity and other word similarity, respectively.

S6: and calculating the final first-ranked problem data according to the size and ranking condition of each problem data in the N problem data aiming at the first similarity value, the second similarity and the third similarity.

The specific calculation process in this embodiment is:

s61: calculating the total rank of each question data:

rank＝α·rank_vec+β·rank_wo+γ·rank_sdp

wherein, α, β, γ are weight adjusting parameters, α + β + γ is 1, rank_vecRank the first similarity, rank_woRank the second degree of similarity, rank_sdpRanking the third similarity.

S62: and judging whether the number of the first question data of the total rank is greater than 1, if so, entering S63, otherwise, taking the first question data of the total rank as the first question data of the final rank, and ending.

S63: and calculating the total similarity sim of each question in the plurality of question data with the first total rank, taking the question data with the maximum total similarity as the question data with the first final rank, and ending.

sim＝α·S_y+β·S_r+γ·S_l

The embodiment of the invention solves the problem that the word order and the semantics of sentences are not considered in most of the similarity calculation methods based on word embedding at present, introduces the calculation of the second similarity based on word stream characteristics and the third similarity based on centripetal word grooves, can be used for representing the hidden association of the word order and the semantics among texts, improves the accuracy and the stability of similarity calculation, and improves all expressions.

In the method in the embodiment, the N candidate problem data screened by the first similarity calculation method are respectively sorted by the second similarity and the third similarity to calculate the optimal problem, and the time overhead and the calculation overhead are low; in other embodiments, a plurality of candidate problem data obtained by the first similarity algorithm, the second similarity algorithm and the third similarity algorithm may be used to calculate the optimal problem, and the time overhead and the calculation overhead of this method are relatively large, but more elements can be covered, so that the method is more accurate.

It should be noted that, when the second method is adopted to obtain the respective candidate problem data through the first similarity algorithm, the second similarity algorithm and the third similarity algorithm, the number of the candidate problem data obtained through each similarity algorithm may be the same or different, and is not limited herein.

rank＝α·rank_vec+β·rank_wo+γ·rank_sdp

Example two:

the invention further provides a terminal device for automatically screening answers to questions, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the first method embodiment of the invention.

Further, as an executable scheme, the terminal device for automatically screening answers to questions may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device for automatically screening answers to questions may include, but is not limited to, a processor and a memory. It will be understood by those skilled in the art that the above-mentioned structure of the terminal device for automatically screening answers to a question is merely an example of the terminal device for automatically screening answers to a question, and does not constitute a limitation to the terminal device for automatically screening answers to a question, and may include more or less components than the above-mentioned structure, or combine some components, or different components, for example, the terminal device for automatically screening answers to a question may further include an input-output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.

Further, as an executable solution, the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the terminal device for automatically screening answers to questions, and various interfaces and lines are used to connect various parts of the terminal device for automatically screening answers to questions.

The memory may be used to store the computer program and/or module, and the processor may implement various functions of the terminal device for automatically screening answers to questions by executing or executing the computer program and/or module stored in the memory and calling data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.

The terminal device integrated module/unit for automatically screening answers to questions may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM ), Random Access Memory (RAM), software distribution medium, and the like.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for automatically screening answers to questions, comprising the steps of:

2. The method of automatically screening answers to questions as set forth in claim 1, wherein: the calculation method of the first similarity comprises the following steps:

X＝U·∑·V^τ

And calculating a matrix Y without main components in the text matrix X:

Y＝X-X·V·V^τ

wherein V is V^τThe transposed matrix of (2);

3. The method of automatically screening answers to questions as set forth in claim 1, wherein: the second similarity calculation method comprises the following steps:

4. The method of automatically screening answers to questions as set forth in claim 3, wherein: the specific calculation formula in step S43 is:

feature vectors representing the two problem data, respectively.

5. The method of automatically screening answers to questions as set forth in claim 1, wherein: the third similarity calculation method comprises the following steps:

6. The method of automatically screening answers to questions of claim 5, wherein: in step S53, the three similarity calculation methods are all:

wherein the content of the first and second substances,

representing similarity, a function Sim represents similarity between two vectors, i represents the ith word in the word group of the first question data, j represents the jth word in the word group of the second question data, m and n represent the total number of words in the word group of the first question data and the word group of the second question data respectively, p and q represent the serial number of the words, V_i、V_j、V_pAnd V_qEach representing a vector of words corresponding to a sequence number.

7. The method of automatically screening answers to questions of claim 5, wherein: the specific process of step S6 is:

s61: calculating the total rank of each question data:

rank＝α·rank_vec+β·rank_wo+γ·rank_sdp

sim＝α·S_y+β·S_r+γ·S_l

8. A method for automatically screening answers to questions, comprising the steps of:

9. A terminal device for automatically screening answers to questions is characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.