CN110688472A - Method for automatically screening answers to questions, terminal equipment and storage medium - Google Patents

Method for automatically screening answers to questions, terminal equipment and storage medium Download PDF

Info

Publication number
CN110688472A
CN110688472A CN201910954584.9A CN201910954584A CN110688472A CN 110688472 A CN110688472 A CN 110688472A CN 201910954584 A CN201910954584 A CN 201910954584A CN 110688472 A CN110688472 A CN 110688472A
Authority
CN
China
Prior art keywords
similarity
word
data
question data
questions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910954584.9A
Other languages
Chinese (zh)
Inventor
刘继明
肖肇宇
谭云丹
高力伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Jincun Technology Co Ltd
Original Assignee
Xiamen Jincun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Jincun Technology Co Ltd filed Critical Xiamen Jincun Technology Co Ltd
Priority to CN201910954584.9A priority Critical patent/CN110688472A/en
Publication of CN110688472A publication Critical patent/CN110688472A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Abstract

The invention relates to a method for automatically screening answers to questions, terminal equipment and a storage medium, wherein the method comprises the following steps: preprocessing the data of the questions to be answered; carrying out synonym normalization operation on the question data after the answer preprocessing; calculating the first question data with the similarity and the final ranking in the database according to the first similarity, the second similarity and the third similarity of the question data to be answered and the questions in the database; and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered. The invention improves the problem that the word order and the semanteme of sentences are not considered in most of the similarity algorithms based on word embedding at present, introduces the calculation of the second similarity based on word stream characteristics and the third similarity based on centripetal word grooves, can be used for representing the hidden association of the word order and the semanteme among texts, improves the accuracy and the stability of similarity calculation, and improves all expressions.

Description

Method for automatically screening answers to questions, terminal equipment and storage medium
Technical Field
The present invention relates to the field of automatically screening answers to questions, and in particular, to a method, a terminal device, and a storage medium for automatically screening answers to questions.
Background
The automatic question answering aims to automatically answer the questions posed by the user through a computer so as to meet the requirements of the user on knowledge and solve the questions of the user. In the process of answering the user questions, the automatic question-answering system needs to be capable of correctly understanding the question sentences in the natural language form input by the user, analyzing and acquiring key information in the question sentences, and further matching the acquired information to the most appropriate answer in a preset corpus database, a question-answering database or a knowledge database and returning the answer to the user.
At present, the mainstream method for problem sorting processing is realized through text similarity. The mainstream method for calculating text similarity has several branches, and among them, the method for obtaining word vectors by using word embedding method and calculating sentence vectors by using the word embedding method is widely applied due to its simple and efficient characteristics. However, this method often only considers the similarity of components in the text, but does not consider the word order and semantics in the text, resulting in an incorrect understanding of the text. For example, the two texts "he is my ancestor" and "my is his ancestor" are completely different in meaning, but many similarity algorithms judge that their similarity is high.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method, a terminal device and a storage medium for automatically screening answers to questions.
The specific scheme is as follows:
a method of automatically screening answers to questions, comprising the steps of:
s1: preprocessing the data of the questions to be answered, wherein the preprocessing comprises word segmentation and meaningless word removal;
s2: replacing all words with the same meaning or similar meaning in the question data after the answer preprocessing with the same word, wherein the same word is the word with the same meaning or similar meaning as the replaced word;
s3: calculating first similarity between the data of the questions to be answered and all the questions in the database, sorting the first similarity values of all the questions from big to small, and screening N question data with the largest first similarity values;
s4: respectively calculating second similarity of the N question data and the question data to be answered, and respectively sequencing the N question data according to the second similarity;
s5: respectively calculating third similarity of the N question data and the question data to be answered, and respectively sequencing the N question data according to the third similarity;
s6: calculating the final first-ranked problem data according to the magnitude and ranking condition of each problem data in the N problem data aiming at the first similarity value, the second similarity and the third similarity;
s7: and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered.
Further, the method for calculating the first similarity includes:
s31: calculating a text vector of each question data aiming at two question data to be compared;
s32: the text vectors of two problem data are combined together to form a text matrix X, and the text matrix X is subjected to singular value decomposition to form three matrixes U, sigma and Vτ
X=U·∑·Vτ
And calculating a matrix Y without main components in the text matrix X:
Y=X-X·V·Vτ
wherein V is VτThe transposed matrix of (2);
s33: and extracting vectors corresponding to the two question data from the matrix Y as preferred text vectors of the vectors, and calculating the similarity between the two question data according to the preferred text vectors corresponding to each question data as a first similarity.
Further, the calculation process of the text vector of each question data in step S31 is:
s311: calculating a word vector of each word in the problem data;
s312: calculating the weight of the word in the question data;
s313: and calculating a text vector of the question data according to the word vector of each word in the question data and the weight of the word in the question data.
Further, the weight W of the word ω in step S312ωThe calculation formula of (2) is as follows:
Figure BDA0002226844720000031
where α is a tuning constant and P (ω) is the frequency with which the word ω appears in the corpus.
Further, the method for calculating the text vector of the question data in step S313 includes:
Figure BDA0002226844720000032
wherein, KSA text vector representing the question data S, i representing the ith word in the question data, n representing the number of words in the question data, WiIs the weight of the ith word in the question data, ViA word vector representing the ith word in the question data.
Further, the first similarity S in step S33yThe calculation formula of (2) is as follows:
Figure BDA0002226844720000041
wherein, Ki、KjA preferred text vector representing the two question data, respectively, | denotes the modulo operation.
Further, the second similarity calculation method includes:
s41: aiming at two problem data to be compared, all words in the two problem data are combined into a word set, and each word in the word set is uniquely numbered;
s42: respectively constructing a feature vector of each problem data in the two problem data;
the construction method of the feature vector of each problem datum comprises the following steps: for each word in the word set, searching whether a word with the same or similar meaning to the word exists in the problem, and if the word with the same or similar meaning exists, setting the value of an element, which is positioned at the same sequence in the word set as the word, in the feature vector as the unique number of the word; if no word with the same or similar meaning exists, setting the value of an element in the feature vector, which is in the same sequence with the word in the word set, as 0;
s43: a second similarity of the two problem data is calculated based on the feature vectors.
Further, the specific calculation formula in step S43 is:
wherein S isrRepresents a second degree of similarity, | represents a modulo operation,
Figure BDA0002226844720000043
feature vectors representing the two problem data, respectively.
Further, the third similarity calculation method includes:
s51: dividing all words in the two question data into three types of core words, key words and other words;
s52: dividing each question data into three phrases of a core phrase, a key phrase and other phrases according to the category of the words;
s53: calculating the core word similarity between the core word groups of the two question data, the keyword similarity between the key word groups and the similarity of other words between other word groups;
s54: and calculating the third similarity of the two question data according to the core word similarity, the keyword similarity and the other word similarity of the two question data.
Further, in step S53, the three similarity calculation methods are all:
Figure BDA0002226844720000051
wherein the content of the first and second substances,representing similarity, a function Sim represents similarity between two vectors, i represents the ith word in the word group of the first question data, j represents the jth word in the word group of the second question data, m and n represent the total number of words in the word group of the first question data and the word group of the second question data respectively, p and q represent the serial number of the words, Vi、Vj、VpAnd VqEach representing a vector of words corresponding to a sequence number.
Further, the specific process of step S6 is:
s61: calculating the total rank of each question data:
rank=α·rankvec+β·rankwo+γ·ranksdp
wherein, α, β, γ are weight adjusting parameters, α + β + γ is 1, rankvecRank the first similarity, rankwoRank the second degree of similarity, ranksdpRanking the third similarity;
s62: judging whether the number of the first problem data of the total rank is greater than 1, if so, entering S63, otherwise, taking the first problem data of the total rank as the first problem data of the final rank, and ending;
s63: calculating the total similarity sim of each question in the plurality of question data with the first total rank, taking the question data with the maximum total similarity as the question data with the first final rank, and ending;
sim=α·Sy+β·Sr+γ·Sl
wherein S isyIs the first similarity, SrIs the second degree of similarity, SlThe third similarity.
Further, the method comprises the following steps:
s1: preprocessing the data of the questions to be answered, wherein the preprocessing comprises word segmentation and meaningless word removal;
s2: replacing all words with the same meaning or similar meaning in the question data after the answer preprocessing with the same word, wherein the same word is the word with the same meaning or similar meaning as the replaced word;
s3: calculating first similarity between the data of the questions to be answered and all the questions in the database, sorting the first similarity values of all the questions from big to small, and screening out a plurality of question data with the largest first similarity values;
s4: calculating second similarity between the data of the questions to be answered and all the questions in the database, and sorting the second similarity values of all the questions from big to small to screen out a plurality of question data with the largest second similarity values;
s5: calculating third similarity between the data of the questions to be answered and all the questions in the database, and sorting the third similarity values of all the questions from large to small to screen out a plurality of question data with the maximum third similarity values;
s6: calculating the problem data with the first final ranking according to the size and ranking condition of each problem data in all the problem data screened in S3, S4 and S5 aiming at the first similarity value, the second similarity and the third similarity;
s7: and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered.
A terminal device for automatically screening answers to questions, comprising a processor, a memory and a computer program stored in the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of the method of the embodiment of the present invention.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to an embodiment of the invention as described above.
The invention adopts the technical scheme, improves the problem that the word order and the semantics of sentences are not considered in most of the similarity algorithms based on word embedding at present, introduces the calculation of the second similarity based on word stream characteristics and the third similarity based on centripetal word grooves, can be used for representing the hidden association of the word order and the semantics among texts, improves the accuracy and the stability of similarity calculation, and improves all expressions.
Drawings
Fig. 1 is a flowchart illustrating a first embodiment of the present invention.
Detailed Description
To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.
The invention will now be further described with reference to the accompanying drawings and detailed description.
The first embodiment is as follows:
the embodiment of the invention provides a method for automatically screening answers to questions, which comprises the following steps as shown in figure 1:
s1: and preprocessing the data of the questions to be answered.
The preprocessing in this embodiment includes, but is not limited to, word segmentation and word decommissioning.
The word segmentation is a sequence of recombining problem data (sentences or texts) into words according to a certain standard, and the word segmentation is processed by adopting the existing algorithm.
In the result after word segmentation, there may be some characters or words with no specific meaning, such as word (o, or), punctuation (comma, period, etc.), etc., which may interfere with the acquisition of the actual content of the problem, so that these stop words need to be removed.
In the embodiment, the method for removing stop words is to set a stop word dictionary, add common stop words into the stop word dictionary, and in some special application scenarios, add or delete contents in the stop word dictionary manually. In other embodiments, the stop word may be used in other ways, and is not limited herein.
S2: and carrying out synonym normalization operation on the preprocessed to-be-answered question data, namely replacing all words with the same meaning or similar meaning in the preprocessed data with one same word, wherein the same word is a word with the same meaning or similar meaning as the replaced word.
In this embodiment, a synonym dictionary is provided, in which all words having the same meaning or similar meanings correspond to a certain word having the same meaning or similar meaning as these words, for example, three words of television, and TV all point to the same word, i.e., television/TV, and the pointed word is any one of the above-mentioned consents or similar meanings.
Through the synonym normalization operation, the error in the similarity calculation process can be reduced.
S3: and calculating first similarity between the data of the questions to be answered and all the questions stored in the database, and sorting the first similarity values of all the questions from big to small to screen out N question data with the maximum first similarity values.
A large number of questions and corresponding answers are stored in the database, and when any question in the database is selected, the corresponding answer can be obtained. The contents of the database may be considered to increase or decrease.
In this embodiment, the method for calculating the first similarity includes the following steps:
s31: aiming at every two pieces of problem data to be compared, calculating a text vector of each piece of problem data, wherein the calculation process of each text vector is as follows:
s311: a word vector for each word in the problem data is calculated.
The Word vector can be generated by selecting a generation algorithm according to specific scene requirements, such as Word2Vec, FastText and the like.
S312: the weight of the word in the question data is calculated.
In this embodiment, the weight W of the word ω is setωIs calculated byThe formula is as follows:
Figure BDA0002226844720000091
where α is an adjustment constant, and is set empirically by those skilled in the art, P (ω) is the frequency of the word ω appearing in the corpus, i.e., the word frequency of the word.
S313: and calculating a text vector of the question data according to the word vector of each word in the question data and the weight of the word in the question data.
Problem vector K of problem data S in this embodimentSThe calculation formula of (2) is as follows:
where i denotes the ith word, i.e. the word number, WiIs the weight of the ith word in question data S, n represents the number of words in question data, ViA word vector representing the ith word in the question data S.
S32: the text vectors of two problem data are combined together to form a text matrix X, and the text matrix X is subjected to singular value decomposition to form three matrixes U, sigma and Vτ
X=U·∑·Vτ
Wherein the obtained matrix VτIn the case where the text matrix X has a generally similar theme, the matrix Y is calculated by removing the principal component from the text matrix X, and the calculation accuracy is increased by removing the principal component to remove meaningless information, i.e., information that has a small influence on the calculation of the similarity.
Y=X-X·V·Vτ
Wherein V is VτThe transposed matrix of (2);
s33: and extracting vectors corresponding to the two question data from the matrix Y as preferred text vectors of the vectors, and calculating the similarity between the two question data according to the preferred text vectors corresponding to each question data as a first similarity.
The first similarity S adopted in this embodimentyThe calculation formula of (2) is as follows:
Figure BDA0002226844720000101
wherein S isyRepresenting a first degree of similarity, K, between two text datai、KjA preferred text vector representing the two text data, respectively, | denotes the modulo operation.
S4: and respectively calculating second similarity of the N question data and the question data to be answered, and sequencing the N question data according to the second similarity.
The calculation of the second similarity comprehensively considers the influence of both the order and the semantics of the words.
In this embodiment, the method for calculating the second similarity includes the following steps:
s41: and aiming at the two question data to be compared, forming a word set by all words in the two question data, and respectively and uniquely numbering each word in the word set.
In this embodiment, a word set formed by all words in the two question data is set as T, and assuming that the word set included in the question data S1 is { you, me, and he }, and the word set included in the question data S2 is { you, me, and they }, the word set T is { you, me, he, and they }, and the unique numbers respectively assigned to the word set are: you are a, i is b, he is c, and they are d.
S42: respectively constructing a feature vector of each problem data in the two problem data;
the construction method of the feature vector of each problem datum comprises the following steps: for each word in the word set, searching whether a word with the same or similar meaning to the word exists in the problem, and if the word with the same or similar meaning exists, setting the value of an element, which is positioned at the same sequence in the word set as the word, in the feature vector as the unique number of the word; if there is no word of the same or similar meaning, the value of the element in the feature vector that is in the same sequence as the word in the set of words is set to 0.
In this embodiment, looking for the 1 st word "you" in the word set T, which has the same word in S1, the feature vector is usedThe 1 st element (sequence 1) in (a) is set to a; looking up the 2 nd word "me" in the word set T, which has the same word in S1, the feature vector is added
Figure BDA0002226844720000112
The 2 nd element in (b) is set as b; looking up the 3 rd word "he" in the word set T, which has the same word in S1, the feature vector is
Figure BDA0002226844720000113
The 3 rd element in (1) is set to c; looking up the 4 th word "them" in the word set T, which does not have the same word in S1, and has the similar word "he", the feature vector is then added
Figure BDA0002226844720000114
The 4 th element in (a) is set to d; then the feature vector
Figure BDA0002226844720000115
Is { a, b, c, d }.
Feature vectorAnd a method of constructing
Figure BDA0002226844720000117
Similarly, the feature vector constructed in this embodiment
Figure BDA0002226844720000118
Also { a, b, c, d }.
In this embodiment, words having similar meanings are set such that the similarity between two words is greater than a preset threshold.
S43: a second similarity of the two problem data is calculated based on the feature vectors.
In this embodiment, the second similarity S of two question datarThe calculation formula of (2) is as follows:
Figure BDA0002226844720000119
wherein | denotes the modulo operation,
Figure BDA0002226844720000121
feature vectors representing the two problem data, respectively.
S5: and respectively calculating third similarity of the N question data and the question data to be answered, and sequencing the N question data according to the third similarity.
In this embodiment, the method for calculating the second similarity includes the following steps:
s51: all words in the two question data are classified into three categories, namely core words, keywords and other words.
Semantic dependency analysis is carried out on each problem data to obtain a semantic dependency tree of the problem data, and words forming the problem data are divided into three types of core words, key words and other words according to the semantic dependency tree.
The core word is a word serving as a root node in the semantic dependency tree, the keyword is a word serving as a child node with the distance of 1 from the root node in the semantic dependency tree, and the other words are the core word and words except the keyword.
S52: dividing each question data into three phrases of a core phrase, a key phrase and other phrases according to the category of the words, namely, all the core phrases in each question data are divided into the core phrase, all the key words form the key phrase, and the other phrases form the other phrases.
S53: and calculating the core word similarity between the core word groups of the two question data, the keyword similarity between the key word groups and the other word similarity between other word groups.
The core word similarity, the keyword similarity and the similarity of other words are calculated by the same calculation method in this embodiment, that is:
Figure BDA0002226844720000122
wherein the content of the first and second substances,
Figure BDA0002226844720000123
representing similarity, the function Sim represents similarity between two vectors, in this embodiment, similarity between two word vectors, i represents the ith word in the word group of the first question data, j represents the jth word in the word group of the second question data, m and n represent the total number of words in the word group of the first question data and the word group of the second question data, p and q represent the serial numbers of the words, Vi、Vj、VpAnd VqEach representing a vector of words corresponding to a sequence number.
In this example, Sim (V)i,Vp) The calculation formula of (2) is as follows:
s54: and calculating the third similarity of the two question data according to the core word similarity, the keyword similarity and the other word similarity of the two question data.
In this embodiment, the third similarity SlThe calculation formula of (2) is as follows:
Sl=φ1sim_c+φ2sim_k+φ3sim_b
wherein phi is1、φ2、φ3For the weighting adjustment parameter, phi1231. sim _ c, im _ k, sim _ b are the core word similarity, keyword similarity and other word similarity, respectively.
S6: and calculating the final first-ranked problem data according to the size and ranking condition of each problem data in the N problem data aiming at the first similarity value, the second similarity and the third similarity.
The specific calculation process in this embodiment is:
s61: calculating the total rank of each question data:
rank=α·rankvec+β·rankwo+γ·ranksdp
wherein, α, β, γ are weight adjusting parameters, α + β + γ is 1, rankvecRank the first similarity, rankwoRank the second degree of similarity, ranksdpRanking the third similarity.
S62: and judging whether the number of the first question data of the total rank is greater than 1, if so, entering S63, otherwise, taking the first question data of the total rank as the first question data of the final rank, and ending.
S63: and calculating the total similarity sim of each question in the plurality of question data with the first total rank, taking the question data with the maximum total similarity as the question data with the first final rank, and ending.
sim=α·Sy+β·Sr+γ·Sl
Wherein S isyIs the first similarity, SrIs the second degree of similarity, SlThe third similarity.
S7: and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered.
The embodiment of the invention solves the problem that the word order and the semantics of sentences are not considered in most of the similarity calculation methods based on word embedding at present, introduces the calculation of the second similarity based on word stream characteristics and the third similarity based on centripetal word grooves, can be used for representing the hidden association of the word order and the semantics among texts, improves the accuracy and the stability of similarity calculation, and improves all expressions.
In the method in the embodiment, the N candidate problem data screened by the first similarity calculation method are respectively sorted by the second similarity and the third similarity to calculate the optimal problem, and the time overhead and the calculation overhead are low; in other embodiments, a plurality of candidate problem data obtained by the first similarity algorithm, the second similarity algorithm and the third similarity algorithm may be used to calculate the optimal problem, and the time overhead and the calculation overhead of this method are relatively large, but more elements can be covered, so that the method is more accurate.
It should be noted that, when the second method is adopted to obtain the respective candidate problem data through the first similarity algorithm, the second similarity algorithm and the third similarity algorithm, the number of the candidate problem data obtained through each similarity algorithm may be the same or different, and is not limited herein.
rank=α·rankvec+β·rankwo+γ·ranksdp
Example two:
the invention further provides a terminal device for automatically screening answers to questions, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the first method embodiment of the invention.
Further, as an executable scheme, the terminal device for automatically screening answers to questions may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device for automatically screening answers to questions may include, but is not limited to, a processor and a memory. It will be understood by those skilled in the art that the above-mentioned structure of the terminal device for automatically screening answers to a question is merely an example of the terminal device for automatically screening answers to a question, and does not constitute a limitation to the terminal device for automatically screening answers to a question, and may include more or less components than the above-mentioned structure, or combine some components, or different components, for example, the terminal device for automatically screening answers to a question may further include an input-output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.
Further, as an executable solution, the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the terminal device for automatically screening answers to questions, and various interfaces and lines are used to connect various parts of the terminal device for automatically screening answers to questions.
The memory may be used to store the computer program and/or module, and the processor may implement various functions of the terminal device for automatically screening answers to questions by executing or executing the computer program and/or module stored in the memory and calling data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.
The terminal device integrated module/unit for automatically screening answers to questions may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM ), Random Access Memory (RAM), software distribution medium, and the like.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for automatically screening answers to questions, comprising the steps of:
s1: preprocessing the data of the questions to be answered, wherein the preprocessing comprises word segmentation and meaningless word removal;
s2: replacing all words with the same meaning or similar meaning in the question data after the answer preprocessing with the same word, wherein the same word is the word with the same meaning or similar meaning as the replaced word;
s3: calculating first similarity between the data of the questions to be answered and all the questions in the database, sorting the first similarity values of all the questions from big to small, and screening N question data with the largest first similarity values;
s4: respectively calculating second similarity of the N question data and the question data to be answered, and respectively sequencing the N question data according to the second similarity;
s5: respectively calculating third similarity of the N question data and the question data to be answered, and respectively sequencing the N question data according to the third similarity;
s6: calculating the final first-ranked problem data according to the magnitude and ranking condition of each problem data in the N problem data aiming at the first similarity value, the second similarity and the third similarity;
s7: and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered.
2. The method of automatically screening answers to questions as set forth in claim 1, wherein: the calculation method of the first similarity comprises the following steps:
s31: calculating a text vector of each question data aiming at two question data to be compared;
s32: the text vectors of two problem data are combined together to form a text matrix X, and the text matrix X is subjected to singular value decomposition to form three matrixes U, sigma and Vτ
X=U·∑·Vτ
And calculating a matrix Y without main components in the text matrix X:
Y=X-X·V·Vτ
wherein V is VτThe transposed matrix of (2);
s33: and extracting vectors corresponding to the two question data from the matrix Y as preferred text vectors of the vectors, and calculating the similarity between the two question data according to the preferred text vectors corresponding to each question data as a first similarity.
3. The method of automatically screening answers to questions as set forth in claim 1, wherein: the second similarity calculation method comprises the following steps:
s41: aiming at two problem data to be compared, all words in the two problem data are combined into a word set, and each word in the word set is uniquely numbered;
s42: respectively constructing a feature vector of each problem data in the two problem data;
the construction method of the feature vector of each problem datum comprises the following steps: for each word in the word set, searching whether a word with the same or similar meaning to the word exists in the problem, and if the word with the same or similar meaning exists, setting the value of an element, which is positioned at the same sequence in the word set as the word, in the feature vector as the unique number of the word; if no word with the same or similar meaning exists, setting the value of an element in the feature vector, which is in the same sequence with the word in the word set, as 0;
s43: a second similarity of the two problem data is calculated based on the feature vectors.
4. The method of automatically screening answers to questions as set forth in claim 3, wherein: the specific calculation formula in step S43 is:
wherein S isrRepresents a second degree of similarity, | represents a modulo operation,
Figure FDA0002226844710000022
feature vectors representing the two problem data, respectively.
5. The method of automatically screening answers to questions as set forth in claim 1, wherein: the third similarity calculation method comprises the following steps:
s51: dividing all words in the two question data into three types of core words, key words and other words;
s52: dividing each question data into three phrases of a core phrase, a key phrase and other phrases according to the category of the words;
s53: calculating the core word similarity between the core word groups of the two question data, the keyword similarity between the key word groups and the similarity of other words between other word groups;
s54: and calculating the third similarity of the two question data according to the core word similarity, the keyword similarity and the other word similarity of the two question data.
6. The method of automatically screening answers to questions of claim 5, wherein: in step S53, the three similarity calculation methods are all:
Figure FDA0002226844710000031
wherein the content of the first and second substances,
Figure FDA0002226844710000032
representing similarity, a function Sim represents similarity between two vectors, i represents the ith word in the word group of the first question data, j represents the jth word in the word group of the second question data, m and n represent the total number of words in the word group of the first question data and the word group of the second question data respectively, p and q represent the serial number of the words, Vi、Vj、VpAnd VqEach representing a vector of words corresponding to a sequence number.
7. The method of automatically screening answers to questions of claim 5, wherein: the specific process of step S6 is:
s61: calculating the total rank of each question data:
rank=α·rankvec+β·rankwo+γ·ranksdp
wherein, α, β, γ are weight adjusting parameters, α + β + γ is 1, rankvecRank the first similarity, rankwoRank the second degree of similarity, ranksdpRanking the third similarity;
s62: judging whether the number of the first problem data of the total rank is greater than 1, if so, entering S63, otherwise, taking the first problem data of the total rank as the first problem data of the final rank, and ending;
s63: calculating the total similarity sim of each question in the plurality of question data with the first total rank, taking the question data with the maximum total similarity as the question data with the first final rank, and ending;
sim=α·Sy+β·Sr+γ·Sl
wherein S isyIs the first similarity, SrIs the second degree of similarity, SlThe third similarity.
8. A method for automatically screening answers to questions, comprising the steps of:
s1: preprocessing the data of the questions to be answered, wherein the preprocessing comprises word segmentation and meaningless word removal;
s2: replacing all words with the same meaning or similar meaning in the question data after the answer preprocessing with the same word, wherein the same word is the word with the same meaning or similar meaning as the replaced word;
s3: calculating first similarity between the data of the questions to be answered and all the questions in the database, sorting the first similarity values of all the questions from big to small, and screening out a plurality of question data with the largest first similarity values;
s4: calculating second similarity between the data of the questions to be answered and all the questions in the database, and sorting the second similarity values of all the questions from big to small to screen out a plurality of question data with the largest second similarity values;
s5: calculating third similarity between the data of the questions to be answered and all the questions in the database, and sorting the third similarity values of all the questions from large to small to screen out a plurality of question data with the maximum third similarity values;
s6: calculating the problem data with the first final ranking according to the size and ranking condition of each problem data in all the problem data screened in S3, S4 and S5 aiming at the first similarity value, the second similarity and the third similarity;
s7: and taking the answer corresponding to the question data with the first final ranking in the database as the answer of the question data to be answered.
9. A terminal device for automatically screening answers to questions is characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any one of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN201910954584.9A 2019-10-09 2019-10-09 Method for automatically screening answers to questions, terminal equipment and storage medium Pending CN110688472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910954584.9A CN110688472A (en) 2019-10-09 2019-10-09 Method for automatically screening answers to questions, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910954584.9A CN110688472A (en) 2019-10-09 2019-10-09 Method for automatically screening answers to questions, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110688472A true CN110688472A (en) 2020-01-14

Family

ID=69111705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910954584.9A Pending CN110688472A (en) 2019-10-09 2019-10-09 Method for automatically screening answers to questions, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110688472A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268346A (en) * 2013-05-27 2013-08-28 翁时锋 Semi-supervised classification method and semi-supervised classification system
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec
CN109948143A (en) * 2019-01-25 2019-06-28 网经科技(苏州)有限公司 The answer extracting method of community's question answering system
CN109977421A (en) * 2019-04-15 2019-07-05 南京邮电大学 A kind of Knowledge Base of Programming subjects answering system after class

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268346A (en) * 2013-05-27 2013-08-28 翁时锋 Semi-supervised classification method and semi-supervised classification system
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec
CN109948143A (en) * 2019-01-25 2019-06-28 网经科技(苏州)有限公司 The answer extracting method of community's question answering system
CN109977421A (en) * 2019-04-15 2019-07-05 南京邮电大学 A kind of Knowledge Base of Programming subjects answering system after class

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633009A (en) * 2020-12-29 2021-04-09 扬州大学 Identification method for random combination uploading field

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN109117474B (en) Statement similarity calculation method and device and storage medium
CN110597971B (en) Automatic question answering device and method based on neural network and readable storage medium
CN109840255B (en) Reply text generation method, device, equipment and storage medium
CN110705248A (en) Text similarity calculation method, terminal device and storage medium
CN115795061B (en) Knowledge graph construction method and system based on word vector and dependency syntax
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN111159404A (en) Text classification method and device
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN111737420A (en) Class case retrieval method, system, device and medium based on dispute focus
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN110046344B (en) Method for adding separator and terminal equipment
CN113590811A (en) Text abstract generation method and device, electronic equipment and storage medium
CN113743090A (en) Keyword extraction method and device
CN110427626B (en) Keyword extraction method and device
CN110287284B (en) Semantic matching method, device and equipment
CN110688472A (en) Method for automatically screening answers to questions, terminal equipment and storage medium
CN112597287B (en) Statement processing method, statement processing device and intelligent equipment
US20220108071A1 (en) Information processing device, information processing system, and non-transitory computer readable medium
CN110413985B (en) Related text segment searching method and device
CN112650951A (en) Enterprise similarity matching method, system and computing device
Lichtblau et al. Authorship attribution using the chaos game representation
CN115437620B (en) Natural language programming method, device, equipment and storage medium
CN111666770A (en) Semantic matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200114

RJ01 Rejection of invention patent application after publication