CN117609447A - Method, device, equipment and storage medium for generating question-answer background information - Google Patents

Method, device, equipment and storage medium for generating question-answer background information Download PDF

Info

Publication number
CN117609447A
CN117609447A CN202311533288.4A CN202311533288A CN117609447A CN 117609447 A CN117609447 A CN 117609447A CN 202311533288 A CN202311533288 A CN 202311533288A CN 117609447 A CN117609447 A CN 117609447A
Authority
CN
China
Prior art keywords
question
answer
similarity
answer pair
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311533288.4A
Other languages
Chinese (zh)
Inventor
陈晓鸿
董灿佳
黄华新
魏宝辉
黎智韬
黄伟文
蔡鑫
罗朝彤
吴志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202311533288.4A priority Critical patent/CN117609447A/en
Publication of CN117609447A publication Critical patent/CN117609447A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for generating question-answering background information, and relates to the field of artificial intelligence. The method for generating the question-answer background information comprises the following steps: generating a plurality of question-answer pairs based on knowledge corpus information, wherein the question-answer pairs comprise questions and answers corresponding to the questions; aiming at each question-answer pair, calculating the similarity between the question in the question-answer pair and the user question and the similarity between the answer in the question-answer pair and the user question respectively, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity; dividing the question-answer pairs with the same answers into a group to obtain a plurality of groups, and forming a plurality of question-answer pair sets; determining the set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set; and screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set, and determining the question-answer pairs in the target question-answer pair set as question-answer background information of the user questions.

Description

Method, device, equipment and storage medium for generating question-answer background information
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a method, a device, equipment and a storage medium for generating question-answering background information.
Background
With the development of artificial intelligence technology, intelligent customer service question-answering application according to a large language model is rapidly developed. The intelligent customer service question and answer application is realized by taking the questions and question and answer background information of the user as input into a large language model, and outputting a question and answer result after the questions and answer background information are understood by the model.
The current method for generating question-answer background information mainly cuts knowledge corpus information to form different question-answer pairs, and after user questions are input, the question-answer pair with the highest similarity with the user questions is found and used as the question-answer background information of a large language model, so that the user questions are answered accordingly.
However, the method is easy to generate that the similarity calculated by some expression sentences is lower when the similarity calculation is carried out on the questions in question-answer pairs due to different expression modes, and the most accurate question-answer pairs are not matched, so that the answer accuracy of the large language model according to the answer is lower.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for generating question-answer background information, which can match the most relevant question-answer pair set aiming at a user problem and improve the accuracy of question-answer pair matching.
In one aspect of the embodiments of the present application, a method for generating question-answer background information is provided, where the method includes:
generating a plurality of question-answer pairs based on knowledge corpus information, wherein the question-answer pairs comprise questions and answers corresponding to the questions;
aiming at each question-answer pair, calculating the similarity between the question in the question-answer pair and the user question and the similarity between the answer in the question-answer pair and the user question respectively, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity;
dividing the question-answer pairs with the same answers into a group to obtain a plurality of groups, and forming a plurality of question-answer pair sets;
determining the set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set;
and screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set, and determining the question-answer pairs in the target question-answer pair set as question-answer background information of the user questions so as to be used for a large language model to answer the user questions according to the question-answer background information.
In one aspect of the embodiments of the present application, a device for generating question-answering background information is provided, where the device includes:
the generation module is used for generating a plurality of question-answer pairs based on the knowledge corpus information, wherein the question-answer pairs comprise questions and answers corresponding to the questions;
the acquisition module is used for respectively calculating the similarity between the questions in the question-answer pair and the user questions and the similarity between the answers in the question-answer pair and the user questions aiming at each question-answer pair, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity;
the grouping module is used for dividing the question-answer pairs with the same answers in the question-answer pairs into a group to obtain a plurality of groups so as to form a plurality of question-answer pair sets;
the computing module is used for determining the set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set;
and the screening module is used for screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set, determining the question-answer pairs in the target question-answer pair set as question-answer background information of the user questions, and answering the user questions according to the question-answer background information by the large language model.
In one aspect of the embodiments of the present application, there is provided an electronic device, where the generating device of the question-answer background information includes: the method for generating the question-answer background information according to any aspect of the embodiments of the present application is implemented when the program or the instructions are executed by the processor.
In one aspect of the embodiments of the present application, a readable storage medium is provided, where a program or an instruction is stored, where the program or the instruction, when executed by a processor, implement a method for generating question-answer background information provided in any aspect of the embodiments of the present application.
In one aspect of the embodiments of the present application, a computer program product is provided, where instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform a method for generating question-answer background information as provided in any aspect of the embodiments of the present application.
In the method for generating question-answer background information provided by the embodiment of the application, firstly, a plurality of question-answer pairs are generated based on knowledge corpus information, and the question-answer pairs corresponding to one answer for a plurality of questions are formed in a generalization mode by considering the possible question-asking mode of each knowledge segment. And then, for each question-answer pair, calculating the similarity of the question in the question-answer pair and the user question and the similarity of the answer in the question-answer pair and the user question respectively, obtaining the comprehensive similarity of the question-answer pairs according to the similarity, grouping the question-answer pairs, and combining the question-answer pairs with consistent answers into a question-answer pair set so as to form a plurality of question-answer pair sets. And meanwhile, calculating the set similarity of each question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set, and finally screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set as question-answer background information of the user questions. Thus, the embodiment of the application comprehensively considers the comprehensive similarity of each question-answer pair in the question-answer pair set, so as to generate the set similarity of each question-answer pair set. And finally, screening the question-answer pair sets according to the set similarity of the question-answer pair sets, so that question-answer background information can be generated aiming at the question-answer pair sets most relevant to the user question matching, influences of different expression modes of the user questions on the similarity are comprehensively considered, the accuracy of the question-answer pair matching is greatly improved, and the accuracy of the large language model answer is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described, and it is possible for a person skilled in the art to obtain other drawings according to these drawings without inventive effort.
Fig. 1 is a schematic flow chart of an embodiment of a method for generating question-answer background information provided in the present application;
fig. 2 is a schematic structural diagram of an embodiment of a question and answer background information generating device provided in the present application;
fig. 3 is a schematic structural diagram of an embodiment of a question-answer background information generating device provided in the present application.
Detailed Description
Features and exemplary embodiments of various aspects of the present application are described in detail below to make the objects, technical solutions and advantages of the present application more apparent, and to further describe the present application in conjunction with the accompanying drawings and the detailed embodiments. It should be understood that the specific embodiments described herein are intended to be illustrative of the application and are not intended to be limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present application by showing examples of the present application.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The data acquisition, storage, use, processing and the like in the technical scheme meet the relevant regulations of national laws and regulations.
In the related art, the current question-answer background information generation method mainly cuts knowledge corpus information to form different question-answer pairs, and after user questions are input, the question-answer pair with the highest similarity with the user questions is found to serve as the question-answer background information of the large language model, so that the user questions are answered accordingly. In the method, the calculated similarity of some expression sentences is low, and the most correct question-answer pair is not matched, so that the answer accuracy of the large language model according to the answer is low.
The embodiment of the application provides a method, a device, equipment and a storage medium for generating question-answering background information. In the method for generating question-answer background information provided by the embodiment of the application, firstly, a plurality of question-answer pairs are generated based on knowledge corpus information, and the question-answer pairs corresponding to one answer for a plurality of questions are formed in a generalization mode by considering the possible question-asking mode of each knowledge segment. And then, for each question-answer pair, calculating the similarity of the question in the question-answer pair and the user question and the similarity of the answer in the question-answer pair and the user question respectively, obtaining the comprehensive similarity of the question-answer pairs according to the similarity, grouping the question-answer pairs, and combining the question-answer pairs with consistent answers into a question-answer pair set so as to form a plurality of question-answer pair sets. And meanwhile, calculating the set similarity of each question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set, and finally screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set as question-answer background information of the user questions. Thus, the embodiment of the application comprehensively considers the comprehensive similarity of each question-answer pair in the question-answer pair set, so as to generate the set similarity of each question-answer pair set. And finally, screening the question-answer pair sets according to the set similarity of the question-answer pair sets, so that question-answer background information can be generated aiming at the question-answer pair sets most relevant to the user question matching, influences of different expression modes of the user questions on the similarity are comprehensively considered, the accuracy of the question-answer pair matching is greatly improved, and the accuracy of the large language model answer is further improved.
Specific embodiments of a method, an apparatus, a device, and a storage medium for generating question-answer background information provided by the embodiments of the present application are described below. The method for generating the question-answer background information is first described below.
Fig. 1 is a flowchart of a method for generating question-answer background information, which is provided in an embodiment of the present application, and the method is applied to a knowledge question-answer system, and may include steps S101 to S105.
S101, generating a plurality of question-answer pairs based on knowledge corpus information, wherein the question-answer pairs comprise questions and answers corresponding to the questions.
The knowledge corpus information is information containing specific knowledge content. The knowledge corpus information may be any of a user manual or a system description document, for example. The question-answer pair is a pair of questions and answers, wherein the questions are questions according to specific content of knowledge corpus information, and the answers are knowledge segments in the knowledge corpus information.
For example, the knowledge question-answering system obtains knowledge corpus information and generates a plurality of question-answer pairs according to specific content in the knowledge corpus information.
S102, for each question-answer pair, calculating the similarity between the question in the question-answer pair and the user question and the similarity between the answer in the question-answer pair and the user question respectively, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity.
The comprehensive similarity is the similarity between question-answer pairs and user questions. For example, the method of calculating the integrated similarity may be any one of a Term Frequency-inverse document Frequency algorithm (Term Frequency-Inverse Document Frequency, TF-IDF algorithm) or a Word Vector algorithm (Word 2Vector algorithm).
For example, when the Word2Vector algorithm is adopted in the knowledge question-answering system to calculate the comprehensive similarity between the question-answering pair and the user problem, the Word Vector corresponding to each Word is obtained by firstly Word segmentation of the question-answering pair, then all Word vectors are added and averaged, so that the Vector of the question-answering pair is obtained, and then the Vector of the user problem is determined according to the same method. And finally, calculating the cosine value between the question-answer pair vector and the user question vector, wherein the obtained cosine value is the comprehensive similarity between the question-answer pair and the user question.
S103, the same question-answer pairs of the answers in the question-answer pairs are divided into a group to obtain a plurality of groups, and a plurality of question-answer pair sets are formed.
The question-answer pair set is a set formed by a plurality of question-answer pairs consistent with each other. For example, if there are 3 question-answer pairs, a-A1, B-B1, and C-A1, respectively, where the answers of the question-answer pairs a-A1 and C-A1 are identical, then the question-answer pairs a-A1 and C-A1 are put into the same question-answer pair set, and the question-answer pair B-B1 is put into another question-answer pair set.
For example, in the case of generating a plurality of question-answer pairs, the knowledge question-answer system groups the question-answer pairs according to the answers of the question-answer pairs, puts a plurality of question-answer pairs with identical answers into the same question-answer pair set, and forms a plurality of question-answer pair sets.
S104, determining the set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set.
The similarity of the set is the similarity between the question-answer pair set and the user problem. For example, the method of calculating the set similarity may be an average method.
For example, when the knowledge question-answering system calculates the similarity between the question-answering pair set and the user question set by using an average method, the comprehensive similarity of each question-answering pair in the question-answering pair set is accumulated, and then the accumulated value is divided by the number of the question-answering pairs in the question-answering pair set, so that the obtained value is the set similarity of the question-answering pair set.
S105, a target question-answer pair set with the set similarity meeting a preset threshold is selected from the question-answer pair set, and the question-answer pairs in the target question-answer pair set are determined to be question-answer background information of the user questions, so that the large language model answers the user questions according to the question-answer background information.
The question and answer background information is information formed by a plurality of screened question and answer pair sets and is used as background knowledge of user questions, so that the large language model answers the user questions according to the question and answer background information.
For example, after calculating the set similarity of each set of question-answer pairs, the knowledge question-answer system sorts the sets of question-answer pairs in order from large to small, and sums the set similarity in order from large to small, thereby obtaining a sum value. It is then determined whether the sum is greater than 1, and if the sum is less than 1, the aggregate similarity is continued to be summed. And under the condition that the sum value is larger than or equal to 1 for the first time, screening a plurality of question-answer pair sets corresponding to the set similarity of the target quantity included in the sum value as question-answer background information of the user questions, inputting the question-answer background information into a large language model, and answering the user questions by the large language model according to the question-answer background information.
According to the method, firstly, a plurality of question-answer pairs are generated based on knowledge corpus information, and the possible question-answer modes of each knowledge segment are considered to generalize and form question-answer pairs of a plurality of questions corresponding to one answer. And then, for each question-answer pair, calculating the similarity of the question in the question-answer pair and the user question and the similarity of the answer in the question-answer pair and the user question respectively, obtaining the comprehensive similarity of the question-answer pairs according to the similarity, grouping the question-answer pairs, and combining the question-answer pairs with consistent answers into a question-answer pair set so as to form a plurality of question-answer pair sets. And meanwhile, calculating the set similarity of each question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set, and finally screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set as question-answer background information of the user questions. Thus, the embodiment of the application comprehensively considers the comprehensive similarity of each question-answer pair in the question-answer pair set, so as to generate the set similarity of each question-answer pair set. And finally, screening the question-answer pair sets according to the set similarity of the question-answer pair sets, so that question-answer background information can be generated aiming at the question-answer pair sets most relevant to the user question matching, influences of different expression modes of the user questions on the similarity are comprehensively considered, the accuracy of the question-answer pair matching is greatly improved, and the accuracy of the large language model answer is further improved.
In one embodiment, the step S101 specifically includes:
cutting knowledge corpus information into a plurality of corpus fragments;
according to the corpus fragments, each corpus fragment is generalized into a plurality of question-answer pairs, wherein the answers of the question-answer pairs generalized by the same corpus fragment are consistent.
The corpus fragment is a specific knowledge fragment generated by cutting the knowledge corpus information. For example, the knowledge corpus information mainly includes three parts of content including name, gender and age, and can be correspondingly cut into three corpus segments including name, gender and age.
For example, the knowledge question-answering system cuts the knowledge corpus information into a plurality of corpus fragments according to specific content contained in the knowledge corpus information. Then generalizing each corpus fragment into multiple question-answer pairs, for example, when the content of the corpus fragment is a name, "what you call? "," what name you are? "and" what is your name? "question-answer pair composed of three questions". Wherein, the answers of a plurality of question-answer pairs generalized by the same corpus fragment are consistent.
According to the method and the device, knowledge corpus information is cut into a plurality of corpus fragments, each corpus fragment is generalized into a plurality of question-answer pairs, the difference between the expressions of each user is comprehensively considered, and the problem that corresponding answers are changed due to different question-answer modes of the user is solved.
In one embodiment, prior to S102, the method further comprises:
acquiring configuration information and user problems, wherein the configuration information comprises a similarity adjustment coefficient;
vectorizing the user questions and the question-answer pairs;
s102 specifically comprises the following steps:
aiming at each question-answer pair, respectively calculating the question similarity of the question-answer pair and the user question and the answer similarity of the answer in the question-answer pair and the user question according to the similarity adjustment coefficient, the vectorized user question and the vectorized question-answer pair;
and obtaining the comprehensive similarity of the question-answer pairs according to the similarity of the questions and the similarity of the answers.
The configuration information is information corresponding to the configuration parameters of the question-answer pair, and specifically includes a similarity adjustment coefficient. The similarity adjustment coefficient is a coefficient value for determining the pertinence of the algorithm, and the higher the value of the similarity adjustment coefficient is, the stronger the pertinence of the algorithm is, and the fewer question-answer pair sets in question-answer background information finally output to the large model are.
The question similarity is used for representing the similarity between the questions in the question-answer pair and the user questions, and the answer similarity is used for representing the similarity between the answers in the question-answer pair and the user questions.
For example, the knowledge question-answering system obtains a preset similarity adjustment coefficient and an inputted user question, and vectorizes the similarity adjustment coefficient and the user question. And determining the similarity of the questions between the user questions and the questions of the question-answer pair and the similarity of the answers between the user questions and the answers of the question-answer pair according to the similarity adjustment coefficient. The cosine similarity can be calculated by a Word2Vector algorithm, the value range of the cosine similarity is between minus 1 and 1, and the representative correlation is stronger when the value of the cosine similarity is closer to 1. And finally, determining the comprehensive similarity between the user questions and the question-answer pairs according to the similarity between the user questions and the questions of the question-answer pairs and the similarity between the user questions and the answers of the question-answer pairs.
By the method, configuration information and user questions are obtained, the similarity of questions and the similarity of answers of a plurality of question-answer pairs are calculated, the subsequent calculation of comprehensive similarity between the user questions and the question-answer pairs according to the similarity of the questions and the similarity of the answers is facilitated, the similarity between the user questions and the questions of the question-answer pairs and the similarity between the user questions and the answers of the question-answer pairs are fully considered, and the similarity between the user questions and the answer pairs is reflected more accurately.
In one embodiment, the configuration information further includes a question-answer pair maximum limit value and a similarity minimum limit value, and for each question-answer pair, after calculating the question similarity between the question in the question-answer pair and the user question and the answer similarity between the answer in the question-answer pair and the user question according to the similarity adjustment coefficient, the vectorized user question, and the vectorized question-answer pair, the method further includes:
screening a plurality of question-answer pairs with the question similarity not smaller than the minimum similarity limit according to the maximum question-answer pair limit, the minimum similarity limit and the question similarity of the plurality of question-answer pairs from large to small according to the order of the question similarity, wherein the number of the screened plurality of question-answer pairs is not larger than the maximum question-answer pair limit;
obtaining the comprehensive similarity of the question-answer pairs according to the similarity of the questions and the similarity of the answers, wherein the comprehensive similarity comprises the following steps:
and obtaining the comprehensive similarity of the question-answer pairs according to the screened question similarity and answer similarity.
The maximum limit value of the question-answer pairs is used for representing the maximum number of the screened question-answer pairs, and the minimum limit value of the similarity is used for representing the minimum value of the question similarity and the answer similarity of the question-answer pairs.
For example, after generating the question similarity and the answer similarity of the plurality of question-answer pairs, the knowledge question-answer system screens the question-answer pairs according to the similarity minimum limit value, and removes each question-answer pair whose question similarity is smaller than the similarity minimum limit value. And then screening the rest question-answer pairs according to the maximum limit value of the question-answer pairs, sorting the rest question-answer pairs according to the order of the similarity of questions from big to small, removing the question-answer pairs after the sorting exceeding the maximum limit value of the question-answer pairs, wherein the number of the screened question-answer pairs is not more than the maximum limit value of the question-answer pairs. And finally, calculating the comprehensive similarity of the plurality of question-answer pairs and the user questions according to the selected question similarity and answer similarity of the plurality of question-answer pairs.
Through the embodiment, the maximum limit value and the minimum limit value of the question-answer pair are obtained, the question-answer pair is screened, the question-answer pair with smaller similarity can be screened out, and the calculated amount of the comprehensive similarity between the question-answer pair and the user problem can be reduced.
In one embodiment, the overall similarity of each question-answer pair to the user's question is obtained according to equation 1 below:
wherein f (x) i ,y i ) For the comprehensive similarity of the ith question-answer pair, k is a similarity adjustment coefficient, x i Similarity of questions for the ith question-answer pair, y i And the answer similarity of the ith question-answer pair.
According to the embodiment, the comprehensive similarity of each question-answer pair is calculated according to the formula 1, so that the subsequent determination of the set similarity of the question-answer pair set according to the comprehensive similarity is facilitated.
In one embodiment, the set similarity of the set of question-answer pairs is calculated according to the following equation 2:
wherein R is Total i Set similarity for the ith question-answer pair set, R j For the comprehensive similarity of the jth question-answer pair in the question-answer pair set, the ith question-answer pair setTogether, j question-answer pairs are included.
By the embodiment, the set similarity of the question-answer pair set is calculated according to the formula 2, so that the follow-up determination of the question-answer background information according to the set similarity is facilitated.
As shown in fig. 2, the device for generating question-answer background information provided in the embodiment of the present application includes a generating module 210, an obtaining module 220, a grouping module 230, a calculating module 240, and a screening module 250.
A generating module 210, configured to generate a plurality of question-answer pairs based on the knowledge corpus information, where the question-answer pairs include questions and answers corresponding to the questions;
the obtaining module 220 is configured to calculate, for each question-answer pair, a similarity between a question in the question-answer pair and a user question and a similarity between an answer in the question-answer pair and the user question, respectively, and obtain a comprehensive similarity between the question-answer pairs according to the similarity;
a grouping module 230, configured to divide question-answer pairs with the same answer among the multiple question-answer pairs into a group, to obtain multiple groups, and form multiple question-answer pair sets;
the computing module 240 is configured to determine a set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set;
and the screening module 250 is used for screening a target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set, and determining the question-answer pairs in the target question-answer pair set as question-answer background information of the user questions so as to be used for the large language model to answer the user questions according to the question-answer background information.
According to the embodiment, a plurality of question-answer pairs are generated by the generating module 210 based on knowledge corpus information, and the question-answer pairs corresponding to one answer of a plurality of questions are formed in a generalization mode in consideration of possible question-asking modes of each knowledge segment. Then, for each question-answer pair, the obtaining module 220 calculates the similarity between the question in the question-answer pair and the user question and the similarity between the answer in the question-answer pair and the user question, and obtains the comprehensive similarity between the question-answer pairs according to the similarity, and the grouping module 230 groups the question-answer pairs, so as to combine the question-answer pairs with consistent answers into one question-answer pair set, thereby forming a plurality of question-answer pair sets. And meanwhile, calculating the set similarity of each question-answer pair set through a calculation module 240 according to the comprehensive similarity of each question-answer pair in the question-answer pair set, and finally screening the target question-answer pair set with the set similarity meeting a preset threshold from the question-answer pair set through a screening module 250 to obtain question-answer background information of the user questions. Thus, the embodiment of the application comprehensively considers the comprehensive similarity of each question-answer pair in the question-answer pair set, so as to generate the set similarity of each question-answer pair set. And finally, screening the question-answer pair sets according to the set similarity of the question-answer pair sets, so that question-answer background information can be generated aiming at the question-answer pair sets most relevant to the user question matching, influences of different expression modes of the user questions on the similarity are comprehensively considered, the accuracy of the question-answer pair matching is greatly improved, and the accuracy of the large language model answer is further improved.
In one embodiment, the generation module 210 specifically includes the following elements:
the cutting unit is used for cutting the knowledge corpus information into a plurality of corpus fragments;
and the generalization unit is used for generalizing each corpus fragment into a plurality of question-answer pairs according to the plurality of corpus fragments, wherein the answers of the plurality of question-answer pairs generalized by the same corpus fragment are consistent.
In one embodiment, the device for generating question-answer background information further includes a vectorization module.
The obtaining module 220 is further configured to obtain configuration information and a user problem, where the configuration information includes a similarity adjustment coefficient;
the vectorization module is used for vectorizing the user questions and the question and answer pairs;
the obtaining module 220 specifically includes:
the computing unit is used for respectively computing the question similarity of the question and the user question in the question-answer pair and the answer similarity of the answer and the user question in the question-answer pair according to the similarity adjustment coefficient, the vectorized user question and the vectorized question-answer pair;
and the generating unit is used for obtaining the comprehensive similarity of the question-answer pairs according to the similarity of the questions and the similarity of the answers.
In one embodiment of the present invention, in one embodiment,
the screening module 250 is further configured to screen a plurality of question-answer pairs with a question similarity not smaller than the minimum similarity limit according to the maximum question-answer pair limit, the minimum similarity limit, and the question similarities of the plurality of question-answer pairs in order from the high question similarity to the low question similarity, where the number of the plurality of screened question-answer pairs is not greater than the maximum question-answer pair limit;
the obtaining module 220 is further configured to obtain a comprehensive similarity of question-answer pairs according to the filtered question similarity and answer similarity.
In one embodiment, the obtaining module 220 is further configured to obtain the comprehensive similarity between each question and answer pair and the user question according to the following formula 1:
wherein f (x) i ,y i ) For the comprehensive similarity of the ith question-answer pair, k is a similarity adjustment coefficient, x i Similarity of questions for the ith question-answer pair, y i And the answer similarity of the ith question-answer pair.
In one embodiment, the calculating module 240 is further configured to calculate the set similarity of the question-answer pair set according to the following formula 2:
wherein R is Total i Set similarity for the ith question-answer pair set, R j Is the comprehensive similarity of the j-th question-answer pair in the question-answer pair set.
Fig. 3 shows a schematic hardware structure of a device for generating question-answer background information according to an embodiment of the present application.
The question and answer background information generating device may comprise a processor 301 and a memory 302 storing computer program instructions.
In particular, the processor 301 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.
Memory 302 may include mass storage for data or instructions. By way of example, and not limitation, memory 302 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 302 may include removable or non-removable (or fixed) media, where appropriate. Memory 302 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 302 is a non-volatile solid-state memory.
The memory may include Read Only Memory (ROM), random Access Memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible memory storage devices. Thus, in general, the memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software comprising computer-executable instructions and when the software is executed (e.g., by one or more processors) it is operable to perform the operations described with reference to methods in accordance with aspects of the present disclosure.
The processor 301 reads and executes the computer program instructions stored in the memory 302 to implement the method for generating question-answer background information in any of the above embodiments.
In one example, the question and answer background information generating device may further include a communication interface 303 and a bus 310. As shown in fig. 3, the processor 301, the memory 302, and the communication interface 303 are connected to each other by a bus 310 and perform communication with each other.
The communication interface 303 is mainly used to implement communication between each module, device, unit and/or apparatus in the embodiments of the present application.
Bus 310 includes hardware, software, or both that couple the components of the online data flow billing device to each other. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. Bus 310 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.
In addition, in combination with the method for generating question-answer background information in the above embodiment, the embodiment of the application may be implemented by providing a computer storage medium. The computer storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement a method of generating question-answer background information in any of the above embodiments.
It should be clear that the present application is not limited to the particular arrangements and processes described above and illustrated in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions, or change the order between steps, after appreciating the spirit of the present application.
The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.
It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be different from the order in the embodiments, or several steps may be performed simultaneously.
Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to being, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware which performs the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the foregoing, only the specific embodiments of the present application are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, which are intended to be included in the scope of the present application.

Claims (10)

1. A method for generating question-answer background information, the method comprising:
generating a plurality of question-answer pairs based on knowledge corpus information, wherein the question-answer pairs comprise questions and answers corresponding to the questions;
aiming at each question-answer pair, calculating the similarity between the question in the question-answer pair and the user question and the similarity between the answer in the question-answer pair and the user question respectively, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity;
dividing the question-answer pairs with the same answers into a group to obtain a plurality of groups, and forming a plurality of question-answer pair sets;
determining the set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set;
and screening a target question-answer pair set, wherein the similarity of the target question-answer pair set meets a preset threshold, from the question-answer pair set, and determining the question-answer pairs in the target question-answer pair set as question-answer background information of the user questions so as to be used for a large language model to answer the user questions according to the question-answer background information.
2. The method of claim 1, wherein generating a plurality of question-answer pairs based on knowledge-corpus information comprises:
cutting knowledge corpus information into a plurality of corpus fragments;
and according to the plurality of corpus fragments, generalizing each corpus fragment into a plurality of question-answer pairs, wherein the answers of the plurality of question-answer pairs generalized by the same corpus fragment are consistent.
3. The method of claim 1, wherein before calculating, for each question-answer pair, a similarity between a question in the question-answer pair and a user question and a similarity between an answer in the question-answer pair and a user question, respectively, and obtaining a comprehensive similarity between the question-answer pairs according to the similarity, the method further comprises:
acquiring configuration information and user problems, wherein the configuration information comprises a similarity adjustment coefficient;
vectorizing the user questions and the plurality of question-answer pairs;
and for each question-answer pair, calculating the similarity between the question in the question-answer pair and the user question and the similarity between the answer in the question-answer pair and the user question respectively, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity, wherein the method comprises the following steps:
aiming at each question-answer pair, respectively calculating the similarity of the questions in the question-answer pair and the user questions and the similarity of the answers in the question-answer pair and the user questions according to the similarity adjustment coefficient, the vectorized user questions and the vectorized question-answer pair;
and obtaining the comprehensive similarity of the question-answer pairs according to the similarity of the questions and the similarity of the answers.
4. The method of claim 3, wherein the configuration information further includes a question-answer pair maximum limit and a similarity minimum limit,
and for each question-answer pair, respectively calculating the similarity of the questions in the question-answer pair and the user questions and the similarity of the answers in the question-answer pair and the user questions according to the similarity adjustment coefficient, the vectorized user questions and the vectorized question-answer pair, wherein the method further comprises the following steps:
screening a plurality of question-answer pairs with the question similarity not smaller than the minimum similarity limit according to the maximum question-answer pair limit, the minimum similarity limit and the question similarity of the plurality of question-answer pairs from large to small according to the sequence of the question similarity, wherein the number of the screened plurality of question-answer pairs is not larger than the maximum question-answer pair limit;
and obtaining the comprehensive similarity of the question-answer pairs according to the similarity of the questions and the similarity of the answers, wherein the method comprises the following steps:
and obtaining the comprehensive similarity of the question-answer pairs according to the screened question similarity and answer similarity.
5. The method according to any one of claims 1-4, wherein for each question-answer pair, calculating a similarity between a question in the question-answer pair and a user question and a similarity between an answer in the question-answer pair and a user question, respectively, and obtaining a comprehensive similarity between the question-answer pairs according to the similarity includes:
and obtaining the comprehensive similarity of each question and answer pair and the user problem according to the following formula:
wherein f (x) i ,y i ) For the comprehensive similarity of the ith question-answer pair, k is a similarity adjustment coefficient, x i Similarity of questions for the ith question-answer pair, y i And the answer similarity of the ith question-answer pair.
6. The method of claim 1, wherein the determining the set similarity of each set of question-answer pairs in the set of question-answer pairs based on the integrated similarity of each of the question-answer pairs in the set of question-answer pairs comprises:
calculating the set similarity of the question-answer pair sets according to the following formula:
wherein R is Total i Set similarity for the ith question-answer pair set, R j Is the comprehensive similarity of the j-th question-answer pair in the question-answer pair set.
7. A question-answer background information generating device, characterized in that the device comprises:
the generation module is used for generating a plurality of question-answer pairs based on knowledge corpus information, wherein the question-answer pairs comprise questions and answers corresponding to the questions;
the acquisition module is used for respectively calculating the similarity between the questions in the question-answer pair and the user questions and the similarity between the answers in the question-answer pair and the user questions aiming at each question-answer pair, and obtaining the comprehensive similarity of the question-answer pairs according to the similarity;
the grouping module is used for dividing the question-answer pairs with the same answers into a group to obtain a plurality of groups so as to form a plurality of question-answer pair sets;
the computing module is used for determining the set similarity of each question-answer pair set in the question-answer pair set according to the comprehensive similarity of each question-answer pair in the question-answer pair set;
and the screening module is used for screening a target question-answer pair set, the similarity of which meets a preset threshold, from the question-answer pair set, and determining the question-answer pairs in the target question-answer pair set as question-answer background information of the user questions so as to be used for a large language model to answer the user questions according to the question-answer background information.
8. The question-answering background information generating device according to claim 7, wherein the generating module includes:
the cutting unit is used for cutting the knowledge corpus information into a plurality of corpus fragments;
and the generalization unit is used for generalizing each corpus fragment into a plurality of question-answer pairs according to the plurality of corpus fragments, wherein the answers of the plurality of question-answer pairs generalized by the same corpus fragment are consistent.
9. An electronic device, the device comprising: a processor and a memory storing computer program instructions;
the processor, when executing the computer program instructions, implements a method for generating question-answer background information according to any one of claims 1-6.
10. A computer readable storage medium having stored thereon computer program instructions which when executed by a processor implement a method of generating question-answer background information according to any one of claims 1-6.
CN202311533288.4A 2023-11-16 2023-11-16 Method, device, equipment and storage medium for generating question-answer background information Pending CN117609447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311533288.4A CN117609447A (en) 2023-11-16 2023-11-16 Method, device, equipment and storage medium for generating question-answer background information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311533288.4A CN117609447A (en) 2023-11-16 2023-11-16 Method, device, equipment and storage medium for generating question-answer background information

Publications (1)

Publication Number Publication Date
CN117609447A true CN117609447A (en) 2024-02-27

Family

ID=89957190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311533288.4A Pending CN117609447A (en) 2023-11-16 2023-11-16 Method, device, equipment and storage medium for generating question-answer background information

Country Status (1)

Country Link
CN (1) CN117609447A (en)

Similar Documents

Publication Publication Date Title
CN110147456B (en) Image classification method and device, readable storage medium and terminal equipment
CN105389307A (en) Statement intention category identification method and apparatus
CN111666761A (en) Fine-grained emotion analysis model training method and device
CN109685537B (en) User behavior analysis method, device, medium and electronic equipment
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN115082920A (en) Deep learning model training method, image processing method and device
CN111695335A (en) Intelligent interviewing method and device and terminal equipment
CN112307860A (en) Image recognition model training method and device and image recognition method and device
CN106803092B (en) Method and device for determining standard problem data
CN116150125A (en) Training method, training device, training equipment and training storage medium for structured data generation model
CN117609447A (en) Method, device, equipment and storage medium for generating question-answer background information
CN111382265B (en) Searching method, device, equipment and medium
CN112598078B (en) Hybrid precision training method and device, electronic equipment and storage medium
CN115062126A (en) Statement analysis method and device, electronic equipment and readable storage medium
CN111613287B (en) Report coding model generation method, system and equipment based on Glow network
CN112597208A (en) Enterprise name retrieval method, enterprise name retrieval device and terminal equipment
CN114490969A (en) Question and answer method and device based on table and electronic equipment
CN112541357A (en) Entity identification method and device and intelligent equipment
CN114429166A (en) Method, device and equipment for acquiring high-dimensional features of data and computer storage medium
CN112700270B (en) Score data processing method, device, equipment and storage medium
CN114238634B (en) Regular expression generation method, application, device, equipment and storage medium
CN116911316A (en) Data processing method, device, equipment and computer storage medium
CN116910340A (en) Data processing method, device, equipment, medium and product
CN115758129A (en) Model training and project recommendation method, device, equipment, medium and product
CN117815674A (en) Game information recommendation method and device, computer readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination