CN115292459A

CN115292459A - Information retrieval method based on question-answering library, question-answering system and computing equipment

Info

Publication number: CN115292459A
Application number: CN202210793159.8A
Authority: CN
Inventors: 吴锡坤; 战立涛; 杨雷
Original assignee: Chezhi Interconnection Beijing Technology Co ltd
Current assignee: Chezhi Interconnection Beijing Technology Co ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-11-04

Abstract

The disclosure discloses an information retrieval method based on a question-answer library, a question-answer system and computing equipment, wherein the question-answer library comprises at least one question and at least one answer corresponding to each question, and the answer set is used as an answer set of each question, and the information retrieval method based on the question-answer library comprises the following steps: in response to a user query, determining a recalled question and a candidate answer subset corresponding to the recalled question from a question-answer library as at least one question-answer pair for recall, wherein the candidate answer subset is from an answer set; inputting a user query and at least one question-answer pair into a preset matching model for processing to obtain the matching degree of the user query and each question-answer pair; and sorting the question-answer pairs according to the matching degree to serve as a retrieval result. According to the scheme disclosed by the invention, the accuracy and the relevance of the question answering can be obviously improved, and the question answering quality is improved.

Description

Information retrieval method based on question-answering library, question-answering system and computing equipment

Technical Field

The present disclosure relates to the field of computer network technology, and more particularly, to an information retrieval scheme based on a question and answer library.

Background

In order to overcome the defects of the traditional search engine, the question-answering system technology is developed. The question-answering system can accurately identify the semantic intention of the query expressed by the natural language of the user, and matches the most relevant answer for the semantic intention as a query result. Taking community question answering as an example, more and more community question answering platforms provide an online question and answer platform for internet users to help people to quickly obtain high-quality answers to daily or professional questions.

As the community questions and answers become more popular, the questions existing in the platform gradually emerge, one of the questions is that the answers have different quality, and the experience of the user on the platform is greatly influenced by the low-quality answers; in addition, the diversity of natural language question expressions can make similar questions difficult to match, and the best answer under the similar questions cannot be obtained quickly.

Therefore, how to quickly and effectively match the closest and high-quality answers from the question-answering resources is a key problem of the community question-answering platform.

Disclosure of Invention

The present disclosure provides a question-answer library-based information retrieval method, a question-answer system and a computing device in an attempt to solve or at least alleviate at least one of the problems presented above.

According to one aspect of the present disclosure, there is provided an information retrieval method based on a question-and-answer library, the question-and-answer library including at least one question and at least one answer corresponding to each question as an answer set for each question, the method including: in response to a user query, determining a recalled question and a candidate answer subset corresponding to the recalled question from a question-answer library as at least one question-answer pair for recall, wherein the candidate answer subset is from an answer set; inputting a user query and at least one question-answer pair into a preset matching model for processing to obtain the matching degree of the user query and each question-answer pair; and sorting the question-answer pairs according to the matching degree to serve as a retrieval result.

Optionally, the method according to the present disclosure further comprises the steps of: and generating a preset matching model. Specifically, user query, questions and answers are used as training data, and the training data are respectively labeled based on a preset task to obtain labeled data; constructing a training model for generating a preset matching model, and setting initial parameters of the model; splicing the training data and inputting the training data into a training model for processing to obtain a prediction matching degree and a prediction category; and calculating a loss value by using the prediction matching degree, the prediction category and the labeled data, and adjusting model parameters of the training model based on the loss value until the training is finished to obtain a preset matching model.

Optionally, in a method according to the present disclosure, the training model comprises a language representation model, a text matching component, and an entity recognition component; and when the training is finished, using the trained language representation model and the trained text matching component as preset matching models.

Optionally, the method according to the present disclosure further comprises the steps of: and respectively selecting at least one answer with high matching degree with the question as a candidate answer subset of the question aiming at each question in the question-answer library. Specifically, for each question in the question-answer library, the question is spliced with each corresponding answer to correspondingly obtain each spliced clause; inputting each splicing clause into a preset matching model respectively to obtain the matching degree of the corresponding question and answer; and when the matching degree of the question and the answer is higher than the threshold value, taking the answer as a candidate answer of the question.

Optionally, in a method according to the present disclosure, the text matching component includes a fully connected layer and a classifier, and the entity identification component includes a conditional random field network layer.

Optionally, the method according to the present disclosure further comprises the steps of: determining a first number of questions with similarity larger than a preset value by calculating the vector similarity between the user query and each question in a question and answer library; searching questions in the question and answer library by using user query to determine a second number of questions; and performing deduplication processing on the first number of questions and the second number of questions to determine recalled questions.

According to still another aspect of the present disclosure, there is provided a question-answering system including: the question-answer storage device is suitable for storing at least one question and at least one answer corresponding to each question as a question-answer library, wherein each question comprises a corresponding answer set; the off-line device is suitable for training to generate a preset matching model and is also suitable for respectively determining candidate answer subsets of questions by using the preset matching model aiming at all questions in the question-answer library, wherein the candidate answer subsets are from an answer set; and the online device is suitable for responding to the user query, and determining at least one question and a corresponding candidate answer subset thereof from the question-answer library as a retrieval result on the basis of a preset matching model.

Optionally, in a system according to the present disclosure, the offline device includes: the training unit is suitable for constructing a training model to generate a preset matching model; and the answer selection unit is suitable for respectively selecting at least one answer with high matching degree with the question as a candidate answer subset of the question by utilizing a preset matching model.

Optionally, in a system according to the present disclosure, the online device includes: the system comprises a recall unit, a question database and a question-answer database, wherein the recall unit is suitable for responding to user query and determining a recalled question and a candidate answer subset corresponding to the recalled question from the question-answer database as at least one question-answer pair for recall; and the sorting unit is suitable for inputting the user query and the recalled question-answer pairs into a preset matching model for processing so as to obtain the matching degree of the user query and each question-answer pair, and sorting the question-answer pairs according to the matching degree to be used as a retrieval result.

According to yet another aspect of the present disclosure, there is provided a computing device comprising: one or more processor memories; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods above.

According to yet another aspect of the disclosure, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described above.

In summary, according to the scheme disclosed by the disclosure, a training model based on joint learning is constructed, and a joint text matching task and an entity recognition task perform model training at the same time, so that on one hand, information of user query, questions and answers is integrated into a semantic vector, and the model learns more semantic information; on the other hand, the entity consistency of the question answer matching is improved. In addition, through offline answer selection (determining a candidate answer subset), online question recall and online question-answer ranking, development amount and maintenance cost can be greatly reduced.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 illustrates a schematic diagram of a computing device 100, in accordance with some embodiments of the present disclosure;

FIG. 2 illustrates a flow diagram of a question-answer library-based information retrieval method 200 according to some embodiments of the present disclosure;

FIG. 3 illustrates a schematic diagram of a training model according to some embodiments of the present disclosure;

FIG. 4 illustrates a schematic diagram of a question-answering system 400 according to some embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The applicant finds in research that the construction of a high-quality community question-answering system needs to consider the following two factors simultaneously, namely the similarity between user queries (queries) and questions in a question-answering library. The higher the similarity, the more relevant the answer to the question and the user query. Secondly, the answer list under the questions in the question-answer library has good or bad degree, and answers with high quality should be preferentially provided for the user.

Around the above two points, researchers have proposed a variety of ways, among which, "answer ranking" is an effective way to help users quickly pick out high quality answers from a series of answers. Relevant research mainly focuses on lexical and syntactic relations between the user query and the question or between the user query and the answer, and ignores the important role of semantic information on answering and sorting questions and the matching degree of the user query, the question and the answer.

In addition, two main ideas are currently adopted to solve the above problems: one is based on the way that the query of the user is matched with the answer, the answer which is most relevant to the query semantic of the user is directly searched from the candidate answers of the question-answer library; the other method is a two-stage matching sorting mode, firstly matching the query of the user with the questions, sorting the questions, and then sorting the answer list under the most relevant questions to obtain the answers. For a complex problem, the complex problem can be decomposed into simple and independent sub-problems to be solved separately, and then the results are combined to obtain the result of the original complex problem. This seems reasonable, but in practice many problems cannot be decomposed into a single sub-problem, and even if it can be, the sub-problems are interrelated and linked together by some sharing factor or sharing expression. The real problems are treated as independent single tasks, and rich associated information among the problems can be ignored.

In view of the above, the present disclosure provides an information retrieval scheme based on a question and answer library. The question-answer library comprises at least one question and at least one answer corresponding to each question, and the questions and the answers are used as answer sets of the questions. And decomposing the information retrieval based on the question-answer library into 2 subtasks for joint learning. Firstly, labeling data based on each subtask type and training a joint learning model; secondly, answer selection is carried out on the questions and the answers by utilizing the model; and finally, carrying out multi-path recall on line, and matching and sequencing the query and question-answer pairs of the user. The method fully utilizes the knowledge expression and reasoning capability of the joint learning model, and can obtain more accurate and semantically related answers to the questions.

The information retrieval scheme based on the question and answer library of the embodiment of the disclosure can be executed in one or more computing devices. Fig. 1 is a block diagram of an example computing device 100.

In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. Operating system 120 may be, for example, linux, windows, etc., which includes program instructions for handling basic system services and performing hardware dependent tasks. The application 122 includes program instructions for implementing various user-desired functions, and the application 122 may be, for example, but not limited to, a browser, instant messenger, a software development tool (e.g., an integrated development environment IDE, a compiler, etc.), and the like.

When the computing device 100 is started, the processor 104 reads program instructions of the operating system 120 from the memory 106 and executes them. The application 122 runs on top of the operating system 120, utilizing the operating system 120 and interfaces provided by the underlying hardware to implement various user-desired functions. When the user starts the application 122, the application 122 is loaded into the memory 106, and the processor 104 reads the program instructions of the application 122 from the memory 106 and executes the program instructions.

The computing device 100 also includes a storage device 132, the storage device 132 including removable storage 136 (e.g., CD, DVD, U-disk, removable hard disk, etc.) and non-removable storage 138 (e.g., hard disk drive, HDD, etc.), the removable storage 136 and the non-removable storage 138 each connected to the storage interface bus 134.

Computing device 100 may also include a storage interface bus 134. A storage interface bus 134 enables communication from a storage device 132 (e.g., a removable storage 136 and a non-removable storage 138) to the basic configuration 102 via the bus/interface controller 130. Operating system 120, applications 122, and at least a portion of program data 124 may be stored on removable storage 136 and/or non-removable storage 138, and loaded into system memory 106 via storage interface bus 134 and executed by one or more processors 104 when computing device 100 is powered on or applications 122 are to be executed.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices such as a display or speakers via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. The example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media. In some embodiments, one or more programs are stored in a computer readable medium, the one or more programs including instructions for performing certain methods.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations. The computing device 100 may also be implemented as a server having the above-described configuration.

In an embodiment according to the present disclosure, the computing device 100 is configured to perform a question-answer-library-based information retrieval method. The application 122 of the computing device 100 includes a plurality of program instructions for executing the question-answer-base-based information retrieval method 200 according to the present disclosure, and the program data 124 may also store related data of a network model for executing the method 200, including but not limited to training data, hyper-parameter information, and the like.

FIG. 2 illustrates a flow diagram of a question-answer library-based information retrieval method 200 according to some embodiments of the present disclosure. As shown in fig. 2, the method 200 begins at step S210.

According to the embodiment of the present disclosure, in the method 200, before the step S210 is executed, the method further includes the steps of: and generating a preset matching model. The preset matching model is used in the subsequent steps to obtain semantic information of texts such as user query, questions and answers and determine the matching degree of the three. Generally, a user query is input by a user in real time to search for content, which may be a word, a sentence, or a picture, a voice, etc. The questions and answers are information pre-stored in a question and answer library, and for convenience of description hereinafter, 1 question and corresponding 1 answer in the question and answer library are denoted as "question-answer".

The specific steps of training to generate the preset matching model are explained first.

In the first step, user queries, questions and answers are used as training data, specifically, 1 user query, 1 question and 1 answer are used as 1 training sample, and all the training samples constitute training data.

And then, labeling the training data respectively based on a preset task to obtain labeled data. According to an embodiment of the present disclosure, the preset tasks (i.e., the two subtasks described above) include a text matching task and an entity recognition task.

The text matching task is used for calculating the similarity between the user query and the question-answer, so as to classify the matching degree between the user query and the question-answer. The entity recognition task is used for recognizing entity words in the text, so that entity information is learned in semantic vectors of the text, the entity words in a semantic layer are more sensitive, and the task of text matching is facilitated. Based on the preset tasks, the generated label data respectively comprise a label matching degree and a label category.

Specifically, based on the text matching task, the matching degree among the user query, the question and the answer is labeled as the labeling matching degree. Respectively 0 and 1.0 indicates that the user query and the question have different semantics or the answer does not answer the user query well, and 1 indicates that the user query and the question have the same semantics and the answer can answer the user query well. For text matching tasks, the training data may be labeled in the following format: the user queries the question answer matching degree, and three examples are shown below.

The Oddi A4 is called as the strongest control of the earth surface in 30W; what is the BMW 320 higher overall configuration, the brand is loud, the price is more and more expensive, the power of the two vehicles is almost the same, and what is the correct one of the two entangled vegetable vehicles is compared? Which is selected for | attz and bmau 3? Both the | atz and bmac 3 lines belong to the well-opened, smooth and foot series. The differences are mainly in price, power, brand and interior. First, the price is calculated, 25 is predicted for BMW 320, 20 is predicted for Atz 2.5, BMW 3 wins dynamically, brand 3 wins, and sound insulation 3 wins. The interior decoration is really dare not to feel like the ones of Atz and Atz ruffian leather, and the enamel with 3-series litchi veins is full of the whole vehicle. As for comfort, I feel that the two cars are half a kilogram and two, and the riding comfort of the back row is very common. The late-stage cost difference is large, the Atz is much cheaper and the oil is much saved, and the starting noise is one level.

||0

Asking for which of the Q5 and the XC60 is better, recommending how to select the I Audi Q5 and the Volvo XC60 respectively has advantages and disadvantages, XC60T5 is low-matched with about 33 ten thousand, the cost performance is very high, Q5L45 is low-matched with about 40 ten thousand, and firstly, three large parts are called, the power of the engine is close, and the durability is good. The gearbox XC60 uses the Ergonomic-to-Atlants, the Q5L is poor, and uses the seven-speed wet double-clutch. The texture of the combined chassis is not very different, Q5L is comfortable, XC60 is hard, but most of the chassis is made of white aluminum alloy. The space Q5L needs to be spacious, the power is strong, the configuration, the material consumption and the safety XC60 are better, and the peculiar smell in the vehicle is less. The user is advised to buy Volvo for bad money and Audi for sufficient money. 1 | l

To ask about, is 17 types of Sagitar 1.4T, what needs to be maintained when running for 3 kilometers, someone know? Is |17 a 1.4T soar to do maintenance over 3 kilometers? After the automobile travels thirty thousand kilometers, | the maintenance should be done as follows. 1. And (4) replacing. After the automobile runs for thirty thousand kilometers, the automobile is stained with very much dust and oil stains, and the performance of the automobile is greatly influenced when the automobile runs for thirty thousand kilometers, so that the automobile is preferably replaced, and the working performance is ensured. Therefore, most manual specifications clearly specify that the air cleaner must be replaced after the automobile has traveled thirty thousand kilometers. 2. And (4) replacing. The principle is very similar to that of an air filter, and when an automobile runs for thirty thousand kilometers, a great amount of impurities are generated, so that the lubricating effect and the heat dissipation function are greatly reduced, and therefore, the air filter is also very necessary to be replaced at regular time. 3. And cleaning an oil way. Typically including combustion chamber oil path purging and engine internal purging. The oil way cleaning is mainly used for cleaning carbon deposition, and the carbon deposition is the deposit generated after the blow-by gas of the lubricating oil and the gasoline is mixed. This is not necessarily done, and is analyzed based on the actual condition of the vehicle, and if the vehicle has been processed in the previous maintenance, this time can be ignored. 1 | l

And based on the entity recognition task, marking the entity category to which each element in each text in the training data belongs as a marking category. And marking out entities needing to be identified according to specific service requirements or specific corpus conditions. Taking the automotive industry as an example, entity categories include, but are not limited to: vehicle family name (SERIES), amount of money (PRICE), parts and Components (COMP). In one embodiment, the training data is labeled three-dimensionally using the BIO labeling method, labeling each element as "B-X", "I-X", or "O". Where "B-X" indicates that the fragment (i.e., entity) in which the element is located belongs to X type and that the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and that the element is at a non-beginning position (e.g., middle position, end position) of the fragment, and "O" indicates that it does not belong to any type.

For the training data, each sentence is treated as a text, and each text is segmented with empty lines. For each text, each element and annotation category is separated by "\ t" (or one or more spaces), the specific format being shown in the examples below.

Wherein, the text 1 is: "what model the 20-model hanlanda engine is", text 2 is: "about 30 thousands of available cars are recommended.

2O

0O

Money O

Han B-SERIES

Lane I-SERIES

Dai-SERIES

O of (A) to (B)

Hair B-COMP

Kinetic I-COMP

Machine I-COMP

Is O

Assorted O

O

Form O

Number O

3B-PRICE

0I-PRICE

Ten thousand I-PRICE

Left O

Right O

Can be O

O from commercial O

Vehicle O

Push away O

Recommended O

Lower O

And secondly, constructing a training model for generating a preset matching model, and setting initial parameters of the model.

According to one embodiment, the training model includes a language representation model, a text matching component, and an entity recognition component. FIG. 3 illustrates a schematic diagram of a training model according to some embodiments of the present disclosure. Referring to FIG. 3, the text matching component and the entity recognition component are coupled to the language representation model respectively to complete the above subtasks.

In one embodiment, the language representation model employs a BERT model, but is not so limited. It should be understood that the fundamental purpose of the method 200 using the language representation model is to obtain accurate semantic information capable of expressing text, thereby ensuring the accuracy of the retrieval result. Therefore, other language representation models, such as RoBERTa, ALBERT, eletra, etc., may be used when text encoding.

And thirdly, splicing the training data and inputting the training data into a training model for processing to obtain a prediction matching degree and a prediction category.

Specifically, the training data is input into the language representation model for processing to obtain a first semantic vector and a second semantic vector. As shown in fig. 3, the BERT model inputs are pairs of sentences, one of which is a user query and the other of which is a question and answer concatenation. The input to the language representation model is written as: CLS { user query } SEP { question } SEP { answer } SEP. In one embodiment, when the total input length is greater than 512, the answer portion is truncated.

The language representation model encodes the input text, and codes the splicing of user query, question and answer into a semantic vector with fixed dimensionality. As shown in fig. 3, the output of the language representation model comprises a first semantic vector V1 and a second semantic vector V2. The first semantic vector V1 is the output of the last transform corresponding to the classified token ([ CLS ]), and the second semantic vector V2 is the output of the last transform corresponding to other tokens.

Therefore, the input end of the language representation model is fused with three parts of texts of user query, question and answer, the encoding vector can learn semantic information of the three parts of texts at the same time, and transfer errors caused by two-stage matching and sorting can be avoided.

And then, inputting the first semantic vector V1 into a text matching component for processing so as to output a prediction matching degree. In one embodiment, the text matching component includes a fully connected layer and a classifier (e.g., softmax). And inputting the first semantic vector V1 into the full-connection layer, and then connecting the full-connection layer with the Softmax layer to calculate the predicted matching degree.

Meanwhile, the second semantic vector V2 is input into the entity recognition component for processing so as to output a prediction category. In one embodiment, the entity identification component includes a Conditional Random Field (CRF) network layer. And inputting the second semantic vector V2 into a CRF network layer, decoding, and outputting a category label with the maximum probability value, namely a prediction category. According to one embodiment of the present disclosure, the form of the prediction class corresponds to the form of the labeled class labeled during the training phase (i.e., BIO ternary labeling).

And fourthly, calculating a loss value by using the prediction matching degree, the prediction category and the labeled data, and adjusting model parameters of the training model based on the loss value until the training is finished to obtain a preset matching model.

According to some embodiments of the present disclosure, the loss value is calculated in the following manner.

On one hand, through a cross entropy loss function, calculating a first loss of a prediction matching degree and a label matching degree, and recording the first loss as L _relevance . In one embodiment, the first loss is calculated using the following equation:

wherein n is the number of samples, y _i Is labeled degree of match (0 or 1), y' _i Is the predicted match.

On the other hand, a second loss of the prediction class and the label class is calculated by a cross entropy loss function, denoted as L _ner . Since both adopt the cross entropy loss function, the specific calculation of the second loss can refer to the related content of the first loss, which is not described herein.

Finally, a loss value is obtained based on the first loss and the second loss.

According to one embodiment, the loss values are summed in a dynamic weighting manner, and the loss value L can be represented by the following formula:

L＝ω _relevance *L _relevance +ω _ner *L _ner ，

in the formula, omega _relevance And omega _ner The weights of the first loss and the second loss are respectively adjusted according to the difficulty degree of learning of different tasks and the learning effect.

And adjusting model parameters of the training model based on the loss value, and in one embodiment, updating parameters of the text matching component, the entity recognition component and the BERT model by using a back propagation algorithm and a gradient descent algorithm under the condition that the loss value is greater than a preset loss threshold value until the training is finished. And when the training is finished, using the trained language representation model and the trained text matching component as preset matching models.

According to the embodiment of the disclosure, two subtasks of text matching and entity recognition are jointly learned by constructing a training model. The text matching subtasks are matched from the aspect of semantics instead of keywords, so that the coding vectors output by the model can learn more semantic information, and the generalization capability of the model is stronger. Meanwhile, the entity identification subtask emphasizes the attention to key entity components in the sentence, is beneficial to improving the entity consistency of question answer matching, and further remarkably improves the accuracy and the relevance of questions and answers.

In step S210, in response to the user query, a recalled question and a candidate answer subset corresponding to the recalled question are determined from the question-answer library as at least one question-answer pair for recall.

According to one embodiment of the disclosure, the recall phase is in the form of a vector recall and a keyword recall.

On one hand, the first number of questions with similarity greater than a preset value are determined by calculating the vector similarity between the user query and each question in the question-answer library. For example, using a sentence vector model, user queries and questions are encoded into semantic vectors, respectively; then the similarity of the two semantic vectors (i.e. the user query vector and the question vector) is calculated by using the vector similarity, and the K1 questions with the highest similarity score are selected as the result of the vector recall.

In another aspect, questions in the question-and-answer library are retrieved using the user query to determine a second number of questions. For example, based on an open source search engine such as ES, questions are searched by a user query, and the top K2 questions are taken as the results of keyword recall. Certainly, the keyword recall may also be performed by extracting keywords from the user query and the question and answer library, and then obtaining a keyword recall result by calculating indexes such as co-occurrence, association, similarity and the like of the user query and the keywords in the question and answer library. The specific implementation of the vector recall and keyword recall is not overly limiting of the present disclosure.

The first number of questions and the second number of questions are then deduplicated to determine the recalled questions, i.e., the final outcome of the recall phase.

In addition, the candidate answer subset corresponding to the recalled question is the answer set from each recalled question. According to still other embodiments of the present disclosure, the subset of candidate answers may be determined prior to the recall phase to increase the speed of the online recall.

In some embodiments, for each question in the question-answer library, at least one answer with a high matching degree with each question is selected as a candidate answer subset of each question. Specifically, the candidate answer subsets of the questions can be determined by the preset matching model. And aiming at each question, splicing the question with each corresponding answer to correspondingly obtain each spliced clause. And inputting each splicing clause into a preset matching model respectively to obtain the matching degree of the corresponding question and answer. And when the matching degree of the question and the answer is higher than the threshold value, taking the answer as a candidate answer of the question.

Since the semantic space of the user query and the semantic space of the question are the same, when the candidate answer subset is determined, the input of the preset matching model is the question-answer, the output result is the matching degree of the question and the answer, and the higher the matching degree of the question and the answer is, the better the answer is. Answers above the threshold are retained so that subsequent question-answer ranking will only consider good answers for each question.

And in the recall stage, a multi-path recall strategy is adopted, and question-answer pairs with high similarity are recalled from the aspects of semantics and keywords, so that the recall rate and the processing speed in the sequencing stage can be improved. In addition, the candidate answer subsets determined in advance are adopted, so that the data volume in the sequencing stage can be reduced, and the online running time is reduced; on the other hand, the task is performed offline, so that the system can be decoupled, and the performance of the system is improved.

In step S220, the question-answer pairs queried and recalled by the user are input into a preset matching model for processing, so as to obtain matching degrees between the user query and each question-answer pair.

As described above, the question-answer pairs used in the matching stage are the questions recalled in the recall stage and the good answers (candidate answer subsets) under the questions, the preset matching model is used for performing text matching on the question-answer pairs queried and recalled by the user, the input data is the spliced user query-question-answer pairs, and the matching degrees of the user query-question-answer pairs are output.

Subsequently, in step S230, the question-answer pairs are sorted according to the matching degree as a search result. In one embodiment, the higher the degree of match, the higher the ranking.

According to the embodiment of the disclosure, the matching process synchronously fuses semantic information of three parts of user query, question and answer, thereby improving the accuracy and controllability of sequencing.

In conclusion, according to the information retrieval method based on the question-answer library, a training model based on joint learning is constructed, and the joint text matching task and the entity recognition task simultaneously perform model training, so that on one hand, the information of the user query, the question and the answer is integrated into the semantic vector, and the model learns more semantic information; on the other hand, the entity consistency of question answer matching is improved. In addition, through offline answer selection (determining a candidate answer subset), online question recall and online question-answer ranking, development amount and maintenance cost can be greatly reduced.

FIG. 4 illustrates a schematic diagram of a question-answering system 400 according to some embodiments of the present disclosure. As shown in fig. 4, the question-answering system 400 includes: question-answer storage device 410, offline device 420, and online device 430.

The question-answer storage device 410 stores at least one question and at least one answer corresponding to each question as a question-answer library, wherein each question includes a corresponding answer set. As shown in FIG. 4, below the Question-answer storage means 410, a Question-answer library containing questions is exemplarily shown ₁ And their corresponding m answers, it should be understood that only 1 Question and its Answer set are shown here, and a plurality of such Question and Answer sets are included in the Question-Answer library.

The offline device 420 trains and generates a preset matching model. In addition, the offline device 420 may further determine a candidate answer subset of each question in the question-answer library by using a preset matching model, where the candidate answer subset is from the answer set.

According to one embodiment, the offline device 420 includes: a training unit 422 and an answer selection unit 424.

The training unit 422 constructs a training model to generate a preset matching model. For the construction and training process of the training model, reference may be made to the related description in the method 200, and details are not repeated here.

The answer selecting unit 424 uses the preset matching model to select at least one answer with a high matching degree with each question as a candidate answer subset corresponding to each question. As shown in fig. 4, below the offline device 420, a schematic diagram of offline selection of a candidate answer subset is exemplarily shown. For Question ₁ And selecting n answers with higher matching degree (n is less than or equal to m) from the Answer set as candidate Answer subsets.

It should be noted that, regarding the process of the answer selecting unit 424 selecting the candidate answer subset, reference may be made to the related description of the method 200, and the description thereof is omitted here.

The online device 430 determines at least one question and a corresponding candidate answer subset thereof from the question-answer library as a retrieval result based on a preset matching model in response to the user query.

According to one embodiment, the online device 430 includes a recall unit 432 and a ranking unit 434.

Recall receiptElement 432 determines, from the question-answer library, recalled questions and a subset of candidate answers to the recalled questions as recalled at least one question-answer pair in response to the user query. As shown in FIG. 4, in response to a user query, it is determined that the problem to recall is at least a Question ₁ 、Question _s (other recalled questions, not shown, are replaced by ellipses in the figure) and the corresponding subset of candidate answers to the recalled questions constitute a single recalled question-answer pair. For the specific implementation of the recall unit 432, reference may be made to the description of step S210 in the method 200, and the description thereof is omitted here.

The sorting unit 434 inputs the user query and the recalled question-answer pair into a preset matching model for processing, so as to obtain the matching degree between the user query and each question-answer pair, and sorts the question-answer pairs according to the matching degree, so as to serve as a retrieval result. The final ranking result is shown in fig. 4: question ₁ +Answer _p 、…、Question _s +Answer ₁ As a result of the search. For the specific implementation process of the sorting unit 434, reference may be made to the description of step S220 and step S230 in the method 200, and details are not repeated herein.

The present disclosure provides a question-answering system. By establishing the joint learning, not only entity information can be effectively fused into a text vector, but also information of questions and answers in a user query and question-answer library can be simultaneously fused into a sequencing process, so that the most relevant answers can be matched in the community question-answer. The question-answering system has the following characteristics.

1) And (4) performing joint learning. And performing joint learning by constructing two subtasks of text matching and entity recognition. The text matching task is matched from the aspect of semantics but not keywords, so that more semantic information can be learned by the encoding vector output by the model, and the generalization capability of the model is stronger. Meanwhile, the entity recognition task emphasizes the attention to key entity components in the sentence, and is beneficial to improving the entity consistency of question answer matching, so that the accuracy and the relevance of question answering are obviously improved, and the question answering quality is improved.

2) User query-question-answer information is integrated. The input end of the constructed training model is simultaneously fused with three texts of user query, question and answer, and the coded vector can learn semantic information of the three texts at the same time, so that transmission errors caused by two-stage matching and sorting can be avoided.

3) Answer selection is performed offline. The candidate answer subset is determined in advance, so that on one hand, the answers with low quality can be filtered, the data volume in the sequencing stage is reduced, and the online running time is reduced; on the other hand, the task is performed offline, so that the system can be decoupled, and the performance of the system is improved.

4) In the online part, a multi-path recall strategy is adopted in the recall stage, and question-answer pairs with high similarity are recalled from the aspects of semantics and keywords, so that the recall rate and the processing speed of the sequencing stage can be improved; in the sequencing stage, a preset matching model is used for calculating a similarity score (matching degree), the model input is formed by user query, question and answer splicing, and the matching process fuses three parts of information, so that the sequencing accuracy and controllability are improved.

In conclusion, according to the information retrieval method and the question-answering system based on the question-answering library, the ranking effect is greatly improved through a model based on joint learning and a three-stage method of offline answer selection, online recall and online ranking. The scheme disclosed by the invention has the characteristics of accurate and reliable question answering and high efficiency, is suitable for application scenes of community question answering in various vertical industries, and can be popularized to question answering or searching in other scenes.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the question-answer library-based information retrieval method of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the present disclosure. The required structure for constructing such a system will be apparent from the description above. Moreover, this disclosure is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the present disclosure as described herein, and any descriptions above of specific languages are provided for disclosure of preferred embodiments of the present disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be construed to reflect the intent: rather, the present disclosure is directed to features more specifically recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

The present disclosure also discloses:

the method of any one of A2-7, wherein calculating a loss value using the prediction matching degree, the prediction category, and the annotation data comprises: calculating a first loss of the prediction matching degree and the labeling matching degree and a second loss of the prediction type and the labeling type through a cross entropy loss function; obtaining the loss value based on the first loss and the second loss.

A9, the method as in any one of A2-8, wherein the text matching component comprises a fully connected layer and a classifier, and the entity identification component comprises a conditional random field network layer.

The method of any one of A1-9, wherein the step of determining recalled questions from the question-and-answer library in response to a user query comprises: determining a first number of questions with similarity greater than a preset value by calculating vector similarity between the user query and each question in the question-answer library; retrieving questions in the question-answer library by using the user query to determine a second number of questions; and performing deduplication processing on the first number of questions and the second number of questions to determine recalled questions.

B12. The question-answering system according to B11, wherein the offline apparatus includes: the training unit is suitable for constructing a training model to generate a preset matching model; and the answer selection unit is suitable for respectively selecting at least one answer with high matching degree with the question by utilizing the preset matching model to serve as a candidate answer subset of the question.

B13, the question-answering system according to B11 or 12, wherein the online device includes: the recalling unit is suitable for responding to user query, and determining a recalled question and a candidate answer subset corresponding to the recalled question from the question-answer library as at least one question-answer pair for recalling; and the sequencing unit is suitable for inputting the user query and the recalled question-answer pairs into a preset matching model for processing so as to obtain the matching degree of the user query and each question-answer pair, and sequencing the question-answer pairs according to the matching degree to be used as a retrieval result.

Moreover, those of skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, not others, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for performing functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims

1. An information retrieval method based on a question-answer library, wherein the question-answer library comprises at least one question and at least one answer corresponding to each question, and the at least one answer is used as an answer set of each question, and the method comprises the following steps:

in response to a user query, determining a recalled question and a candidate answer subset corresponding to the recalled question from the question-answer library as a recalled at least one question-answer pair, wherein the candidate answer subset is from the answer set;

inputting the user query and the at least one question-answer pair into a preset matching model for processing to obtain the matching degree of the user query and each question-answer pair; and

and sequencing the question-answer pairs according to the matching degree to serve as a retrieval result.

2. The method of claim 1, further comprising the step of generating the preset matching model:

taking user query, questions and answers as training data, and labeling the training data respectively based on a preset task to obtain labeled data;

constructing a training model for generating a preset matching model, and setting initial parameters of the model;

splicing the training data and inputting the training data into a training model for processing to obtain a prediction matching degree and a prediction category; and

and calculating a loss value by using the prediction matching degree, the prediction category and the labeled data, and adjusting model parameters of the training model based on the loss value until the training is finished to obtain a preset matching model.

3. The method of claim 2, wherein,

the training model comprises a language representation model, a text matching component and an entity recognition component; and

and when the training is finished, matching the trained language representation model and the trained text as a preset matching model.

4. The method of claim 3, wherein the step of inputting training data after splicing into the training model for processing to obtain the prediction matching degree and the prediction category comprises:

inputting the training data into a language representation model for processing to obtain a first semantic vector and a second semantic vector;

inputting the first semantic vector into a text matching component for processing so as to output a prediction matching degree;

and inputting the second semantic vector into an entity recognition component for processing so as to output a prediction category.

5. The method of any one of claims 1-4, further comprising the step of:

and respectively selecting at least one answer with high matching degree with the question as a candidate answer subset of the question aiming at each question in the question-answer library.

6. The method according to claim 5, wherein the step of selecting at least one answer with a high matching degree with the question as the candidate answer subset of the question comprises:

for each question in the question-and-answer library,

splicing the question with each answer corresponding to the question respectively to obtain each spliced clause correspondingly;

inputting each splicing clause into a preset matching model respectively to obtain the matching degree of the corresponding question and answer;

and when the matching degree of the question and the answer is higher than the threshold value, taking the answer as a candidate answer of the question.

7. The method of any one of claims 2-6, wherein the preset tasks include a text matching task and an entity recognition task, and

based on a preset task, marking the training data respectively to obtain marked data, wherein the step of obtaining the marked data comprises the following steps:

based on the text matching task, marking the matching degree among the user query, the question and the answer as a marking matching degree;

and based on the entity recognition task, marking the entity category to which each element in each text in the training data belongs as a marking category.

8. A question-answering system comprising:

the question-answer storage device is suitable for storing at least one question and at least one answer corresponding to each question as a question-answer library, wherein each question comprises a corresponding answer set;

the off-line device is suitable for training and generating a preset matching model and is also suitable for respectively determining candidate answer subsets of the questions by utilizing the preset matching model aiming at all the questions in the question-answer library, wherein the candidate answer subsets are from the answer set;

and the online device is suitable for responding to user query and determining at least one question and a corresponding candidate answer subset thereof from the question-answer library as a retrieval result on the basis of the preset matching model.

9. A computing device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the method of any of claims 1-7.