CN113704421A

CN113704421A - Information retrieval method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN113704421A
Application number: CN202110363234.2A
Authority: CN
Inventors: 王唯康
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2021-11-26

Abstract

The embodiment of the application provides an information retrieval method and device, electronic equipment and a computer readable storage medium, and relates to the field of artificial intelligence. The method comprises the following steps: dividing the question to obtain at least one question participle, and dividing the target document containing the answer to obtain at least one document participle; determining at least one document word segmentation representation corresponding to at least one document word segmentation based on at least one question word segmentation and at least one document word segmentation; dividing the target document into at least one short sentence, and determining at least one short sentence representation corresponding to the at least one short sentence based on the at least one document word segmentation representation; and determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence characterization. The problem of incomplete semantics and missed extraction that appear when extracting long answers among the prior art has been solved to this application, and then has saved the time cost of user retrieval information, has promoted user's retrieval experience.

Description

Information retrieval method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an information retrieval method, an information retrieval apparatus, an electronic device, and a computer-readable storage medium.

Background

As the information on the internet increases in geometric progression, search engines have also gradually transitioned from traditional web page retrieval to the form of precise question and answer, i.e., after a user enters a question, the search engine can directly give answers instead of simply returning to multiple web pages.

However, the question answering function of the search engine is limited to give a short answer of a real body type, but in practical application, answering the question with the real body only has the phenomenon of insufficient answer, and even many questions cannot be answered with the real body.

Existing solutions basically employ BERT-based reading understanding models. The model is mainly oriented to the question-answering tasks in the general field, and answers are in the form of short entity fragments. However, in an actual application scenario, the answer form desired by the user is not necessarily an entity. It has been found through research and analysis that under vertical domains (e.g., medical and legal domains), the complete answer to many questions is a long sentence. For such problems, existing BERT reading understanding models do not handle well. Specifically, the following two problems exist when extracting a long answer segment using the BERT-based reading understanding model:

the extracted answers have the phenomenon of incomplete semantics. For example, for the question "which set is also vs. fortune", the corresponding paragraph is "… men who are also respectively gods at 350 th and 351 th start! Fairy model, payne six at 352, and hero article … "at 353. The answer extracted based on the BERT reading understanding model is "from scratch also vs payne at 350", respectively. As can be seen, the BERT extraction model directly truncates the unit with complete semantics of '350 words';

it is easy to miss important answer information. For example, for the problem of ' furazolidone instruction dosage ', the corresponding paragraph is that ' furazolidone tablet is an antibacterial drug of western medicine, and is mainly used for treating helicobacter pylori infection. The medicine is orally taken by an adult once by 0.1g for 3-4 times a day, children take 5-10 mg per kg of body weight per day in four times, and the treatment course of intestinal infection is 5-7 days. The answer extracted based on the BERT reading understanding model is "0.1 g at a time". Therefore, the BERT extraction model only extracts the dosage of the adult, but not the dosage of the child.

It can be seen from the above analysis that when a relatively long answer segment is extracted by the BERT-based reading understanding model, the problems of incomplete semantics and missing extraction easily occur, and the retrieval experience of the user is relatively poor.

Disclosure of Invention

The application provides an information retrieval method, an information retrieval device, electronic equipment and a computer-readable storage medium, which can solve the problems that when a relatively long answer fragment is extracted by a BERT-based reading understanding model, the semantic is incomplete and the extraction is missed easily, and the retrieval experience of a user is poor. The technical scheme is as follows:

according to an aspect of the present application, there is provided a method of information retrieval, the method comprising:

dividing the question to obtain at least one question participle, and dividing the target document containing the answer to obtain at least one document participle;

determining at least one document participle representation corresponding to the at least one document participle based on the at least one question participle and the at least one document participle;

dividing the target document into at least one short sentence, and determining at least one short sentence representation corresponding to the at least one short sentence based on the at least one document word segmentation representation;

and determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence characterization.

In one or more embodiments, the dividing the question to obtain at least one question participle, and the dividing the target document including the answer to obtain at least one document participle includes:

dividing the text characters and symbol characters of the question to obtain at least one question text character participle and at least one question symbol character participle, and,

and dividing the literal characters and the symbolic characters of the target document to obtain at least one document literal character participle and at least one document symbolic character participle.

In one or more embodiments, the determining, based on the at least one question segmentation and the at least one document segmentation, at least one document segmentation attribute corresponding to the at least one document segmentation includes:

splicing the at least one question participle, the at least one document participle and a preset identification symbol to obtain a target sequence;

and performing feature extraction on the target sequence through a pre-training language model to obtain at least one document word segmentation representation corresponding to the at least one document word segmentation.

In one or more embodiments, the determining, based on the at least one document participle token, at least one phrase token corresponding to the at least one phrase comprises:

determining at least one first target document word segmentation representation included in any short sentence aiming at any short sentence obtained by traversing from the at least one short sentence;

determining at least one first proportion corresponding to the at least one first target document word segmentation representation in any short sentence;

and determining a first short sentence representation of any short sentence based on the at least one first proportion until the traversal is completed, and obtaining at least one first short sentence representation corresponding to the at least one short sentence.

In one or more embodiments, further comprising:

respectively determining sparse attention between any two first short sentence representations in the at least one first short sentence representation to obtain at least one sparse attention response value;

determining at least one second phrase representation corresponding to the at least one phrase based on the at least one sparse attention response value;

the determining, from the at least one phrase, at least one target phrase as an answer corresponding to the question based on the at least one phrase representation includes:

and determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one second short sentence characterization.

In one or more embodiments, the determining, from the at least one phrase, at least one target phrase as an answer to the question based on the at least one phrase token includes:

determining a probability value that each of the at least one short sentence belongs to an answer based on the at least one short sentence representation;

marking the short sentences with the probability values exceeding the probability threshold as positive labels to obtain at least one positive label, and marking the short sentences with the probability values not exceeding the probability threshold as negative labels to obtain at least one negative label;

sequencing the at least one positive label and the at least one negative label based on the position sequence of the at least one short sentence in the target document to obtain a label sequence corresponding to the at least one short sentence;

determining the starting position of the first positive label and the end position of the last positive label in the label sequence;

and taking a continuous short sentence between the starting position and the end position as a final answer.

In one or more embodiments, the dividing the target document into at least one short sentence includes any of the following:

if the target document comprises a paragraph, dividing the paragraph to obtain at least one short sentence;

and if the target document comprises at least two paragraphs, splicing the at least two paragraphs into a final paragraph, and dividing the final paragraph to obtain at least one short sentence.

According to another aspect of the present application, there is provided an apparatus for information retrieval, the apparatus including:

the system comprises a dividing module, a query processing module and a query processing module, wherein the dividing module is used for dividing a question to obtain at least one question word and dividing a target document containing an answer to obtain at least one document word;

the word segmentation representation module is used for determining at least one document word segmentation representation corresponding to at least one document word segmentation based on the at least one question word segmentation and the at least one document word segmentation;

the phrase representation module is used for dividing the target document into at least one phrase and determining at least one phrase representation corresponding to the at least one phrase based on the at least one document word segmentation representation;

and the answer determining module is used for determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence representation.

In one or more embodiments, the dividing module is specifically configured to:

In one or more embodiments, the word segmentation characterization module comprises:

the splicing sub-module is used for splicing the at least one question participle, the at least one document participle and a preset identification symbol to obtain a target sequence;

and the feature extraction submodule is used for extracting features of the target sequence through a pre-training language model to obtain at least one document word segmentation representation corresponding to the at least one document word segmentation.

In one or more embodiments, the phrase characterization module includes:

the first processing submodule is used for determining at least one first target document word segmentation representation included in any short sentence aiming at any short sentence obtained by traversing from the at least one short sentence;

the second processing submodule is used for determining at least one first proportion corresponding to the at least one first target document word segmentation representation in any short sentence;

and the third processing submodule is used for determining the first short sentence representation of any short sentence based on the at least one first proportion until the traversal is completed, so as to obtain at least one first short sentence representation corresponding to the at least one short sentence.

In one or more embodiments, further comprising:

the fourth processing submodule is used for respectively determining sparse attention between any two first short sentence representations in the at least one first short sentence representation to obtain at least one sparse attention response value;

a fifth processing submodule, configured to determine, based on the at least one sparse attention response value, at least one second phrase representation corresponding to the at least one phrase;

a sixth processing sub-module, configured to determine, based on the at least one phrase token, at least one target phrase from the at least one phrase as an answer corresponding to the question, including:

and the seventh processing submodule is used for determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one second short sentence representation.

In one or more embodiments, the answer determination module comprises:

the probability calculation submodule determines the probability value of the answer to which each short sentence belongs based on the at least one short sentence representation;

the marking submodule is used for marking the short sentences of which the probability values exceed the probability threshold as positive labels to obtain at least one positive label, and marking the short sentences of which the probability values do not exceed the probability threshold as negative labels to obtain at least one negative label;

a sequence submodule, configured to rank the at least one positive tag and the at least one negative tag based on a position order of the at least one short sentence in the target document, so as to obtain a tag sequence corresponding to the at least one short sentence;

the position determining submodule is used for determining the starting position of the first positive label and the end position of the last positive label in the label sequence;

and the answer determining submodule is used for taking continuous short sentences from the starting position to the end position as final answers.

In one or more embodiments, the dividing module is specifically configured to:

According to another aspect of the present application, there is provided an electronic device including:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the corresponding operation of the information retrieval method as shown in the first aspect of the application.

According to yet another aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the method of information retrieval shown in the first aspect of the present application.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of any of the aspects described above.

The beneficial effect that technical scheme that this application provided brought is:

in the embodiment of the present invention, a question is divided to obtain at least one question participle, a target document including an answer is divided to obtain at least one document participle, then at least one document participle representation corresponding to the at least one document participle is determined based on the at least one question participle and the at least one document participle, the target document is divided into at least one short sentence, and at least one short sentence representation corresponding to the at least one short sentence is determined based on the at least one document participle representation, that is, at least one target short sentence is determined from the at least one short sentence as the answer corresponding to the question based on the at least one short sentence representation. Thus, the question and the target document are divided respectively to obtain question participles and document participles, the document participle representation is determined based on the question participles and the document participles, short sentence representations are introduced on the basis, namely the short sentence representation of each short sentence in the target document is determined through the document participle representation, and then continuous short sentences are determined as final answers based on the short sentence representations. Therefore, the problems of incomplete semantics and missing extraction in the extraction of long answers in the prior art are solved, the time cost of information retrieval of a user is saved, and the retrieval experience of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic view of an application scenario for executing an information retrieval method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an information retrieval method according to an embodiment of the present application;

fig. 3 is a schematic application diagram of an information retrieval method according to an embodiment of the present application;

fig. 4A is a first schematic flowchart illustrating a partial step of step S203 in an information retrieval method according to an embodiment of the present application;

fig. 4B is a schematic flowchart illustrating a partial step of step S203 in an information retrieval method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a method for training an information retrieval model in a question-answering system according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an information retrieval apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device for information retrieval according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms referred to in this application will first be introduced and explained:

BERT: bidirective Encoder responses from transformations, a pre-trained language model proposed by Google. It emphasizes that instead of pre-training by using a traditional one-way language model or a method of shallow-splicing two one-way language models as in the past, a new Masked Language Model (MLM) is used so as to generate deep two-way language representations.

Short sentence: sentence units separated by commas, periods, question marks and exclamation marks.

Question-answering system QA: the Question Answering System is a high-level form of information retrieval System, and can answer questions posed by users in natural language with accurate and concise natural language. The main reason for the rise of research is the need of people to acquire information quickly and accurately. The question-answering system is a research direction which is concerned with and has wide development prospect in the fields of artificial intelligence and natural language processing.

AI: artificial Intelligence, Artificial Intelligence. The method is a comprehensive subject, and relates to the field of wide application, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

NLP: nature Language processing. An important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

ML: machine Learning. The method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

EM: the proportion of samples in a test set, in which the answers given by the answer system are identical to the standard answers, of the Exact Match is one of the evaluation indexes of the answer system.

P: precision, which refers to how many words in the output of the question-answer model belong to the standard answer.

R: recall, indicating how many words in the answers to question appear in the output of the question-answer model.

F1: the harmonic mean of the two indexes of P and R is one of the evaluation indexes of the question answering system.

The application provides an information retrieval method, an information retrieval device, an electronic device and a computer-readable storage medium, which aim to solve the above technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

An embodiment of the present invention provides an application scenario for executing an information retrieval method, and referring to fig. 1, the application scenario includes: a first device 101 and a second device 102. The first device 101 and the second device 102 are connected through a network, the first device 101 is an access device, and the second device 102 is an accessed device. The first device 101 may be a terminal and the second device 102 may be a server.

The terminal may have the following features:

(1) on a hardware architecture, a device has a central processing unit, a memory, an input unit and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, various input modes such as a keyboard, a mouse, a touch screen, a microphone, a camera and the like can be provided, and input can be adjusted as required. Meanwhile, the equipment often has a plurality of output modes, such as a telephone receiver, a display screen and the like, and can be adjusted according to needs;

(2) on a software system, the device must have an operating system, such as Windows Mobile, Symbian, Palm, Android, iOS, and the like. Meanwhile, the operating systems are more and more open, and personalized computer programs developed based on the open operating system platforms are infinite, such as a communication book, a schedule, a notebook, a calculator, various games and the like, so that the requirements of personalized users are met to a great extent;

(3) in terms of communication capacity, the device has flexible access mode and high-bandwidth communication performance, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby being convenient for users to use. The device may support 3GPP (3rd Generation Partnership Project), 4GPP (4rd Generation Partnership Project), 5GPP (5rd Generation Partnership Project), LTE (Long Term Evolution), WIMAX (World Interoperability for Microwave Access), mobile communication based on TCP/IP (Transmission Control Protocol/Internet Protocol), UDP (User data Protocol, User Datagram Protocol) Protocol, computer network communication based on TCP/IP Protocol, and short-range wireless Transmission based on bluetooth and infrared Transmission standards, not only supporting voice services, but also supporting various wireless data services;

(4) in the aspect of function use, the equipment focuses more on humanization, individuation and multi-functionalization. With the development of computer technology, devices enter a human-centered mode from a device-centered mode, and the embedded computing, control technology, artificial intelligence technology, biometric authentication technology and the like are integrated, so that the human-oriented purpose is fully embodied. Due to the development of software technology, the equipment can be adjusted and set according to individual requirements, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the function is more and more powerful.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The plurality of servers can also be combined into a blockchain, and the servers are nodes on the blockchain.

It should be noted that, in addition to the above-mentioned remote method, an application scenario of information retrieval may also adopt a local method, for example, a question-answering system in a server is deployed in a terminal, so that a user may still perform information retrieval in the terminal under the condition that the terminal is not connected to a network.

Or, a combination of the two methods is adopted, for example, when the terminal is connected to the network, the information is retrieved through the server, and when the terminal is not connected to the network, the information is retrieved in the terminal, so that the information retrieval can be performed by the user in any application scene. Of course, in practical applications, the setting may be performed according to practical requirements, and the embodiment of the present invention is not limited thereto.

Further, an information retrieval method may be performed in the application environment, as shown in fig. 2, the method including:

step S201, dividing the question to obtain at least one question participle, and dividing the target document containing the answer to obtain at least one document participle;

the question may be a question input by the user in the question-answering system, the target document containing the answer may be a document containing the answer corresponding to the question, and the target document may also be input by the user in the question-answering system. That is, in embodiments of the present invention, a user may enter a question and a target document into a question and answer system simultaneously.

After the questions and the target document are input into the question-answering system, the question-answering system can divide the questions to obtain at least one word segmentation of the questions and record the word segmentation as the question word segmentation, and divide the target document to obtain at least one word segmentation of the document and record the word segmentation as the document word segmentation.

Step S202, determining at least one document participle representation corresponding to at least one document participle based on at least one question participle and at least one document participle;

after obtaining at least one question participle and at least one document participle, each question participle and each document participle may be processed to obtain a document participle representation corresponding to each document participle.

Step S203, dividing the target document into at least one short sentence, and determining at least one short sentence representation corresponding to the at least one short sentence based on the at least one document word segmentation representation;

after at least one document segmentation characterization is obtained, the target document may be divided into at least one short sentence, and then a characterization corresponding to each short sentence is determined based on each document segmentation characterization and is recorded as a short sentence characterization (a specific determination manner is described in detail later). That is, all the document word segmentation tokens in the target document are used for determining all the phrase tokens.

And step S204, determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence representation.

After the short sentence representation of each short sentence in the target document is determined, at least one target short sentence serving as an answer corresponding to the question is determined from all short sentences according to the short sentence representations. Wherein if there is more than one target phrase, then there is continuity of the plurality of target phrases.

Fig. 3 shows a schematic application diagram of an embodiment of the present invention. For example, a complete specification for "furazolidone" has been entered into the question-answering system, which is assumed to be written as "furazolidone instructions". When a user inputs the usage of furazolidone specifications in a search engine, the search engine finds the furazolidone specifications through retrieval, then inputs the furazolidone specifications into a question-answering system, and the question-answering system determines that a plurality of short sentences of furazolidone tablets are antibacterial medicines of western medicines and mainly treats helicobacter pylori infection. The oral preparation is taken by an adult once at a dose of 0.1g, 3-4 times a day, and is taken by children at a dose of 5-10 mg per kg per day in four times, and the treatment course of intestinal infection is 5-7 days. And then, determining a target phrase from the phrases that the target phrase is orally taken by an adult once by 0.1g and 3-4 times a day, children take the target phrase by four times according to the weight of the children 5-10 mg per kg a day, and the treatment course of intestinal infection is 5-7 days, and the target phrase is displayed to a user.

In the embodiment of the invention, a question is divided to obtain at least one question participle, a target document containing an answer is divided to obtain at least one document participle, at least one document participle representation corresponding to the at least one document participle is determined based on the at least one question participle and the at least one document participle, the target document is divided into at least one short sentence, at least one short sentence representation corresponding to the at least one short sentence is determined based on the at least one document participle representation, and then the at least one target short sentence is determined from the at least one short sentence to serve as the answer corresponding to the question based on the at least one short sentence representation. Thus, the question and the target document are divided respectively to obtain question participles and document participles, the document participle representation is determined based on the question participles and the document participles, short sentence representations are introduced on the basis, namely the short sentence representation of each short sentence in the target document is determined through the document participle representation, and then continuous short sentences are determined as final answers based on the short sentence representations. Therefore, the problems of incomplete semantics and missing extraction in the extraction of long answers in the prior art are solved, the time cost of information retrieval of a user is saved, and the retrieval experience of the user is improved.

In another embodiment, the steps of an information retrieval method as shown in fig. 2 are described in detail.

the question can be a question input by the user in the question-answering system, for example, the question input by the user is 'furazolidone instruction dosage'; the target document containing the answer can be a document containing the answer corresponding to the question, for example, the answer of the question is that the adult takes the question orally 0.1g once and 3-4 times a day, the child takes the question according to the weight of 5-10 mg per kilogram every day and is taken by four times, and the intestinal infection treatment course is 5-7 days, so that the target document containing the answer can be that the furazolidone tablet is an antibacterial drug of western medicines and mainly treats the helicobacter pylori infection. The oral preparation is taken by an adult once at a dose of 0.1g, 3-4 times a day, and is taken by children at a dose of 5-10 mg per kg per day in four times, and the treatment course of intestinal infection is 5-7 days. ", and, the target document may also be input into the question-answering system by the user. That is, in embodiments of the present invention, a user may enter a question and a target document into a question and answer system simultaneously.

In an embodiment of the present invention, the dividing the question to obtain at least one question participle, and the dividing the target document including the answer to obtain at least one document participle includes:

Specifically, when dividing, the question and the target document can be divided based on NLP, and besides dividing the literal characters in the question and the target document, the symbolic characters can also be divided; the literal characters include but are not limited to chinese characters and english characters, and the symbolic characters include but are not limited to punctuation marks and graphic symbols. And obtaining at least one question word character participle and at least one question symbol character participle, at least one document word character participle and at least one document symbol character participle after division. That is, the question segmenter may be a text character, such as "weather" or "usage", or a symbolic character, such as "? "," and document word segmentation are the same.

For example, the question is marked as Q, the target document is marked as P, and Q after division comprises (x)₁,x₂,…,x_M) P includes (y)₁,y₂,…,y_N) Wherein x is_iAnd y_iRespectively representing the ith participle in the question and the jth participle in the target document, wherein M and N are the numbers of words in the question and the target document.

In one embodiment of the present invention, determining at least one document participle representation corresponding to at least one document participle based on at least one question participle and at least one document participle includes:

splicing at least one question participle, at least one document participle and a preset identification symbol to obtain a target sequence;

and performing feature extraction on the target sequence through a pre-training language model to obtain at least one document word segmentation representation corresponding to at least one document word segmentation.

Specifically, each question participle, each document participle, and a preset identifier may be concatenated to obtain the target sequence. The target sequence can include the question word character word, the question symbol character word, the document symbol character word and the identification symbol in total.

After the target sequence is obtained, the target sequence can be input into a pre-training language model BERT, and the BERT performs feature extraction on each participle and the identifier in the target sequence, so as to obtain a document participle representation corresponding to each document participle.

For example, as mentioned above, the target sequence obtained by splicing Q and P may be "[ CLS ]]x₁,…,x_M[SEP]y₁,…,y_N[SEP]", wherein [ CLS]And [ SEP ]]Are all identification symbols, [ CLS]For marking the start position of the target sequence, [ SEP]For marking the end position of the target sequence and distinguishing the part of the question participle from the part of the document participle. Of course, the identifier may be [ CLS ] instead of]And [ SEP ]]Besides, other symbols may be used, and may be set according to actual requirements in practical applications, which is not limited in this embodiment of the present invention.

And then inputting the target sequence into BERT, wherein the BERT adopts a formula (1) to perform characteristic extraction on each participle and the identifier to obtain a participle representation matrix, wherein the participle representation matrix comprises a question participle representation corresponding to each question participle, a document participle representation corresponding to each document participle and a representation of the identifier, and the participle representation matrix is marked as H.

H＝BERT([CLS],x₁,…,x_M,[SEP],y₁,…,y_N,[SEP]) Formula (1)

Wherein the content of the first and second substances,

representing the hidden layer output of the BERT model, d representing the dimension of the BERT hidden layer, 3 being three identifiers (one [ CLS ]]And two [ SEP ]])。

It should be noted that the default input format of the pre-trained language model BERT is "[ CLS ]]x₁,…,x_M[SEP]y₁,…,y_N[SEP]". Of course, if the default input format of the BERT is changed, the format of the target sequence may also be adaptively changed, which is not limited in the embodiment of the present invention. Besides, the word segmentation characterization obtained by BERT calculation, other ways of obtaining the word segmentation characterization by calculation are also applicable to the embodiment of the present invention, and the embodiment of the present invention is not limited to this.

It should be noted that, in the process of determining the short sentence representations, each document participle representation may be adopted, and the representations of each question participle representation and each identifier may not be adopted, because the BERT already extracts the representations of the document participles by using the features of the question participles and the identifiers in the process of feature extraction, the representations of the question participles and the identifiers may not be needed when determining the representations of each short sentence in the target document. However, if the BERT performs feature extraction only by using document participles in the feature extraction process, but does not perform feature extraction by using question participles and identifiers, the question cannot be associated with the target document, and simply speaking, the BERT cannot know what the question is.

In one embodiment of the present invention, dividing the target document into at least one short sentence includes any one of the following situations:

Specifically, when the target document is divided, if the target document only has one paragraph, the paragraph is directly divided to obtain at least one short sentence; if the target document includes at least two paragraphs, the paragraphs may be spliced into a final paragraph, and then the final paragraph may be divided into at least one short sentence. In the dividing, the dividing may be performed based on commas, periods, question marks, exclamation marks, and the like, and of course, other dividing manners are also applicable to the embodiment of the present invention.

It should be noted that, dividing the target document into at least one short sentence may be performed after obtaining at least one document participle representation, or may be performed before the document participle representation, or may be performed simultaneously with the document participle representation, which may be set according to actual requirements in actual applications, and this is not limited in the embodiment of the present invention.

After the short sentence representation of each short sentence in the target document is determined, at least one target short sentence serving as an answer corresponding to the question is determined from all short sentences according to the short sentence representations. Wherein if there is more than one target phrase, then there is continuity of the plurality of target phrases. For example, there are five target phrases, and the positions corresponding to the target documents are sentences 3 to 7.

In one embodiment of the present invention, determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence characterization comprises:

determining a probability value of each short sentence belonging to the answer based on the at least one short sentence representation;

sequencing at least one positive label and at least one negative label based on the position sequence of at least one short sentence in the target document to obtain a label sequence corresponding to at least one short sentence;

determining the starting position of a first positive label and the end position of a last positive label in the label sequence;

and taking continuous short sentences from the starting position to the end position as final answers.

Specifically, after at least one short sentence is obtained through division, whether each short sentence belongs to a part of the answer or not may be marked by a positive tag and a negative tag, where the positive tag may be used to indicate that the short sentence belongs to I, and the negative tag may be used to indicate that the short sentence does not belong to O. The label of each phrase can be estimated using equation (2):

wherein the content of the first and second substances,

indicates whether the result of the estimation is "I" or "O", W_outAnd b_outParameter, z, representing a sequence annotation Module_iFor short-sentence characterization of short-sentence i, softmax (, p) is a probability normalization function.

Further, since the calculation result of the function is a probability value, a probability threshold may be set before estimation, and when the probability value of any short sentence obtained through calculation exceeds the probability threshold, the short sentence may be marked as a positive label, otherwise, the short sentence may be marked as a negative label. For example, if the probability threshold is 0.5 and the computation result of a short sentence is 0.6, the short sentence can be marked as a positive label.

After the labels corresponding to the short sentences are obtained in the above manner, the labels are sequenced according to the position sequence of the short sentences in the target document, so as to obtain the label sequence corresponding to the short sentences, namely the label sequence of the target document.

For example, the target document has 7 phrases in total, and the tags corresponding to the 1 st to 7 th phrases are "O", "I", "O", "I" and "O", respectively, so that the 7 tags are sorted according to the position order of the 7 phrases in the target document, and the tag sequence "OOIOOIO" can be obtained.

And then determining the starting position of the first positive label and the end position of the last positive label in the label sequence, and taking continuous short sentences from the starting position to the end position as answers corresponding to the questions. For example, in the above example, if the starting position of the first positive tag is the 3rd short sentence, and the starting position of the last positive tag is the 6 th short sentence, the 3rd to 6 th short sentences are used as the final answers.

Wherein the final answer may be determined based on the maximum probability of all permutation combinations of labels between the start position and the end position. For example, in the above example, the tag sequence is "OOIOOIO", then all arrangement combinations of tags between the 3rd and 6 th phrases are "IOOI", "IIOI", "IOII", and "IIII", and it is obvious that, in the above four arrangement combinations, the probability of "IIII" as an answer is the largest, so that the 3rd to 6 th phrases are used as final answers, which ensures that the final answers are continuous, and solves the problems of incomplete semantics and missing extraction in the prior art when extracting long answers. Here, the tags of the 3rd and 6 th phrases have already been determined to be "I", so the positive tag is not changed when permutation and combination are performed.

For another example, the target document includes 5 short sentences in total, and the corresponding tag sequence is "ioii", so that all permutations of tags of 1 st to 5 th short sentences are "ioii", "IIIOI", "IOIII", and "IIIII", and it is obvious that the probability of "IIIII" as the answer is the largest among the above four permutations, so all 5 short sentences are taken as the final answer.

Instead of using the maximum probability, all phrases between the start position and the end position can be directly used as final answers. For example, after the starting position of the first positive tag is determined to be the 3rd short sentence and the starting position of the last positive tag is determined to be the 6 th short sentence, the 3rd to 6 th short sentences are directly used as final answers.

Of course, the final answer may also be determined in other manners, and may be adjusted according to actual requirements in actual applications, which is not limited in this embodiment of the present invention.

The probability that each short sentence belongs to the answer is obtained through sequence marking estimation based on each short sentence characterization, positive/negative labels are marked for each short sentence according to the probability to obtain a label sequence, then the starting position of the first positive label and the end position of the last positive label in the label sequence are determined, and continuous short sentences between the starting position and the end position are used as final answers. On the basis of solving the problems of incomplete semantics and missing extraction in the process of extracting long answers in the prior art, the continuity of the final answer is further ensured.

In the embodiment of the present invention, a detailed description is given to the step of "determining at least one phrase token corresponding to at least one phrase based on at least one document participle token" in step S203, as shown in fig. 4A, the step includes:

step S401, for any short sentence obtained by traversing from at least one short sentence, determining at least one first target document word segmentation representation included in the at least one short sentence;

specifically, after determining the word segmentation representation corresponding to each document word segmentation, traversing any short sentence from at least one short sentence, and determining at least one first target document word segmentation representation included in any short sentence p, which is denoted as (e)_p,1,e_p,2,...,e_p,N) And e is the first target document word segmentation representation, and N is the number of document words included in any short sentence p. For example, the target document comprises 5 phrases, and when the 3rd phrase is traversed, at least one first target document participle representation included in the 3rd phrase is determined.

Step S402, determining at least one first proportion corresponding to at least one first target document word segmentation representation in any short sentence;

after at least one first target document word segmentation characteristic is determined, a first proportion of each document word in any short sentence can be obtained through calculation of a formula (3).

Wherein alpha is_p,iThe first weight (value) MLP before the word segmentation normalization of the ith document in the short sentence p represents a multi-layer perceptron, softmax (x) is a probability normalization function, score_p,iRepresenting the first weight (probability value) after normalization of the ith document participle in the phrase p.

Step S403, determining a first short sentence representation of any short sentence based on at least one first ratio until traversal is completed, and obtaining at least one first short sentence representation corresponding to at least one short sentence.

After the first proportion corresponding to each first target document word segmentation is determined, the short sentence representation of any short sentence p can be obtained through calculation of a formula (4) and is marked as v_p。

v_p＝∑_1≤i≤Nscore_p,i*e_p,iFormula (4)

And repeating the steps S401 to S403 to obtain the first short sentence representation corresponding to each short sentence.

Correspondingly, the step of determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence characterization comprises the following steps:

and determining at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one first short sentence characterization.

Specifically, after determining the first short sentence representations corresponding to the short sentences in the target document, step S204 may be executed according to the first short sentence representations, and for avoiding repetition, details are not described here.

Further, as shown in fig. 4B, the step may further include:

step S404, respectively determining sparse attention between any two first short sentence representations in at least one first short sentence representation to obtain at least one sparse attention response value;

step S405, determining at least one second phrase representation corresponding to the at least one phrase based on the at least one sparse attention response value.

In the process of extracting the long answers, it is found through research that short sentences inside the long answers have certain similarity in sentence patterns. For example, for the question "which set is also versus vs. fortune", the complete long answer is "Men who also versus vs. fortune respectively become god at 350, 351 started! Fairy model, payne six at 352, and hero article at 353 ". Wherein, the "351 th voice starts! There are clear similarities in the form of three phrases "fairy model" and "peyen six track 352" and "jiyao haugery object 353". However, in the conventional long-answer extraction scheme, this feature is not utilized, and therefore, extraction omission occurs in the long-answer extraction.

Based on the above analysis, the embodiment of the present invention introduces a sparse attention mechanism for establishing an association relationship between phrases.

Specifically, the sparse attention response value β of the ith clause to the jth clause_i,jCan be calculated from equation (5):

where tanh is the hyperbolic tangent function, v_iAnd v_jRespectively representing the ith short sentence and the jth short sentence after passing through a sentence representing layer, w_bAnd W_bExpressing the weight parameter, α -entmax (×) is a sparse attention function, and its specific calculation method is defined by equation (6):

α-entmax(g)＝ReLu[(α-1)g-τ1]^1/(α-1)formula (6)

Where ReLu is a linear rectification function, g represents a vector of any dimension, 1 represents a vector of all 1 s having the same dimension as g, α is 1.5 in practical use, and τ is a hyperparameter used to make the sum of the right side of equation (6) 1. The attention response value calculated by equation (5) has the following properties: the ith phrase will only have an attention response value greater than 0 for some phrases, e.g., the "350 th god man" started on "351 th phrase in the above example! The attention response values of the cactus mode, the payne six track 352, and the haugery article language 353 are greater than 0.

After obtaining the sparse attention response value beta_i,jThereafter, the final characterization of clause i, i.e. the second clause characterization z_iCalculated from equation (7):

Specifically, after determining the second short sentence representations corresponding to the short sentences in the target document, step S204 may be executed according to the second short sentence representations, and for avoiding repetition, the description is omitted here.

Further, in the embodiment of the present invention, at least one target phrase may be determined as a final answer based on each first phrase representation, or at least one target phrase may be determined as a final answer based on each second phrase representation. The difference between the two is that a sparse attention mechanism is introduced in the calculation of the second phrase representation.

In order to verify the performance difference between the embodiment of the invention and the existing scheme, researchers use long answer data as evaluation corpora to test each scheme, wherein the scales of training, development and testing in the long answer data are 9.8k, 2.4k and 2.5k respectively. The specific test results are shown in table 1:

	F1	EM
			BERT-based decimation	52.2	19
Long answer extraction without using sparse attention mechanism	76	36
			Long answer extraction using sparse attention mechanism	76.2	38

TABLE 1

As can be seen from table 1, the embodiment of the present invention has the best effect on long answer extraction, which is significantly better than the BERT-based extraction of the existing scheme. Meanwhile, as can be seen from the third and fourth rows in table 1, the sparse attention mechanism can effectively improve the EM index of long answer extraction.

It should be noted that, in addition to the above method, the phrase representation may also be obtained by other methods, such as a maximum pooling method, an average pooling method, and the like, and may be set according to actual requirements in practical applications, which is not limited in this embodiment of the present invention.

The embodiment of the present invention provides a method for training an information retrieval model in the above question-answering system, as shown in fig. 5, the method includes:

step S501, dividing sample questions to obtain at least one sample question participle, and dividing sample documents containing marked answers to obtain at least one sample document participle;

step S502, determining at least one sample document participle representation corresponding to at least one sample document participle based on at least one sample question participle and at least one sample document participle;

step S503, dividing the sample document into at least one sample short sentence, and determining at least one sample short sentence representation corresponding to the at least one sample short sentence based on the at least one sample document word segmentation representation;

step S504, determining at least one target sample short sentence from the at least one sample short sentence as a final answer corresponding to the question based on the at least one sample short sentence representation;

step S505, calculating by using a loss function in the information retrieval model to obtain an error value based on the final answer and the marked answer;

step S506, updating the information retrieval model by adopting the error value to obtain an updated information retrieval model;

and step S507, taking the updated information retrieval model as the current information retrieval model, and repeatedly executing the step S501 to the step S507 until the minimum value of the loss function is converged to obtain the trained information retrieval model.

Step S501 to step S504 are substantially the same as step S201 to step S204, and are not described herein again to avoid repetition.

Further, since the answer is determined according to the label of the short sentence, it is only critical whether the label of each short sentence is correct. After the label of each document clause is obtained by calculation according to the formula (2), the obtained label can be compared with the labeled real label, and an error value between the two labels is calculated by using a loss function, for example, the cross entropy formula shown in the formula (8) can be used for calculation:

wherein, y_ijA true tag representing the ith phrase,

representing the information retrieval model for the ith clauseThe estimation result of (2).

And then updating the information retrieval model by adopting the error value obtained by calculation to obtain an updated information retrieval model, taking the updated information retrieval model as a current information detection model, and repeatedly executing the steps on sample questioning and sample documents until the minimum value of the loss function is converged to obtain the trained information retrieval model.

Fig. 6 is a schematic structural diagram of an information retrieval device according to an embodiment of the present application, and as shown in fig. 5, the device according to the embodiment may include:

the dividing module 601 is configured to divide the question to obtain at least one question participle, and divide the target document including the answer to obtain at least one document participle;

the word segmentation representation module 602 determines at least one document word segmentation representation corresponding to at least one document word segmentation based on at least one question word segmentation and at least one document word segmentation;

the phrase representation module 603 divides the target document into at least one phrase, and determines at least one phrase representation corresponding to at least one phrase based on at least one document word segmentation representation;

an answer determining module 604, configured to determine at least one target short sentence from the at least one short sentence as an answer corresponding to the question based on the at least one short sentence characterization.

In one or more embodiments, the partitioning module is specifically configured to:

In one or more embodiments, the word segmentation characterization module includes:

the splicing submodule is used for splicing at least one question participle, at least one document participle and a preset identification symbol to obtain a target sequence;

and the characteristic extraction submodule is used for extracting the characteristics of the target sequence through the pre-training language model to obtain at least one document word segmentation representation corresponding to at least one document word segmentation.

In one or more embodiments, the phrase characterization module includes:

the first processing submodule is used for determining at least one first target document word segmentation representation included in any short sentence aiming at any short sentence obtained by traversing from at least one short sentence;

the second processing submodule is used for determining at least one first proportion corresponding to at least one first target document word segmentation representation in any short sentence;

and the third processing submodule is used for determining the first short sentence representation of any short sentence based on at least one first proportion until traversal is completed, and obtaining at least one first short sentence representation corresponding to at least one short sentence.

In one or more embodiments, further comprising:

the fourth processing submodule is used for respectively determining the sparse attention between any two first short sentence representations in the at least one first short sentence representation to obtain at least one sparse attention response value;

a sixth processing submodule, configured to determine, from the at least one short sentence, at least one target short sentence as an answer corresponding to the question based on the at least one short sentence characterization, where the sixth processing submodule includes:

In one or more embodiments, the answer determination module comprises:

the probability calculation submodule determines the probability value of the answer of each short sentence based on the representation of the short sentences;

the sequence submodule is used for sequencing at least one positive label and at least one negative label based on the position sequence of at least one short sentence in the target document to obtain a label sequence corresponding to at least one short sentence;

The information retrieval apparatus of the present embodiment can execute the information retrieval method shown in the foregoing embodiments of the present application, and the implementation principles thereof are similar and will not be described herein again.

An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: the method comprises the steps of dividing a question to obtain at least one question participle, dividing a target document containing an answer to obtain at least one document participle, determining at least one document participle representation corresponding to the at least one document participle based on the at least one question participle and the at least one document participle, dividing the target document into at least one short sentence, determining at least one short sentence representation corresponding to the at least one short sentence based on the at least one document participle representation, and determining the at least one target short sentence from the at least one short sentence to serve as the answer corresponding to the question based on the at least one short sentence representation. Thus, the question and the target document are divided respectively to obtain question participles and document participles, the document participle representation is determined based on the question participles and the document participles, short sentence representations are introduced on the basis, namely the short sentence representation of each short sentence in the target document is determined through the document participle representation, and then continuous short sentences are determined as final answers based on the short sentence representations. Therefore, the problems of incomplete semantics and missing extraction in the extraction of long answers in the prior art are solved, the time cost of information retrieval of a user is saved, and the retrieval experience of the user is improved.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 7, where the electronic device 7000 shown in fig. 7 comprises: a processor 7001 and a memory 7003. Wherein the processor 7001 and the memory 7003 are coupled, such as via a bus 7002. Optionally, the electronic device 7000 may further include the transceiver 7004, and the transceiver 7004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 7004 is not limited to one in practical applications, and the structure of the electronic device 7000 does not constitute a limitation to the embodiments of the present application.

The Processor 7001 may be a CPU (Central Processing Unit), a general purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 7001 may also be a combination implementing computing functionality, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, or the like.

Bus 7002 may include a path to transfer information between the above components. The bus 7002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 7002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, 7 is shown with only one thick line, but does not indicate that there is only one bus or one type of bus.

The Memory 7003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 7003 is used for storing computer program codes (computer programs) for executing the present scheme, and the execution is controlled by the processor 7001. The processor 7001 is used to execute computer program code stored in the memory 7003 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

Embodiments of the present application provide a computer program product containing instructions, which when run on a computer device, cause the computer device to execute the information retrieval method provided by the above-mentioned method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An information retrieval method, comprising:

2. The information retrieval method of claim 1, wherein the dividing the question to obtain at least one question participle and the dividing the target document containing the answer to obtain at least one document participle comprises:

3. The information retrieval method according to claim 1 or 2, wherein the determining at least one document participle representation corresponding to the at least one document participle based on the at least one question participle and the at least one document participle comprises:

4. The information retrieval method of claim 1, wherein the determining at least one phrase representation corresponding to the at least one phrase based on the at least one document participle representation comprises:

5. The information retrieval method according to claim 4, further comprising:

6. The information retrieval method according to claim 1, 4 or 5, wherein the determining at least one target phrase from the at least one phrase based on the at least one phrase representation as an answer corresponding to the question comprises:

7. The information retrieval method according to claim 1, wherein the dividing the target document into at least one short sentence includes any of:

8. An information retrieval apparatus, characterized by comprising:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the information retrieval method of any one of claims 1 to 7.

10. A computer readable storage medium storing at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the information retrieval method as claimed in any one of claims 1 to 7.