CN115470313A

CN115470313A - Information retrieval and model training method, device, equipment and storage medium

Info

Publication number: CN115470313A
Application number: CN202210969926.6A
Authority: CN
Inventors: 陈诺; 程鸣权; 潘秋桐; 刘欢; 李雅楠; 陈坤斌; 张楠; 何伯磊; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2022-12-13

Abstract

The disclosure provides an information retrieval and model training method, device, equipment and storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of intelligent search, natural language processing, deep learning and the like. The information retrieval method comprises the following steps: acquiring at least one candidate text corresponding to the search term; aiming at any candidate text, updating a second semantic vector of the keywords in the candidate text based on a first semantic vector of the candidate text to obtain a third semantic vector; for any candidate text, determining the similarity between the candidate text and the search word based on a third semantic vector corresponding to the candidate text and a fourth semantic vector of the search word; and sequencing the at least one candidate text based on the similarity between each candidate text and the search word, and acquiring a search result corresponding to the search word based on the candidate text after sequencing. The information retrieval method and the information retrieval device can improve the information retrieval effect.

Description

Information retrieval and model training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the technical fields of intelligent search, natural language processing, deep learning, and the like, and in particular, to a method, an apparatus, a device, and a storage medium for information retrieval and model training.

Background

In the enterprise internal knowledge flow, one of the targets is to complete uniform, simple and efficient retrieval on different knowledge in an enterprise. In such search, search terms (query) are usually short and extensive, and how to improve the search effect of such extensive requirements is a problem to be solved.

Disclosure of Invention

The disclosure provides an information retrieval and model training method, device, equipment and storage medium. According to an aspect of the present disclosure, there is provided an information retrieval method including: acquiring at least one candidate text corresponding to the search term; aiming at any candidate text, updating a second semantic vector of a keyword in the candidate text based on a first semantic vector of the candidate text to obtain a third semantic vector; for any candidate text, determining the similarity between the candidate text and the search word based on a third semantic vector corresponding to the candidate text and a fourth semantic vector of the search word; and sequencing the at least one candidate text based on the similarity between each candidate text and the search word, and acquiring a search result corresponding to the search word based on the candidate text after sequencing.

According to another aspect of the present disclosure, there is provided a model training method, including: obtaining training data, the training data comprising: the method comprises the steps of obtaining sample search terms and at least one corresponding sample candidate text, and the real similarity between each sample candidate text and the sample search terms; adopting a coding model for any sample candidate text, and coding the sample candidate text to obtain a first semantic vector of the sample candidate text; for any sample candidate text, updating a second semantic vector of the keywords in the sample candidate text based on a first semantic vector of the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text; respectively encoding at least one text unit in the sample search term by adopting the encoding model to obtain at least one encoding vector, and merging the at least one encoding vector to obtain a fourth semantic vector of the sample search term; for any sample candidate text, determining the prediction similarity of the sample candidate text and the sample search term based on the third semantic vector corresponding to the sample candidate text and the fourth semantic vector of the sample search term; constructing a loss function based on the prediction similarity of each sample candidate text and the sample search term and the real similarity of each sample candidate text and the sample search term; and adjusting model parameters of the coding model based on the loss function.

According to another aspect of the present disclosure, there is provided an information retrieval apparatus including: the first acquisition module is used for acquiring at least one candidate text corresponding to the search term; the updating module is used for updating a second semantic vector of the keywords in any candidate text based on the first semantic vector of the candidate text to obtain a third semantic vector; the determining module is used for determining the similarity between any candidate text and the search word based on a third semantic vector corresponding to the candidate text and a fourth semantic vector of the search word; and the sequencing module is used for sequencing the at least one candidate text based on the similarity between each candidate text and the search word, and obtaining a search result corresponding to the search word based on the candidate text after sequencing.

According to another aspect of the present disclosure, there is provided a model training apparatus including: an acquisition module configured to acquire training data, the training data including: the method comprises the steps of obtaining a sample search word and at least one corresponding sample candidate text, and the real similarity between each sample candidate text and the sample search word; the first coding module is used for adopting a coding model for any sample candidate text and coding the sample candidate text to obtain a first semantic vector of the sample candidate text; the updating module is used for updating a second semantic vector of the keyword in any sample candidate text based on the first semantic vector of the sample candidate text so as to obtain a third semantic vector corresponding to the sample candidate text; the second coding module is used for respectively coding at least one text unit in the sample search term by adopting the coding model to obtain at least one coding vector and combining the at least one coding vector to obtain a fourth semantic vector of the sample search term; the determining module is used for determining the prediction similarity of any sample candidate text and the sample search term based on a third semantic vector corresponding to the sample candidate text and a fourth semantic vector of the sample search term; the adjusting module is used for constructing a loss function based on the prediction similarity of each sample candidate text and the sample search term and the real similarity of each sample candidate text and the sample search term; and adjusting model parameters of the coding model based on the loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme of the disclosure, the information retrieval effect can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an application scenario suitable for use in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic view of an overall framework provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of matching text stored in a database provided in accordance with an embodiment of the present disclosure;

FIG. 5 is a schematic illustration according to a second embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a ranking model provided in accordance with an embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 10 is a schematic diagram of an electronic device for implementing an information retrieval method or a model training method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, which provides an information retrieval method, including:

step 101, at least one candidate text corresponding to the search term is obtained.

102, aiming at any candidate text, updating a second semantic vector of the keyword in the candidate text based on a first semantic vector of the candidate text to obtain a third semantic vector.

Step 103, for any candidate text, determining the similarity between the candidate text and the search word based on the third semantic vector corresponding to the candidate text and the fourth semantic vector of the search word.

And 104, sequencing the at least one candidate text based on the similarity between each candidate text and the search word, and acquiring a search result corresponding to the search word based on the candidate text after sequencing.

The information retrieval process may include an offline process and an online process, and the online process may include a recall stage and a sorting stage, where the recall stage mainly refers to matching the search terms with the texts in the database, the texts in the data may be referred to as matching texts, and after the matching process, a first preset number (for example, 300) of matching texts may be recalled, and the first preset number of matching texts may be referred to as candidate texts. The candidate texts are usually a plurality of candidate texts, such as 300 candidate texts described above.

The semantic vector of the candidate text may be referred to as a first semantic vector.

The candidate text may include a keyword, where the keyword refers to a text subjected to similarity calculation with a search term (query).

For example, each candidate text includes a question-answer pair, each question-answer pair including a question and an answer, wherein the question may be a keyword.

The initial semantic vector for the keyword may be referred to as a second semantic vector and the updated semantic vector for the keyword may be referred to as a third semantic vector.

The semantic vector of the term may be referred to as a fourth semantic vector.

The first semantic vector and the fourth semantic vector may be obtained by using a pre-trained encoder (encoder).

For example, for any candidate text, the candidate text is input to a first encoder for encoding, and the first encoder outputs a first semantic vector, for example, by

And representing that N is the length of the candidate text, namely the number of text units (tokens) contained in the candidate text, and the tokens can be selected as characters or words according to actual requirements. h is _i (i =1,2.., N) may be selected as a row vector of dimension M. N and M are both positive integers, e.g., N =512, M =768.

Similarly, the search term is input into a second encoder for encoding, and the encoded vector output by the second encoder can be used

Indicating that the fourth semantic vector may be obtained by merging the above-mentioned encoding vectors, for example, the fourth semantic vector

And performing an average value operation, and performing activation processing by using a set activation function to obtain the target.

Wherein the first encoder and the second encoder may be two encoders; alternatively, the first encoder and the second encoder may be the same encoder or encoders sharing parameters.

The second semantic vector may be obtained in an offline stage, and the third semantic vector is obtained by updating the second semantic vector with the first semantic vector.

The dimensions of the second semantic vector, the third semantic vector, and the fourth semantic vector are all 1*M.

The similarity of each candidate text and the search word is determined based on the third semantic vector and the fourth semantic vector. The similarity may be embodied by a similarity score, for example, a dot product of the third semantic vector and the fourth semantic vector may be calculated, and the dot product may be used as the similarity score. The greater the similarity score, the greater the similarity of the corresponding candidate text to the search term.

After the similarity score is obtained, the candidate texts may be ranked in order of the similarity score from high to low to obtain ranked candidate texts.

After the candidate texts after the sorting processing are obtained, a second preset number (for example, 2 to 5) of candidate texts may be selected in sequence, and a search result is given based on the selected candidate texts. For example, each candidate text may include a question and an answer, and the question may be used as a search result, or the question and the answer may be used as a search result.

The embodiment can be applied to a broad-demand search scenario of a specific domain (such as an enterprise interior), in which a search term (query) is generally short and broad (such as the search term is "weekly newspaper"), and compared with a candidate text which is generally longer and contains more information.

In the embodiment, for any candidate text, based on the first semantic vector of the candidate text, the second semantic vector of the keyword in the candidate text is updated to obtain the third semantic vector, so that the third semantic vector with better representation capability can be obtained by using the characteristic that the candidate text contains more information, the similarity is determined based on the third semantic vector, and when the retrieval result is obtained based on the similarity, the accuracy of the retrieval result can be improved; in addition, the similarity between each candidate text and the search word is determined based on the third semantic vector and the fourth semantic vector of the search word, so that the search result can be obtained relatively quickly, and the search efficiency is improved. Because the retrieval accuracy and the retrieval efficiency are good, the information retrieval effect can be improved.

For better understanding of the embodiments of the present disclosure, the following describes application scenarios to which the embodiments of the present disclosure are applicable. The embodiment can be applied to a universal requirement retrieval scene of a specific domain (such as an enterprise interior), wherein the universal requirement refers to a problem (query) input by a user which is relatively universal, and the requirement is not clear.

As shown in fig. 2, a user may input a search term (query) in a client of a search engine, the client may be deployed on a user terminal 201, and the user terminal may be a Personal Computer (Personal Computer), a notebook Computer, a mobile device (such as a mobile phone), and the like. The client sends the search term to the server, and the server may be deployed on the server 202, and the server may be a local server or a cloud server. The server side can carry out retrieval based on the retrieval words to obtain retrieval results, and the retrieval results are sent to the client side to be displayed.

In the technical scheme of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the common customs of public order.

The whole retrieval process can comprise an off-line stage and an on-line stage, wherein the off-line stage mainly generates data to be put in storage, and the on-line stage mainly acquires retrieval results related to retrieval words in the data to be put in storage.

As shown in FIG. 3, the offline phase mainly includes an offline module 301, and the online phase mainly includes an online recall module 302 and an online ranking module 303.

For the offline module 301, the data that is usually put in storage is an existing question-answer pair, and in this embodiment, the data includes not only the question-answer pair described above, but also a trigger word, and in addition, the data includes not only the text described above, but also a relevant semantic vector.

The existing question-answer pairs refer to question-answer pairs collected in advance, and are usually question-answer pairs of common Questions (FQA).

The question-answer pairs comprise questions and answers, and can also comprise other information such as departments, regions and the like, wherein the questions can be called standard questions.

In a specific domain (such as an enterprise interior), the number of question-answer pairs in the database is small relative to the number of question-answer pairs in the development domain, so that the recall rate (i.e. the recall rate) is low.

For example, the standard question is "financial report of quarter and year 2021", and its corresponding trigger may include: "financial report", "2021 annual financial report", and the like.

As shown in fig. 4, a plurality of texts may be stored in the database, the stored texts may be referred to as matching texts, each matching text may include a plurality of fields, and the plurality of fields may include: sequence number, question-of-standard field, answer field, region field, department field, and, trigger word field, heat field, etc.

Wherein the standard question field is used to record standard questions, such as "financial report of quarter and year 2021"; the answer field is used for recording answers corresponding to the standard questions; the region field is used for recording region information, and the department field is used for recording department information; the trigger word field is used for recording trigger words corresponding to standard questions, such as 'financial report' and '2021 year financial report'; the hot degree field is used for recording the hot degree score of the matched text, the hot degree score is a value between [0,1], such as '0.6', the hot degree score can be calculated according to the retrieval frequency of the standard questions in the matched text, the higher the retrieval frequency is, the higher the hot degree score is, the retrieval frequency can be obtained based on the user behaviors recorded in the log, and the calculation formula between the retrieval frequency and the hot degree score can be set according to the actual requirement.

The matching text stored in the database may be obtained by cleaning an existing question-answer pair, for example, as shown in fig. 3, the offline cleaning module may clean the existing question-answer pair, and the question-answer pair retained after the cleaning may be referred to as an admissible question-answer pair. In addition, the offline cleaning module can also calculate a popularity score based on the retrieval frequency of the standard questions in the admission question-answers.

The trigger word module may obtain trigger words based on standard questions in the admitted question-answer pairs. And then, forming a matching text by the input question-answer pairs, the trigger words and the popularity scores, and recording the matching text in a database.

Therefore, the matching text recorded in the database includes not only the standard question text but also the trigger text.

The standard question and trigger words may be collectively referred to as keywords.

In addition, the database records not only the matching text, but also the semantic vector of the keyword, namely, the semantic vector of the standard question and the semantic vector of the trigger word.

Using offline module 301, data in the database, including text and semantic vectors, may be obtained and provided to the online recall module.

For the online recall module 302, the input is the search term (query) input by the user, and the data (text and semantic vector) in the database obtained by the offline module 301, and the output is the candidate text. In the general demand scenario, the terms are short and general, and are represented by the general demand question in fig. 3, such as "weekly report".

As shown in fig. 3, the recall phase may be performed by a retrieval engine, which may be specifically an ES (electronic search), which is a kind of distributed retrieval and analysis engine.

Based on the above off-line processing, the text in the database includes not only the standard question but also the trigger word, so the data source may be referred to as multiple data sources, and specifically includes two data sources, namely the standard question and the trigger word.

The normal recall process is either a semantic-based recall or a text-based (term) recall, and in this embodiment, is a two-way recall based on a semantic recall and a text-based recall. Thus, as shown in FIG. 3, the recall process of the present embodiment may be referred to as a standard question + trigger word multiple data source two-way recall.

The recall process is to select a first preset number (e.g. 300) of matching texts from the database, where the selected matching texts may be referred to as candidate texts, or the candidate texts may also be texts subjected to deduplication processing on the selected matching texts.

For the online ranking module 303, the input is the online recall module 302 to obtain the candidate text, and the output is the search result of the search term, which is represented by recommendation in fig. 3.

The ranking stage is mainly to rank the candidate texts, the ranking score of the embodiment is obtained by fusing the similarity score and the hotness score, and in addition, the similarity score is obtained based on the modified poly-encoder.

In combination with the application scenario, the present disclosure further provides an information retrieval method.

Fig. 5 is a schematic diagram of a second embodiment of the present disclosure, where this embodiment provides an information retrieval method, and the present embodiment includes, by a keyword: the standard question and the trigger word are taken as examples, and the embodiment takes a plurality of matching texts, candidate texts and search results as examples. The method comprises the following steps:

and step 501, obtaining existing question and answer pairs.

The existing question-answer pairs are collected in advance, and can be question-answer pairs corresponding to Frequently Asked Questions (FAQ). In a particular domain, the number of existing challenge-response pairs is small.

Each existing question-answer pair comprises at least one question and one answer, and the question can be called a standard question.

Step 502, cleaning the existing question-answer pairs to obtain the access question-answer pairs.

Wherein the cleaning process may include: filtering question-answer pairs with too long and too short answers and filtering question-answer pairs with not refreshing (buck-asking) answers, wherein the two types of question-answer pairs can be filtered based on the regular matching sentences; filtering question-answer (Q-A) unmatched question-answer pairs and filtering chatting question-answer pairs containing characters, wherein the two pairs can be matched based on an AC automatic machine (Aho-Corasick automatic); the question-answer pairs with unknown meanings are filtered, and filtering can be carried out based on the regular matching sentences; the question-answer pairs with low timeliness are filtered, the relation between the question-answer pairs and the current date can be judged based on a Lexical Analysis tool LAC (Lexical Analysis of Chinese) and a dictionary library extraction time entity, and the low timeliness information can be updated regularly.

The question-answer pairs reserved after the cleaning treatment can be called admission question-answer pairs.

Step 503, based on the standard questions in the access question-answer pair, generating trigger words corresponding to the standard questions.

Wherein, aiming at each standard question, a similar question of the standard question can be obtained, and the trigger word is obtained based on the standard question and the similar question.

The similarity between the standard questions can be achieved by various related technologies, such as manual construction, automatic generation by a machine, mining based on user behaviors, and the like.

After obtaining the similar questions, splicing the standard questions and the similar questions corresponding to the standard questions to obtain spliced texts; performing multiple word segmentation processing on the spliced text to obtain words after multiple groups of word segmentation; merging the words after the multi-group word segmentation to obtain candidate words; and filtering the candidate words to obtain the trigger words corresponding to the standard questions.

Specifically, the generation process of the trigger word may include:

(1) After the standard question and the similar question are spliced (the purpose is to increase the diversity of trigger words), word segmentation is carried out on the basis of an LAC2.0 tool, and words such as nouns, nouns verbs, proper names and common verbs are extracted and put into a candidate set.

(2) After the standard question and the similar question are spliced (the purpose is to increase the diversity of trigger words), word segmentation and part-of-speech tagging are carried out on the basis of a jieba tool, and words such as nouns, verbs and proper names of the head (the position is close to the front) are extracted and put into a candidate set.

(3) And (3) merging (concat) the results of (1) and (2), namely taking the merged set as a trigger word candidate set corresponding to the standard.

(4) And filtering the trigger word candidate set, and taking the reserved trigger words as the final trigger words corresponding to the standard questions.

The filtering process for the trigger word candidate set may include at least one of: filtering based on word importance operators, filtering based on stop word dictionaries, and filtering based on semantic matching similarity.

(a) Filtering based on word importance operators:

the word importance operator is an existing classification model, the words in the trigger word candidate set can be called candidate words, the candidate words are input into the word importance operator, the output is the importance scores of the candidate words, the importance scores can be 0 or 1,0 to indicate that the corresponding candidate words are non-important, and 1 indicates that the corresponding candidate words are important.

Thereafter, unimportant (scoring 0) candidate words may be removed from the candidate set.

(b) Filtering based on a domain stop word dictionary:

the candidate words can be matched with the domain stop word dictionary based on the AC automaton, and the candidate words belonging to the domain stop word dictionary are determined.

Candidate words belonging to the domain stop word dictionary may then be removed from the candidate set.

(c) Filtering based on semantic matching similarity:

aiming at a certain standard question, a semantic vector of the standard question and a semantic vector of a similar question of the standard question can be obtained, and a mean vector of the semantic vector of the standard question and the semantic vector of the similar question is calculated; assuming that the dimension of each token (which may be a word or word according to the actual partitioning rule) is 768, the above-mentioned mean vector is a vector of 1 × 768;

similarly, for each candidate word, a semantic vector of the candidate word may be obtained, which is a vector of 1 × 768.

For each candidate word, calculating a dot product of the semantic vector of the candidate word and the mean vector, that is, multiplying the semantic vector of each candidate word by the transposed vector of the mean vector, and a calculated dot product value may be referred to as a similarity score of the candidate word.

Then, candidate words with similarity scores smaller than a preset value (e.g., 0.5) may be removed from the candidate set.

The semantic vector may be obtained by inputting corresponding text into the coding model and outputting the corresponding text based on the coding model, where the corresponding text refers to a standard question, a similar question, and a candidate word.

The coding model may be an existing pre-training model, such as a Bidirectional Transformer Encoder (BERT) model, a kNowledge Enhanced semantic Representation (ERNIE) model, or the like.

In this embodiment, the trigger word is obtained based on the standard question and the corresponding similar question, so that the diversity of the trigger word can be increased, and the recall rate can be further improved.

Step 504, generating a plurality of matching texts based on the standard questions and the trigger words corresponding to the standard questions, and storing the plurality of matching texts in a database.

As shown in fig. 4, each matching text may include a standard question, a trigger word, an answer, and other information, such as a region and a department. In addition, a popularity score for the corresponding matching text may also be determined based on user behavior, such as the frequency of retrieval of standard questions, and recorded in the popularity field.

In the embodiment, the matching text is generated based on the standard question and the trigger word, so that the data source not only comprises the standard question, but also comprises the trigger word, the variety of the data source is enriched, and the recall rate is improved.

And 505, acquiring an initial standard question semantic vector corresponding to the standard question and an initial trigger word semantic vector corresponding to the trigger word, and storing the initial standard question semantic vector and the initial trigger word semantic vector in a database.

Since the standard question semantic vector and the trigger term semantic vector are updated in the subsequent process, the initial values stored in the database may be referred to as an initial standard question semantic vector and an initial trigger term semantic vector, respectively.

The initial standard question semantic vector and the initial trigger term semantic vector can be obtained in a similar manner.

Taking the standard question as an example, the standard question may be input into the coding model, and an initial standard question semantic vector may be obtained according to an output vector of the coding model. Taking the coding model as an ERNIE model as an example, after the standard questions are input into the ERNIE model, the output of the last layer is a tensor of 512 × 768 dimensions, where 512 is the number of tokens included in the standard questions and 768 is the dimension of the semantic vector of each token, and the initial standard question semantic vector of 1 × 768 can be obtained by averaging the tokens by columns.

Step 504 and step 505 have no timing constraint relationship.

The steps 501 to 505 described above may be performed in an offline phase.

The online phase may perform the following steps:

step 506, receiving a search term (query) input by the user.

After receiving the search term, the subsequent steps of 507 and the like can be directly executed, or after receiving the search term, whether the search term is the search term with the general demand or not is judged, and when the search term is the search term with the general demand, the subsequent steps of 507 and the like are executed. For the identification of the search term, a pre-trained classification model can be adopted, and whether the search term is a widely required search term or not is judged by adopting the classification model. The classification model can be obtained by adopting a relevant model training mode.

And 507, acquiring a plurality of candidate texts corresponding to the search terms based on semantic recall and text recall from a plurality of matching texts stored in the database.

Wherein, the recalling stage adopts a two-way recalling scheme, namely semantic-based recall and text-based (term) recall.

A two-way recall process, in which multiple matching texts can be obtained; respectively determining semantic similarity of each keyword and the search word aiming at a plurality of keywords included in the matched texts, and determining text similarity of each keyword and the search word; determining recall probabilities for the respective keywords based on the text similarity and the semantic similarity, the recall probabilities being used to characterize a probability that a keyword is selected (recalled); selecting a preset number of keywords from the plurality of keywords based on the recall probability of each keyword to obtain the selected keywords; and acquiring at least one candidate text corresponding to the search word based on the matched text where the selected keyword is located.

In the embodiment, the semantic similarity and the text similarity can be embodied by the semantic similarity score and the text similarity score respectively, the recall probability can be embodied by the recall score, and accordingly, the recall score is obtained based on the text similarity score and the semantic similarity score, so that the semantic-based recall and the text-based recall are realized.

Is formulated as:

score _{recall from scratch} ＝w1*score _j-vec +w2*score _j-term (1)

Where j is a keyword index, the keyword may be a standard question or a trigger, for example, if there are N standard questions and M trigger, j =1,2., (N + M);

score _{recall from scratch} Is the recall score for the jth keyword; the greater the recall score, the greater the probability of selection of the corresponding keyword, i.e., the easier to select.

score _j-vec Is the semantic similarity score of the jth keyword and the search term; the larger the semantic similarity score is, the larger the semantic similarity between the corresponding keyword and the search word is.

score _j-term Is the text similarity score of the jth keyword and the search term; the larger the text similarity score is, the larger the text similarity between the corresponding keyword and the search word is.

w1 and w2 are preset weight coefficients which are all values between 0 and 1, and w1+ w2=1;

* Which represents a multiplication operation.

For semantic recall, the calculation formula is:

wherein, score _j-vec Is the semantic similarity score of the jth keyword and the search term;

Vec _j the semantic vector of the jth keyword is the stored initial standard question semantic vector or the initial trigger word semantic vector;

is Vec _query The transposed vector of (1);

Vec _query is a semantic vector of a search term (query).

The semantic vector of the search term may also adopt a similar obtaining manner to the above initial standard semantic vector, that is, the search term may be input into a pre-trained coding model (such as an ERNIE model), and the output of the model is subjected to mean operation to obtain the semantic vector of the search term, that is, the output tensor of 512 × 768 dimensions is subjected to column-wise mean to obtain a vector of 1 × 768. Is formulated as:

Vec _query ＝avg(Ernie _{last_layer} ) (3)

for text recalls, the calculation formula is:

score _j-term ＝∑ _jk w _jk *score _jk (4)

wherein, score _j-term Is the text similarity score of the jth keyword and the search term;

wherein w _jk Is a weight coefficient and can be set artificially.

As can be seen from the matching text shown in fig. 4, each piece of matching text may include a variety of fields, and k is a field index in the matching text.

Parameter score referred to in equation (4) _jk The calculation formula of (2) is:

score _jk ＝∑ _i (tf _ijk *idf _ijk ) (5)

wherein, i is the index of the participle of the search word after the participle processing;

the number of occurrences of the ith participle in the search word in the kth field of the matched text where the jth keyword is located;

docLen _jk the length of the kth field of the matched text where the jth keyword is located, namely the word number of the participle after the participle processing of the corresponding field;

docCount _jk is the total number of documents;

docFreq _i is the number of documents containing the ith word segmentation in the search term;

a field in each piece of matching text may be considered a document.

The recall score for each keyword can be calculated by the above equations (1) to (7).

After the recall score of each keyword is obtained, a first preset number (e.g., 300) of keywords may be selected as the recall keywords in the sequence from high recall score to low recall score.

And taking the matched text where the recalled keyword is located as a candidate text, or performing processing such as duplication removal on the matched text where the recalled keyword is located to obtain the candidate text.

And step 508, sequencing the candidate texts, and obtaining a retrieval result corresponding to the retrieval word based on the candidate texts after sequencing.

For each candidate text, a pre-trained coding model may be used to code the input candidate text to obtain a candidate text semantic vector, which may be referred to as a first semantic vector. In addition, the search term comprises at least one text unit, at least one text unit in the search term is respectively subjected to coding processing by adopting the coding model to obtain at least one coding vector, and the at least one coding vector is subjected to merging processing to obtain the fourth semantic vector.

It is understood that the coding model for extracting the first semantic vector and the fourth semantic vector may be the same or different models, and both may be obtained by pre-training.

As shown in fig. 6, the coding model is an ERNIE model, and a candidate text and a term sharing coding model are taken as an example, that is, the two ERNIE models shown in fig. 6 share parameters.

The initial semantic vector of the keyword may be referred to as a second semantic vector, which comprises, since the keyword comprises the standard question and the trigger word: the initial standard question semantic vector corresponding to the standard question and the initial trigger word semantic vector corresponding to the trigger word.

The initial standard question semantic vector and the initial trigger term semantic vector are obtained and stored in the database at an offline stage.

The updated second semantic vector may be referred to as a third semantic vector, i.e., an updated standard inter-semantic vector and an updated trigger word semantic vector.

The update process may be performed using an attention (attention) network.

As shown in fig. 6, the attention networks involved in the update may include a first attention network and a second attention network, and the respective weights may be referred to as a first attention weight and a second attention weight.

The update process may include: determining a first attention weight based on the first semantic vector and the initial criterion semantic vector; determining the updated inter-standard semantic vector based on the first attention weight and the first semantic vector; determining a second attention weight based on the first semantic vector and the initial trigger term semantic vector; determining the updated trigger word sense vector based on the second attention weight and the first semantic vector.

In this embodiment, by using the attention mechanism, the initial standard question semantic vector and the initial trigger word semantic vector are updated based on the first semantic vector, so that an updated standard question semantic vector and an updated trigger word semantic vector with better information expression capability can be obtained, and when the similarity score is calculated based on the updated standard question semantic vector and the updated trigger word semantic vector, a more accurate similarity score can be obtained, the ranking accuracy is improved, and the accuracy of the retrieval result is improved.

The standard question semantic vector and the trigger word semantic vector are similar in updating process, and as the standard question and the trigger word can be collectively called as keywords, the semantic vector updating formula of each keyword is as follows:

wherein Vec _j Is the initial semantic vector of the jth keyword, typically a 1*M dimensional row vector;

h ₁ ,...,h _N is the first semantic vector corresponding to the candidate text, each h _i (i =1,2.., N) is a row vector of dimension 1*M; the superscripted T represents a transpose operation;

is a first attention weight;

Vec' _j is the updated semantic vector for the jth keyword.

Therefore, based on the formulas (8) to (9), the updated standard question semantic vector and the updated trigger word semantic vector can be calculated.

After obtaining the updated standard question semantic vector and the updated trigger word semantic vector, a first dot value may be determined based on the updated standard question semantic vector and the fourth semantic vector; determining a second dot product value based on the updated trigger word semantic vector and the fourth semantic vector; and taking the maximum value of the first dot product value and the second dot product value as the similarity score of the candidate text.

In this embodiment, the updated standard question semantic vector and the updated trigger word semantic vector are directly used to perform dot product operation with the fourth semantic vector, and the updated standard question semantic vector and the updated trigger word semantic vector are not subjected to attention processing, so that the retrieval efficiency can be improved.

The candidate text can be encoded by adopting a pre-trained encoding model to obtain the first semantic vector; the search word is encoded by adopting the encoding model to obtain an encoding vector of the search word, wherein the search word comprises a plurality of text units, the number of the encoding vectors is multiple, and each encoding vector in the plurality of encoding vectors corresponds to each text unit in the plurality of text units respectively; and merging the plurality of encoding vectors to obtain the fourth semantic vector.

The text unit may be called token, and may be selected as a word or a word according to actual needs.

As shown in fig. 6, the coding model takes an ERNIE model as an example, token1 to Token n on the left side constitute candidate texts, token1 to Token n on the right side constitute search terms, and the search terms are represented by universal requirement query.

Fig. 6 shows two ERNIE models, but in practical applications, the parameters may be shared, i.e. may be considered to be the same model.

The left side O1-ON is the first semantic vector corresponding to the candidate text, i.e., h in equation (8) above ₁ ,...,h _N 。

The right O1 to ON are a plurality of encoded vectors corresponding to the search term, and a fourth semantic vector can be obtained by performing a merge process (indicated by Agg in fig. 6).

The merging process refers to processing N row vectors into one row vector, and may be, for example, a value obtained by averaging N row vectors and then activating with an activation function. The fourth semantic vector is also a 1*M dimensional row vector.

In the embodiment, the candidate text and the search term share the coding model, so that the model parameter quantity can be reduced, the resource overhead can be reduced, and the operation efficiency can be improved.

The fourth semantic vector is represented as a search term semantic vector in fig. 6, and as shown in fig. 6, in the inference stage, a first dot product value may be calculated based on the updated standard question semantic vector and the search term semantic vector, a second dot product value may be calculated based on the updated trigger term semantic vector and the search term semantic vector, and a maximum value (represented by max (dot product)) of the first dot product value and the second dot product value may be taken as a similarity score between each candidate text and the search term.

Is formulated as follows:

wherein Vec _query Is a search term sense vector;

Vec _stdquery is the updated standard question semantic vector;

Vec _triggerwords is the updated trigger word sense vector;

the superscripted T represents the transpose operation;

* Represents a multiplication operation;

max () represents a max operation.

When sorting, the method can comprise the following steps: acquiring the popularity of each candidate text; determining the ranking priority of each candidate text based on the similarity and the popularity; and performing sorting processing on the at least one candidate text based on the sorting priority.

Wherein, the popularity can be embodied by the popularity score, and the sorting priority can be embodied by the sorting score.

As shown in fig. 4, the matching text includes a popularity score, and the candidate text is obtained from the matching text, so that the popularity score is also recorded in each candidate text, and the popularity score can be obtained from the recorded data. The popularity score is used to characterize the popularity of the corresponding candidate text, for example, as mentioned above, each candidate text includes a standard question, which may be calculated based on the retrieval frequency of the standard question.

The formula for calculating the ranking score based on the similarity score and the popularity score is:

ranking score = w3 similarity score + w4 hotness score;

where w3 and w4 are manually set weighting coefficients, w3 and w4 are both values between 0 and 1, and w3+ w4=1.

The ranking score is used to characterize the ranking order of the corresponding candidate text, e.g., the higher the ranking score, the higher the corresponding candidate text is ranked.

After the ranking score of each candidate text is obtained, the candidate texts can be ranked from high to low according to the ranking score, and a preset number (for example, 2 to 5) is selected in sequence from the ranked candidate texts to obtain a retrieval result.

For example, the standard question in the selected candidate text may be used as the search result, or the standard question and the answer in the candidate text may be used as the search result.

In this embodiment, the ranking score is obtained based on the similarity score and the popularity score, so that the relevance between the candidate text and the search term and the popularity of the candidate text can be fused in the ranking stage, and the information retrieval effect is improved.

The above embodiment relates to a coding model, such as the ERNIE model shown in fig. 6, and the training process of the coding model is described below.

Fig. 7 is a schematic diagram according to a third embodiment of the present disclosure, which provides a model training method, including:

step 701, obtaining training data, wherein the training data comprises: the method comprises the steps of obtaining a sample search term and at least one corresponding sample candidate text, and the real similarity between each sample candidate text and the sample search term.

Step 702, a coding model is adopted for any sample candidate text, and the sample candidate text is coded to obtain a first semantic vector of the sample candidate text.

Step 703, for any sample candidate text, based on the first semantic vector of the sample candidate text, performing update processing on the second semantic vector of the keyword in the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text.

Step 704, using the coding model to respectively perform coding processing on at least one text unit in the sample search term to obtain at least one coding vector, and perform merging processing on the at least one coding vector to obtain a fourth semantic vector of the sample search term.

Step 705, for any sample candidate text, determining the prediction similarity between the sample candidate text and the sample search term based on the third semantic vector corresponding to the sample candidate text and the fourth semantic vector of the sample search term.

Step 706, constructing a loss function based on the predicted similarity of each sample candidate text and the sample search term and the real similarity of each sample candidate text and the sample search term; and adjusting model parameters of the coding model based on the loss function.

In the training process, the sample candidate text and the sample search term may be samples collected in advance, and the true similarity may be manually labeled, for example, the sample candidate text and the sample search term are similar and may be labeled as 1, otherwise, they are labeled as 0.

In this embodiment, for any sample candidate text, based on the first semantic vector of the sample candidate text, the second semantic vector of the keyword in the candidate text is updated to obtain a third semantic vector, which can obtain the third semantic vector with better representation capability by using the characteristic that the sample candidate text contains more information, and further determine the prediction similarity based on the third semantic vector, and when the model is trained based on the prediction similarity, the accuracy of the model can be improved.

In some embodiments, the keywords include: the standard questions are questions in existing question-answer pairs, and the trigger words are obtained based on the standard questions; accordingly, the second semantic vector comprises: the initial standard question semantic vector corresponding to the standard question and the initial trigger word semantic vector corresponding to the trigger word; the third semantic vector comprises: the updated standard question semantic vector and the updated trigger word semantic vector; the updating, for any sample candidate text, a second semantic vector of the keyword in the sample candidate text based on a first semantic vector of the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text includes: determining a first attention weight based on the first semantic vector and the initial criterion semantic vector; determining the updated inter-standard semantic vector based on the first attention weight and the first semantic vector; determining a second attention weight based on the first semantic vector and the initial trigger term semantic vector; determining the updated trigger word sense vector based on the second attention weight and the first semantic vector.

Similar steps exist in the model reasoning process and the model training process, and different steps are respectively marked by different lines.

The updating process of the training process is consistent with the updating process of the reasoning process, and reference may be made to the relevant description of the above embodiments.

In this embodiment, by using an attention mechanism, the initial standard question semantic vector and the initial trigger word semantic vector are updated based on the first semantic vector, so that an updated standard question semantic vector and an updated trigger word semantic vector with better information expression capability can be obtained, and when the prediction similarity is calculated based on the updated standard question semantic vector and the updated trigger word semantic vector, more accurate prediction similarity can be obtained, and the model accuracy is improved.

In some embodiments, the determining, for any sample candidate text, the predicted similarity of the sample candidate text and the sample term based on the third semantic vector corresponding to the sample candidate text and the fourth semantic vector of the sample term includes: determining a third attention weight based on the updated standard question semantic vector, the updated trigger word semantic vector, and the fourth semantic vector; determining a fifth semantic vector based on the third attention weight, the updated standard inter-semantic vector, and the updated trigger word semantic vector; obtaining the prediction similarity based on the fifth semantic vector and the fourth semantic vector.

As shown in fig. 6, in the training phase, after obtaining the updated standard question semantic vector, the updated trigger word semantic vector, and the search word semantic vector, a third attention network may be used to perform processing to obtain a candidate text semantic vector, which may be referred to as a fifth semantic vector.

The calculation formula is as follows:

Vec _FAQ ＝w _stdquery *Vec _stdquery +w _triggerwords *Vec _triggerwords (12)

wherein Vec _FAQ Is a candidate text semantic vector;

Vec _stdquery is the updated standard question semantic vector;

Vec _triggerwords is the updated trigger word sense vector;

w _stdquery 、w _triggerwords is a weight;

Vec _query is a search term sense vector.

The calculation formula of the training similarity score is as follows:

in the embodiment, the candidate text semantic vectors are obtained through the attention mechanism, so that the expression effect of the candidate text semantic vectors can be improved, and further the model effect is improved.

Fig. 8 is a schematic diagram according to a fourth embodiment of the present disclosure, which provides an information retrieval apparatus 800, including: a first obtaining module 801, an updating module 802, a determining module 803, and an ordering module 804.

The first obtaining module 801 is configured to obtain at least one candidate text corresponding to a search term; the updating module 802 is configured to, for any candidate text, update a second semantic vector of a keyword in the candidate text based on a first semantic vector of the candidate text to obtain a third semantic vector; the determining module 803 is configured to determine, for any candidate text, a similarity between the candidate text and the search term based on a third semantic vector corresponding to the candidate text and a fourth semantic vector of the search term; the sorting module 804 is configured to perform sorting processing on the at least one candidate text based on the similarity between each candidate text and the search term, and obtain a search result corresponding to the search term based on the candidate text after the sorting processing.

In some embodiments, the candidate text is obtained from matching text already stored within the database; the first obtaining module 801 is further configured to: acquiring a plurality of matching texts; respectively determining semantic similarity of each keyword and the search word aiming at a plurality of keywords included in the matched texts, and determining text similarity of each keyword and the search word; based on the text similarity and the semantic similarity, determining recall probability of each keyword, wherein the recall probability is used for representing the probability of selecting the keyword; selecting a preset number of keywords from the plurality of keywords based on the recall probability of each keyword to obtain the selected keywords; and acquiring at least one candidate text corresponding to the search word based on the matched text where the selected keyword is located.

In the embodiment, the recall probability is obtained based on the text similarity and the semantic similarity, so that the semantic-based recall and the text-based recall are realized.

In some embodiments, the keywords include: the standard questions are questions in existing question-answer pairs, and the trigger words are obtained based on the standard questions; the apparatus 800 further comprises: the second acquisition module is used for acquiring the similar questions of the standard questions; a third obtaining module, configured to obtain a trigger word corresponding to the standard question based on the standard question and the similar question; and the generating module is used for generating the matching text based on the standard question and the trigger word corresponding to the standard question.

In some embodiments, the third obtaining module is further configured to: splicing the standard question and the similar question to obtain a spliced text; performing multiple word segmentation processing on the spliced text to obtain words after multiple groups of word segmentation; merging the words after the multi-group word segmentation to obtain candidate words; and filtering the candidate words to obtain the trigger words corresponding to the standard questions. The keywords include: the standard questions are questions in existing question-answer pairs, and the trigger words are obtained based on the standard questions; accordingly, the second semantic vector comprises: the initial standard question semantic vector corresponding to the standard question and the initial trigger word semantic vector corresponding to the trigger word; the third semantic vector comprises: the updated standard question semantic vector and the updated trigger word semantic vector; the update module 802 is further configured to: determining a first attention weight based on the first semantic vector and the initial criterion semantic vector; determining the updated inter-standard semantic vector based on the first attention weight and the first semantic vector; determining a second attention weight based on the first semantic vector and the initial trigger term semantic vector; determining the updated trigger word sense vector based on the second attention weight and the first semantic vector.

In this embodiment, by using the attention mechanism, the initial standard question semantic vector and the initial trigger word semantic vector are updated based on the first semantic vector, so that an updated standard question semantic vector and an updated trigger word semantic vector with better information expression capability can be obtained, and when the similarity is calculated based on the updated standard question semantic vector and the updated trigger word semantic vector, more accurate similarity can be obtained, the ranking accuracy is improved, and the accuracy of the retrieval result is improved.

In some embodiments, the determining module 803 is further configured to: determining a first dot product value based on the updated standard question semantic vector and the fourth semantic vector; determining a second dot product value based on the updated trigger word semantic vector and the fourth semantic vector; and taking the maximum value of the first dot product value and the second dot product value as the similarity score of the candidate text and the search word.

In this embodiment, the updated standard question semantic vector and the updated trigger word semantic vector are directly used to perform the dot product operation with the fourth semantic vector, and the updated standard question semantic vector and the updated trigger word semantic vector are not further processed, so that the retrieval efficiency can be improved.

In some embodiments, the apparatus 800 further comprises: the encoding module is used for encoding each candidate text by adopting a pre-trained encoding model so as to obtain a first semantic vector of each candidate text; and/or the search term comprises at least one text unit, the at least one text unit in the search term is respectively subjected to coding processing by adopting the coding model to obtain at least one coding vector, and the at least one coding vector is subjected to merging processing to obtain the fourth semantic vector.

In some embodiments, the ordering module 804 is further configured to: acquiring the popularity of each candidate text; determining the ranking priority of each candidate text based on the similarity and the popularity; and performing sorting processing on the at least one candidate text based on the sorting priority.

In the embodiment, the ranking priority is obtained based on the similarity and the popularity, so that the relevance between the candidate text and the search word and the popularity of the candidate text can be fused in the ranking stage, and the information retrieval effect is improved.

Fig. 9 is a schematic diagram of a fifth embodiment according to the present disclosure, which provides a model training apparatus 900, including: an obtaining module 901, a first encoding module 902, an updating module 903, a second encoding module 904, a determining module 905, and an adjusting module 906.

The obtaining module 901 is configured to obtain training data, where the training data includes: the method comprises the steps of obtaining a sample search word and at least one corresponding sample candidate text, and the real similarity between each sample candidate text and the sample search word; the first encoding module 902 is configured to apply an encoding model to any sample candidate text, and perform encoding processing on the sample candidate text to obtain a first semantic vector of the sample candidate text; the updating module 903 is configured to, for any sample candidate text, update a second semantic vector of a keyword in the sample candidate text based on a first semantic vector of the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text; the second encoding module 904 is configured to separately perform encoding processing on at least one text unit in the sample term by using the encoding model to obtain at least one encoding vector, and perform merging processing on the at least one encoding vector to obtain a fourth semantic vector of the sample term; the determining module 905 is configured to determine, for any sample candidate text, a prediction similarity between the sample candidate text and the sample search term based on a third semantic vector corresponding to the sample candidate text and a fourth semantic vector of the sample search term; the adjusting module 906 is configured to construct a loss function based on the predicted similarity between each sample candidate text and the sample search term and the real similarity between each sample candidate text and the sample search term; and adjusting model parameters of the coding model based on the loss function.

In this embodiment, for any sample candidate text, based on the first semantic vector of the sample candidate text, the second semantic vector of the keyword in the sample candidate text is updated to obtain a third semantic vector, which can obtain the third semantic vector with better representation capability by using the characteristic that the sample candidate text contains more information, and further determine the prediction similarity based on the third semantic vector, and when the model is trained based on the prediction similarity, the accuracy of the model can be improved.

In some embodiments, the keywords include: the system comprises standard questions and trigger words, wherein the standard questions are questions in existing question-answer pairs, and the trigger words are obtained based on the standard questions; accordingly, the second semantic vector comprises: the initial standard question semantic vector corresponding to the standard question and the initial trigger word semantic vector corresponding to the trigger word; the third semantic vector comprises: the updated standard question semantic vector and the updated trigger word semantic vector; the update module 902 is further configured to: determining a first attention weight based on the first semantic vector and the initial criterion semantic vector; determining the updated inter-standard semantic vector based on the first attention weight and the first semantic vector; determining a second attention weight based on the first semantic vector and the initial trigger term semantic vector; determining the updated trigger word sense vector based on the second attention weight and the first semantic vector.

In some embodiments, the determining module 905 is further configured to: determining a third attention weight based on the updated standard question semantic vector, the updated trigger word semantic vector, and the fourth semantic vector; determining a fifth semantic vector based on the third attention weight, the updated standard inter-semantic vector, and the updated trigger word semantic vector; obtaining the prediction similarity based on the fifth semantic vector and the fourth semantic vector.

It is to be understood that in the disclosed embodiments, the same or similar contents in different embodiments may be mutually referred to.

It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. The electronic device 1000 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device 1000 may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 performs the respective methods and processes described above, such as an information retrieval method or a model training method. For example, in some embodiments, the information retrieval method or the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the information retrieval method or the model training method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the information retrieval method or the model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable retrieval device, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An information retrieval method, comprising:

acquiring at least one candidate text corresponding to the search term;

aiming at any candidate text, updating a second semantic vector of the keywords in the candidate text based on a first semantic vector of the candidate text to obtain a third semantic vector;

for any candidate text, determining the similarity between the candidate text and the search word based on a third semantic vector corresponding to the candidate text and a fourth semantic vector of the search word;

and sequencing the at least one candidate text based on the similarity between each candidate text and the search word, and acquiring a search result corresponding to the search word based on the candidate text after sequencing.

2. The method of claim 1, wherein the obtaining at least one candidate text corresponding to a search term comprises:

acquiring a plurality of matching texts;

respectively determining semantic similarity of each keyword and the search word aiming at a plurality of keywords included by the matched texts, and determining text similarity of each keyword and the search word;

based on the text similarity and the semantic similarity, determining a recall probability of each keyword, wherein the recall probability is used for representing the probability of selecting the keyword;

selecting a preset number of keywords from the plurality of keywords based on the recall probability of each keyword to obtain the selected keywords;

and acquiring at least one candidate text corresponding to the search word based on the matched text where the selected keyword is located.

3. The method of claim 2, wherein,

the keywords include: the standard questions are questions in existing question-answer pairs, and the trigger words are obtained based on the standard questions;

the method further comprises the following steps:

acquiring a similarity question of the standard question;

obtaining a trigger word corresponding to the standard question based on the standard question and the similar question;

and generating the matching text based on the standard question and the trigger word corresponding to the standard question.

4. The method according to claim 3, wherein the obtaining of the trigger word corresponding to the standard question based on the standard question and the similar question comprises:

splicing the standard question and the similar question to obtain a spliced text;

performing multiple word segmentation processing on the spliced text to obtain words after multiple groups of word segmentation;

merging the words after the multi-group word segmentation to obtain candidate words;

and filtering the candidate words to obtain the trigger words corresponding to the standard questions.

5. The method of claim 1, wherein,

the keywords include: the system comprises standard questions and trigger words, wherein the standard questions are questions in existing question-answer pairs, and the trigger words are obtained based on the standard questions;

accordingly, the second semantic vector comprises: the initial standard question semantic vector corresponding to the standard question and the initial trigger word semantic vector corresponding to the trigger word;

the third semantic vector comprises: the updated standard question semantic vector and the updated trigger word semantic vector;

the updating, for any candidate text, a second semantic vector of the keyword in the candidate text based on the first semantic vector of the candidate text to obtain a third semantic vector includes:

determining a first attention weight based on the first semantic vector and the initial criterion semantic vector;

determining the updated inter-standard semantic vector based on the first attention weight and the first semantic vector;

determining a second attention weight based on the first semantic vector and the initial trigger term semantic vector;

determining the updated trigger word sense vector based on the second attention weight and the first semantic vector.

6. The method of claim 5, wherein the determining, for any candidate text, the similarity between the candidate text and the search term based on the third semantic vector corresponding to the candidate text and the fourth semantic vector of the search term comprises:

determining a first dot product value based on the updated standard inter-semantic vector and the fourth semantic vector;

determining a second dot product value based on the updated trigger word semantic vector and the fourth semantic vector;

and taking the maximum value of the first dot product value and the second dot product value as the similarity of the candidate text and the search word.

7. The method of any of claims 1-6, further comprising:

coding each candidate text by adopting a pre-trained coding model to obtain a first semantic vector of each candidate text; and/or the presence of a gas in the gas,

the search term comprises at least one text unit, the at least one text unit in the search term is respectively subjected to coding processing by adopting the coding model so as to obtain at least one coding vector, and the at least one coding vector is subjected to merging processing so as to obtain the fourth semantic vector.

8. The method according to any one of claims 1-6, wherein said ranking said at least one candidate text based on similarity of each candidate text to said search term comprises:

acquiring the popularity of each candidate text;

determining the ranking priority of each candidate text based on the similarity and the popularity;

and performing sorting processing on the at least one candidate text based on the sorting priority.

9. A model training method, comprising:

obtaining training data, the training data comprising: the method comprises the steps of obtaining a sample search word and at least one corresponding sample candidate text, and the real similarity between each sample candidate text and the sample search word;

aiming at any sample candidate text, coding the sample candidate text by adopting a coding model to obtain a first semantic vector of the sample candidate text;

for any sample candidate text, updating a second semantic vector of the keywords in the sample candidate text based on a first semantic vector of the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text;

respectively encoding at least one text unit in the sample search term by adopting the encoding model to obtain at least one encoding vector, and merging the at least one encoding vector to obtain a fourth semantic vector of the sample search term;

for any sample candidate text, determining the prediction similarity of the sample candidate text and the sample search word based on a third semantic vector corresponding to the sample candidate text and a fourth semantic vector of the sample search word;

constructing a loss function based on the prediction similarity of each sample candidate text and the sample search term and the real similarity of each sample candidate text and the sample search term; and adjusting model parameters of the coding model based on the loss function.

10. The method of claim 9, wherein,

the updating, for any sample candidate text, a second semantic vector of the keyword in the sample candidate text based on a first semantic vector of the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text includes:

11. The method of claim 10, wherein the determining, for any sample candidate text, the predicted similarity of the sample candidate text and the sample term based on the third semantic vector corresponding to the sample candidate text and the fourth semantic vector of the sample term comprises:

determining a third attention weight based on the updated standard question semantic vector, the updated trigger word semantic vector, and the fourth semantic vector;

determining a fifth semantic vector based on the third attention weight, the updated standard inter-semantic vector, and the updated trigger word semantic vector;

obtaining the prediction similarity based on the fifth semantic vector and the fourth semantic vector.

12. An information retrieval apparatus comprising:

the first acquisition module is used for acquiring at least one candidate text corresponding to the search term;

the updating module is used for updating a second semantic vector of the keywords in any candidate text based on the first semantic vector of the candidate text to obtain a third semantic vector;

the determining module is used for determining the similarity between any candidate text and the search word based on a third semantic vector corresponding to the candidate text and a fourth semantic vector of the search word;

and the sequencing module is used for sequencing the at least one candidate text based on the similarity between each candidate text and the search word, and obtaining a search result corresponding to the search word based on the candidate text after sequencing.

13. The apparatus of claim 12, wherein the first obtaining means is further for:

acquiring a plurality of matching texts;

14. The apparatus of claim 13, wherein,

the device further comprises:

the second acquisition module is used for acquiring the similar questions of the standard questions;

a third obtaining module, configured to obtain a trigger word corresponding to the standard question based on the standard question and the similar question;

and the generating module is used for generating the matching text based on the standard question and the trigger word corresponding to the standard question.

15. The apparatus of claim 14, wherein the third obtaining means is further for:

16. The apparatus of claim 12, wherein,

the update module is further to:

17. The apparatus of claim 16, wherein the means for determining is further for:

and taking the maximum value of the first dot product value and the second dot product value as the similarity score of the candidate text and the search word.

18. The apparatus of any of claims 12-17, further comprising:

the encoding module is used for encoding each candidate text by adopting a pre-trained encoding model so as to obtain a first semantic vector of each candidate text; and/or the search term comprises at least one text unit, the at least one text unit in the search term is respectively subjected to coding processing by adopting the coding model to obtain at least one coding vector, and the at least one coding vector is subjected to merging processing to obtain the fourth semantic vector.

19. The apparatus of any of claims 12-17, wherein the ranking module is further to:

acquiring the popularity of each candidate text;

20. A model training apparatus comprising:

an acquisition module configured to acquire training data, the training data including: the method comprises the steps of obtaining a sample search word and at least one corresponding sample candidate text, and the real similarity between each sample candidate text and the sample search word;

the first coding module is used for adopting a coding model for any sample candidate text and coding the sample candidate text to obtain a first semantic vector of the sample candidate text;

the updating module is used for updating a second semantic vector of a keyword in any sample candidate text based on a first semantic vector of the sample candidate text to obtain a third semantic vector corresponding to the sample candidate text;

the second coding module is used for respectively coding at least one text unit in the sample search term by adopting the coding model to obtain at least one coding vector and combining the at least one coding vector to obtain a fourth semantic vector of the sample search term;

the determining module is used for determining the prediction similarity of any sample candidate text and the sample search word based on a third semantic vector corresponding to the sample candidate text and a fourth semantic vector of the sample search word;

the adjusting module is used for constructing a loss function based on the prediction similarity of each sample candidate text and the sample search term and the real similarity of each sample candidate text and the sample search term; and adjusting model parameters of the coding model based on the loss function.

21. The apparatus of claim 20, wherein,

the update module is further to:

22. The apparatus of claim 21, wherein the means for determining is further for:

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.