CN113157727B

CN113157727B - Method, apparatus and storage medium for providing recall result

Info

Publication number: CN113157727B
Application number: CN202110567087.0A
Authority: CN
Inventors: 陈刚保
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-12-13
Anticipated expiration: 2041-05-24
Also published as: CN113157727A

Abstract

The disclosure provides a method, equipment and a storage medium for providing recall results, and belongs to the technical field of computers. The method comprises the following steps: the method comprises the steps of obtaining a query sentence to be processed, adding entity words in the query sentence to be processed at a target position under the condition that the query sentence to be processed meets a target condition, and obtaining a target text corresponding to the query sentence to be processed, wherein the target condition is that semantic query and/or the number of characters is smaller than a target numerical value and has no fixed meaning, and the target position is a sentence tail or a sentence head of the query sentence to be processed. Inputting the target text into a semantic vector generation model so that the semantic vector generation model obtains a semantic vector of a query sentence to be processed according to the entity words in the target text, wherein the query sentence in a training data set of the semantic vector generation model comprises the entity words. Based on the semantic vector, a recall result is provided for the query statement to be processed. By adopting the method and the device, the accuracy of the provided recall result is improved.

Description

Method, apparatus and storage medium for providing recall result

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for providing recall results.

Background

At present, when a user inputs a query statement in some applications and queries contents, if the query statement is a statement without explicit direction, if the query statement is a semantic query (query), a recall model converts the query statement into a semantic vector, obtains the semantic vector of each recall result in each candidate library, and then calculates the similarity between the semantic vector of the query statement and the semantic vector of the recall result. And displaying the recall result with the similarity higher than a certain numerical value to the user.

In the related art, when the recall model converts a query statement into a semantic vector, the term and sentence structure of the query statement are analyzed to obtain the semantic vector of the query statement. When the recall result is converted into the semantic vector by the recall model, the word and sentence structure of the recall result is analyzed to obtain the semantic vector of the recall result.

In the related technology, when a semantic vector is obtained, the context and sentence structure of the entity words of the same type are similar, but the semantics are opposite or irrelevant, and the entity words of the same type are possibly recognized as the same semantics by mistake, so that the generated semantic vector has low accuracy, and further the provided recall result is inaccurate.

Disclosure of Invention

The embodiment of the disclosure provides a method, equipment and a storage medium for providing a recall result, which can solve the problem of low accuracy of semantic vector generation, and further can improve the accuracy of the provided recall result. The technical scheme is as follows:

in one aspect, a method of providing recall results is provided, the method comprising:

acquiring query sentences to be processed;

under the condition that the query sentence to be processed meets a target condition, adding an entity word in the query sentence to be processed at a target position of the query sentence to be processed to obtain a target text corresponding to the query sentence to be processed, wherein the target condition is semantic query, and/or the number of characters is smaller than a target numerical value and has no fixed meaning, and the target position is a sentence tail or a sentence head of the query sentence to be processed;

inputting the target text into a trained semantic vector generation model so that the semantic vector generation model obtains a semantic vector of the query sentence to be processed according to the entity words in the target text, wherein the query sentence in a training data set for training the semantic vector generation model comprises the entity words;

and providing a target recall result for the query statement to be processed based on the semantic vector of the query statement to be processed.

In one possible implementation, the method further includes:

cleaning an exposure log of semantic query to obtain original training data, wherein the original training data comprises a plurality of query sentences and recall results corresponding to the query sentences;

adding labels to the recall results of the query sentences in the original training data based on the feedback of the user to the recall results of the query sentences to generate first-class training data;

determining a target query statement comprising entity words in the original training data, and constructing second type training data based on the entity words in the target query statement and the recall result of the target query statement;

adding the first type of training data and the second type of training data to a training data set;

and training an initial semantic vector generation model based on the training data set and a pre-constructed loss function to obtain the trained semantic vector generation model.

In this way, the semantic vector generation model can be trained.

In one possible implementation manner, the constructing the second type of training data based on the entity words in the target query statement and the recall result of the target query statement includes:

and deleting or replacing the entity words in the target query sentence with other entity words of the same type, and modifying the label of the recall result of the target query sentence into an opposite label to obtain second type training data.

Therefore, the performance of recognizing the entity words of the same type by the semantic vector generation model can be improved.

In one possible implementation manner, the constructing of the second type of training data based on the entity words in the target query statement and the recall result of the target query statement includes:

and if the entity words of the target query statement are inconsistent with the entity words in the recall result of the target query statement and/or the entity word types are inconsistent, modifying the labels of the recall result of the target query statement into opposite labels to obtain second-class training data.

In one possible implementation, the method further includes:

searching synonyms of the keywords in the query sentence in a preset synonym table aiming at the query sentence in the original training data;

replacing the keywords in the query sentence with synonyms of the keywords to obtain a first query sentence;

adding the first query statement, the recall result of the first query statement, and a label as training data to the training data set.

Thus, the performance of the semantic vector generation model for identifying the synonyms can be improved.

In one possible implementation, the method further includes:

searching an antisense word of a keyword in the query sentence in a preset antisense word table aiming at the query sentence in the original training data;

replacing the keywords in the query sentence with the antisense words of the keywords to obtain a second query sentence;

adding the second query statement, the recall result of the second query statement, and an opposite label as training data to the training data set.

Thus, the performance of recognizing the antisense words by the semantic vector generation model can be improved.

In one possible implementation, the method further includes:

identifying query sentences in the original training data to obtain keywords in the query sentences;

deleting or replacing a word except the keyword in the query sentence with a preset word to obtain a third query sentence;

adding the third query statement, the recall result of the third query statement, and a label as training data to the training data set.

In this way, the training data set can be expanded.

In one possible implementation, the method further includes:

adding stop words to the head or the tail of the query sentence in the original training data to obtain a fourth query sentence, wherein the semantics of the query sentence added with the stop words are not changed;

adding the fourth query statement, the recall result of the fourth query statement, and a label as training data to the training data set.

In this way, the training data set can be expanded.

In one possible implementation, the method further includes:

screening a fifth query statement which is subjected to semantic query and has frequency exceeding a target threshold value in the original training data;

displaying a fifth query statement and a recall result corresponding to the fifth query statement;

acquiring a label result of a recall result corresponding to the fifth query statement by a user;

and taking the fifth query statement and the labeling result as training data, and adding the training data to the training data set.

Thus, training data of accurate labels can be acquired.

In a possible implementation manner, the training an initial semantic vector generation model based on the training data set and a pre-constructed loss function to obtain the trained semantic vector generation model includes:

acquiring a plurality of pieces of training data in the training data set according to the batch size, wherein each piece of training data comprises a combination of a query sentence and a corresponding recall result;

inputting each piece of training data into the initial semantic vector generation model to obtain a text vector of a query sentence and a text vector of a recall result in the training data;

determining a loss corresponding to the training data based on the loss function, the text vector of the query sentence in the training data and the text vector of the recall result;

when the loss meets a first condition or the evaluation index meets a second condition, determining the initial semantic vector generation model as the trained semantic vector generation model, when the loss does not meet the first condition and the evaluation index does not meet the second condition, updating the initial semantic vector generation model by using the loss, continuing to train the updated semantic vector generation model by using the training data set until the loss meets the first condition or the evaluation index meets the second condition, and determining the semantic vector generation model when the loss meets the first condition or the evaluation index meets the second condition as the trained semantic vector generation model.

In one possible implementation, the loss function includes a foldout loss function and a vector similarity loss function;

determining a loss corresponding to the training data based on the loss function, the text vector of the query sentence in the training data, and the text vector of the recall result, including:

multiplying the text vector of the query statement by the numerical value of the corresponding position of the text vector of the recall result to obtain a multiplied vector, and subtracting the numerical value of the corresponding position of the text vector of the query statement from the numerical value of the corresponding position of the text vector of the recall result to obtain a subtracted vector;

respectively merging and inputting the text vector of the query statement, the text vector of the recall result, the multiplied vector and the subtracted vector to a full-connection layer to obtain a similar prediction result of the training data;

substituting the similar prediction result of the training data and the label of the recall result in the training data into the folding loss function to determine the folding loss of the training data;

calculating the vector similarity of the text vector of the query statement and the text vector of the recall result, and substituting the vector similarity, the similar prediction result and the label of the recall result into the vector similarity loss function to determine the vector similarity loss of the training data;

and calculating the loss corresponding to the training data by using the vector similarity loss and the foldout loss.

In this way, two kinds of loss functions are considered, so that the performance of the semantic vector generation model can be made better.

In one possible implementation manner, the semantic vector generation model comprises a text steering quantity layer, a first coding layer and a second coding layer;

the inputting the target text into a trained semantic vector generation model so that the semantic vector generation model obtains the semantic vector of the query statement to be processed according to the entity words in the target text includes:

and sequentially inputting the target text into the text steering quantity layer, the first coding layer and the second coding layer to obtain a semantic vector of the query statement to be processed, wherein the text steering quantity layer is used for converting the target text into a vector, and the first coding layer and the second coding layer are used for coding the vector converted from the target text into the semantic vector according to the entity words in the target text.

In a possible implementation manner, the semantic vector generation model further includes a third coding layer and a fourth coding layer;

the providing a target recall result for the query statement to be processed based on the semantic vector of the query statement to be processed comprises:

inputting candidate recall results into the text turning quantity layer, the third coding layer and the fourth coding layer in sequence to obtain semantic vectors of the candidate recall results;

determining the similarity between the semantic vector of the query statement to be processed and the semantic vector of the candidate recall result;

and providing a target recall result for the query statement to be processed based on the similarity.

In this way, recall results can be provided for the query statement to be processed.

In another aspect, an apparatus for providing recall results is provided, the apparatus comprising:

the acquisition module is used for acquiring the query statement to be processed;

the input module is used for inputting the target text into a trained semantic vector generation model so that the semantic vector generation model obtains a semantic vector of the query sentence to be processed according to the entity words in the target text, wherein the query sentence in a training data set used for training the semantic vector generation model comprises the entity words;

and the matching module is used for providing a target recall result for the query statement to be processed based on the semantic vector of the query statement to be processed.

In one possible implementation, the apparatus further includes:

a training module to:

determining a target query sentence comprising entity words in the original training data, and constructing second type training data based on the entity words in the target query sentence and a recall result of the target query sentence;

adding the first class of training data and the second class of training data to a training data set;

In one possible implementation, the training module is configured to:

and if the entity words of the target query statement are not consistent with the entity words in the recall result of the target query statement and/or the entity word types are not consistent, modifying the labels of the recall result of the target query statement into opposite labels, and acquiring second-class training data.

In one possible implementation manner, the training module is further configured to:

screening a fifth query statement with frequency exceeding a target threshold and serving as semantic query in the original training data;

acquiring a labeling result of the recall result corresponding to the fifth query statement by the user;

In one possible implementation, the training module is configured to:

acquiring a plurality of pieces of training data in the training data set according to the batch size, wherein each piece of training data comprises a combination of a query statement and a corresponding recall result;

the training module is configured to:

multiplying the text vector of the query statement and the numerical value of the corresponding position of the text vector of the recall result to obtain a multiplied vector, and subtracting the numerical value of the corresponding position of the text vector of the query statement and the text vector of the recall result to obtain a subtracted vector;

and calculating to obtain the loss corresponding to the training data by using the vector similarity loss and the foldout loss.

the input module is configured to:

the matching module is configured to:

In yet another aspect, the present disclosure provides a computer apparatus comprising a processor and a memory, the memory having stored therein at least one computer instruction, the computer instruction being loaded and executed by the processor to perform operations performed by the method of providing recall results of the first aspect.

In yet another aspect, the present disclosure provides a computer-readable storage medium having at least one computer instruction stored therein, the computer instruction being loaded and executed by a processor to implement the operations performed by the method for providing recall results of the first aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the disclosure at least comprise:

in the embodiment of the disclosure, when the semantic vector generation model is trained, the query sentence used includes the entity word, so that the semantic vector generation model can accurately identify the entity word. And then, when the semantic vector is generated for the query sentence to be processed by using the semantic vector generation model, the entity words can be accurately identified, so that the accurate semantic vector can be determined for the query sentence to be processed. Therefore, when the recall result is matched for the query sentence to be processed, the accurate recall result can be matched due to the accurate semantic vector, and the accuracy of the provided recall result is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is apparent that the drawings in the description below are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings may be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method of providing recall results provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of constructing a training data set provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an initial model provided by embodiments of the present disclosure;

FIG. 4 is a schematic diagram of a training semantic vector generation model provided by an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an apparatus for providing recall results according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of an apparatus for providing recall results according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

For a better understanding of the embodiments of the present disclosure, the following noun concepts that may be involved are first introduced:

semantic query (query) refers to a search term that is determined to be semantic intent during a search process, such as a cheerful song, an impaired music, and the like.

The entity word refers to a proper noun, for example, the entity word is a name of province, a name of musical instrument, a name of language, a name of festival, a name of religion, a term specific to a music field (disco, blue tone, etc.), and the like. The same type of entity words refers to the same type of entity words, for example, names of countries belonging to the same type of entity words, names of provinces, names of musical instruments belonging to the same type of entity words, and the like.

In the related art, when a semantic vector is obtained, since the context of an entity word of the same type is similar to the structure of a sentence (for example, the part of speech, the syntactic structure, and the semantic role are the same), it may be possible to cause misrecognition to the same semantic, and further cause the accuracy of the generated semantic vector to be low. And because the part of speech, the grammatical structure and the semantic role of the antisense words of some words are the same, the antisense words of some words are also possibly recognized as the same semantics by mistake, so that the accuracy of the generated semantic vector is lower, and the accuracy of the provided recall result is lower.

In order to improve the accuracy of providing the recall result, the present disclosure provides a method for providing the recall result, where an execution subject of the method may be a device for providing the recall result, which is hereinafter referred to as a providing device for short, the providing device may be a terminal integrated with a software program, or may be a computer device such as a server, and the server may also be a cloud server. The server may include a processor, a memory, and a transceiver. The processor may be configured to perform processing of the process of providing recall results, such as generating semantic vectors and the like. The memory may be used to store data needed in providing recall results as well as data generated, such as semantic vectors that may store query statements. The transceiver may be used to receive as well as transmit data.

In the embodiment of the present disclosure, a process of providing a recall result related to a music application is described as an example. As shown in FIG. 1, the flow of the method of providing recall results may include the following steps 101-104.

Step 101, obtaining a query statement to be processed.

In this embodiment, if the query statement to be processed needs to be converted into a semantic vector, the providing device may obtain the query statement to be processed. For example, the user enters a sentence, which is a query sentence to be processed, in a search box of the music program.

And 102, under the condition that the query sentence to be processed meets the target condition, adding the entity word in the query sentence to be processed at the target position of the query sentence to be processed to obtain a target text corresponding to the query sentence to be processed, wherein the target condition is semantic query, and/or the number of characters is less than a target numerical value and has no fixed meaning, and the target position is the sentence tail or the sentence head of the query sentence to be processed.

Wherein, the target value can be preset, such as 20. The absence of a fixed meaning is also considered to be an absence of an explicit indication. The non-explicit direction indicates that the specific target cannot be directed, for example, the specific song cannot be directed, and the query sentence does not include the name of the singer, the name of the song, the name of the movie and the television play, and the like; specific pointing can point to specific targets, such as pointing to specific songs, and query statements include singer names, song names, movie titles and the like.

In this embodiment, the providing apparatus first performs word segmentation on the query sentence to be processed to obtain a word segmentation result, and then analyzes whether the query sentence to be processed is a semantic query using the word segmentation result. Illustratively, if the providing device determines that the query sentence to be processed has a definite direction based on the word segmentation result, it is determined that the query sentence to be processed is not the semantic query, and further indicates that the query sentence to be processed does not satisfy the target condition, and if the providing device determines that the query sentence to be processed has no definite direction based on the word segmentation result, it is determined that the query sentence to be processed is the semantic query, and further indicates that the query sentence to be processed satisfies the target condition. For example, the query sentence to be processed is "song of singer a", and since "a" and "song" exist in the query sentence to be processed, the query sentence to be processed has an explicit singer (a) direction, and is not a semantic query. For another example, the query sentence to be processed is "good weather today", "light music for soothing", "medicated leaven for tea afternoon", "2020 smothering medicated leaven", and the like, and since there is no explicit direction in the query sentence to be processed, the query sentence to be processed is a semantic query.

The providing device can determine the number of characters of the query sentence to be processed, if the number of characters is smaller than a target numerical value, whether the query sentence to be processed has a definite direction or not is judged, and if the query sentence to be processed does not have a definite direction, the query sentence to be processed can be determined to meet a target condition. And if the number of the characters is larger than or equal to the target numerical value or the query statement to be processed has definite direction, determining that the query statement to be processed does not meet the target condition. When judging whether the query sentence to be processed meets the target condition, the number of characters can be judged to be smaller than the target numerical value and have no fixed meaning, and then whether the query sentence is a semantic query or not can be judged at the same time, as long as at least one of the query sentence to be processed meets the target condition.

After determining that the query statement to be processed satisfies the target condition, the providing device may identify the entity word in the query statement to be processed. For example, the providing device may compare each segment of the query sentence to be processed with each entity word in a preset entity word bank to obtain an entity word in the query sentence to be processed, or the providing device may input the query sentence to be processed into a pre-trained neural network model to obtain an entity word in the query sentence to be processed. And then adding the entity words in the query sentence to be processed to the target position of the query sentence to be processed by the providing device to obtain a target text corresponding to the query sentence to be processed. The target position here is the end of a sentence or the beginning of a sentence.

Optionally, when the providing apparatus adds the entity word in the query statement to be processed to the target position of the query statement to be processed, the entity word and the query statement to be processed may be spaced apart by using a preset mark. For example, the preset mark is "#", the query statement to be processed is represented by a, the entity word in the query statement to be processed is M, and the query statement to be processed after the entity word is added is a # M.

Step 103, inputting the target text into the trained semantic vector generation model, so that the semantic vector generation model obtains a semantic vector of the query sentence to be processed according to the entity words in the target text, wherein the query sentence in the training data set for training the semantic vector generation model includes the entity words.

In this embodiment, the providing apparatus may obtain the semantic vector generation model from another device, or obtain the semantic vector generation model by training in advance, where the query sentence in the training data set for training the semantic vector generation model includes the entity word. And inputting the target text of the query sentence to be processed into a semantic vector generation model, and outputting the target text in combination with the entity words in the query sentence to be processed by the semantic vector generation model, wherein the output is the semantic vector of the query sentence to be processed. In this way, a semantic vector of the query statement to be processed is obtained.

And step 104, providing a target recall result for the query statement to be processed based on the semantic vector of the query statement to be processed.

In this embodiment, the providing apparatus selects a recall result among the candidate recall results as a target recall result of the query statement to be processed using the semantic vector of the query statement to be processed. The providing means provides the targeted recall result to the user.

In this way, when the semantic vector generation model is trained, the query sentences of the training samples include entity words, so that the trained semantic vector generation model can accurately identify the entity words, and the trained semantic vector generation model can determine accurate semantic vectors for the input query sentences to be processed according to the entity words in the query sentences to be processed. When the recall result is matched for the query sentence to be processed, the accurate recall result can be matched based on the accurate semantic vector, so that the accuracy of the recall result is improved.

The method flow shown in fig. 1 is described as follows:

in one possible implementation manner, the providing device trains and obtains the semantic vector generation model, and the processing is as follows:

cleaning an exposure log of semantic query to obtain original training data, wherein the original training data comprises a plurality of query sentences and recall results corresponding to the plurality of query sentences; adding labels to the recall results of the query sentences in the original training data based on the feedback of the user to the recall results of the query sentences to generate first-class training data; determining a target query statement comprising entity words in original training data, and constructing second type training data based on the entity words in the target query statement and the recall result of the target query statement; adding the first type of training data and the second type of training data to a training data set; and training the initial semantic vector generation model based on the training data set and a pre-constructed loss function to obtain a trained semantic vector generation model.

In this embodiment, the providing apparatus may obtain an exposure log of the semantic query on the music program line, where the exposure log includes a query statement of the semantic query, a corresponding recall result, and a feedback of the recall result from the user. And the providing means builds an initial semantic vector generation model comprising two inputs at a time, one input being an input text of the query sentence and the other input being an input text of the recall result, and a loss function. The providing device can carry out data cleaning on the exposure log of the semantic query, remove repeated data and obtain original training data.

For any query statement in the original training data, if a certain recall result of the query statement by the user in the original training data is positive feedback, the providing device may set the tag of the recall result to 1, and if the certain recall result of the query statement by the user in the original training data is negative feedback, the providing device may set the tag of the recall result to 0. For different situations, the definition of the positive feedback is different from that of the negative feedback, for example, in a music application, the recall result is a song, the positive feedback is that the user plays the song or likes the song after playing (such as praise, collection, comment, etc.), the song is played for more than a preset time, and the like, and the negative feedback is that the user does not play the song (such as the user cuts the song after playing a short time), cuts the song continuously for multiple times, dislikes such explicit negative feedback, and the like. For another example, in a video application program, the recall result is a video, the positive feedback indicates that the user has finished playing the video, the playing time of the video exceeds the preset time, and the negative feedback indicates that the user has not finished playing the video, the playing time of the video does not exceed the preset time, and the like. Thus, after tagging the recall results, training data of a first type is generated.

Then, the providing device determines a target query sentence including the entity words in the original training data, and constructs second type training data by using the entity words in the target query sentence and the recall result of the target query sentence. The first type of training data and the second type of training data are added to the training data set. And then training to obtain a semantic vector generation model based on the training data set, the initial semantic vector generation model and the loss function.

In a possible implementation manner, the process of constructing the second type of training data is:

and deleting or replacing the entity words in the target query sentence with other entity words of the same type, modifying the label of the recall result of the target query sentence into an opposite label, and obtaining second type training data.

In this embodiment, the providing apparatus may delete the entity word in the target query statement, modify the label of the recall result of the target query statement into an opposite label, and obtain the second type of training data. Or the providing device may replace the entity word in the target query sentence with another entity word of the same type, modify the label of the recall result of the target query sentence into an opposite label, obtain the second type of training data, and increase the training data of entity word meaning learning. For example, the entity word in the target query sentence is "shanxi famous song", the "shanxi famous song" is replaced by "northeast famous song", and the label of the original recall result is modified to be the opposite label.

In this embodiment, the providing apparatus determines whether the entity words in the target query statement and the recall result of the target query statement are consistent. If not, modifying the label of the recall result of the target query statement into an opposite label; if so, the tag is not modified. And judging whether the entity word types in the target query statement and the recall result of the target query statement are consistent or not. If not, modifying the label of the recall result of the target query statement into an opposite label; if so, the tag is not modified. In this way, the second type of training data is also obtained, and training data for entity word meaning learning is added.

It should be noted that the inconsistency includes two cases, one is that the target query statement and the recall result both include the entity word, but the entity word is inconsistent; the other is that only one of the target query statement and the recall result includes the entity word, and the other does not include the entity word.

In one possible implementation, to consider the effect of synonyms on semantic understanding, training data may be constructed based on a synonym table, processed as:

searching synonyms of key words in the query sentences in a preset synonym table aiming at the query sentences in the original training data; replacing the keywords in the query sentence with synonyms of the keywords to obtain a first query sentence; adding the first query statement, the recall result of the first query statement, and the label as training data to a training data set.

In this embodiment, the providing apparatus may obtain a part of the query statement or the entire query statement in the original training data. The providing apparatus may screen out the keywords of each of the obtained query sentences using the parts of speech and sentence structures of each of the obtained terms in each of the query sentences. And then acquiring a preset synonym table, and searching the synonyms of the keywords in each acquired query sentence in the synonym table. And replacing the keywords in each query sentence with synonyms of the keywords to obtain a first query sentence. And adding the first query statement, the recall result of the first query statement and the corresponding label as training data to a training data set. Therefore, the training data with the same label is constructed without changing the sentence structure, so that not only can more training data be constructed, but also the learning of the model to the semantics of the query sentence in the training data is facilitated.

In a possible implementation manner, since the content that the user generally wants to query corresponds to the recall result, the antisense words tend to have a larger influence on the user experience than the synonyms, for example, searching for wedding background music, and listening to these songs by people whose recall result is a lost love, has a larger influence on the user experience. To account for the effect of anti-sense words on semantic understanding, training data may be constructed based on anti-sense word lists, processed as:

searching an antisense word of a keyword in a query sentence in a preset antisense word table aiming at the query sentence in original training data; replacing the keywords in the query sentence with antisense words of the keywords to obtain a second query sentence; the second query statement, the recall result of the second query statement, and the opposite label are added to the training data set as training data.

In this embodiment, the providing apparatus may acquire a part of the query sentence or the entire query sentence in the original training data. The providing apparatus may screen out the keywords of each of the obtained query sentences using the parts of speech and sentence structures of each of the obtained terms in each of the obtained query sentences. And then acquiring a preset anti-sense word list, and searching the acquired anti-sense words of the keywords in each query sentence in the anti-sense word list. And replacing the keywords in each obtained query sentence with the antisense words of the keywords to obtain a second query sentence. And modify the label of the recalled result of the second query statement to the opposite label. And adding the second query statement, the recall result of the second query statement and the corresponding label as training data to a training data set. Therefore, the sentence structure is not changed, the training data with the opposite labels is constructed, more training data can be constructed, and the model is favorable for learning the semantics of the query sentences in the training data.

In one possible implementation, in constructing the training data, in order to improve the stability of the trained semantic vector generation model, the training data may be augmented by:

identifying query sentences in original training data to obtain keywords in the query sentences; deleting or replacing a word except the keyword in the query sentence with a preset word to obtain a third query sentence; the third query statement, the recall result of the third query statement, and the label are added to the training data set as training data.

In this embodiment, the providing apparatus may acquire a part of the query sentence or the entire query sentence in the original training data. The providing apparatus may screen out the keywords of each of the obtained query sentences using the parts of speech and sentence structures of each of the obtained terms in each of the query sentences. On the basis of not changing the original sentence semantics, deleting or replacing a word except the keyword in each obtained query sentence by a preset word to obtain a third query sentence. The third query sentence, the recall result of the third query sentence, and the label are added to the training data set as training data. Thus, the label is not changed because the semanteme is not changed, and the aim of amplifying the training data can be achieved.

In one possible implementation manner, in order to improve the stability of the trained semantic vector generation model when constructing the training data, the training data may be augmented in the following manner, and the processing is performed as follows:

adding stop words at the head or the tail of the query sentence in the original training data to obtain a fourth query sentence, wherein the semantics of the query sentence added with the stop words are not changed; and adding the fourth query statement, the recall result of the fourth query statement and the label as training data to the training data set.

And after the stop word is added to the head or the tail of the query sentence, the semantics of the query sentence is unchanged. For example, stop words are punctuation, o, and the like.

In this embodiment, the providing apparatus may acquire a part of the query sentence or the entire query sentence in the original training data. The providing apparatus may add stop words to the beginning or end of each of the obtained query sentences to obtain a fourth query sentence on the basis that the semantics of each of the obtained query sentences are not changed. And adding the fourth query statement, the recall result of the fourth query statement and the label as training data to a training data set. Thus, the training data can be amplified without changing the label because the semantic is unchanged.

In one possible implementation, to obtain highly reliable labeled data, a human may label a portion of the data by:

screening a fifth query statement with frequency exceeding a target threshold and serving as semantic query in the original training data; displaying the fifth query statement and a recall result corresponding to the fifth query statement; acquiring a label result of a recall result corresponding to the fifth query sentence by the user; and taking the fifth query statement and the labeling result as training data, and adding the training data to the training data set.

Wherein the target threshold value may be preset and stored in the providing means.

In this embodiment, the providing apparatus may screen out the fifth query statement, which is semantic query, in the original training data, where the frequency of the fifth query statement exceeds the target threshold. The providing device can display the fifth query statement and the recall result corresponding to the fifth query statement to the user, and the user can add a label to the recall result as a labeling result and submit the labeling result. The providing means may add the fifth query statement and the annotation result as training data to the training data set. Thus, the manual work can mark partial recall results to obtain partially accurate training data.

In addition, in the embodiment of the present disclosure, when the training data is constructed and a certain query statement in the original training data includes another query statement, the other query statement may be deleted and not used as the training data. In this way, duplicate training data may be deleted.

It should be noted here that the above-mentioned methods for constructing the training data set may exist simultaneously or only in some cases. As shown in fig. 2, an exemplary diagram is provided for a better understanding of the process of constructing a training data set in embodiments of the present disclosure.

In one possible implementation, as shown in fig. 3, the initial semantic vector generation model may include a text steering amount layer, a first coding layer, a second coding layer, a third coding layer, and a fourth coding layer, where, for any query statement, the query statement is processed by the text steering amount layer, the first coding layer, and the second coding layer to obtain a text vector of the query statement. And the recall result of the query statement passes through the text turning quantity layer, the third coding layer and the fourth coding layer to obtain a text vector of the recall result of the query statement. Illustratively, the text steering amount layer may be an embedding (embedding) layer for converting a numeric sequence of input text into a vector. Before entering the text steering amount layer, the input text is first preprocessed by the token layer to convert the input text into numbers (token-based). It should be noted that, in the initial semantic vector generation model, the first coding layer and the third coding layer may be the same, and share parameters, and the second coding layer and the fourth coding layer are both K layers, but do not share parameters.

The first coding layer and the second coding layer may be albert encoders, but may also be other encoders like long-short-term memory (LSTM), convolutional Neural Network (CNN), or the bert series. Similarly, the third and fourth coding layers may be albert encoders, but may also be other encoders like LSTM, CNN, or the bert series. In application, how many layers of networks are selected for the albert encoder, which is not limited in the embodiments of the present application.

The process of training to obtain the semantic vector generation model comprises the following steps:

acquiring a plurality of pieces of training data in a training data set according to batch sizes, wherein each piece of training data comprises a combination of a query statement and a corresponding recall result; inputting each piece of training data into an initial semantic vector generation model to obtain text vectors of query sentences and text vectors of recall results in the training data; determining the loss corresponding to the training data based on the loss function, the text vector of the query sentence in the training data and the text vector of the recall result; when the loss meets the first condition or the evaluation index meets the second condition, determining the initial semantic vector generation model as a trained semantic vector generation model, when the loss does not meet the first condition and the evaluation index does not meet the second condition, updating the initial semantic vector generation model by using the loss, continuing to train the updated semantic vector generation model by using the training data set until the loss meets the first condition or the evaluation index meets the second condition, and determining the semantic vector generation model when the loss meets the first condition or the evaluation index meets the second condition as the trained semantic vector generation model.

The first condition is that the loss is smaller than a target value, the evaluation index can comprise recall rate, accuracy, false alarm rate, missing report rate and the like, and the second condition indicates that the recall rate is higher than a certain value, the accuracy is higher than a certain value, the false alarm rate is lower than a certain value, and the missing report rate is lower than a certain value.

In this embodiment, the providing means may take the initial semantic vector generation model as the current model. In the training data set, a plurality of pieces of current training data are obtained according to the current batch size, and each piece of training data comprises a combination of a query statement and a corresponding recall result.

Then, the providing device inputs each piece of current training data into the current model, and the output is the text vector of the query sentence and the text vector of the recall result in each piece of current training data. The providing device calculates the loss of the current training data by using the loss function, the text vector of the query sentence in each piece of current training data and the text vector of the recall result.

If the loss meets the first condition or the evaluation index meets the second condition, determining the text vector generation model to be trained as the text vector generation model.

If the loss does not satisfy the first condition and the evaluation index does not satisfy the second condition, the providing device may update the current model based on the loss. And returning to the processing of selecting the current training data in the training data set until the loss meets the first condition or the evaluation index meets the second condition, and determining the text steering quantity layer, the first coding layer and the second coding layer in the model when the loss meets the first condition or the evaluation index meets the second condition as the semantic vector generation model.

It should be noted here that during the training process, the training process can be divided into two parts. The first part is training 3 rounds by combining a data set constructed by rules and a data set labeled manually, the training data of the first part is more than 100 thousands, and the first part mainly aims at introducing a large number of domain terms into the model and improving the generalization performance; the second part is to continue training two rounds by using manually marked data, and the batch size is smaller relative to the first part.

Optionally, after the first coding layer and the second coding layer, the text vector of the query statement is obtained through pooling. And performing pooling treatment after the third coding layer and the fourth coding layer to obtain a text vector of the recall result. In this way, the dimension can be reduced. Pooling here is to maximize pooling or average pooling, etc.

In one possible implementation, the loss function may include a foldout loss function and a vector similarity loss function, the foldout loss function is a maximum foldout loss function, and the process of calculating the loss in the training process may be:

multiplying the text vector of the query sentence by the numerical value of the corresponding position of the text vector of the recall result to obtain a multiplied vector, and subtracting the numerical value of the corresponding position of the text vector of the query sentence from the numerical value of the corresponding position of the text vector of the recall result to obtain a subtracted vector; respectively combining and inputting the text vector of the query statement, the text vector of the recall result, the multiplied vector and the subtracted vector to a full-connection layer to obtain a similar prediction result of the training data; substituting the similar prediction result of the training data and the label of the recall result in the training data into a folding loss function to determine the folding loss of the training data; calculating the vector similarity of the text vector of the query statement and the text vector of the recall result, substituting the vector similarity, the similar prediction result and the label of the recall result into a vector similarity loss function, and determining the vector similarity loss of the training data; and calculating the loss corresponding to the training data by using the vector similarity loss and the foldout loss.

In this embodiment, the loss is calculated for a current piece of training data as follows:

as shown in fig. 4, for each piece of training data at present, the providing apparatus multiplies the text vector of the query sentence in the piece of training data by the numerical value at the corresponding position of the text vector of the recall result, to obtain a multiplied vector. For example, the text vector u of the query sentence is (a 1, a2, a3, \8230;, an), the text vector v of the recall result is (b 1, b2, b3, \8230;, bn), and the multiplied vector is (a 1:b1, a 2:b2, a 3:b3, \8230;, an:an:bn). And subtracting the numerical value of the corresponding position of the text vector of the query sentence in the training data and the text vector of the recall result to obtain the vector after subtraction. For example, the text vector u of the query statement is (a 1, a2, a3, \8230;, an), the text vector v of the recall result is (b 1, b2, b3, \8230;, bn), and the subtracted vectors are (a 1-b1, a2-b2, a3-b3, \8230;, an-bn).

Combining the text vector of the query sentence, the text vector of the recall result, the multiplied vector and the subtracted vector in the training data to obtain a vector with the dimensionality being 4 times that of the original u or v vector. In this way, vectors with more dimensions can be obtained, and the difference value and the product of the text vector of the query sentence and the text vector of the recall result are described in the vectors with more dimensions, so that the similarity of the text vector of the query sentence and the text vector of the recall result can be described better. And inputting the merged vector into a full-connection layer to obtain a similar prediction result. The providing means may then determine a fold loss for the piece of training data based on the similar prediction result and the label of the recall result, as expressed by equation (1):

（1）

wherein, in the formula (1)

A fold representing the piece of training dataLoss, y represents a label, the value of y is 1 or 0, p represents a similar prediction result, which can also be called the prediction probability of the model, the larger p represents the probability that the prediction is similar,

and

indicating a hyper-parameter, the default value may be 0.2,

means taking 0 and

the square of the median maximum.

For each piece of current training data, calculating the vector similarity of the text vector of the query sentence and the text vector of the recall result in the piece of training data, wherein the vector similarity may be cosine similarity. Determining a vector similarity loss of the piece of training data based on the vector similarity and the label of the recall result, as expressed by equation (2):

（2）

wherein, in the formula (2)

Representing the vector similarity loss of the training data, y representing a label, the value of which is 1 or 0, p representing a similar prediction result,

the vector similarity of the text vector u of the query sentence in the training data and the text vector v of the recall result is shown in the range of [ -1,1]。

Finally, the loss for any piece of training data is expressed using equation (3):

（3）

wherein, in the formula (3),

for hyper-parameters, it is an empirical value.

And then adding losses of all the current training data to obtain the corresponding loss of the current training data.

Thus, the trained semantic vector generation model can be more accurate due to the consideration of the vector similarity loss and the foldout loss.

In a possible implementation manner, when the semantic vector generation model includes a text conversion vector layer, a first coding layer, and a second coding layer, the processing of step 103 is:

and sequentially inputting the target text into a text steering quantity layer, a first coding layer and a second coding layer to obtain a semantic vector of the query sentence to be processed, wherein the text steering quantity layer is used for converting the target text into the vector, and the first coding layer and the second coding layer are used for coding the vector converted from the target text into the semantic vector according to the entity words in the target text.

In this embodiment, the providing apparatus inputs the target text to the text steering amount layer, and obtains a number sequence of the target text and converts the number sequence into a vector. And inputting the vector converted from the target text into the first coding layer and the second coding layer for coding to obtain a semantic vector. Optionally, after the encoding process, the semantic vector of the query statement to be processed is obtained through pooling.

It should be noted here that, in the training phase, since the first coding layer and the second coding layer learn the entity words of the query sentence in the training data, it can be considered that the coding processing is performed based on the entity words in the query sentence when the coding processing is performed.

In a possible implementation manner, a process of providing a recall result to a query statement to be processed is further provided, where the process includes:

inputting the candidate recall result into a text turning quantity layer, a third coding layer and a fourth coding layer in sequence to obtain a semantic vector of the candidate recall result; determining the similarity between the semantic vector of the query statement to be processed and the semantic vector of the candidate recall result; and providing a target recall result for the query statement to be processed based on the similarity.

In this embodiment, the candidate library includes a preset recall result and is an offline candidate library, and the offline candidate library is an offline refreshed online service candidate library. The providing device acquires all or part of recall results in the candidate library as candidate recall results of the query statement to be processed, then converts each candidate recall result into a vector after being processed by a text vector layer, and obtains the semantic vector of each candidate recall result by encoding the vector of each candidate recall result by a third encoding layer and a fourth encoding layer. Optionally, after the encoding processing, the semantic vector of each candidate recall result is obtained through pooling processing.

Then, the similarity between the semantic vector of the query statement to be processed and the semantic vector of each candidate recall result is calculated, and optionally, the similarity may be cosine similarity. And then determining the candidate recall result with the similarity larger than a preset threshold as the recall result of the query statement to be processed. Therefore, the generated semantic vector is accurate, so that the obtained recall result is also accurate, the click rate of the user is also high, and the click rate is the ratio of the total click times to the total display times.

Optionally, when the recall result is provided for the query statement to be processed based on the similarity, the review information of the recall result may also be referred to, and the recall result is provided for the query statement to be processed. For example, for a certain recall result, although the similarity is high, the comment information user evaluation score is low, and the comment information user evaluation score is not taken as the recall result of the query sentence to be processed.

In the embodiment of the disclosure, when the semantic vector generation model is trained, the query sentence used includes the entity word, so that the semantic vector generation model can accurately identify the entity word. And then, when the semantic vector is generated for the query sentence to be processed by using the semantic vector generation model, the entity word can be accurately identified, so that the accurate semantic vector can be determined for the query sentence to be processed. Therefore, when the recall result is matched for the query sentence to be processed, the accurate recall result can be matched due to the accurate semantic vector, and the accuracy of the provided recall result is improved.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present disclosure, and are not described in detail herein.

Based on the same technical concept, as shown in fig. 5, an embodiment of the present disclosure provides an apparatus for providing a recall result, the apparatus including:

an obtaining module 510, configured to obtain a query statement to be processed;

an input module 520, configured to input the target text into a trained semantic vector generation model, so that the semantic vector generation model obtains a semantic vector of the query sentence to be processed according to an entity word in the target text, where the query sentence in a training data set used for training the semantic vector generation model includes the entity word;

a matching module 530, configured to provide a target recall result for the query statement to be processed based on the semantic vector of the query statement to be processed.

In one possible implementation manner, as shown in fig. 6, the apparatus further includes:

a training module 540 to:

In one possible implementation manner, the training module 540 is configured to:

and deleting or replacing the entity words in the target query sentence with other entity words of the same type, modifying the label of the recall result of the target query sentence into an opposite label, and obtaining second-class training data.

In a possible implementation manner, the training module 540 is further configured to:

searching synonyms of the keywords in the query sentences in a preset synonym table aiming at the query sentences in the original training data;

adding stop words to the sentence heads or the sentence tails of the query sentences in the original training data to obtain fourth query sentences, wherein the semantics of the query sentences are not changed after the stop words are added to the query sentences;

and taking the fifth query sentence and the labeling result as training data, and adding the training data set with the fifth query sentence and the labeling result.

the training module 540 is configured to:

combining and inputting the text vector of the query statement, the text vector of the recall result, the multiplied vector and the subtracted vector to a full-connection layer respectively to obtain a similar prediction result of the training data;

the input module 520 is configured to:

the matching module 530 is configured to:

The above division of the means for providing recall results is exemplary.

It should be noted that: in the apparatus for providing a recall result according to the above embodiment, when providing a recall result, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for providing a recall result and the method for providing a recall result provided in the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, and are not described herein again.

Fig. 7 is a schematic structural diagram of a computer device 700, where the computer device 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where the memory 702 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 701 to implement the method for providing a recall result provided by the foregoing method embodiments. Certainly, the computer device may further have a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the computer device may further include other components for implementing functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including instructions executable by a processor in a terminal, to perform the method of providing recall results in the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a read-only memory (ROM), a Random Access Memory (RAM), a compact disc read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is intended to be exemplary only and not to limit the present disclosure, and any modification, equivalent replacement, or improvement made without departing from the spirit and scope of the present disclosure is to be considered as the same as the present disclosure.

Claims

1. A method of providing recall results, the method comprising:

acquiring query sentences to be processed;

inputting the target text into a trained semantic vector generation model, so that the semantic vector generation model obtains a semantic vector of the query sentence to be processed according to entity words in the target text, wherein the query sentence in a training data set for training the semantic vector generation model comprises the entity words;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein constructing a second type of training data based on the entity words in the target query statement and the recall result of the target query statement comprises:

4. The method of claim 2, wherein constructing a second type of training data based on the entity words in the target query statement and the recall result of the target query statement comprises:

5. The method according to any one of claims 2 to 4, further comprising:

6. The method according to any one of claims 2 to 4, further comprising:

7. The method according to any one of claims 2 to 4, further comprising:

8. The method according to any one of claims 2 to 4, further comprising:

9. The method according to any one of claims 2 to 4, further comprising:

10. The method according to any one of claims 2 to 4, wherein the training an initial semantic vector generation model based on the training data set and a pre-constructed loss function to obtain the trained semantic vector generation model comprises:

inputting each piece of training data into the initial semantic vector generation model to obtain text vectors of query sentences and text vectors of recall results in the training data;

11. The method of claim 10, wherein the loss function comprises a foldout loss function and a vector similarity loss function;

the determining the loss corresponding to the training data based on the loss function, the text vector of the query sentence in the training data, and the text vector of the recall result includes:

calculating the vector similarity of the text vector of the query sentence and the text vector of the recall result, substituting the vector similarity, the similar prediction result and the label of the recall result into the vector similarity loss function, and determining the vector similarity loss of the training data;

12. The method according to any one of claims 1 to 4, wherein the semantic vector generation model comprises a text steering vector layer, a first coding layer and a second coding layer;

the inputting the target text into the trained semantic vector generation model so that the semantic vector generation model obtains the semantic vector of the query sentence to be processed according to the entity words in the target text comprises:

and sequentially inputting the target text into the text steering quantity layer, the first coding layer and the second coding layer to obtain a semantic vector of the query sentence to be processed, wherein the text steering quantity layer is used for converting the target text into a vector, and the first coding layer and the second coding layer are used for coding the vector converted from the target text into the semantic vector according to the entity words in the target text.

13. The method of claim 12, wherein the semantic vector generation model further comprises a third coding layer and a fourth coding layer;

14. A computer device comprising a processor and a memory, the memory having stored therein at least one computer instruction, the computer instruction being loaded and executed by the processor to perform operations performed by a method of providing recall results according to any of claims 1 to 13.

15. A computer-readable storage medium having at least one computer instruction stored therein, the computer instruction being loaded and executed by a processor to perform operations performed by a method for providing recall results according to any of claims 1 to 13.