CN115033774A

CN115033774A - Method, device, equipment and medium for generating search text to be recommended

Info

Publication number: CN115033774A
Application number: CN202210694359.8A
Authority: CN
Inventors: 陈灿; 高菲; 田永杰; 张亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-09

Abstract

The disclosure provides a method and a device for generating a search text to be recommended, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the technical fields of natural language processing, deep learning and intelligent search. The specific implementation scheme of the method for generating the search text to be recommended is as follows: in response to receiving the input text, generating subsequent text for the input text; splicing the input text and the subsequent text to obtain an alternative search text aiming at the input text; and screening the alternative search texts according to a preset text screening strategy to obtain the search text to be recommended aiming at the input text.

Description

Method, device, equipment and medium for generating search text to be recommended

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical field of natural language processing, deep learning, and intelligent search, and more particularly, to a method and an apparatus for generating a search text to be recommended, an electronic device, and a storage medium.

Background

With the development of computer technology and network technology, a function of providing a recommended term may be integrated in a search engine so that a user can search for knowledge by means of the recommended term. The recommended entries are provided by a search engine based on search history, and problems of low accuracy rate of the provided recommended entries and poor user experience often exist.

Disclosure of Invention

The present disclosure is directed to a method and an apparatus for generating a search text to be recommended, which improve accuracy of a recommended search text and are suitable for a low-frequency scene, an electronic device, and a storage medium.

According to an aspect of the present disclosure, there is provided a method for generating a search text to be recommended, including: in response to receiving the input text, generating subsequent text for the input text; splicing the input text and the subsequent text to obtain an alternative search text aiming at the input text; and screening the alternative search texts according to a preset text screening strategy to obtain the search text to be recommended aiming at the input text.

According to another aspect of the present disclosure, there is provided an apparatus for generating a search text to be recommended, including: a text generation module for generating a subsequent text for the input text in response to receiving the input text; the text splicing module is used for splicing the input text and the subsequent text to obtain an alternative search text aiming at the input text; and the text screening module is used for screening the alternative search texts according to a preset text screening strategy to obtain the search texts to be recommended aiming at the input texts.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method for generating the search text to be recommended provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method of generating a search text to be recommended provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of generating a search text to be recommended provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic application scenario diagram of a method and an apparatus for generating a search text to be recommended according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a method for generating a search text to be recommended according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a method of generating a search text to be recommended according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the principle of generating follow-up text for input text according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the principle of screening alternative search texts in accordance with an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of screening alternative search texts according to another embodiment of the present disclosure;

fig. 7 is a block diagram of a structure of a device for generating a search text to be recommended according to an embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device for implementing a method of generating a search text to be recommended according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

First, an application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is a schematic application scenario diagram of a method and an apparatus for generating a search text to be recommended according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be various electronic devices with processing functions, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like.

In one embodiment, the electronic device 110 may provide, for example, a human-machine interface for operation by the user 120. For example, the electronic device 110 may be installed with a search-class application, an instant messaging-class application, a shopping-class platform, and/or the like, integrated with a search engine (by way of example only). The user 120 may enter text information as a search term or search statement, for example, for an input box provided by a search engine. The electronic device 110 may provide the search results to the user 120 based on the textual information entered in the input box. Wherein the user 120 may enter textual information, for example, via a virtual keyboard or a peripheral input device of the electronic device 110. For example, the electronic device 110 may be provided with a touch screen, and in response to an operation of the touch screen by the user, the electronic device 110 may obtain text information input for the input box.

In an embodiment, the electronic device 110 may present the recommended search text to the user 120 according to the text input by the input box, for example, and provide the user 120 with a search result matching any selected text in response to a selection operation of any text in the recommended search text by the user 120.

In an embodiment, the application scenario 100 may further include a server 130, and the server 130 may maintain a search text library 140 according to historical search information. The electronic device 110 may be communicatively coupled to the server 130 via a network, and the server 130 may access the search corpus 140. The network may include wired or wireless communication links, among others.

Illustratively, the electronic device 110 may transmit the text information entered in the input box to the server 130, query the search text repository 140 by the server 130 according to the text information, thereby querying for search text 150 matching the entered text information, and transmit the search text 150 to the electronic device 110 for presentation by the electronic device 110.

For example, the server 130 may recall the search text 150 from the search text repository 140 in the form of a matching recall or an ambiguous recall. For example, the matching recall may process the input text information and then use the processed text information as a basis for querying the search text library 140, and use the text obtained by querying as a search text. The fuzzy recall may, for example, perform a vectorized representation of the input text information, recall text that is closer to the vectorized representation of the text information as the search text based on the distance between the vectorized representation of the text information and the vectorized representation of the text in the search text repository 140.

In one embodiment, the server 130 may also filter and sort the text, for example, after recalling the text from the search text repository 140. The filtering is to filter out text that is not suitable for presentation in the recalled text, such as text including an abnormal word. The embodiment can sort the recalled texts according to the distance between the recalled texts and the input text information from small to large, or can sort the recalled texts in any mode. Finally, the filtered and sorted text is sent to the electronic device 110 as a recommended search text.

In an embodiment, the server 130 may also predict the subsequent text of the text information by performing semantic understanding on the input text information, for example, and concatenate the predicted subsequent text with the text information to obtain the search text 150.

It will be appreciated that the electronic device 110 may, for example, process the entered textual information locally after obtaining the entered textual information to obtain the search text 150 locally, in a similar manner as the server 130 obtains the search text.

It should be noted that the method for generating the search text to be recommended provided by the present disclosure may be executed by the server 130 or the electronic device 110. Accordingly, the generation apparatus of the search text to be recommended provided by the present disclosure may be provided in the server 130 or in the electronic device 110.

It should be understood that the number and types of electronic devices 110, servers 130, and search text repositories 140 in FIG. 1 are merely illustrative. There may be any number and type of electronic devices 110, servers 130, and search text repositories 140, as the implementation requires.

A method for generating a search text to be recommended provided by the present disclosure will be described in detail below with reference to fig. 2 to 6.

Fig. 2 is a flowchart illustrating a method for generating a search text to be recommended according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 for generating a search text to be recommended according to this embodiment may include operations S210 to S230.

In operation S210, in response to receiving the input text, a subsequent text for the input text is generated.

In operation S220, the input text and the subsequent text are spliced to obtain an alternative search text for the input text.

In operation S230, the alternative search texts are screened according to a predetermined text screening policy, so as to obtain a search text to be recommended for the input text.

According to an embodiment of the present disclosure, the input text may be, for example, text entered in an input box presented by a search engine in the electronic device. The input text may be obtained, for example, in response to a touch operation of the electronic device by a user, or in response to a data signal provided by a peripheral input device of the electronic device.

According to an embodiment of the present disclosure, operation S210 may perform word segmentation on the input text to obtain a word sequence. Then, the embedded expression of the word sequence is input into the recurrent neural network model RNN, the probability that the subsequent words of the input text are each word in the predetermined dictionary is output by the RNN, and the word with the highest probability is used as the subsequent text of the input text. In an embodiment, for example, the words and word sequences with the highest probability may be iteratively input into the RNN to sequentially obtain a plurality of words with the highest probability, and the plurality of words with the highest probability are sequentially spliced according to the generation order, so that a subsequent text for the input text may be obtained. The iteration number may be set according to actual requirements, which is not limited by the present disclosure.

According to an embodiment of the present disclosure, operation S210 may, for example, perform word segmentation on the input text to obtain a word sequence. And then, according to the co-occurrence probability of the last word in the word sequence and each word in the preset dictionary, screening out a word with higher co-occurrence probability with the last word from the preset dictionary, and taking the screened-out word as the first word of the subsequent text. Then, according to the co-occurrence probability of the first word and each word in the predetermined dictionary, the second word of the subsequent text is screened out from the predetermined dictionary. By analogy, each word in the subsequent text is obtained in an iterative manner, and the words are sequentially spliced according to the screening sequence, so that the subsequent text can be obtained. The iteration number may be set according to actual requirements, which is not limited by the present disclosure.

In an embodiment, operation S210 may employ a mask-based Sequence-to-Sequence pre-training Model (MASS) to process the input text, so as to obtain a subsequent text. Specifically, the subsequent text after the input text may be replaced with a mask token, the mask token and the word token formed by the words in the input text are input into the MASS model in a sequence form, and the MASS model is used for predicting information of the mask token. Based on the prediction information, the subsequent text can be obtained.

After obtaining the subsequent text, the embodiment may concatenate the subsequent text after the last word of the input text, thereby obtaining the alternative search text. It will be appreciated that similar methods as described above may be employed to generate one or more subsequent texts for the input text, resulting in one or more alternative search texts.

According to the embodiment of the disclosure, after the alternative search text is obtained, whether the alternative search text includes the abnormal word may be determined according to a pre-constructed abnormal word table, for example. And taking the text which does not comprise the abnormal word in the alternative search text as the search text to be recommended. Or when a plurality of candidate search texts are available, the multiple candidate search texts may be subjected to deduplication operation, and a text obtained after the deduplication operation is used as a search text to be recommended. The abnormal words may include various sensitive words, for example.

According to the embodiment of the disclosure, in the offline mode, words needing to be shielded can be mined from historical search texts or any texts, and a shielding word dictionary is constructed according to the words needing to be shielded. And then, screening the alternative search texts according to the screening word dictionary, and taking the text without the screening words as the search text to be recommended. The mask word may be obtained by mining according to a set prior rule, for example, and the type of the mask word may be set according to an actual requirement, which is not limited by the present disclosure.

According to the method for generating the search text to be recommended, the subsequent text is generated according to the input text, and the search text to be recommended is generated according to the subsequent text, so that the search text can be recommended on the premise of not depending on a search text library. Therefore, the problem of long tail existing in the process of recommending the search text can be solved. Moreover, since the subsequent texts are generated according to the input texts, the technical scheme of the embodiment of the disclosure can be realized by relying on a deep learning technology, and therefore the accuracy of the recommended search texts is improved.

The long tail in the long tail problem refers to the situation of low frequency and sparse statistics. Search text in a search text corpus is usually maintained in the order of billions or more, while search text in billions or more in a search text corpus, which includes low-frequency words, is much smaller in order, for example, a search text including a certain low-frequency word may be only a single digit. For example, if the search text in the search text library is a time window of year, then the "management-type researcher needs to take english" for the input text of the low-frequency search or the first search, and if the technical solution relying on the search text library is adopted, there may be a case where the search text including the input text cannot be searched from the search text library because no user has provided the search text including the input text in the last year, and thus, the recommended search text cannot be provided to the user. Even if the search text library is queried by adopting a fuzzy recall strategy, only contents with semantics similar to input texts, such as 'how to study English by a researcher' and the like, can be queried, and the query cost is high because the search texts maintained in the search text library are in the billion level. If the method for generating the search text to be recommended, which is provided by the embodiment of the present disclosure, is used, the subsequent texts "four-stage Do", "six-stage Do", "four-six-stage Do" and the like "need to test English for the management type students" can be generated for the input text "needs to test English for the management type students", and the subsequent texts and the input text "needs to test English for the management type students" can be spliced, so that alternative search texts "needs to test English for the management type students", "needs to test English for the management type students" and the like can be obtained. Compared with the technical scheme depending on the search text base, the method provided by the embodiment of the disclosure can provide more accurate search text, and the situation that the recommended search text cannot be obtained due to the long tail problem does not exist.

Fig. 3 is a schematic diagram illustrating a principle of a method of generating a search text to be recommended according to an embodiment of the present disclosure.

As shown in fig. 3, in an embodiment 300, subsequent text for the input text may be generated by means of a text generation model 320. The text generation model may be a generation model constructed based on a recurrent neural network, a generation model constructed based on a Transformer, or the like. Therefore, when the subsequent text is generated, the context information of the input text can be fully considered, and the accuracy of the generated subsequent text is improved.

For example, in the embodiment 300, after the input text 310 is received, for example, the input text may be cut into words, and the words obtained by the segmentation are ordered from first to last according to the positions of the words in the input text, so as to obtain a word sequence. The input text may be segmented by any one of a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.

After the sequence of words is obtained, the embodiment may process the sequence of words using the text generation model 320 to generate subsequent text for the input text from the text generation model. For example, after the Word sequence is obtained, each Word in the Word sequence may be encoded by using, for example, a Word2Vector method, so as to obtain a Word Vector of each Word. The word vectors of the plurality of words are arranged according to the arrangement sequence of the words in the word sequence to form a word vector sequence. This embodiment may input a sequence of word vectors into the text generation model 320, with subsequent text of the input text being output by the text generation model 320.

For example, the text generation model 320 may output probabilities that respective words in a predetermined dictionary are subsequent texts of the input text, and the embodiment may take the word with the highest probability as the subsequent text. Alternatively, the embodiment may obtain M subsequent texts according to a predetermined number of words with a higher probability (for example, M, where M is an integer greater than 1). The embodiment may stitch the M subsequent texts to the input text 310 to obtain M candidate search texts 330, for example, Item _1 to Item _ M may be used to represent the M candidate search texts 330.

After obtaining the M candidate search texts 330, the embodiment 300 may filter the M candidate search texts 330 by using a predetermined filtering policy 340, so as to obtain a recommended search text 350.

In one embodiment, the text Generation model may be, for example, An Enhanced Multi-Flow Pre-training and Fine-tuning frame for Natural Language Generation (ERNIE-GEN) model based on a text-center model, which is a Pre-training model for generating complete semantic fragments based on a Multi-stream mechanism. Based on this, when the input text 310 is cut, the cut may be performed in units of segments (span). Wherein, the fragment is composed of 1 to 3 words, and each fragment is a complete semantic phrase. Accordingly, the obtained word sequence may be embodied in the form of a sequence of segments, and the text generation model is used to generate subsequent segments of the input text, the subsequent text being composed of the subsequent segments. By adopting the ERNIE-GEN to generate the subsequent text, the accuracy of the generated subsequent text can be improved, and the problem of exposure deviation in text generation is avoided.

Specifically, the ERNIE-GEN model proposes a training target for generating tasks piece by piece in the pre-training. The model carries out overall prediction on each segment, and the generation of each word in the segment does not form a dependency relationship. In this way, the trained ERNIE-GEN model can be generated with phrase granularity. The ERNIE-GEN model adopts unified 'ATTN' and the position code of the nth segment to be generated as the input of the model when generating the segment to be generated during training and decoding. Wherein, the self-attention (self-attention) calculation is carried out on the 'ATTN' and the previous (n-1) segments according to the semantic information and the position information, so that the above important information can be adaptively filled in (filling) to the representation of the current segment. This approach may reduce the dependence of the generation of the nth segment on the (n-1) th segment, thereby attenuating exposure bias.

FIG. 4 is a schematic diagram of the principle of generating follow-up text for input text according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, when the subsequent text is generated by adopting the text generation model, a plurality of words iteratively generated by the text generation model can be spliced to be used as the subsequent text, so that the integrity of the obtained alternative search text can be improved, and the accuracy of the recommended search text can be improved.

As shown in fig. 4, in this embodiment 400, when generating the subsequent text, a similar method to the word segmentation method described above may be used to segment the input text 410, so as to obtain a word sequence 420. Subsequently, the word sequence may be used as an initial sequence of a target sequence, and the target sequence is processed by using the text generation model 430 to obtain a probability vector. Specifically, the word vector sequence obtained by encoding the target sequence may be input into the text generation model 430, and the text generation model 430 outputs the probability vector. The probability number in the probability vector may be the same as the number of a plurality of predetermined words in a predetermined dictionary, and each probability in the probability vector corresponds to a predetermined word and represents the probability that each word in the predetermined word is a subsequent word of the target sequence. For example, the predetermined words may include a word a 441, a word b 442, a word c 443, a word d 444, and a word e 445, and the probability vector includes five probability values representing the probabilities that the word a, the word b, the word c, the word d, and the word e are subsequent words of the input text 410, respectively.

After obtaining the probability, the embodiment may update the target sequence according to the plurality of predetermined words, so as to obtain an alternative sequence. For example, each predetermined word may be added to each target sequence, the number of the predetermined words is set to be N, the number of the target sequences is set to be P, and the number of the obtained candidate sequences is N × P. For example, for the first iteration, P is 1, and if N is set to 5, the resulting candidate sequence may include the following 5 word sequences: word sequences obtained by adding word a to word sequence 420, word sequences obtained by adding word b to word sequence 420, word sequences obtained by adding word c to word sequence 420, word sequences obtained by adding word d to word sequence 420, and word sequences obtained by adding word e to word sequence 420.

The embodiment may then determine a probability for each alternative sequence based on the probability that each word of the plurality of predetermined words is a subsequent word. In which, the probability of the initial sequence is set to be 1, and the product of the probability of the target sequence and the probability that a predetermined word is a subsequent word may be used as the probability of the candidate sequence obtained after the predetermined word is added to the target sequence. After obtaining the probability of each candidate sequence, the embodiment may use a predetermined number of sequences with higher probabilities in the multiple candidate sequences as the target sequence in the next iteration. That is, the target sequence is updated to a predetermined number of sequences having a large probability.

For example, in the embodiment 400, the predetermined number is set to 2. For the first iteration, if the probability vector output by the text generation model 430 includes a probability that the word a 441 is a subsequent word and a probability that the word d 444 is a subsequent word that are both greater than the probabilities that the word b, the word c, and the word e are subsequent words, it may be determined that the target sequence in the second iteration includes a first word sequence obtained by adding the word a 441 to the word sequence 420 and a second word sequence obtained by adding the word d 444 to the word sequence 420. For a second iteration, the first sequence of words and the second sequence of words may be processed separately using the text generation model 430 to obtain a first probability vector for the first sequence of words and a second probability vector for the second sequence of words. A plurality of predetermined words (words a to e) are added to the first word sequence and the second word sequence, respectively, and 10 candidate sequences can be obtained. The probabilities of the 5 alternative sequences obtained by adding the words a to e to the first word sequence are respectively as follows: the probability that the word a is a subsequent word in the first iteration is multiplied by the probabilities that the word a, the word b, the word c, the word d and the word e are subsequent words in the first probability vector. The probabilities of the 5 alternative sequences obtained by adding the words a to e to the second word sequence are respectively as follows: in the first iteration, the probability that the word d is a subsequent word is multiplied by the probabilities that the word a, the word b, the word c, the word d and the word e are subsequent words in the second probability vector.

For example, if the probability of the candidate sequence composed of the first word sequence and the word c and the probability of the candidate sequence composed of the second word sequence and the word b are greater than the probabilities of the other 8 candidate sequences in the obtained 10 candidate sequences, it may be determined that the target sequence in the third iteration includes: a third word sequence consisting of the first word sequence and the word c, and a fourth word sequence consisting of the second word sequence and the word b. By analogy, after the iteration stop condition is reached, the other words except the word sequence 420 in the finally updated target sequence can be spliced according to the generation sequence, so that a predetermined number of subsequent texts are obtained.

Illustratively, the iteration stop condition may include, for example: the iteration times reach the preset times, namely the number of the generated words reaches the preset word number. For example, the predetermined number of times may be 3, and as shown in fig. 4, in the embodiment 400, the target sequence obtained through three iterations may include: the word sequence comprises a word sequence formed by a word a, a word c and a word b, and the word sequence comprises a word sequence formed by a word d, a word b and a word c. Accordingly, the subsequent text may include text acb 451 formed by concatenating word a, word c, and word b, and text dbc 452 formed by concatenating word d, word b, and word c. It is understood that the predetermined number of times can be set according to actual requirements. For example, the accuracy of the words generated in each round when the words are iteratively generated in the test process of the text generation model can be determined, so that the accuracy of the obtained subsequent text and the recommended search text can be improved, and the situation that the accuracy is reduced due to the fact that the generated subsequent text is too long can be avoided.

Illustratively, the iteration stop condition may include, for example: and the word ranked at the last position of each sequence in the updated target sequence is a stop word. For example, stop words may be included in the aforementioned predetermined dictionary, and may be represented by a separator [ SEP ], for example. It will be appreciated that the predetermined number of subsequent texts that are ultimately determined may each include a different number of words. For example, if, in the second iteration, the predetermined number of candidate sequences with a higher probability includes a specific sequence composed of the target sequence before update and the stop word, the embodiment may use, as the target sequence in the third iteration, a sequence other than the specific sequence from among the predetermined number of candidate sequences with a higher probability, and pick out the predetermined number of sequences with a higher probability from among the candidate sequences obtained in the third iteration and the specific sequence. Alternatively, the specific sequence obtained in the second iteration may also be used as the target sequence in the third iteration, but in the probability vector obtained by processing the specific sequence by using the text generation model, the probability for the stop word is 1, and the probability for the other words is 0. By setting the iteration stop condition that the word in the last position is the stop word, the integrity of the meaning expressed by the generated alternative search text can be ensured, the accuracy of the recommended search text is improved, and the recommendation effect and the user experience are improved.

Illustratively, the iteration stop condition may include, for example, two conditions that the number of iterations reaches a predetermined number and a word in the last position of each sequence in the updated target sequence is a stop word, and during the iteration, if one of the two conditions is satisfied, the iteration may be stopped.

It will be appreciated that embodiment 400 is essentially in the form of a Beam Search (Beam Search) to generate subsequent text.

Fig. 5 is a schematic diagram of the principle of screening alternative search texts according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, the text with higher smoothness can be screened from the alternative search texts to be used as the search text to be recommended. For example, the popularity of each text in the alternative search texts may be determined first, and then, the text with the popularity being greater than or equal to the popularity threshold value in the alternative search texts is used as the search text to be recommended. Therefore, the recommended search texts have higher smoothness, the user experience is improved, and the precision of the search results provided according to any recommended search text is improved. The smoothness threshold value can be set according to actual requirements, and the method is not limited by the disclosure.

Illustratively, a semantic compliance computation method based on dependency parsing or a semantic compliance computation method based on a neural network model may be employed to determine the compliance of the text. The semantic compliance calculation method based on the neural network model may determine the compliance of the text by using a single tower model of a self-attention mechanism or a multilayer neural network model composed of full connection layers, which is not limited in this disclosure.

In an embodiment, a text classification model constructed based on a textual heart model may be employed to determine the compliance of the text. The text classification model may be, for example, a binary classification model, which is used to predict whether the text class is a smooth class or a non-smooth class. For example, the embodiment may process the text by using a text classification model constructed based on a text heart model, obtain a probability that the text belongs to a compliance category, and use the probability as the compliance of the text. The method has the advantages that the method determines the compliance of the text by adopting the text classification model constructed based on the heart model, can better understand the text semantics, and can improve the accuracy of the determined compliance.

According to the embodiment of the disclosure, before the text with higher smoothness is screened from the alternative search text, the text including the abnormal word can be filtered from the alternative search text, and the text obtained after filtering is used as the text to be selected. And then screening out the texts with the smoothness more than or equal to the smoothness threshold value from the texts to be selected. Therefore, the recommended search text can better accord with relevant regulations, and the user experience and the safety of information search are improved.

For example, as shown in fig. 5, in embodiment 500, after obtaining the alternative search text 510, the alternative search text 510 may be filtered according to the abnormal word table 520 to filter out text including abnormal words. The abnormal vocabulary 520 may include politically sensitive words, pornographic words, words expressing negative meanings, and the like, and the abnormal vocabulary 520 may be obtained by mining in advance according to actual needs, which is not limited in the present disclosure. For example, by filtering out the text including the abnormal word in the candidate search text 510, 5 candidate texts in total can be obtained, namely, the candidate text a 511, the candidate text B512,. and the candidate text E515. It is to be understood that the 5 candidate texts are only used as examples to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

Subsequently, the embodiment may employ the popularity prediction model 530 to determine the popularity of each of the 5 candidate texts. The compliance prediction model 530 may be any one of the above-described single tower model of the self-attention mechanism, a multi-layer neural network model composed of fully connected layers to determine compliance of the text, and a text classification model constructed based on the text-heart model. For example, the degrees of currency of the text to be selected a 511, the text to be selected B512,. and the text to be selected E515 may be a degree of currency a _ s 541, a degree of currency B _ s 542,. and a degree of currency E _ s 545, respectively.

After the popularity of each text in the 5 texts to be selected is obtained, in this embodiment, a text with a popularity greater than or equal to the threshold of the popularity in the 5 texts to be selected may be used as the search text 550 to be recommended, for example, the search text 550 to be recommended may include a text a to be selected, a text C to be selected, and a text E to be selected.

Fig. 6 is a schematic diagram of a principle of screening alternative search texts according to another embodiment of the present disclosure.

In an embodiment, a text with the smoothness being greater than or equal to the smoothness threshold value can be used as a target text, and the target text is subjected to subsequent processing, so that the use value of the determined search text to be recommended is improved, and the user experience is improved.

For example, as shown in fig. 6, in embodiment 600, a text whose degree of compliance is equal to or greater than a threshold value of degree of compliance may be taken as target text 630. For example, the target text 630 may include a candidate text a, a candidate text C, and a candidate text E. The embodiment can determine the repetition rate of the target texts, eliminate any text of the two texts with higher repetition rate from the target texts, and take the rest texts of the target texts as the search texts to be recommended.

Illustratively, the repetition rate AC 641 may be obtained by calculating a repetition rate between the text a to be selected and the text C to be selected. The repetition rate AE 642 can be obtained by calculating the repetition rate between the text A to be selected and the text E to be selected. The repetition rate CE 643 can be obtained by calculating the repetition rate between the text C to be selected and the text E to be selected. For example, word segmentation may be performed on two texts, resulting in two word sequences corresponding to the two texts. This embodiment may use a cross-over ratio of two word sequences to represent the repetition rate between two texts. Wherein, the intersection ratio refers to the ratio of the number of words in the intersection to the number of words in the union. Alternatively, the Word2Vector method may be used to encode the two texts, resulting in two text-embedded representations. The similarity between the two text-embedded representations is then used to represent the repetition rate between the two texts. It is to be understood that the above method of determining the repetition rate between two texts is only used as an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

In this embodiment 600, the obtained repetition rates of the multiple target texts may all be compared with a repetition rate threshold, and any one of the two target texts with repetition rates larger than the repetition rate threshold may be rejected. For example, it may be determined whether the repetition rate AC 641, the repetition rate AE 642, and the repetition rate CE 643 are greater than the repetition rate threshold value, respectively, by performing operation S610. If the repetition rate CE 643 is greater than the repetition rate threshold, any text of the text C to be selected and the text E to be selected may be removed from the target texts 630, so as to obtain the search text 650 to be recommended. It is understood that in the embodiment 600, the search text to be recommended may include the text a to be selected and the text C to be selected, and may also include the text a to be selected and the text E to be selected. For example, which text of the text C to be selected and the text E to be selected is removed may be determined according to the repetition rate between the text C to be selected and the text E to be selected and the text a to be selected. For example, if the repetition rate AC 641 is smaller than the repetition rate AE 642, it may be determined to reject the text E to be selected. Therefore, the finally determined multiple search texts to be recommended have larger semantic difference expressed between each other, and the hit rate of the search texts is favorably improved.

In one embodiment, in addition to screening the search texts to be recommended according to the repetition rate, the embodiment can also screen the target texts according to the accurate screening word dictionary mined offline, so as to further ensure the quality of the recommended search texts, and be beneficial to improving the user experience.

According to the method for generating the search text to be recommended, the subsequent text is generated according to the input text, the input text can be semantically understood and supplemented, the requirement on the long tail of the search is very friendly, the subsequent input process of the user can be replaced to a certain extent by recommending the search text, the input cost of the user is favorably reduced, and the search efficiency and the user experience are improved.

Based on the method for generating the search text to be recommended provided by the disclosure, the disclosure also provides a device for generating the search text to be recommended. The apparatus will be described in detail below with reference to fig. 7.

Fig. 7 is a block diagram of a structure of a device for generating a search text to be recommended according to an embodiment of the present disclosure.

As shown in fig. 7, the generation apparatus 700 of the search text to be recommended according to this embodiment may include a text generation module 710, a text concatenation module 720, and a text filtering module 730.

The text generation module 710 is operable to generate subsequent text for the input text in response to receiving the input text. In an embodiment, the text generating module 710 may be configured to perform the operation S210 described above, which is not described herein again.

The text stitching module 720 may be configured to stitch the input text with subsequent text to obtain alternative search text for the input text. In an embodiment, the text splicing module 720 may be configured to perform the operation S220 described above, which is not described herein again.

The text screening module 730 is configured to screen the candidate search text according to a predetermined text screening policy to obtain a search text to be recommended for the input text. In an embodiment, the text filtering module 730 may be configured to perform the operation S230 described above, which is not described herein again.

According to an embodiment of the present disclosure, the text generating module 710 may include a word cutting sub-module and a generating sub-module. The word cutting sub-module is used for cutting words of the input text to obtain a word sequence aiming at the input text. The generation submodule is used for processing the word sequence by adopting the text generation model and generating a subsequent text aiming at the input text.

According to an embodiment of the present disclosure, the generation submodule may be configured to use the word sequence as an initial sequence of the target sequence, and iteratively update the target sequence until an iteration stop condition is satisfied. For example, the generation submodule may include a sequence processing unit, a first updating unit, a probability determination unit, and a second updating unit. The sequence processing unit is used for processing the target sequence by adopting a text generation model to obtain the probability that each word in the plurality of preset words is a subsequent word of the target sequence. The first updating unit is used for updating the target sequence according to the plurality of preset words to obtain a plurality of alternative sequences. The probability determination unit is used for determining the probability of each alternative sequence in the plurality of alternative sequences according to the probability that each word is a subsequent word. The second updating unit is used for updating the target sequence into a predetermined number of sequences with higher probability in the plurality of alternative sequences. And under the condition that the iteration stop condition is met, the subsequent text is composed of other words except the word sequence in the target sequence.

According to an embodiment of the present disclosure, the iteration stop condition includes at least one of: the iteration times reach the preset times; and the word ranked at the last position of each sequence in the updated target sequence is a stop word.

According to an embodiment of the present disclosure, the text generation model includes a natural language generation model based on a text-heart model.

According to an embodiment of the present disclosure, the text screening module 730 may be specifically configured to screen out, according to the popularity of the text, a text with the popularity being greater than or equal to the popularity threshold in the alternative search text, and obtain a search text to be recommended for the input text.

According to an embodiment of the present disclosure, the text filtering module 730 may include a filtering sub-module, a compliance determining sub-module, and a first obtaining sub-module. And the filtering submodule is used for filtering out the texts, including the abnormal words, in the alternative search texts according to the preset abnormal word list to obtain a plurality of texts to be selected. And the popularity determination submodule is used for determining the popularity of each text in the plurality of texts to be selected. The first obtaining sub-module is used for obtaining a search text to be recommended aiming at the input text according to the texts with the fluency greater than or equal to the fluency threshold value in the texts to be selected.

According to an embodiment of the present disclosure, the text filtering module 730 may include a filtering sub-module and a second obtaining sub-module. And the screening submodule is used for screening out the texts with the smoothness more than or equal to the smoothness threshold value in the alternative search texts according to the smoothness of the texts to obtain a plurality of target texts. And the second obtaining submodule is used for screening out a search text to be recommended for the input text from the target texts according to the repetition rate of the target texts.

According to an embodiment of the disclosure, the compliance determination submodule may be configured to process each text by using a text classification model constructed based on a text-to-heart model, and obtain a probability of each text for a compliance category, so as to serve as a compliance of each text.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying the personal information of the related users all conform to the regulations of related laws and regulations, and necessary security measures are taken without violating the good customs of the public order. In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement the method of generating search text to be recommended of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the generation method of the search text to be recommended. For example, in some embodiments, the method of generating the search text to be recommended may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the above-described generation method of a search text to be recommended may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method of generating the search text to be recommended in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for generating a search text to be recommended comprises the following steps:

in response to receiving input text, generating subsequent text for the input text;

splicing the input text and the subsequent text to obtain an alternative search text aiming at the input text; and

and screening the alternative search texts according to a preset text screening strategy to obtain search texts to be recommended aiming at the input texts.

2. The method of claim 1, wherein the generating subsequent text for the input text in response to receiving the input text comprises:

performing word segmentation on the input text to obtain a word sequence aiming at the input text; and

and processing the word sequence by adopting a text generation model to generate a subsequent text aiming at the input text.

3. The method of claim 2, wherein the processing the sequence of words with a text generation model to generate subsequent text for the input text comprises: taking the word sequence as an initial sequence of a target sequence, and iteratively executing the following operations until an iteration stop condition is met:

processing the target sequence by adopting the text generation model to obtain the probability that each word in a plurality of preset words is a subsequent word of the target sequence;

updating the target sequence according to the plurality of preset words to obtain a plurality of alternative sequences;

determining a probability for each of the plurality of alternative sequences according to the probability that each word is the subsequent word; and

updating the target sequence to a predetermined number of sequences with a higher probability among the plurality of candidate sequences,

wherein, in case that an iteration stop condition is satisfied, the subsequent text is composed of words in the target sequence other than the word sequence.

4. The method of claim 3, wherein the iteration stop condition comprises at least one of:

the iteration times reach the preset times;

and the word ranked at the last position of each sequence in the updated target sequence is a stop word.

5. The method of any of claims 2-4, wherein the text generation model comprises a natural language generation model based on a text-heart model.

6. The method of claim 1, wherein the filtering the alternative search texts according to a predetermined text filtering policy to obtain the search text to be recommended for the input text comprises:

screening out texts with the fluency greater than or equal to a fluency threshold value in the alternative search texts according to the fluency of the texts, and obtaining the search texts to be recommended aiming at the input texts.

7. The method of claim 6, wherein screening out the texts with the fluency greater than or equal to a fluency threshold from the alternative search texts according to the fluency of the texts, and obtaining the search text to be recommended for the input text comprises:

filtering out texts including abnormal words in the alternative search texts according to a preset abnormal word list to obtain a plurality of texts to be selected;

determining the smoothness of each text in the plurality of texts to be selected; and

and obtaining a search text to be recommended aiming at the input text according to the text with the smoothness more than or equal to the smoothness threshold value in the texts to be selected.

8. The method according to claim 6 or 7, wherein the screening out the texts with the popularity greater than or equal to the popularity threshold value from the alternative search texts according to the popularity of the texts, and obtaining the search text to be recommended for the input text comprises:

screening out texts with the fluency greater than or equal to a fluency threshold value in the alternative search texts according to the fluency of the texts to obtain a plurality of target texts; and

and screening out search texts to be recommended for the input texts from the target texts according to the repetition rate of the target texts.

9. The method of claim 7, wherein the determining the popularity of each of the plurality of candidate texts comprises:

and processing each text by adopting a text classification model constructed based on a text-heart model to obtain the probability of each text for the smoothness category as the smoothness of each text.

10. An apparatus for generating a search text to be recommended, comprising:

a text generation module for generating subsequent text for an input text in response to receiving the input text;

the text splicing module is used for splicing the input text and the subsequent text to obtain an alternative search text aiming at the input text; and

and the text screening module is used for screening the alternative search texts according to a preset text screening strategy to obtain search texts to be recommended aiming at the input texts.

11. The apparatus of claim 10, wherein the text generation module comprises:

the word cutting sub-module is used for cutting words of the input text to obtain a word sequence aiming at the input text; and

and the generation submodule is used for processing the word sequence by adopting a text generation model and generating a subsequent text aiming at the input text.

12. The apparatus of claim 11, wherein the generation submodule is configured to take the sequence of words as an initial sequence of a target sequence, and iteratively update the target sequence until an iteration stop condition is satisfied; the generation submodule includes:

the sequence processing unit is used for processing the target sequence by adopting the text generation model to obtain the probability that each word in a plurality of preset words is a subsequent word of the target sequence;

the first updating unit is used for updating the target sequence according to the plurality of preset words to obtain a plurality of alternative sequences;

a probability determining unit, configured to determine, according to a probability that each word is the subsequent word, a probability for each candidate sequence in the multiple candidate sequences; and

a second updating unit, configured to update the target sequence to a predetermined number of sequences with a higher probability in the plurality of candidate sequences,

and under the condition that an iteration stop condition is met, the subsequent text is composed of other words in the target sequence except the word sequence.

13. The apparatus of claim 12, wherein the iteration stop condition comprises at least one of:

the iteration times reach the preset times;

and the word ranked at the last position of each sequence in the updated target sequence is the stop word.

14. The apparatus of any of claims 11-13, wherein the text generation model comprises a natural language generation model based on a text-heart model.

15. The apparatus of claim 10, wherein the text filtering module is to:

screening out texts with the smoothness degree larger than or equal to a smoothness degree threshold value in the alternative search texts according to the smoothness degree of the texts, and obtaining the search texts to be recommended aiming at the input texts.

16. The apparatus of claim 15, wherein the text filtering module comprises:

the filtering submodule is used for filtering out the texts, including the abnormal words, in the alternative search texts according to a preset abnormal word list to obtain a plurality of texts to be selected;

the popularity determining submodule is used for determining the popularity of each text in the plurality of texts to be selected; and

and the first obtaining submodule is used for obtaining a search text to be recommended aiming at the input text according to the text of which the popularity is greater than or equal to the popularity threshold value in the texts to be selected.

17. The method of claim 15 or 16, wherein the text filtering module comprises:

the screening submodule is used for screening out the texts with the smoothness more than or equal to the smoothness threshold value in the alternative search texts according to the smoothness of the texts to obtain a plurality of target texts; and

and the second obtaining submodule is used for screening out a search text to be recommended aiming at the input text from the target texts according to the repetition rate of the target texts.

18. The apparatus of claim 16, wherein the compliance determination submodule is to:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-9.

21. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 9.