CN117093600A - Search prompt word generation method and device, electronic equipment and storage medium - Google Patents

Search prompt word generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117093600A
CN117093600A CN202311116993.4A CN202311116993A CN117093600A CN 117093600 A CN117093600 A CN 117093600A CN 202311116993 A CN202311116993 A CN 202311116993A CN 117093600 A CN117093600 A CN 117093600A
Authority
CN
China
Prior art keywords
word
candidate
vector
prompt
initial search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311116993.4A
Other languages
Chinese (zh)
Inventor
马多昌
唐宇
朱朴
邵明星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202311116993.4A priority Critical patent/CN117093600A/en
Publication of CN117093600A publication Critical patent/CN117093600A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a device for generating search prompt words, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an initial search term input by a user; vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word; determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to each candidate prompting word; and determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word. Therefore, the accuracy of the recommended search prompt words can be improved, and the probability that any search prompt word cannot be given out is reduced.

Description

Search prompt word generation method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing, and in particular, to a method and apparatus for generating a search prompt word, an electronic device, and a storage medium.
Background
With the development of internet applications, users are more and more convenient to use a search function in various internet applications. In the prior art, when a user uses a search function, some data containing search words input by the user at this time are found from data such as episode titles, hot charts, roles, actors and the like and are recommended to the user as search prompt words, so that the user input is reduced, and the user experience is improved.
However, in practical applications, because the input of the user is not controllable, the situation that the relevance between the search prompt word and the search word input by the user at this time is low or any search prompt word cannot be recommended often occurs.
Disclosure of Invention
The application provides a method, a device, electronic equipment and a storage medium for generating search prompt words, which are used for solving the problems that in the prior art, the correlation degree between the recommended search prompt words and the search words input by a user at this time is low or any search prompt word cannot be recommended.
In a first aspect, the present application provides a method for generating a search prompt word, where the method includes:
acquiring an initial search term input by a user;
vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word;
determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to each candidate prompting word, and the vector similarity between the initial searching word vector and the target candidate prompting word vector is larger than a set similarity threshold;
And determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word.
Optionally, the vector database is constructed by:
acquiring a plurality of candidate prompt words from a plurality of preset data sources, wherein each data source comprises a plurality of pieces of text data, and the sources of the text data included in different data sources are different;
carrying out vectorization processing on the candidate prompt words aiming at each candidate prompt word to obtain candidate prompt word vectors corresponding to the candidate prompt words;
and constructing the vector database based on the candidate prompt words and the candidate prompt word vectors corresponding to the candidate prompt words.
Optionally, the candidate prompt word vector corresponding to the candidate prompt word is obtained by the following method:
extracting keywords in the candidate prompt words by using a preset keyword extraction technology;
and carrying out vectorization processing on the keywords to obtain candidate prompt word vectors corresponding to the candidate prompt words.
Optionally, the candidate prompt word vector corresponding to the candidate prompt word is obtained by the following method:
performing word segmentation processing on the candidate prompt words to obtain at least two candidate prompt word segments;
Carrying out vectorization processing on the candidate prompt word for each candidate prompt word to obtain a candidate prompt word vector corresponding to the candidate prompt word;
and determining each candidate prompt word segmentation vector as a candidate prompt word vector corresponding to the candidate prompt word.
Optionally, the vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word includes:
performing word segmentation processing on the initial search words to obtain at least two initial search word segments;
performing vectorization processing on the initial search word for each initial search word to obtain an initial search word vector corresponding to the initial search word;
determining each initial search word segmentation vector as an initial search word vector corresponding to the initial search word;
the determining at least one target candidate prompting word vector matched with the initial searching word vector from the pre-constructed vector database comprises the following steps:
for each initial search word segmentation vector included in the initial search word vector, determining at least one candidate prompt word vector with vector similarity greater than a set similarity threshold value from a pre-constructed vector database;
And determining candidate cue word vectors with vector similarity between the candidate cue word vectors and the initial search word vector being larger than a set similarity threshold value as target candidate cue word vectors matched with the initial search word vector.
Optionally, the method further comprises:
generating pinyin data of the initial search word under the condition that a target candidate prompt word vector matched with the initial search word vector is not determined from the vector database;
determining at least one target pinyin data from a pre-constructed pinyin database, wherein the pinyin database comprises a plurality of candidate prompt words and pinyin data of each candidate prompt word, and the target pinyin data are the same as the pinyin data of the initial search word;
and determining the candidate prompt word corresponding to the target pinyin data as the target prompt word of the initial search word.
Optionally, before the vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word, the method further includes:
searching a pre-constructed candidate prompt word stock according to the initial search word;
under the condition that target candidate prompt words containing the initial search words are found out from the candidate prompt word library, determining the target candidate prompt words as target prompt words of the initial search words;
And executing the step of vectorizing the initial search word under the condition that the target candidate prompt word containing the initial search word is not found from the candidate prompt word library.
In a second aspect, the present application provides a device for generating a search prompt word, where the device includes:
the initial word acquisition module is used for acquiring initial search words input by a user;
the initial word vectorization module is used for vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word;
the first matching module is used for determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to the candidate prompting words, and the vector similarity between the initial searching word vector and the target candidate prompting word vector is larger than a set similarity threshold;
and the first determining module is used for determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word.
In a third aspect, the present application provides an electronic device, comprising: the device comprises a processor and a memory, wherein the processor is used for executing a generation program of search prompt words stored in the memory so as to realize the generation method of the search prompt words in the first aspect.
In a fourth aspect, the present application provides a storage medium, where one or more programs are stored, where the one or more programs are executable by one or more processors to implement the method for generating search hint words according to any one of the first aspects.
According to the technical scheme provided by the embodiment of the application, the initial search word input by the user is obtained, vectorization processing is carried out on the initial search word to obtain the initial search word vector corresponding to the initial search word, at least one target candidate prompt word vector is determined from the pre-constructed vector database, the vector similarity between the initial search word vector and the target candidate prompt word vector is larger than the set similarity threshold, and the candidate prompt word corresponding to the target candidate prompt word vector is determined as the target prompt word of the initial search word, so that the rewriting of the initial search word input by the user and the matching of the target prompt word in a vectorization mode are realized, the accuracy of the recommended search prompt word can be improved, and the occurrence probability of the condition that any search prompt word cannot be given is reduced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is a flowchart of an embodiment of a method for generating search prompt words according to an embodiment of the present application;
FIG. 2 is a flowchart of an embodiment of another method for generating search terms according to an embodiment of the present application;
FIG. 3 is a flowchart of an embodiment of a method for generating a search term according to an embodiment of the present application;
FIG. 4 is a flowchart of an embodiment of a method for generating search terms according to an embodiment of the present application;
FIG. 5 is a flowchart of an embodiment of a method for generating search terms according to an embodiment of the present application;
FIG. 6 is a block diagram of an embodiment of a device for generating search prompt words according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The following disclosure provides many different embodiments, or examples, for implementing different structures of the application. In order to simplify the present disclosure, components and arrangements of specific examples are described below. They are, of course, merely examples and are not intended to limit the application. Furthermore, the present application may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
In order to solve the technical problems that the correlation degree between the recommended search prompt words and the search words input by the user is low or any search prompt word cannot be recommended in the prior art, the application provides a method for generating the search prompt words, which can improve the accuracy of the recommended search prompt words and reduce the occurrence probability of the condition that any search prompt words cannot be given.
Fig. 1 is a flowchart of an embodiment of a method for generating search prompt words according to an embodiment of the present application. As shown in fig. 1, the method comprises the following steps:
step 101, obtaining an initial search word input by a user.
In order to facilitate understanding of the embodiments of the present application, the following first explains application scenarios related to the embodiments of the present application:
the embodiment of the application is applied to a resource searching scene, such as a video searching scene, and correspondingly, an execution subject of the embodiment of the application can be a server capable of providing resource service. In practical application, when a user searches resources, a search word is input in a client, the client sends a resource search request to a server based on the search word input by the user, and the server returns resources related to the search word input by the user to the client in response to the resource search request.
Further, in the process of inputting the search term by the user, the server side further provides a function of generating a corresponding prompt term according to the input of the user, for example, the user inputs "flower", the server side generates the prompt term "flower", "flower appearance", "sun-facing flower" and the like according to the input of the user, and recommends the prompt term to the user, so that the user can determine the final search term from the prompt term. Through the processing, on one hand, manual input of a user can be reduced, user experience is improved, and on the other hand, prompts can be brought to the user under the condition that keywords (such as names and roles) of the resources of interest are not memorized fully, so that the user can still search the resources of interest.
In the application scenario, the server may apply the method for generating the search prompt word provided by the embodiment of the present application to generate the prompt word.
In this step 101, the initial search term input by the user refers to the input content in the search bar of the client, such as "flower" in the above example, when the user performs the resource search.
Step 102, vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word.
In an embodiment, the Word encoding may be used to vector the initial search Word to obtain a vector representation corresponding to the initial search Word, which is referred to herein as an initial search Word vector for convenience of description. Wherein, the ebedding model provided by OpenAI may be invoked: text-casting-ada-002 to implement the process of Word Embedding.
Step 103, determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to each candidate prompting word, and the vector similarity between the initial searching word vector and the target candidate prompting word vector is larger than a set similarity threshold value.
Step 104, determining the candidate prompt words corresponding to the target candidate prompt word vector as target prompt words of the initial search words.
In one embodiment, a plurality of candidate hint words are obtained from a plurality of preset data sources, and a candidate hint word library is constructed by using the plurality of candidate hint words. Here, a plurality of pieces of text data are included in each data source, and the sources of the text data included in different data sources are different. For example, in a video search scenario, the text data may be episode titles, episode descriptions, episode recommendations, episode introductions, subtitles, bullet screens, historical search terms, and the like. The pieces of text data included in one data source may come from a plurality of sources, such as episode titles including a plurality of episodes in one data source, episode introductions including a plurality of television shows/movies in another data source, subtitles including a plurality of episodes in yet another data source, and so forth. Optionally, when candidate prompt words are acquired from the data source, some acquisition rules may be set, and candidate prompt words are acquired from the data source according to the acquisition rules. For example, the acquisition rule is set to acquire the barrage generated in the last period (for example, in the last 30 days) as the candidate prompting words.
When the obtained candidate prompt words are used for constructing the candidate prompt word library, the obtained candidate prompt words can be further subjected to data cleaning (such as cleaning out messy codes and special characters in data), word segmentation, duplication removal and the like, so that a final candidate prompt word library is obtained.
It is further noted that the candidate hint word library may be dynamically changed, for example, when a new resource is online, that is, when a plurality of candidate hint words are acquired for the new resource and added to the constructed candidate hint word library.
By acquiring candidate prompt words from a plurality of data sources, a plurality of candidate prompt words can be pre-produced as much as possible, so that the probability of matching initial search words input by a user to target prompt words is improved.
The vector database is constructed by the following steps: and carrying out vectorization processing on the candidate prompt words aiming at each candidate prompt word to obtain vector representations corresponding to the candidate prompt words, wherein the vector representations are called candidate prompt word vectors for convenience of description, and then constructing a vector database based on the plurality of candidate prompt words and the candidate prompt word vectors corresponding to each candidate prompt word. That is, the vector database includes a plurality of candidate hint words and candidate hint word vectors corresponding to each candidate hint word.
In one embodiment, the Word encoding may be used to vector the whole candidate prompting Word, so as to obtain a candidate prompting Word vector corresponding to the candidate prompting Word.
In an embodiment, a preset keyword extraction technology may be used to extract keywords in the candidate prompt words, and then only the keywords are vectorized, and the vectorization result of the keywords is determined as the candidate prompt word vector corresponding to the candidate prompt words. By the processing, the interference of non-key content in the candidate prompt words on the candidate prompt word vectorization result can be reduced, so that the vectorization result of the candidate prompt words can represent the key semantics of the candidate key words, and the accuracy of the target prompt words obtained according to the candidate prompt word vectors is improved.
In one example, keywords may be extracted from candidate prompt words through a GPT (Gererate Pre-Training Model, generated) large language Model. Keywords here may include people, places, times, and events.
In one embodiment, obtaining candidate alert word vectors corresponding to candidate alert words by: performing word segmentation processing on the candidate prompt words to obtain at least two candidate prompt word segments, performing vectorization processing on the candidate prompt word segments aiming at each candidate prompt word segment to obtain candidate prompt word segment vectors corresponding to the candidate prompt word segments, and determining each candidate prompt word segment vector as a candidate prompt word vector corresponding to the candidate prompt word.
In one example, the above-described processing manner of obtaining the candidate alert word vector may be used for long-tail type candidate alert words. Of course, the long-tail candidate prompting words can be subjected to vectorization processing integrally to obtain a candidate prompting word vector, which is not limited in the embodiment of the application.
It can be seen that long-tail type candidate hint words can correspond to one or more than two candidate hint word vectors. Under the condition that more than two candidate prompt word vectors are corresponding, the initial search word vector is matched with any one candidate prompt word vector, the candidate prompt word can be determined to be a target prompt word, and the probability that any search prompt word cannot be given out can be reduced by processing the candidate prompt word as well, particularly for the candidate prompt word of long tail type.
In an embodiment, when determining at least one target candidate hint word vector from a pre-constructed vector database, calculating, for each candidate hint word vector in the vector database, a vector similarity between the candidate hint word vector and an initial search word vector, and determining, as the target candidate hint word vector, a candidate hint word vector having a corresponding vector similarity greater than a set similarity threshold. The vector similarity may be represented by a euclidean distance or a cosine distance. The similarity threshold can be set by a person skilled in the art according to needs, and it can be understood that the larger the similarity threshold is, the more accurate the target prompt word is finally determined.
In addition, the server side can set the display sequence of the target prompt words on the client side according to the vector similarity. For example, a plurality of target prompt words may be presented on the client side in order of the corresponding vector similarity from large to small. Through the processing, the user can preferentially see the prompt words with high matching degree, and user experience is improved.
According to the technical scheme provided by the embodiment of the application, the initial search word input by the user is obtained, vectorization processing is carried out on the initial search word to obtain the initial search word vector corresponding to the initial search word, at least one target candidate prompt word vector is determined from the pre-constructed vector database, the vector similarity between the initial search word vector and the target candidate prompt word vector is larger than the set similarity threshold, and the candidate prompt word corresponding to the target candidate prompt word vector is determined as the target prompt word of the initial search word, so that the rewriting of the initial search word input by the user and the matching of the target prompt word in a vectorization mode are realized, the accuracy of the recommended search prompt word can be improved, and the occurrence probability of the condition that any search prompt word cannot be given is reduced.
Fig. 2 is a flowchart of an embodiment of another search term generating method according to an embodiment of the present application, as shown in fig. 2, including the following steps:
step 201, obtaining an initial search word input by a user.
Step 202, performing word segmentation processing on the initial search words to obtain at least two initial search word segments.
Step 203, for each initial search word, vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word, and determining each initial search word vector as an initial search word vector corresponding to the initial search word.
Step 204, for each initial search word vector included in the initial search word vector, determining at least one candidate prompting word vector with the vector similarity greater than the set similarity threshold value from the vector database, and determining the candidate prompting word vector with the vector similarity greater than the set similarity threshold value as the target candidate prompting word vector.
Step 205, determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word.
As can be seen from the descriptions in steps 202 to 205, for the initial search word, word segmentation processing may be further performed on the initial search word, the target candidate prompt word vector is determined from the vector database by the initial search word vector corresponding to the word segmentation, and finally the candidate prompt word corresponding to the target candidate prompt word vector is determined as the target prompt word of the initial search word. The probability of recommending search term matching the initial search term can thereby be increased.
For example, assume that the initial search word is segmented to obtain initial search word segments A1, A2 and A3, and continuing to assume that for the initial search word segment A1, the vector similarity between the corresponding initial search word segment vector and the candidate hint word vector B1 in the vector database is greater than a set similarity threshold, for the initial search word segment A2, the vector similarity between the corresponding initial search word segment vector and the candidate hint word vector B2 in the vector database is greater than the set similarity threshold, and for the initial search word segment A3, no candidate hint word vector with the vector similarity between the initial search word segment is greater than the set similarity threshold. Then, the finally determined target candidate prompt word vectors are B1 and B2, and the target prompt words of the initial search word are C1 (the candidate prompt word vector corresponding to the target candidate prompt word vector B1) and C2 (the candidate prompt word vector corresponding to the target candidate prompt word vector B2).
In addition, it should be noted that, in an embodiment, the whole initial search word may be vectorized to obtain an initial search word vector corresponding to the initial search word, and then the pre-constructed vector database may be searched based on the initial search word vector. If a target candidate hint word vector having a vector similarity with the initial search word vector greater than the set similarity threshold can be found from the vector database, the candidate hint word corresponding to the found target candidate hint word vector may be determined to be the target hint word of the initial search word, and if a target candidate hint word vector having a vector similarity with the initial search word greater than the set similarity threshold is not found from the vector database, steps 202 to 205 may be executed.
By the processing, the best matched search prompt words can be preferentially determined, and under the condition that the best matched search prompt words cannot be determined, the better matched search prompt words can be still determined through word segmentation, so that the occurrence probability of the condition that any search prompt words cannot be given is reduced.
According to the technical scheme provided by the embodiment of the application, the initial search word is subjected to word segmentation processing to obtain at least two initial search word segments, and then the target prompt word is determined based on the initial search word segments, so that the occurrence probability of the situation that any search prompt word cannot be given can be further reduced.
In the application, whether the user inputs the characters or not directly influences the matching result of the search prompt words, and the situation that the search prompt words cannot be matched out due to incorrect input or wrongly written characters may occur. For example, the hot-air drama is a floral character, and the user inputs a floral character, so that any search prompt word related to the floral character cannot be given. In this regard, the present application proposes a flow shown in fig. 3, as shown in fig. 3, including the following steps:
step 301, generating pinyin data of the initial search word under the condition that a target candidate prompt word vector matched with the initial search word vector is not determined from a vector database.
In one embodiment, the initial search term may be input into a pinyin generator to obtain pinyin data corresponding to the initial search term.
Step 302, determining at least one target pinyin data from a pre-constructed pinyin database, wherein the pinyin database comprises a plurality of candidate prompt words and pinyin data of each candidate prompt word, and the target pinyin data is the same as pinyin data of an initial search word.
In an embodiment, for each candidate prompt word in the candidate prompt word library, the candidate prompt word may be input into a pinyin generator to obtain pinyin data corresponding to the candidate prompt word. And forming a pinyin database by the candidate prompt words and the pinyin data corresponding to the candidate prompt words.
Step 303, determining candidate prompt words corresponding to the target pinyin data as target prompt words of the initial search words.
For example, the hot-cast drama is a floral character, the user inputs a floral character, no search prompt word related to the floral character can be given by means of vector matching, and the search prompt word related to the floral character can be determined by pinyin data huarong of the floral character.
According to the technical scheme provided by the embodiment of the application, the pinyin data of the initial search word is generated under the condition that the target candidate prompt word vector matched with the initial search word vector is not determined from the vector database, at least one target pinyin data identical with the pinyin data of the initial search word is determined from the pre-constructed pinyin database, and the candidate prompt word corresponding to the target pinyin data is determined as the target prompt word of the initial search word, so that the correct search prompt word can be still given under the condition that the initial search word input by a user contains wrongly written characters.
Fig. 4 is a flowchart of an embodiment of a method for generating a search term according to another embodiment of the present application. As shown in fig. 4, the method comprises the following steps:
step 401, searching a pre-constructed candidate prompt word library according to the initial search word; in the case where the target candidate hint word including the initial search word is found from the candidate hint word library, step 402 is performed, and in the case where the target candidate hint word including the initial search word is not found from the candidate hint word library, step 403 is performed.
Step 402, determining target candidate prompt words as target prompt words of initial search words; ending the flow.
Step 403, vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word.
Step 404, searching a pre-constructed vector database according to the initial search word vector; step 405 is performed in the case where the target candidate hint word is found from the vector database, and step 406 is performed in the case where the target candidate hint word is not found from the vector database.
Step 405, determining candidate prompt words corresponding to the target candidate prompt word vector as target prompt words of the initial search word; ending the flow.
Step 406, generating pinyin data for the initial search term.
Step 407, determining at least one target pinyin data from the pre-constructed pinyin database.
Step 408, determining the candidate prompt word corresponding to the target pinyin data as the target prompt word of the initial search word.
As for the detailed description of the above steps 403 to 408, reference may be made to the related description in the above embodiment, and the detailed description is omitted here.
According to the technical scheme provided by the embodiment of the application, through a plurality of matching modes of character matching, vector matching and pinyin matching, the accuracy of recommended search prompt words can be improved, and the occurrence probability of the condition that any search prompt word cannot be given is reduced.
In addition, in the application, after the user triggers the search operation aiming at the given target prompt word, the server side returns the search result related to the target prompt word to the client side, so that the search requirement of the user is met. In this regard, the embodiment of the present application proposes: for each candidate hint word in the vector database (also known as for each candidate hint word in the candidate hint word library), extracting metadata for the resource from the candidate hint word, and then determining a target search result from the resource library based on the extracted metadata for the resource in response to a search operation based on the target hint word.
Taking a video resource search scenario as an example, metadata of the resource may include, but is not limited to, one or more of the following: person, place, time, event, episode ID, etc.
In one embodiment, for candidate prompting words such as episode title, description, sentence recommendation, scenario introduction, caption, bullet screen and the like, the metadata can be extracted from the candidate prompting words through a GPT big language model, while for candidate prompting words from which metadata cannot be extracted through the GPT big language model, the embodiment of the application provides that candidate prompting words matched with the candidate prompting words are found from other candidate prompting words from which metadata can be extracted through the GPT big language model, and metadata corresponding to the matched candidate prompting words is given to the candidate prompting words.
FIG. 5 is a flowchart of an embodiment of a method for generating search term according to another embodiment of the present application.
As shown in fig. 5, the method provided by the embodiment of the application firstly performs offline data production. The generation process of the offline data specifically comprises the following steps: (1) Collection of basic data, including but not limited to episode titles, episode descriptions, episode recommendations, episode introductions, subtitles, bullet screens, search queries, etc.; (2) The basic data is subjected to pretreatment operations such as cleaning, de-duplication and the like; (3) Vectorization and pinyin are carried out on the preprocessed basic data, and the obtained vectorization result and pinyin data are stored to form a vector database and a pinyin database. When the basic data is vectorized, different vectorization modes can be adopted for different types of basic data, for example, the basic data such as search query, title, sentence recommendation and the like can be directly vectorized, keywords in the basic data such as scenario, subtitle and the like can be extracted first, and then only the keywords are vectorized. Wherein the vectorization process may call the ebedding vector model. Keyword extraction and pinyin generation may be based on a large language model GPT, campt as exemplified below:
Keyword extraction campt:
the key (# # content# #) is extracted as follows:
###{content}###
the requirements are: 1. keywords irrelevant to the content are not required to be extracted; 2. keywords include people, places, time, events; 3. wherein the event hopes to generate short sentences, and the event contains character relations; 4. and finally outputting in json format.
Generating a prompt of pinyin:
pinyin (# # # id ":" Contents 1"," id2":" Contents 2"# #") is generated for:
###"{id}":"{content}",###
the requirements are: and outputting according to json format without tone, and reserving id in output content.
As shown in fig. 5, in the service main line, first, matching the search query input by the user according to the original character, if the search query input by the user is not matched with the prompt word containing the search query input by the user, vectorizing the search query input by the user, searching a pre-constructed vector database according to the vectorization result at that time, if the prompt word meeting the requirement is not found in the vector database, further generating pinyin data of the search query input by the user, and determining the final recommended prompt word according to the pinyin data.
Therefore, the technical scheme provided by the embodiment of the application can improve the accuracy of the recommended search prompt words and reduce the occurrence probability of failing to give any search prompt words through a plurality of matching modes of character matching, vector matching and pinyin matching.
Fig. 6 is a block diagram of an embodiment of a device for generating search prompt words according to an embodiment of the present application. As shown in fig. 6, the apparatus includes:
an initial word acquisition module 61, configured to acquire an initial search word input by a user;
the initial word vectorization module 62 is configured to perform vectorization processing on the initial search word, so as to obtain an initial search word vector corresponding to the initial search word;
a first matching module 63, configured to determine at least one target candidate hint word vector from a pre-constructed vector database, where the vector database includes a plurality of candidate hint words and candidate hint word vectors corresponding to each candidate hint word, and a vector similarity between the initial search word vector and the target candidate hint word vector is greater than a set similarity threshold;
the first determining module 64 is configured to determine a candidate prompt word corresponding to the target candidate prompt word vector as a target prompt word of the initial search word.
Optionally, the apparatus further includes: the vector database construction module is used for constructing a vector database; the method specifically comprises the following steps:
a candidate word obtaining unit, configured to obtain a plurality of candidate prompt words from a plurality of preset data sources, where each data source includes a plurality of text data, and sources of the text data included in different data sources are different;
The candidate word vectorization unit is used for vectorizing the candidate prompt words aiming at each candidate prompt word to obtain candidate prompt word vectors corresponding to the candidate prompt words;
the construction unit is used for constructing the vector database based on the candidate prompt words and the candidate prompt word vectors corresponding to the candidate prompt words.
Optionally, the apparatus further includes: the candidate prompt word vectorization module is used for obtaining candidate prompt word vectors corresponding to the candidate prompt words; the method specifically comprises the following steps:
a keyword extraction unit, configured to extract keywords in the candidate prompt words by using a preset keyword extraction technology;
and the first vectorization unit is used for vectorizing the keywords to obtain candidate prompt word vectors corresponding to the candidate prompt words.
Optionally, the apparatus further includes: the candidate prompt word vectorization module is used for obtaining candidate prompt word vectors corresponding to the candidate prompt words; the method specifically comprises the following steps:
the word segmentation unit is used for carrying out word segmentation processing on the candidate prompt words to obtain at least two candidate prompt word segments;
the second vectorization unit is used for vectorizing the candidate prompt word segments according to each candidate prompt word segment to obtain candidate prompt word segment vectors corresponding to the candidate prompt word segments;
And the candidate word vector determining unit is used for determining each candidate prompt word segmentation vector as a candidate prompt word vector corresponding to the candidate prompt word.
Optionally, the initial word vectorization module 62 includes:
the word segmentation unit is used for carrying out word segmentation processing on the initial search words to obtain at least two initial search word segments;
the word vector quantization unit is used for carrying out vector processing on the initial search word for each initial search word to obtain an initial search word vector corresponding to the initial search word, and determining each initial search word vector as an initial search word vector corresponding to the initial search word;
the first matching module 63 includes:
the word segmentation vector matching unit is used for determining at least one candidate prompt word vector with vector similarity larger than a set similarity threshold value from the vector database for each initial search word segmentation vector; and determining candidate prompting word vectors with vector similarity with the initial searching word segmentation vector being larger than a set similarity threshold value as target candidate prompting word vectors.
Optionally, the apparatus further includes:
The initial word spelling module is used for generating the spelling data of the initial search word under the condition that the target candidate prompt word vector matched with the initial search word vector is not determined from the vector database;
the pinyin matching module is used for determining at least one target pinyin data from a pre-constructed pinyin database, wherein the pinyin database comprises a plurality of candidate prompt words and pinyin data of each candidate prompt word, and the target pinyin data are the same as the pinyin data of the initial search word;
and the second determining module is used for determining the candidate prompt word corresponding to the target pinyin data as the target prompt word of the initial search word.
Optionally, the apparatus further includes:
the initial word matching module is used for searching a pre-constructed candidate prompt word library according to the initial search word before the initial search word is subjected to vectorization processing to obtain an initial search word vector corresponding to the initial search word; under the condition that target candidate prompt words containing the initial search words are found out from the candidate prompt word library, determining the target candidate prompt words as target prompt words of the initial search words; and executing the step of vectorizing the initial search word under the condition that the target candidate prompt word containing the initial search word is not found in the candidate prompt word library.
As shown in fig. 7, an embodiment of the present application provides an electronic device including a processor 711, a communication interface 712, a memory 713, and a communication bus 714, wherein the processor 711, the communication interface 712, the memory 713 perform communication with each other through the communication bus 714,
a memory 713 for storing a computer program;
in one embodiment of the present application, the processor 711 is configured to implement the method for generating a search prompt word provided in any one of the foregoing method embodiments when executing the program stored in the memory 713, where the method includes:
acquiring an initial search term input by a user;
vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word;
determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to each candidate prompting word, and the vector similarity between the initial searching word vector and the target candidate prompting word vector is larger than a set similarity threshold;
and determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for generating search prompt words provided in any one of the method embodiments described above.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for generating search prompt words, the method comprising:
acquiring an initial search term input by a user;
vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word;
determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to each candidate prompting word, and the vector similarity between the initial searching word vector and the target candidate prompting word vector is larger than a set similarity threshold;
and determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word.
2. The method according to claim 1, characterized in that the vector database is constructed by:
acquiring a plurality of candidate prompt words from a plurality of preset data sources, wherein each data source comprises a plurality of pieces of text data, and the sources of the text data included in different data sources are different;
carrying out vectorization processing on the candidate prompt words aiming at each candidate prompt word to obtain candidate prompt word vectors corresponding to the candidate prompt words;
And constructing the vector database based on the candidate prompt words and the candidate prompt word vectors corresponding to the candidate prompt words.
3. The method of claim 1, wherein the candidate alert word vector corresponding to the candidate alert word is obtained by:
extracting keywords in the candidate prompt words by using a preset keyword extraction technology;
and carrying out vectorization processing on the keywords to obtain candidate prompt word vectors corresponding to the candidate prompt words.
4. The method of claim 1, wherein the candidate alert word vector corresponding to the candidate alert word is obtained by:
performing word segmentation processing on the candidate prompt words to obtain at least two candidate prompt word segments;
carrying out vectorization processing on the candidate prompt word for each candidate prompt word to obtain a candidate prompt word vector corresponding to the candidate prompt word;
and determining each candidate prompt word segmentation vector as a candidate prompt word vector corresponding to the candidate prompt word.
5. The method of claim 1, wherein the vectorizing the initial search term to obtain an initial search term vector corresponding to the initial search term comprises:
Performing word segmentation processing on the initial search words to obtain at least two initial search word segments;
performing vectorization processing on the initial search word for each initial search word to obtain an initial search word vector corresponding to the initial search word;
determining each initial search word segmentation vector as an initial search word vector corresponding to the initial search word;
the determining at least one target candidate hint word vector from the pre-constructed vector database comprises the following steps:
for each initial search word segmentation vector included in the initial search word vector, determining at least one candidate prompt word vector with vector similarity greater than a set similarity threshold value from a pre-constructed vector database;
and determining candidate prompting word vectors with vector similarity with the initial searching word segmentation vector being larger than a set similarity threshold value as target candidate prompting word vectors.
6. The method according to claim 1, wherein the method further comprises:
generating pinyin data of the initial search word under the condition that a target candidate prompt word vector matched with the initial search word vector is not determined from the vector database;
Determining at least one target pinyin data from a pre-constructed pinyin database, wherein the pinyin database comprises a plurality of candidate prompt words and pinyin data of each candidate prompt word, and the target pinyin data are the same as the pinyin data of the initial search word;
and determining the candidate prompt word corresponding to the target pinyin data as the target prompt word of the initial search word.
7. The method of claim 1, further comprising, prior to said vectorizing said initial search term to obtain an initial search term vector corresponding to said initial search term:
searching a pre-constructed candidate prompt word stock according to the initial search word;
under the condition that target candidate prompt words containing the initial search words are found out from the candidate prompt word library, determining the target candidate prompt words as target prompt words of the initial search words;
and executing the step of vectorizing the initial search word under the condition that the target candidate prompt word containing the initial search word is not found from the candidate prompt word library.
8. A search term generation apparatus, the apparatus comprising:
The initial word acquisition module is used for acquiring initial search words input by a user;
the initial word vectorization module is used for vectorizing the initial search word to obtain an initial search word vector corresponding to the initial search word;
the first matching module is used for determining at least one target candidate prompting word vector from a pre-constructed vector database, wherein the vector database comprises a plurality of candidate prompting words and candidate prompting word vectors corresponding to the candidate prompting words, and the vector similarity between the initial searching word vector and the target candidate prompting word vector is larger than a set similarity threshold;
and the first determining module is used for determining the candidate prompt word corresponding to the target candidate prompt word vector as the target prompt word of the initial search word.
9. An electronic device, comprising: a processor and a memory, the processor being configured to execute a search term generation program stored in the memory, to implement the search term generation method of any one of claims 1 to 7.
10. A storage medium storing one or more programs executable by one or more processors to implement the method of generating search hint words of any one of claims 1 to 7.
CN202311116993.4A 2023-08-31 2023-08-31 Search prompt word generation method and device, electronic equipment and storage medium Pending CN117093600A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311116993.4A CN117093600A (en) 2023-08-31 2023-08-31 Search prompt word generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311116993.4A CN117093600A (en) 2023-08-31 2023-08-31 Search prompt word generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117093600A true CN117093600A (en) 2023-11-21

Family

ID=88775182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311116993.4A Pending CN117093600A (en) 2023-08-31 2023-08-31 Search prompt word generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117093600A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118396124A (en) * 2024-06-27 2024-07-26 北京赛彼思智能科技有限公司 Method and device for adapting prompt word for large language model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118396124A (en) * 2024-06-27 2024-07-26 北京赛彼思智能科技有限公司 Method and device for adapting prompt word for large language model

Similar Documents

Publication Publication Date Title
US11899681B2 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
US10824874B2 (en) Method and apparatus for processing video
CN106973244B (en) Method and system for automatically generating image captions using weak supervision data
CN111831911B (en) Query information processing method and device, storage medium and electronic device
US20180336193A1 (en) Artificial Intelligence Based Method and Apparatus for Generating Article
CN108268600B (en) AI-based unstructured data management method and device
CN112800170A (en) Question matching method and device and question reply method and device
CN111400513B (en) Data processing method, device, computer equipment and storage medium
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN109348262B (en) Calculation method, device, equipment and storage medium for anchor similarity
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN110740389A (en) Video positioning method and device, computer readable medium and electronic equipment
CN110717038A (en) Object classification method and device
CN113392265A (en) Multimedia processing method, device and equipment
CN112860862A (en) Method and device for generating intelligent body dialogue sentences in man-machine dialogue
CN111198946A (en) Network news hotspot mining method and device
CN112347339A (en) Search result processing method and device
CN117093600A (en) Search prompt word generation method and device, electronic equipment and storage medium
CN111813923A (en) Text summarization method, electronic device and storage medium
CN115269913A (en) Video retrieval method based on attention fragment prompt
CN113987274A (en) Video semantic representation method and device, electronic equipment and storage medium
CN115098729A (en) Video processing method, sample generation method, model training method and device
CN118035489A (en) Video searching method and device, storage medium and electronic equipment
CN116702094B (en) Group application preference feature representation method
CN113343692A (en) Search intention recognition method, model training method, device, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination