CN109189990B

CN109189990B - Search word generation method and device and electronic equipment

Info

Publication number: CN109189990B
Application number: CN201810826071.5A
Authority: CN
Inventors: 叶澄灿
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2021-03-26
Anticipated expiration: 2038-07-25
Also published as: CN109189990A

Abstract

The embodiment of the invention provides a method and a device for generating search terms and electronic equipment, and relates to the technical field of search in the technical field of computers, wherein the method comprises the following steps: aiming at the specified search word, adopting a plurality of preset models to respectively generate a recommendation candidate word set; merging the generated recommended candidate word sets, and performing duplicate removal processing on the merged sets to obtain recommended search word candidate sets; and selecting the recommended search word from the recommended candidate words in the recommended search word candidate set. The technical problems that the recommended search words generated in the prior art are not comprehensive enough and the categories are single can be solved by using the scheme provided by the embodiment of the invention to generate the search words.

Description

Search word generation method and device and electronic equipment

Technical Field

The present invention relates to the field of search technologies in the field of computer technologies, and in particular, to a method and an apparatus for generating a search term, and an electronic device.

Background

With the increase in quality and quantity of online videos and the increase in the use degree of video search engines by users, video search has become an important way for users to acquire information and entertain and relax. After the user completes one search action, the high-quality recommended search words are provided for the user, the search interest of the user can be stimulated, the problem that the search result obtained based on the current search words is not good is solved, and the search experience of the user is further improved.

At present, a known search term generation technology is a recommended search term generation method based on a collaborative filtering model, and the scheme is as follows:

and extracting the click relation between the user and the search word by using the search log to construct a data set. For any two search terms q_iAnd q is_jThe following formula is used to calculate its collaborative filtering correlations:

collaborative filtering correlation w_ijEqual to N (i) n (j), divided by the square root of the product of the modulus of N (i) and the modulus of N (j). Wherein N (i) is the searched q within a certain time period_iIs a set of users of, N (j) isSearch over q in the same time period_jN (i) n (j) is the set of users that have searched for q simultaneously within the same time period_iAnd q is_jOf the user. And for the current search word, calculating the collaborative filtering correlation between the current search word and each search word to be selected, and selecting some search words with the highest collaborative filtering correlation to form a recommended search word candidate set aiming at the current search word.

In the video search part of each current search engine, a collaborative filtering model technology is mainly adopted in the problem of generating recommended search words. The technology generates a recommended search word candidate set through a collaborative filtering model, scores the characteristics of all dimensions of all candidate words in the recommended search word candidate set, performs weighted summation, and preferentially selects a candidate word with a high total score as a recommended search word.

The inventor finds that the prior art at least has the following problems in the process of implementing the invention:

compared with the recommended search terms which may need to be known by the user, the recommended search terms obtained by adopting the collaborative filtering model are not comprehensive enough, and the obtained recommended search terms are single in category and cannot effectively meet the search requirements of the user.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for generating search terms and electronic equipment, which are used for solving the technical problems that the generated recommended search terms are not comprehensive enough and the categories are single. The specific technical scheme is as follows:

the embodiment of the invention provides a method for generating search terms, which comprises the following steps:

aiming at a specified search word, adopting a plurality of preset models to respectively generate a recommended candidate word set, wherein the plurality of preset models are respectively obtained by training data with different dimensions in a search log;

merging the generated recommended candidate word sets, and performing duplicate removal processing on the merged sets to obtain recommended search word candidate sets;

and selecting recommended search words from the recommended candidate words in the recommended search word candidate set.

Further, the plurality of models includes at least two of the following models:

clicking a relevance model;

LDA (Latent Dirichlet Allocation, implicit Dirichlet distribution) topic model;

and (4) collaboratively filtering the model.

Further, the preset multiple models include a click relevance model, and the process of generating the recommended candidate word set through the click relevance model includes:

aiming at a specified search word, inquiring a first training result obtained by using a click correlation model to obtain a click correlation expression vector of the specified search word, wherein the click correlation expression vector is a participle vector and is used for representing the weight of each participle of the specified search word, the first training result is obtained by training first sample data extracted from a search log by using the click correlation model, the first sample data comprises a plurality of search words extracted from the search log and used as training search words, and the click relation between the training search words in the search log and the search result, and the first training result comprises the participle vector of each training search word;

respectively calculating the click relevance expression vector of the specified search word and the inner product of the click relevance expression vector of each search word to be selected to obtain the click relevance between the specified search word and each search word to be selected;

and preferentially selecting the search words to be selected with high click relevance from the search words to be selected to form a recommended candidate word set of the specified search words generated by adopting the click relevance model.

Further, the click relationship between the training search words in the search log and the search results is the number of clicks between the training search words in the search log and the search results;

training the click relevance model by using the first sample data by adopting the following steps to obtain a first training result:

segmenting each training search word in the first sample data respectively, and generating an initial segmentation vector aiming at the obtained segmentation, wherein the initial segmentation vector is used for representing the initial weight of each segmentation of the training search word, and the initial weight of each segmentation of the training search word is equal;

repeatedly executing the following steps A and B until a preset iteration termination condition is met:

step A: respectively calculating current iteration expression vectors of a plurality of search results in the first sample data based on the current iteration expression vectors of the training search words, the number of the training search words and the number of clicks, wherein the current iteration expression vectors of the training search words in the first iteration are the initial word segmentation vectors;

and B: respectively calculating new iterative expression vectors of the training search words based on the current iterative expression vectors of the search results, the number of the search results and the number of clicks;

and when the preset iteration termination condition is met, respectively using the latest iteration expression vector of each training search word as the word segmentation vector of the training search word, wherein the word segmentation vector of the training search word forms the first training result.

Further, the calculating the current iterative expression vectors of the plurality of search results in the first sample data based on the current iterative expression vectors of the plurality of training search terms, the number of the plurality of training search terms, and the number of clicks respectively includes:

calculating the current iteration expression vector of the search result by adopting the following formula:

wherein D is_j ⁽ⁿ⁾Is the current iteration expression vector, Q, of the nth iteration of the jth search result_i ^(n-1)Is the current iteration expression vector of the (n-1) th iteration of the ith training search term,C_i,jis the number of clicks between the ith training search term and the jth search result, | Query | is the number of the plurality of training search terms;

the calculating a plurality of new iterative expression vectors of the training search terms respectively based on the current iterative expression vectors of the plurality of search results, the number of the plurality of search results, and the number of clicks includes:

calculating a new iterative expression vector of the training search term using the following formula:

wherein Q is_i ⁽ⁿ⁾Is the new iterative expression vector for the nth iteration of the ith training search term, | Doc | is the number of the search results.

Further, the preset multiple models include an LDA topic model, and the process of generating the recommended candidate word set through the click relevance model includes:

performing word segmentation on the specified search word to obtain the word segmentation of the specified search word;

acquiring the weight of each participle of the specified search word in the specified search word;

respectively inquiring a second training result obtained by adopting an LDA topic model for each obtained participle of the specified search word to obtain the probability distribution of the participle of the specified search word on a plurality of LDA topics, wherein the second training result is obtained by adopting the LDA topic model to train second sample data extracted from a search log, the second sample data comprises the participle extracted from the title of the search result of the search log and is used as a training participle, and the second training result comprises the probability distribution of each training participle on the plurality of LDA topics;

for each LDA theme, calculating a weighted sum value of probability distribution of each participle of the specified search word on the LDA theme by using the weight of each participle of the specified search word in the specified search word as the weight of the specified search word on the LDA theme;

adopting the weight of the specified search word on a plurality of LDA subjects to form an LDA subject vector of the specified search word as an LDA expression vector of the specified search word;

respectively calculating the LDA expression vectors of the specified search words and the inner product of the LDA expression vectors of the search words to be selected to obtain the LDA correlation between the specified search words and each search word to be selected;

and preferentially selecting the search words to be selected with high LDA correlation from the search words to be selected to form a recommendation candidate word set of the specified search words generated by adopting the LDA topic model.

Further, the selecting a recommended search word from the recommended candidate words in the recommended search word candidate set includes:

acquiring the relevance characteristics of the recommended candidate words in the recommended search word candidate set and the specified search words as first relevance characteristics;

for the first correlation characteristics, a recommended search word screening model is adopted, recommended candidate words in the recommended search word candidate set are respectively scored, and screening scores are obtained, wherein the recommended search word screening model is obtained by training third sample data through a linear regression or gradient boosting decision tree algorithm, the third sample data comprises click relations between search words in a search log and the recommended search words of the search words and second correlation characteristics between the search words in the search log and the recommended search words of the search words, and the second correlation characteristics are the same as the first correlation characteristics in type;

and preferentially selecting the recommended candidate words with high screening scores as recommended search words.

Further, the first correlation characteristic includes at least one of the following correlations:

click relevance;

LDA correlation;

collaborative filtering of correlations.

An embodiment of the present invention further provides a device for generating a search term, including:

the set generation module is used for generating recommendation candidate word sets by adopting a plurality of preset models aiming at the specified search word, wherein the plurality of preset models are obtained by training data with different dimensions in the search log respectively;

the set merging module is used for merging the generated recommended candidate word sets and carrying out duplication elimination processing on the merged sets to obtain recommended search word candidate sets;

and the word selecting module is used for selecting the recommended search words from the recommended candidate words in the recommended search word candidate set.

Further, the plurality of preset models at least include two of the following models:

clicking a relevance model;

an LDA topic model;

and (4) collaboratively filtering the model.

Further, the preset multiple models comprise a click correlation model;

the set generation module includes:

a first query submodule, configured to query, for a specified search word, a first training result obtained by using a click relevance model to obtain a click relevance expression vector of the specified search word, where the click relevance expression vector is a participle vector and is used to represent a weight of each participle of the specified search word, where the first training result is obtained by training first sample data extracted from a search log by using the click relevance model, the first sample data includes a plurality of search words extracted from the search log as training search words and a click relationship between the training search words in the search log and a search result, and the first training result includes the participle vector of each training search word;

the first inner product calculating submodule is used for calculating the inner products of the click relevance expression vectors of the specified search words and the click relevance expression vectors of the search words to be selected respectively to obtain the click relevance between the specified search words and each search word to be selected respectively;

and the first optimization sub-module is used for preferentially selecting the search words to be selected with high click correlation from the search words to be selected to form a recommended candidate word set of the specified search words generated by adopting the click correlation model.

the set generating module further includes the following sub-modules, configured to train the click relevance model using the first sample data, and obtain the first training result:

the first word segmentation submodule is used for performing word segmentation on each training search word in the first sample data respectively and generating an initial word segmentation vector aiming at the obtained word segmentation, the initial word segmentation vector is used for representing the initial weight of each word segmentation of the training search word, and the initial weight of each word segmentation of the training search word is equal;

the iteration submodule is used for repeatedly executing the following steps A and B until a preset iteration termination condition is met:

Further, the iteration sub-module includes:

a search result iteration unit, configured to calculate a current iteration expression vector of the search result by using the following formula:

wherein D is_j ⁽ⁿ⁾Is the current iteration expression vector, Q, of the nth iteration of the jth search result_i ^(n-1)Is the current iteration expression vector, C, of the n-1 th iteration of the ith training search term_i,jIs the number of clicks between the ith training search term and the jth search result, | Query | is the number of the plurality of training search terms;

the training search term iteration unit is used for calculating a new iteration expression vector of the training search term by adopting the following formula:

Further, the preset multiple models comprise an LDA topic model;

the set generation module includes:

the second word segmentation submodule is used for segmenting the appointed search word to obtain the segmentation of the appointed search word;

the weight obtaining sub-module is used for obtaining the weight of each participle of the specified search word in the specified search word;

a second query submodule, configured to query, for each obtained participle of the specified search term, a second training result obtained by using an LDA topic model to obtain probability distribution of the participle of the specified search term on multiple LDA topics, where the second training result is obtained by training second sample data extracted from a search log by using the LDA topic model, the second sample data includes the participle extracted from a title of the search result of the search log as a training participle, and the second training result includes probability distribution of each training participle on multiple LDA topics;

and a value calculating operator module, configured to calculate, for each LDA topic, a weighted sum value of a probability distribution of the participles of the specified search word on the LDA topic, as a weight of the specified search word on the LDA topic, using a weight of each participle of the specified search word in the specified search word;

the vector generation submodule is used for adopting the weights of the specified search terms on the plurality of LDA topics to form an LDA topic vector of the specified search terms, and the LDA topic vector is used as an LDA expression vector of the specified search terms;

the second inner product calculation submodule is used for calculating the inner products of the LDA expression vectors of the specified search words and the LDA expression vectors of the search words to be selected respectively to obtain the LDA correlation between the specified search words and each search word to be selected respectively;

and the second optimization submodule is used for preferentially selecting the search words to be selected with high LDA correlation from the search words to be selected to form a recommendation candidate word set of the specified search words generated by adopting the LDA topic model.

Further, the term selecting module includes:

the characteristic obtaining sub-module is used for obtaining a recommended candidate word in the recommended search word candidate set and the correlation characteristic of the specified search word as a first correlation characteristic;

the scoring submodule is used for scoring the first correlation characteristics respectively by adopting a recommended search word screening model and scoring recommended candidate words in the recommended search word candidate set to obtain screening scores, wherein the recommended search word screening model is obtained by training third sample data by adopting a linear regression or gradient boosting decision tree algorithm, the third sample data comprises a click relation between a search word in a search log and the recommended search word of the search word and a second correlation characteristic between the search word in the search log and the recommended search word of the search word, and the second correlation characteristic is the same as the first correlation characteristic in type;

and the third optimization sub-module is used for preferentially selecting the recommended candidate words with high screening scores as recommended search words.

Further, the feature obtaining sub-module specifically obtains the first correlation feature, where the first correlation feature at least includes one of the following correlations:

click relevance;

LDA correlation;

collaborative filtering of correlations.

The embodiment of the invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

and a processor for implementing any of the steps of the search term generation method described above when executing the program stored in the memory.

In yet another aspect of the present invention, the present invention further provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform any of the steps of the search term generation method described above.

In yet another aspect of the present invention, an embodiment of the present invention further provides a computer program product including instructions, which when run on a computer, cause the computer to execute any one of the above-mentioned search term generation methods.

According to the method and the device for generating the search terms, provided by the embodiment of the invention, the recommended search term candidate set is obtained by using various models obtained by training different dimensional data in the search logs, the generation mode of the recommended search terms is expanded, and the technical problems that the recommended search terms generated in the prior art are not comprehensive enough and are single in category can be solved. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of a method for generating search terms according to an embodiment of the present invention;

fig. 2 is another flowchart of a method for generating search terms according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for generating a recommended candidate word set by using a click relevance model according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for training a click relevance model according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method for generating recommended candidate words using an LDA topic model;

fig. 6 is a schematic structural diagram of a search term generation apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

The embodiment of the invention provides a method and a device for generating search terms, and concepts related to the embodiment of the invention are explained first.

The recommended search terms are search terms recommended to the user by a search engine after the user inputs complete or partial search terms, and are intended to provide search terms more meeting the search requirements of the user or stimulate the search interests of the user.

And the click relevance model generates click relevance of the search terms according to the click data. For search terms with click data, the model may provide other search terms that have high relevance to their clicks.

The LDA topic model, i.e. the implicit dirichlet distribution topic model, may give the topic of each document in the document set in the form of probability distribution. The labeled training set is not needed when the LDA topic model is trained, and only the document set and the number of specified topics are needed.

The collaborative filtering model recommends search terms that are of interest to the user using interest cast, the preferences of the group searching for the same content.

The following describes in detail a search term generation method provided by an embodiment of the present invention with reference to specific embodiments.

Referring to fig. 1, fig. 1 is a flowchart of a method for generating search terms according to an embodiment of the present invention, including the following steps:

step 101, aiming at the specified search word, adopting multiple preset models to respectively generate a recommendation candidate word set.

The multiple preset models are obtained through training of data of different dimensions in the search logs respectively. The searching for the data of different dimensions in the log may include: the click relation between each search result and each search word in the search log, the title content of each search result in the search log, and the click relation between each user and each search word in the search log. Data of different dimensions in the search log reflect the search history from different aspects. The click relationship between each search result and each search term reflects the degree of association between the search term and the search result, and the click relationship between each user and each search term reflects the search preference of the user.

In the embodiment of the invention, the specified search terms can be input by the user or imported from other programs. The method comprises the steps of generating a plurality of models of a recommended candidate word set, wherein the models include a model of a search word set with high correlation with a specified search word, and a model of judging a field in which a searcher is interested and taking a set of hot search words in the field as the recommended candidate word set.

And 102, merging the generated recommended candidate word sets, and performing deduplication processing on the merged set to obtain a recommended search word candidate set.

And a part of the plurality of recommended search terms of the same designated search term generated by different models may be the same, and only one recommended search term can be reserved for the same recommended search term, so that repeated recommended search terms are removed. The recommended search word candidate set is composed of a plurality of recommended candidate words.

And 103, selecting recommended search words from the recommended candidate words in the recommended search word candidate set.

In the embodiment of the invention, the search heat of each recommended candidate word in the recommended search word candidate set can be compared, and the recommended candidate word with high search heat is preferably selected as the recommended search word. The feature scores of all dimensions of all candidate words in the recommended search word candidate set can be scored, weighted and summed, and the candidate word with the high total score is preferentially selected as the recommended search word.

According to the method for generating the search terms, provided by the embodiment of the invention, the recommended search term candidate set is obtained by using various models obtained by training different dimensional data in the search logs, the generation mode of the recommended search terms is expanded, and the technical problems that the recommended search terms generated in the prior art are not comprehensive enough and are single in category can be solved.

In the method shown in fig. 1, the recommended candidate words and the recommended search words selected from the recommended candidate words are different from the specified search words.

The above search term generation method provided by the embodiment of the present invention is described in detail below with reference to the accompanying drawings.

Fig. 2 is another flowchart of a method for generating search terms according to an embodiment of the present invention, which may specifically include the following steps:

step 201, obtaining the appointed search word.

Step 202, aiming at the specified search word, generating a recommendation candidate word set by adopting a click relevance model, an LDA topic model and a collaborative filtering model.

In the embodiment of the invention, the recommended candidate word can be generated by adopting the click correlation model only for the specified search word included in the first training result, wherein the first training result is obtained by training the first sample data extracted from the search log by using the click correlation model.

The LDA topic model can give the weight of the specified search word on a plurality of pre-trained LDA topics, and other search words close to the search word topic can be provided according to the weight of the specified search word on each LDA topic.

The scheme for generating the recommendation candidate word set by using the collaborative filtering model is as follows: and extracting the click relation between the user and the search word by using the search log to construct a data set. For any two search terms q_iAnd q is_jThe following formula is used to calculate its collaborative filtering correlations:

collaborative filtering correlation w_ijEqual to N (i) n (j), divided by the square root of the product of the modulus of N (i) and the modulus of N (j). Wherein N (i) is the searched q within a certain time period_iN (j) is the searched q within the same time period_jN (i) n (j) is the set of users that have searched for q simultaneously within the same time period_iAnd q is_jOf the user. The time period may be one day or one week. And for the current search word, calculating the collaborative filtering correlation between the current search word and each search word to be selected, and preferentially selecting the search word to be selected with high collaborative filtering correlation to form a recommended candidate word set.

And step 203, merging the generated recommended candidate word sets, and performing deduplication processing on the merged set to obtain a recommended search word candidate set.

And 204, acquiring the relevance characteristics of the recommended candidate words in the recommended search word candidate set and the specified search words as first relevance characteristics.

The first relevance feature may be click relevance, LDA relevance, or collaborative filtering relevance.

In the embodiment of the invention, the click relevance or LDA relevance or collaborative filtering relevance calculated in the generation process of the recommended candidate words can be directly extracted as the relevance characteristics of the recommended candidate words obtained in the step.

And step 205, for the first correlation characteristics, adopting a recommended search word screening model, and respectively scoring recommended candidate words in the recommended search word candidate set to obtain screening scores.

The recommended search term screening model is obtained by training third sample data by adopting a linear regression or gradient boosting decision tree algorithm, wherein the third sample data comprises a click relation between a search term in a search log and the recommended search term of the search term, and a second correlation characteristic between the search term in the search log and the recommended search term of the search term, and the second correlation characteristic is the same as the first correlation characteristic in the step 206 in type.

In the embodiment of the invention, the click relation between the search word in the search log and the recommended search word of the search word can be click times or click rate.

And step 206, preferentially selecting the recommended candidate words with high screening scores as recommended search words.

Preferentially selecting the recommended candidate words with high screening scores, wherein the selection may include selecting a first preset number of recommended candidate words according to the sequence of the screening scores from high to low, and may also include selecting all recommended candidate words with screening scores exceeding a preset screening score threshold.

In the embodiment of the invention, the recommended search word screening model obtained through training is utilized, and the recommended search word screening model is adopted to score the recommended candidate words in the recommended search word candidate set respectively.

Fig. 3 is a flowchart of a method for generating a recommended candidate word set by using a click relevance model according to an embodiment of the present invention, which specifically includes the following steps:

step 301, obtaining the appointed search word.

Step 302, for the specified search term, a first training result is queried to obtain a click relevance expression vector of the specified search term.

The content of the first training result is a word segmentation vector of a plurality of search words. The first training result is obtained by training first sample data extracted from the search logs by using a click relevance model. The first sample data includes a plurality of search terms extracted from the search log as training search terms, and click relationships between the training search terms in the search log and the search results. The first training result includes a trained click relevance expression vector for each training search term.

In the embodiment of the invention, the click relation between the training search words in the search log and the search result can be the click times.

Step 303, calculating click relevance between the designated search term and each search term to be selected.

The search term to be selected may be all search terms in the first training result, or may be a plurality of search terms in the field of the first training result related to the specified search term.

And designating the click relevance between the search word and each search word to be selected, and taking the click relevance as the inner product of the click relevance expression vector of the designated search word and the click relevance expression vector of each search word to be selected.

The inner product of two participle vectors is the sum of products of weights of the same participle in the two participle vectors, divided by the product of the moduli of the two participle vectors. The click relevance expression vector is a participle vector, so the inner product of two click relevance expression vectors is the same as the inner product of two participle vectors, and the formula is as follows:

wherein the content of the first and second substances,

and

for two click relevance expression vectors, s is

And

i is the number of the participles, n is the total number of the different participles, A_iIs composed of

Weight on a participle with sequence number i, B_iIs composed of

Weight on the participle with sequence number i.

And 304, preferentially selecting the search words to be selected with high click relevance to form a recommended candidate word set.

Preferentially selecting the search terms to be selected with high click relevance, wherein the selection can comprise selecting a preset second number of search terms to be selected according to the order of the click relevance from high to low, and can also comprise selecting all the search terms to be selected with the click relevance exceeding a preset click relevance threshold.

According to the embodiment of the invention, the recommended candidate word set of the specified search word is generated by inquiring the training result of the click relevance model. The embodiment of the invention, as a generation mode of the recommended candidate word set, can be matched with other generation modes, and solves the technical problems that the recommended search words generated in the prior art are not comprehensive enough and have single category.

Fig. 4 is a flowchart of a method for training a click relevance model according to an embodiment of the present invention, which may specifically include the following steps:

step 401, extracting a plurality of search terms from the search log as training search terms, extracting a plurality of search results, and extracting the number of clicks between the training search terms and the search results.

Step 402, performing word segmentation on each training search word, and generating an initial word segmentation vector aiming at the obtained word segmentation.

And adopting the weight of the training search word on each participle of the training search word to form an initial participle vector of the training search word, wherein each weight in the initial participle vector is equal.

In the embodiment of the invention, the training search words with m participles are provided, the initial participle vector can be set to be m elements in total, and the coordinate is expressed as

The unit vector of (2).

And 403, respectively calculating current iteration expression vectors of a plurality of search results based on the current iteration expression vectors of the training search words.

And respectively calculating the current iteration expression vectors of a plurality of search results in the first sample data based on the current iteration expression vectors of a plurality of training search words, the number of the training search words and the number of clicks.

wherein D is_j ⁽ⁿ⁾Is the current iteration expression vector, Q, of the nth iteration of the jth search result_i ^(n-1)Is the current iteration expression vector, C, of the n-1 th iteration of the ith training search term_i,jIs the number of clicks between the ith training search term and the jth search result, | Query | is the number of the training search terms.

Step 404, calculating new iterative expression vectors of a plurality of training search terms respectively based on the current iterative expression vectors of a plurality of search results.

And respectively calculating new iterative expression vectors of a plurality of training search words based on the current iterative expression vectors of a plurality of search results, the number of the plurality of search results and the number of clicks.

Calculating a new iterative expression vector of the training search term by adopting the following formula:

wherein Q is_i ⁽ⁿ⁾Is the new iterative expression vector for the nth iteration of the ith training search term, | Doc | is the number of multiple search results.

And 405, judging whether an iteration termination condition is met, if so, entering step 406, and if not, entering step 403.

In the embodiment of the present invention, whether the iteration termination condition is met is determined, which may be determining whether the number of iterations reaches a preset first threshold, or determining whether the current iteration expression vectors of all the training search terms are less than a preset second threshold by subtracting a difference between the iteration expression vectors of the previous iteration.

Step 406, a first training result is obtained.

The first training result includes a trained click relevance expression vector for each training search term.

According to the embodiment of the invention, click data is trained through an iteration method, click relevance expression vectors of the search words are obtained as training results, and the inner product between the click relevance expression vectors can fully reflect the click relevance of the two expressed search words.

Fig. 5 is a flowchart of a method for generating recommended candidate words by using an LDA topic model according to an embodiment of the present invention, which specifically includes the following steps:

and step 501, acquiring the specified search word.

Step 502, obtaining the participles of the specified search word and the weight of each participle in the specified search word.

In the embodiment of the invention, the participles of the appointed search word and the weight of each participle in the appointed search word can be obtained by the participle method provided by the prior art. And inputting the appointed search word into a word segmentation device to obtain the word segmentation of the appointed search word and the weight of each word segmentation in the appointed search word output by the word segmentation device.

Step 503, querying a second training result for each participle of the specified search term, respectively, to obtain probability distribution of each participle on a plurality of LDA topics.

The content of the second training result is probability distribution of the plurality of participles on the plurality of LDA topics. And the second training result is obtained by training second sample data extracted from the search log by adopting an LDA topic model. The second sample data includes a word segmentation extracted from a title of a search result of the search log as a training word segmentation. The second training result includes a probability distribution of each training participle over a plurality of LDA topics.

Step 504, calculating the weight of the specified search terms on a plurality of LDA topics.

And calculating the weighted sum value of the probability distribution of the participles of the specified search word on the LDA topic as the weight of the specified search word on the LDA topic by using the weight of each participle of the specified search word in the specified search word for each LDA topic. The formula for calculating the weight of a given search term on an LDA topic is:

wherein j is the sequence number of the LDA subject, t is the participle of the designated search word, P_j(z | q) is the weight of the specified search term on the LDA topic with sequence number j, P (z | t) is the weight of the participle t in the specified search term, P_j(t | q) is the weight of the participle t on the LDA topic with sequence number j.

And 505, generating an LDA expression vector of the specified search word.

And adopting the weights of the specified search words on the plurality of LDA topics to form an LDA topic vector of the specified search words as an LDA expression vector of the specified search words.

Step 506, calculating LDA correlation between the specified search terms and each search term to be selected.

The search word to be selected may be all search words in the search log, or may be a plurality of search words in the search log in a field related to the specified search word.

And the LDA correlation between the specified search word and each search word to be selected is the inner product of the LDA expression vector of the specified search word and the LDA expression vector of each search word to be selected. The LDA expression vector is a segmentation vector and the inner product is calculated in the same way as provided in step 303 included in the flowchart shown in fig. 3.

And 507, preferentially selecting the search words to be selected with high LDA correlation to form a recommended candidate word set.

Preferentially selecting the search terms to be selected with high LDA correlation, wherein the selection can comprise selecting a preset third number of search terms to be selected according to the order of the LDA correlation from high to low, and can also comprise selecting all the search terms to be selected with LDA correlation exceeding a preset LDA correlation threshold.

According to the embodiment of the invention, the recommendation candidate word set of the specified search word is generated through the training result of the LDA topic model. The embodiment of the invention, as a generation mode of the recommended candidate word set, can be matched with other generation modes, and solves the technical problems that the recommended search words generated in the prior art are not comprehensive enough and have single category.

Based on the same inventive concept, according to the method for generating a search term provided in the foregoing embodiment of the present invention, correspondingly, an embodiment of the present invention further provides a device for generating a search term, a schematic structural diagram of which is shown in fig. 6, and specifically includes:

the set generating module 601 is configured to generate a set of recommended candidate words by using multiple preset models for a specified search word, where the multiple preset models are obtained by training data of different dimensions in a search log respectively;

a set merging module 602, configured to merge the generated recommended candidate word sets, and perform deduplication processing on the merged set to obtain a recommended search word candidate set;

and a word selecting module 603, configured to select a recommended search word from the recommended candidate words in the recommended search word candidate set.

The search term generation device provided by the embodiment of the invention obtains the recommended search term candidate set by using various models obtained by training different dimensional data in the search logs, expands the generation mode of the recommended search terms, and can solve the technical problems of incomplete generation and single category of the recommended search terms in the prior art.

clicking a relevance model;

an LDA topic model;

and (4) collaboratively filtering the model.

Further, the preset multiple models comprise a click correlation model;

the set generating module 601 includes:

the set generating module 601 further includes the following sub-modules, configured to train the click relevance model using the first sample data, and obtain the first training result:

Further, the iteration sub-module includes:

Further, the preset multiple models comprise an LDA topic model;

the set generating module 601 includes:

Further, the term selecting module 603 includes:

click relevance;

LDA correlation;

collaborative filtering of correlations.

Based on the same inventive concept, according to the method for generating search terms provided in the above-mentioned embodiment of the present invention, correspondingly, the embodiment of the present invention further provides an electronic device, as shown in fig. 7, comprising a processor 701, a communication interface 702, a memory 703 and a communication bus 704, wherein the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement the steps of any of the search term generation methods in the above embodiments when executing the program stored in the memory 703.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

According to the electronic equipment for generating the search terms, provided by the embodiment of the invention, the recommended search term candidate set is obtained by using various models obtained by training different dimensional data in the search logs, the generation mode of the recommended search terms is expanded, and the technical problems that the recommended search terms generated in the prior art are not comprehensive enough and the categories are single can be solved.

In yet another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to perform the steps of any one of the search term generation methods in the above embodiments.

In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform any one of the above-described search term generation methods.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to them, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for generating a search term, comprising:

aiming at a specified search word, adopting a plurality of preset models to respectively generate a recommended candidate word set, wherein the plurality of preset models are respectively obtained by training data with different dimensions in a search log, and the data with different dimensions reflect search histories in different aspects;

selecting recommended search words from the recommended candidate words in the recommended search word candidate set;

selecting a recommended search word from the recommended candidate words in the recommended search word candidate set, including:

2. The method of claim 1, wherein the plurality of predetermined models includes at least two of the following models:

clicking a relevance model;

an LDA topic model;

and (4) collaboratively filtering the model.

3. The method of claim 1, wherein the plurality of preset models comprises a click relevance model, and wherein generating the set of recommended candidate words through the click relevance model comprises:

4. The method of claim 3, wherein the click relationship between the training search words and the search results in the search log is the number of clicks between the training search words and the search results in the search log;

5. The method of claim 4, wherein the calculating the current iteration expression vector of the plurality of search results in the first sample data based on the current iteration expression vector of the plurality of training search terms, the number of the plurality of training search terms, and the number of clicks respectively comprises:

6. The method of claim 1, wherein the plurality of preset models comprises an LDA topic model, and wherein generating the set of recommended candidate words through the LDA topic model comprises:

7. The method of claim 1, wherein the first correlation characteristic comprises at least one of the following correlations:

click relevance;

LDA correlation;

collaborative filtering of correlations.

8. An apparatus for generating a search term, comprising:

the set generation module is used for generating recommendation candidate word sets by adopting a plurality of preset models aiming at the specified search words, wherein the plurality of preset models are obtained by training data with different dimensions in the search logs respectively, and the data with different dimensions reflect the search histories in different aspects;

the word selecting module is used for selecting recommended search words from the recommended candidate words in the recommended search word candidate set;

the term selection module comprises:

9. The apparatus of claim 8, wherein the plurality of preset models comprises at least two of the following models:

clicking a relevance model;

an LDA topic model;

and (4) collaboratively filtering the model.

10. The apparatus of claim 8, wherein the plurality of preset models comprises a click relevance model;

the set generation module includes:

11. The apparatus according to claim 10, wherein the click relationship between the training search word and the search result in the search log is the number of clicks between the training search word and the search result in the search log;

12. The apparatus of claim 11, wherein the iteration sub-module comprises:

13. The apparatus of claim 8, wherein the plurality of preset models comprises an LDA topic model;

the set generation module includes:

14. The apparatus according to claim 8, wherein the feature obtaining sub-module, specifically obtains the first correlation feature, and includes at least one of the following correlations:

click relevance;

LDA correlation;

collaborative filtering of correlations.

15. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.