CN108804550B - Query term expansion method and device and electronic equipment - Google Patents

Query term expansion method and device and electronic equipment Download PDF

Info

Publication number
CN108804550B
CN108804550B CN201810489682.5A CN201810489682A CN108804550B CN 108804550 B CN108804550 B CN 108804550B CN 201810489682 A CN201810489682 A CN 201810489682A CN 108804550 B CN108804550 B CN 108804550B
Authority
CN
China
Prior art keywords
word
sentence
similarity
statement
current query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810489682.5A
Other languages
Chinese (zh)
Other versions
CN108804550A (en
Inventor
王天畅
叶澄灿
陈英傑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201810489682.5A priority Critical patent/CN108804550B/en
Publication of CN108804550A publication Critical patent/CN108804550A/en
Application granted granted Critical
Publication of CN108804550B publication Critical patent/CN108804550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a query term expansion method, a device and electronic equipment, wherein the query term expansion method comprises the following steps: obtaining a current query statement comprising a plurality of words; calculating a first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; the qualifying candidate sentences include: sentences searched by a user history; obtaining a plurality of limited candidate sentences of which the first similarity meets a first preset condition to form a limited sentence set; obtaining each word to be matched from the limited sentence set; calculating second similarity between the current query statement and each obtained word to be matched; and obtaining each word with the second similarity meeting a second preset condition, and determining each word as each expansion word of the current query sentence. The embodiment of the invention improves the user search experience.

Description

Query term expansion method and device and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a query term expansion method and device and electronic equipment.
Background
With the development of the scale of video websites, users are used to find out video contents which concern themselves and want to see through a video website search engine. With diversification of search requirements of users, the traditional search result recall method based on query term segmentation matching often cannot meet real search intentions of users, so that the recall of semantic related search results needs to be improved through a query term expansion technology, similar terms related to the query terms can be conveniently found by the users when the users input the query terms, and further video content corresponding to the similar terms is pushed to the users.
Specifically, for an input query word, a vector of the query word is obtained from a word2vec data dictionary, a vector distance between the vector of the query word and a vector of each word in the word2vec data dictionary is calculated, similarity between the query word and each word is further determined according to the vector distance, and a preset number of words with high similarity are used as the expansion words searched from the word2vec data dictionary.
However, the inventor finds that the prior art has at least the following problems in the process of implementing the invention:
in the existing method for generating the query term expansion words by using the word2vec technology, because the expansion words are obtained by calculating the similarity between the query terms and each word in the word2vec data dictionary, the data calculation amount is large, the user requirements are not considered, and the user search experience is poor.
Disclosure of Invention
The embodiment of the invention aims to provide a query term expansion method, a query term expansion device and electronic equipment so as to improve the search experience of a user. The specific technical scheme is as follows:
in order to achieve the above object, an embodiment of the present invention discloses a query term expansion method, including:
obtaining a current query statement comprising a plurality of words;
calculating a first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; the qualifying candidate sentences include: sentences searched by a user history;
obtaining a plurality of limited candidate sentences of which the first similarity meets a first preset condition to form a limited sentence set;
obtaining each word to be matched from the limited sentence set;
calculating second similarity between the current query statement and each obtained word to be matched;
and obtaining each word with the second similarity meeting a second preset condition, and determining each word as each expansion word of the current query sentence.
Optionally, the process of constructing the defined candidate set comprises:
acquiring search content with preset search volume from a historical search log of a user; the searching content comprises: words and/or sentences historically searched by the user;
filtering the search content of a single word in the search content;
obtaining a vector of each statement;
and storing the vector corresponding to each statement into the qualified candidate set.
Optionally, the obtaining a vector of each statement includes:
for each sentence, dividing the sentence into single words;
for each sentence, searching a vector corresponding to a single word contained in the sentence in the word2vec data dictionary;
and aiming at each statement, calculating the vector corresponding to a single word contained in the statement according to a first preset formula to obtain the vector corresponding to the statement.
Optionally, the calculating a first similarity between the current query statement and each of the pre-constructed qualified candidate statements in the qualified candidate set includes:
calculating a vector corresponding to a single word contained in the current query statement according to a first preset formula to obtain the vector corresponding to the current query statement;
and calculating first similarity of the vector corresponding to the current query statement and the vector corresponding to each limited candidate statement in the limited candidate set.
Optionally, the first preset formula is:
qv=∑weighti*wvi
where qv represents the vector for each statement in the qualified candidate set or the vector for the current query statement, weight represents the weight of the ith word, and wvi represents the vector for the ith word.
Optionally, the obtaining a plurality of qualified candidate sentences of which the first similarity satisfies a first preset condition to form a qualified sentence set includes:
and obtaining a first preset number of multiple limited candidate sentences according to the sequence of the first similarity from high to low to form a limited sentence set.
Optionally, the obtaining the words to be matched from the restricted sentence sets includes:
dividing each limited candidate sentence in the limited sentence set into a single word, and taking each word as each word to be matched of the current query sentence;
the calculating the second similarity between the current query statement and each obtained word to be matched comprises the following steps:
respectively counting the frequency of the occurrence of each word to be matched in the limited sentence set;
and calculating by a second preset formula aiming at each word to be matched to obtain a second similarity between the current query sentence and the word to be matched, wherein the second preset formula is as follows:
Scorei=freqi*simqi
wherein, freqiIndicates the frequency of occurrence of the ith word, simqiRepresenting a first similarity of the qualified sentence set to which the ith word corresponds;
the obtaining of each word with the second similarity meeting a second preset condition is determined as each expansion word of the current query statement, and the method comprises the following steps:
and obtaining a second preset number of words to be matched according to the second similarity sequence, and determining the words to be matched as the expansion words of the current query sentence.
Optionally, the obtaining the words to be matched from the restricted sentence sets includes:
aiming at each limited sentence set, dividing the limited sentence set into single words, and taking each word as the limited sentence set to correspond to each word to be matched of the current query sentence;
the calculating the second similarity between the current query statement and each obtained word to be matched comprises the following steps:
calculating by a third preset formula aiming at each word to be matched to obtain a second similarity between the current query sentence and the word to be matched; the third preset formula is as follows:
Scorei=1-Distance(wvi*qv)
where wvi denotes a vector defining the ith word in the candidate sentence,qv represents the vector corresponding to the current query statement, Distance represents the cosine Distance between the vector of the ith word and the vector corresponding to the current query statement, ScoreiRepresenting a second similarity of the current query statement and the ith participle;
the obtaining of each word with the second similarity meeting a second preset condition is determined as each expansion word of the current query statement, and the method comprises the following steps:
and obtaining a second preset number of words to be matched according to the second similarity sequence, and determining the words to be matched as the expansion words of the current query sentence.
Optionally, the method further comprises:
obtaining a current query term of a word;
the qualified candidate set further includes: words of a user's historical search;
calculating a third similarity between the current query term and each word in the qualified candidate set;
and obtaining each word of which the third similarity meets a third preset condition, and determining each word as each expansion word of the current query word.
Optionally, the obtaining of each word whose third similarity satisfies a third preset condition is determined as each expansion word of the current query word, and the determining includes:
and obtaining a third preset number of words to be matched according to the third similarity sequence, and determining the words to be matched as the expansion words of the current query word.
In order to achieve the above object, an embodiment of the present invention discloses a query term expansion apparatus, including:
the first query sentence acquisition module is used for acquiring a current query sentence containing a plurality of words;
the first similarity calculation module is used for calculating first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; the qualifying candidate sentences include: sentences searched by a user history;
the system comprises a limited statement set determining module, a limiting statement set determining module and a limiting statement set generating module, wherein the limited statement set determining module is used for obtaining a plurality of limited candidate statements of which the first similarity meets a first preset condition to form a limited statement set;
a word to be matched determining module, configured to obtain each word to be matched from the restricted sentence set;
the second similarity calculation module is used for calculating second similarities between the current query statement and the obtained words to be matched;
and the first expansion word determining module is used for obtaining each word with the second similarity meeting a second preset condition and determining each word as each expansion word of the current query sentence.
Optionally, the apparatus further comprises: the limited candidate set construction module is used for acquiring search contents of a preset search amount from a historical search log of a user; the searching content comprises: words and/or sentences historically searched by the user; filtering the search content of a single word in the search content; obtaining a vector of each statement; and storing the vector corresponding to each statement into the qualified candidate set.
Optionally, the defined candidate set construction module comprises:
a sentence dividing submodule for dividing each sentence into a single word;
the word vector searching submodule is used for searching a vector corresponding to a single word contained in the sentence in the word2vec data dictionary aiming at each sentence;
and the sentence vector determining submodule is used for calculating the vector corresponding to the single word contained in the sentence according to a first preset formula aiming at each sentence to obtain the vector corresponding to the sentence.
Optionally, the first similarity calculation module includes:
the query statement vector determining submodule is used for calculating a vector corresponding to a single word contained in the current query statement according to a first preset formula to obtain the vector corresponding to the current query statement;
and the first similarity operator module is used for calculating the first similarity between the vector corresponding to the current query statement and the vector corresponding to each limited candidate statement in the limited candidate set.
Optionally, the first preset formula is:
qv=∑weighti*wvi
where qv represents the vector for each statement in the qualified candidate set or the vector for the current query statement, weight represents the weight of the ith word, and wvi represents the vector for the ith word.
Optionally, the restricted statement set determining module is specifically configured to obtain a first preset number of multiple restricted candidate statements according to a sequence from high to low of the first similarity, so as to form a restricted statement set.
Optionally, the to-be-matched word determining module is specifically configured to divide each limited candidate sentence in the limited sentence set into a single word, and use each word as each word to be matched in the current query sentence;
the second similarity calculation module comprises:
the frequency counting submodule is used for respectively counting the frequency of the words to be matched appearing in the limited statement set;
the second similarity calculation submodule is configured to calculate, for each word to be matched, a second preset formula to obtain a second similarity between the current query statement and the word to be matched, where the second preset formula is:
Scorei=freqi*simqi
wherein, freqiIndicates the frequency of occurrence of the ith word, simqiRepresenting a first similarity of the qualified sentence set to which the ith word corresponds;
the first expansion word determining module is specifically configured to obtain a second preset number of words to be matched according to the second similarity degree order, and determine the words to be matched as expansion words of the current query statement.
Optionally, the restricted sentence set determining module is specifically configured to divide each restricted candidate sentence in the restricted sentence set into a single word, and use each word as each word to be matched in the current query sentence;
the second similarity calculation module is specifically configured to calculate, by using the third preset formula, for each word to be matched, to obtain a second similarity between the current query statement and the word to be matched; the third preset formula is as follows:
Scorei=1-Distance(wvi*qv)
wherein wvi represents a vector defining the ith word in the candidate sentence, qv represents a vector corresponding to the current query sentence, Distance represents the cosine Distance between the vector of the ith word and the vector corresponding to the current query sentence, ScoreiRepresenting a second similarity of the current query statement and the ith participle;
the first expansion word determining module is specifically configured to obtain a second preset number of words to be matched according to the second similarity degree order, and determine the words to be matched as expansion words of the current query statement.
Optionally, the apparatus further comprises:
the second query sentence acquisition module is used for acquiring the current query term of a word;
the qualified candidate set further includes: words of a user's historical search;
the third similarity calculation module is used for calculating third similarities between the current query terms and the words in the limited candidate set;
and the second expansion word determining module is used for obtaining each word of which the third similarity meets a third preset condition and determining each expansion word as the current query word.
Optionally, the second expansion word determining module is specifically configured to obtain a third preset number of words to be matched according to the third similarity order, and determine the third preset number of words to be matched as each expansion word of the current query word.
In order to achieve the above object, an embodiment of the present invention further discloses an electronic device, which includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;
the memory is used for storing a computer program;
the processor is configured to implement the method steps of any one of the above query term expansion methods when executing the program stored in the memory.
In order to achieve the above object, an embodiment of the present invention further discloses a computer-readable storage medium, in which instructions are stored, and when the instructions are run on a computer, the instructions cause the computer to execute any one of the method steps in the above query term expansion method.
The query term expansion method, the query term expansion device and the electronic equipment provided by the embodiment of the invention realize the improvement of the search experience of users. The method specifically comprises the following steps: and determining each limited statement set searched by the user history corresponding to the current query statement in the pre-constructed limited candidate word set through similarity calculation, wherein the limited statement set ensures the real search requirement of the user. And then for each restricted sentence set, calculating the similarity between each word contained in each restricted sentence set and the current query sentence, and determining each word meeting preset conditions as each expansion word of the current query sentence. According to the embodiment of the invention, by combining the word2vec technology, each limited statement set corresponding to the current query statement is determined in the limited candidate word set, on one hand, the real search requirement of a user is considered, on the other hand, the preliminary limited statement set corresponding to the current query statement is directly determined from the limited candidate word set, and the data calculation amount of determining the expansion word corresponding to the current query statement at the later stage is reduced. Furthermore, by analyzing the similarity relation between each word contained in each restricted sentence set and the current query sentence, the effectiveness of each expansion word determined for the query word is ensured, and the purpose of improving the search experience of the user is finally achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flowchart of a query term expansion method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for constructing a limited candidate set in a query term expansion method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for calculating a first similarity in query term expansion according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a query term expansion apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram illustrating establishment of a limited candidate set in a query term expansion method according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of determining an expansion word for a query statement in a query word expansion method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
With the development of the scale of video websites, users are used to search videos which concern themselves and want to see through a video website search engine, and when the number of users is larger and larger, the search requirements of the users are more and more diversified. The Word2vec technology is widely applied to tasks in the aspect of text relevance, and a method for searching similar words by using the Word2vec technology for each Word of a query Word can be used as a basic solution for expanding the query Word. In the prior art, when a word2vec technology is used to generate each expansion word of a query word, each expansion word with high similarity to the query word is obtained by calculating the similarity between the query word and each word in a word2vec data dictionary. The calculation mode has large data calculation amount and does not consider the user requirements, so that the user search experience is poor.
In order to solve the technical problems, the embodiment of the invention discloses a query term expansion method. In the query word expansion method, a limited candidate set is pre-established through a user history search log, a limited sentence set similar to the query word is determined in the pre-established limited candidate set, then words similar to the query word are determined in the limited sentence sets, and the determined words are used as expansion words of the query word. By the query term expansion method, the accuracy of generating the expansion terms can be improved, the correlation degree of each expansion term recalled according to the query terms and the requirements of the user is higher, and the search experience of the user is further improved. The specific implementation mode is as follows:
in a first aspect, an embodiment of the present invention discloses a query term expansion method, as shown in fig. 1. Fig. 1 is a flowchart of a query term expansion method according to an embodiment of the present invention, where the method includes:
s101, obtaining a current query sentence containing a plurality of words;
in this step, a current query sentence including a plurality of words is acquired in an area where user search information is collected. For example: acquiring a current query sentence which is input by a current user and contains a plurality of words on a user front-end search interface; or obtaining the query sentence which is input by the current user and contains a plurality of words from a back-end database for storing the search information input by the user search interface.
S102, calculating first similarity between a current query statement and each limited candidate statement in a pre-constructed limited candidate set; qualifying the candidate sentences comprises: sentences searched by a user history;
the limited candidate set of the embodiment of the invention is as follows: the method comprises the steps of obtaining an information set constructed by search content with preset search quantity from a historical search log of a user, wherein the search content is a sentence of historical search of the user, and if a current query word input by the user is a single word, the search content can also comprise the word of the historical search of the user. The specific construction is detailed in the examples of defining candidate sets.
In this step, for each of the qualified candidate sentences, a vector corresponding to each word included in the qualified candidate sentence is obtained from the word2vec data dictionary in correspondence with each word, and a vector corresponding to the qualified candidate sentence is calculated by calculating a calculation formula of the vector of the qualified candidate sentence. The calculation formula may be a vector summation formula, that is, a vector corresponding to each word included in the constrained candidate sentence is summed to obtain a vector corresponding to the constrained candidate sentence. And obtaining a vector corresponding to the current query statement from the word2vec data dictionary. Furthermore, the similarity between the current query statement and each of the qualified candidate statements in the qualified candidate set is obtained through a similarity calculation method, and the similarity is defined as the first similarity of the embodiment of the present invention.
In the embodiment of the present invention, the first similarity may be calculated by respectively calculating the similarity between the current query statement and each of the qualified candidate statements in the qualified candidate set through an euclidean distance formula, so as to obtain each of the first similarities between the current query statement and each of the qualified candidate statements in the qualified candidate set.
In addition, the way of calculating the first similarity may be to calculate, by using a cosine distance formula, the similarity between the current query statement and each of the qualified candidate statements in the qualified candidate set, so as to obtain each of the first similarities between the current query statement and each of the qualified candidate statements in the qualified candidate set. The specific way of calculating the similarity is not limited.
S103, obtaining a plurality of limited candidate sentences of which the first similarity meets a first preset condition to form a limited sentence set;
in this step, the first preset condition may be to determine whether the first similarity is greater than a threshold of the first similarity. By judging the relationship between each first similarity obtained in S102 and the threshold of the first similarity, the limited candidate sentences corresponding to each first similarity greater than the threshold are added to the pre-established empty set, and the formed set is determined as the limited sentence set of the embodiment of the present invention.
In addition, the first preset condition may also be to screen a preset number of each limited candidate sentence with a large first similarity. Specifically, the first similarities obtained in S102 are sorted in descending order, each limited candidate sentence corresponding to a preset number of first similarities is screened, and each limited candidate sentence obtained by screening is placed in a pre-established empty set to form a limited sentence set.
S104, obtaining each word to be matched from the limited sentence set;
in this step, any commonly used segmentation technique may be used to segment each restricted candidate sentence in the restricted sentence set obtained in S103 to obtain each word to be matched. For example, each of the above-mentioned restriction candidate sentences is segmented using segmentation techniques such as a forward maximum matching method, a reverse maximum matching method, a shortest path segmentation method, or a word sense analysis method. The following describes how to perform word segmentation on each limited candidate sentence in the limited sentence set to obtain each word to be matched by taking a shortest path word segmentation method as an example.
The shortest path word segmentation method means that the number of words required to be cut out in a sentence is minimum. For example, defining a candidate sentence as "how the similarity probability of the face recognition algorithm is calculated", performing word segmentation on the defined candidate sentence by using a shortest path word segmentation method to obtain each word to be matched as: face recognition algorithm + similarity probability + how to calculate + out.
The existing word segmentation technology is mature, and the embodiments of the invention are not listed one by one.
S105, calculating second similarity between the current query sentence and each obtained word to be matched;
in this step, a vector corresponding to each word to be matched is obtained from the word2vec data dictionary, a vector corresponding to the current query sentence is obtained from the word2vec data dictionary, and then the similarity between the current query sentence and each word to be matched is obtained in a similarity calculation manner, and the similarity is defined as a second similarity of the embodiment of the present invention.
In the embodiment of the invention, the second similarity between the current query statement and each word to be matched can be calculated through an Euclidean distance formula or a cosine distance formula.
And S106, obtaining each word with the second similarity meeting a second preset condition, and determining each word as each expansion word of the current query sentence.
In this step, the second preset condition may be to determine whether the second similarity is greater than a threshold of the second similarity. Determining each word corresponding to each second similarity larger than the threshold of the second similarity as each expansion word of the current query sentence by judging the relationship between each second similarity obtained in step S105 and the threshold of the second similarity.
In addition, the second preset condition may also be to screen each word with a preset number and a second large similarity. Specifically, the second similarity obtained in S105 is sorted from large to small, each word corresponding to a preset number of second similarities is screened out, and each word obtained by screening is determined as an expansion word of the current query sentence.
The query term expanding method provided by the embodiment of the invention realizes the improvement of the search experience of the user. The method specifically comprises the following steps: and determining each limited statement set searched by the user history corresponding to the current query statement in the pre-constructed limited candidate word set through similarity calculation, wherein the limited statement set ensures the real search requirement of the user. And then for each restricted sentence set, calculating the similarity between each word contained in each restricted sentence set and the current query sentence, and determining each word meeting preset conditions as each expansion word of the current query sentence. According to the embodiment of the invention, by combining the word2vec technology, each limited statement set corresponding to the current query statement is determined in the limited candidate word set, on one hand, the real search requirement of a user is considered, on the other hand, the preliminary limited statement set corresponding to the current query statement is directly determined from the limited candidate word set, and the data calculation amount of determining the expansion word corresponding to the current query statement at the later stage is reduced. Furthermore, by analyzing the similarity relation between each word contained in each restricted sentence set and the current query sentence, the effectiveness of each expansion word determined for the query word is ensured, and the purpose of improving the search experience of the user is finally achieved.
Alternatively, in an embodiment of the query term expansion method of the present invention, the process of constructing the qualified candidate set in step S102 may be as shown in fig. 2. Fig. 2 is a flowchart of a method for constructing a limited candidate set in a query term expansion method according to an embodiment of the present invention, where the method includes:
s201, obtaining search content of a preset search amount from a historical search log of a user; the searching for the content includes: words and/or sentences that the user has historically searched.
In this step, search contents of a preset search amount may be obtained in a storage area where user history search contents are stored. For example, in a back-end database storing historical search information of a user, search content with a search volume reaching a certain threshold and a click rate reaching a certain threshold is obtained; or in a record file for storing the search content of the user, obtaining the search content of which the search volume reaches a certain threshold value and the click rate reaches a certain threshold value.
S202, filtering the search content of a single word in the search content.
Specifically, a filter condition for filtering the search content into a single word may be set, and all the search content may be filtered through the filter condition. The filtering can ensure that each piece of obtained search content is effective history search content of the user containing a plurality of words, and each piece of search content is a sentence which is actually searched by the user history.
S203, obtaining the vector of each statement.
Specifically, the step of obtaining the vector of each statement includes:
for each sentence, dividing the sentence into a single word;
aiming at each sentence, searching a vector corresponding to a single word contained in the sentence in a word2vec data dictionary;
and aiming at each statement, calculating a vector corresponding to a single word contained in the statement according to a first preset formula to obtain the vector corresponding to the statement.
In the embodiment of the invention, the vector corresponding to each statement is obtained in a word2vec data dictionary mode.
Firstly, word2vec data is trained by using query sentences, click data titles and outstation crawl contents as corpora to obtain a vector corresponding to each word, and a word2vec data dictionary of the embodiment of the invention is generated.
Each sentence is divided into individual words by any of the commonly used word-segmentation techniques, such as the forward maximum matching method.
And aiming at each sentence, obtaining a vector corresponding to each word from the word2vec data dictionary for each word contained in the sentence, and calculating the vector corresponding to the sentence by using a first preset formula for calculating the sentence vector. The first preset formula may be a vector summation formula, that is, a vector corresponding to each word included in the sentence is summed to obtain a vector corresponding to the sentence.
In the embodiment of the present invention, the first preset formula may be as follows:
qv=∑weighti*wvi
wherein qv represents the vector corresponding to the sentence in the limited candidate set, weight represents the weight of the ith word in the sentence, wv represents the weight of the ith word in the sentenceiRepresenting the corresponding vector of the ith word in the sentence in the word2vec data dictionary.
And calculating to obtain the vector corresponding to each statement according to the first preset formula.
And S204, storing the vectors corresponding to the sentences into a limited candidate set.
Specifically, each statement and the vector corresponding to each statement are stored in a pre-constructed empty set, and the set is formed into a limited candidate set. Each sentence stored in the qualified candidate set is taken as a qualified candidate sentence. Such that when a user enters a query statement, the corresponding qualified candidate statement is first matched from the qualified candidate set.
Therefore, the method and the device can realize the construction of the limited candidate set according to the real search content of the user history, so that when the user has the requirement of query sentence expansion, each limited candidate sentence with higher similarity is obtained from the limited candidate set, and each expansion word required by the user in real time is matched for the user in the later period.
Alternatively, in an embodiment of the query term expansion method of the present invention, an implementation manner of calculating a first similarity between the current query statement and each of the pre-constructed qualified candidate statements in the qualified candidate set in S102 may be as shown in fig. 3. Fig. 3 is a flowchart of a method for calculating a first similarity in query term expansion according to an embodiment of the present invention, including:
s301, calculating a vector corresponding to a single word contained in the current query statement according to a first preset formula to obtain the vector corresponding to the current query statement.
Firstly, in a word2vec data dictionary, vectors corresponding to words contained in a current query term are respectively obtained, and then the vector corresponding to the current query term is calculated through a first preset formula.
In the embodiment of the present invention, the first preset formula may be as follows:
qv=∑weighti*wvi
wherein, qv represents a vector corresponding to the current query term, weight represents the weight of the ith word in the current query term, and wvi represents the vector of the ith word contained in the current query term.
S302, calculating a first similarity between a vector corresponding to the current query statement and a vector corresponding to each limited candidate statement in the limited candidate set.
In the embodiment of the invention, the similarity between the current query statement and each limited candidate statement in the limited candidate set can be respectively calculated through an Euclidean distance formula, so as to obtain each first similarity between the current query statement and each limited candidate statement in the limited candidate set. In addition, the similarity between the current query statement and each of the qualified candidate statements in the qualified candidate set can be respectively calculated through a cosine distance formula, so that each first similarity between the current query statement and each of the qualified candidate statements in the qualified candidate set is obtained.
Therefore, through the embodiment of the invention, the first similarity between the current query statement and each limited candidate statement in the limited candidate set can be obtained, and then the limited candidate statement with higher similarity to the current query term is screened out through the first similarity, so that each expansion word with higher similarity to the current query term can be screened out in the limited candidate statement at the later stage, and each obtained expansion word is ensured to be a meaningful word which is really searched by a user.
Optionally, in an embodiment of the query term expansion method of the present invention, an implementation manner of obtaining a plurality of qualified candidate sentences, of which the first similarity satisfies a first preset condition, in S103 to form a qualified sentence set may be as follows, including:
and obtaining a first preset number of multiple limited candidate sentences according to the sequence of the first similarity from high to low to form a limited sentence set.
In this embodiment, the first preset condition may be a threshold of the first similarity.
Specifically, each restricted candidate statement in the restricted statement set similar to the current query statement may be determined by detecting a relationship between a first similarity and a first preset condition in real time, that is, after obtaining the first similarity between the current query statement and the restricted candidate statement each time, detecting whether the first similarity is greater than the first preset condition, and if so, placing the restricted candidate statement corresponding to the first similarity into a pre-established empty set to form the restricted statement set.
And after the first similarities of the current query statement and each limited candidate statement are all obtained, detecting whether each first similarity is greater than the first preset condition, and putting each limited candidate statement corresponding to the first similarity greater than the first preset condition into a pre-established empty set to form a limited statement set.
In addition, the first preset condition may also be to screen a preset number of each limited candidate sentence with a large first similarity. Specifically, the first similarity is ranked according to the sequence of the first similarity value from high to low to obtain a first preset number of multiple restricted candidate sentences, the obtained multiple restricted candidate sentences are placed in a pre-constructed empty set, and the set forms the restricted sentence set of the embodiment of the present invention.
Therefore, the method and the device can determine the multiple limited candidate sentences similar to the current query sentence in the pre-constructed limited candidate set, further form the limited sentence set for the multiple determined limited candidate sentences, and achieve the purpose of obtaining the user historical search sentences related to the current query sentence in the historical data.
Optionally, in an embodiment of the query term expansion method of the present invention, the implementation manner of obtaining each word to be matched from the restricted sentence set in S104 may include:
and dividing each limited candidate sentence in the limited sentence set into a single word, and taking each word as each word to be matched of the current query sentence.
In this step, for each limited candidate word in the limited sentence set, the limited candidate word is analyzed in a semantic analysis manner, and then the limited candidate word is divided into single words, and each divided word is used as each word to be matched in the current query sentence.
The step of calculating the second similarity between the current query statement and each obtained word to be matched in S105 may include:
and respectively counting the frequency of the occurrence of each word to be matched in the limited sentence set.
Specifically, the times of occurrence of each word in all the limited candidate sentences of the limited sentence set are respectively counted through a statistical function; or setting a word storage table, and recording the words formed by each division in the storage table when the limited candidate sentences are divided, and further obtaining the corresponding times of each word.
Calculating by a second preset formula aiming at each word to be matched to obtain a second similarity between the current query sentence and the word to be matched, wherein the second preset formula is as follows:
Scorei=freqi*simqi
wherein, freqiIndicates the frequency of occurrence of the ith word, simqiRepresenting a first similarity of the set of qualified sentences to which the ith word corresponds.
The obtaining, in S106, each word whose second similarity satisfies a second preset condition, and determining the word as an implementation manner of each expansion word of the current query statement, which may include:
and obtaining a second preset number of words to be matched according to the second similarity degree sequence, and determining the words to be matched as the expansion words of the current query sentence.
In the embodiment of the present invention, the second preset condition may be a threshold of the second similarity, and each word corresponding to each second similarity greater than the threshold is determined as each expansion word of the current query statement by determining a relationship between each second similarity and the threshold.
Specifically, the relation between the second similarity and a second preset condition can be detected in real time, and the words similar to the current query sentence in the words to be matched are determined, that is, after the second similarity between the current query sentence and the limited candidate sentence is obtained each time, whether the second similarity is greater than the second preset condition is detected, and when the second similarity is greater than the second preset condition, the words corresponding to the second similarity are determined as the expansion words of the current query sentence.
And after the second similarity between the current query sentence and each word is completely obtained, detecting whether each second similarity is greater than the second preset condition, and determining each word corresponding to the second similarity greater than the second preset condition as the expansion word of the current query sentence.
In addition, the second preset condition may also be to screen each word with a preset number and a second large similarity. Specifically, all the second similarities are sorted according to the magnitude of the second similarity value to obtain a preset number of second similarities, and each word corresponding to each second similarity is determined as each expansion word of the query word.
Therefore, through the embodiment of the invention, the words with higher similarity to the query word can be calculated by counting the relationship between the occurrence frequency of each word in the limited sentence set and the first similarity, and the words are used as the expansion words of the current query sentence.
Optionally, in an embodiment of the query term expansion method of the present invention, the implementation manner of S104 for obtaining each word to be matched from each restricted sentence set may include:
and aiming at each limited sentence set, dividing the limited sentence set into single words, and taking each word as the limited sentence set to correspond to each word to be matched of the current query sentence.
Specifically, each limited term is divided by a forward maximum matching method to obtain each word to be matched corresponding to the current query sentence.
The method specifically comprises the following steps: the forward maximum matching method is to divide a sentence into words from left to right. For example, defining a candidate sentence as "how the similarity probability of the face recognition algorithm is calculated", performing word segmentation on the defined candidate sentence by using a shortest path word segmentation method to obtain each word to be matched as: face recognition algorithm + similarity + probability + how + calculation + out +.
The step of calculating the second similarity between the current query statement and each obtained word to be matched in S105 may include:
calculating each word to be matched through a third preset formula to obtain a second similarity between the current query sentence and the word to be matched; the third preset formula is:
Scorei=1-Distance(wvi*qv)
wherein wvi represents the vector defining the ith word in the candidate sentence, qv represents the vector corresponding to the current query sentence, Distance represents the cosine Distance between the vector of the ith word and the vector corresponding to the current query sentence, ScoreiRepresenting a second similarity of the current query statement to the ith participle.
In the embodiment of the present invention, the second similarity between the current query statement and each word to be matched is calculated according to the third preset formula.
The obtaining, in S106, each word whose second similarity satisfies a second preset condition, and determining the word as an implementation manner of each expansion word of the current query statement, which may include:
and obtaining a second preset number of words to be matched according to the second similarity degree sequence, and determining the words to be matched as the expansion words of the current query sentence.
After the second similarity between the current query sentence and each word to be matched is obtained, the second similarities are sorted to obtain each word meeting a second preset condition, and each word is determined as each expansion word corresponding to the current query sentence.
Specifically, all the second similarities may be sorted according to the magnitude of the second similarity value to obtain a preset number of second similarities, and each word corresponding to each second similarity is determined as each expansion word of the query word.
Therefore, in the embodiment of the invention, the calculation amount of all words in the word2vec data dictionary is reduced by calculating the similarity relation between each word and the query sentence to determine the expansion words. And the expansion words determined by calculating the similarity between each word in the limited sentence set and the query sentence accord with the real search request of the user, so that the real requirements of the user are considered by the determined expansion words, and the search experience of the user is further improved.
Optionally, in an embodiment of the query term expansion method of the present invention, when the query statement is a single word, an expansion term of the single query statement may also be determined.
Specifically, the method may further include:
a current query term for a word is obtained.
In the step, the current query terms of a single word input by the current user can be obtained on the front-end search interface of the user; or obtaining the current query terms of the single word input by the current user from a back-end database for storing the search information input by the user search interface.
Defining the candidate set further comprises: the words that the user has historically searched.
In this step, words of the user history search of a preset search amount may be obtained in a storage area where the user history search content is stored. For example, in a back-end database storing historical search information of a user, words with a search volume reaching a certain threshold and a click rate reaching a certain threshold are obtained; or in a record file for storing the search content of the user, acquiring words of which the search quantity reaches a certain threshold value and the click rate reaches a certain threshold value. The acquired words are saved in a qualified candidate set.
A third similarity between the current query term and each word in the qualified candidate set is calculated.
Specifically, the third similarity between the current query term and each word in the limited candidate set may be calculated by using the second preset formula or the third preset formula. The specific calculation method has already been described in the second preset formula or the third preset formula embodiment, and is not described herein again.
And obtaining each word of which the third similarity meets a third preset condition, and determining each word as each expansion word of the current query word.
Specifically, obtaining each word whose third similarity satisfies a third preset condition, and determining each word as each expansion word of the current query term may include:
and obtaining a third preset number of words to be matched according to the third similarity height sequence, and determining the words to be matched as the expansion words of the current query word.
In this embodiment of the present invention, the third preset condition may be that whether the third similarity is greater than a threshold of the third similarity is determined, and each word corresponding to each third similarity greater than the threshold is determined as each expansion word of the current query term according to a relationship between each third similarity obtained through the determination and the threshold of the third similarity.
Specifically, the relation between the third similarity and a third preset condition may be detected in real time, and each word similar to the current query term in each word to be matched is determined, that is, after the third similarity between the current query term and the word to be matched is obtained each time, whether the third similarity is greater than the third preset condition is detected, and when the third similarity is greater than the third preset condition, the word corresponding to the third similarity is determined as the expansion term of the current query term.
And after the third similarity between the current query term and each word is completely obtained, detecting whether each third similarity is greater than the third preset condition, and determining each word corresponding to the third similarity greater than the third preset condition as the expansion term of the current query term.
In addition, the third preset condition may also be to screen a preset number of words with a third large similarity. Specifically, all the third similarities are sorted according to the magnitude of the third similarity numerical value to obtain a preset number of third similarities, and each word corresponding to each third similarity is determined as each expansion word of the query word.
Therefore, through the embodiment of the invention, the single query term input by the user can be determined, the expansion terms which are actually searched by the user and have higher similarity with the query term are determined, the effectiveness of the expansion terms determined for the query term is ensured, and the aim of improving the search experience of the user is finally fulfilled.
In a second aspect, an embodiment of the present invention discloses a query term expansion apparatus, as shown in fig. 4. Fig. 4 is a schematic structural diagram of a query term expansion apparatus according to an embodiment of the present invention, including:
a first query sentence acquisition module 401, configured to acquire a current query sentence including a plurality of words;
a first similarity calculation module 402, configured to calculate a first similarity between the current query statement and each of the pre-constructed qualified candidate statements in the qualified candidate set; qualifying the candidate sentences comprises: sentences searched by a user history;
a restricted statement set determining module 403, configured to obtain a plurality of restricted candidate statements whose first similarity satisfies a first preset condition, and form a restricted statement set;
a to-be-matched word determining module 404, configured to obtain each word to be matched from the restricted sentence set;
a second similarity calculation module 405, configured to calculate a second similarity between the current query statement and each obtained word to be matched;
the first expanded word determining module 406 is configured to obtain each word of which the second similarity satisfies a second preset condition, and determine each word as each expanded word of the current query statement.
The query term expanding device provided by the embodiment of the invention realizes the improvement of the search experience of the user. The method specifically comprises the following steps: and determining each limited statement set searched by the user history corresponding to the current query statement in the pre-constructed limited candidate word set through similarity calculation, wherein the limited statement set ensures the real search requirement of the user. And then for each restricted sentence set, calculating the similarity between each word contained in each restricted sentence set and the current query sentence, and determining each word meeting preset conditions as each expansion word of the current query sentence. According to the embodiment of the invention, by combining the word2vec technology, each limited statement set corresponding to the current query statement is determined in the limited candidate word set, on one hand, the real search requirement of a user is considered, on the other hand, the preliminary limited statement set corresponding to the current query statement is directly determined from the limited candidate word set, and the data calculation amount of determining the expansion word corresponding to the current query statement at the later stage is reduced. Furthermore, by analyzing the similarity relation between each word contained in each restricted sentence set and the current query sentence, the effectiveness of each expansion word determined for the query word is ensured, and the purpose of improving the search experience of the user is finally achieved.
Optionally, in an embodiment of the apparatus for expanding query terms, the apparatus further includes: the limited candidate set construction module is used for acquiring search contents of a preset search amount from a historical search log of a user; the searching for the content includes: words and/or sentences historically searched by the user; filtering the search content of a single word in the search content; obtaining a vector of each statement; and storing the vectors corresponding to the sentences into a qualified candidate set.
Optionally, in an embodiment of the query term expansion apparatus of the present invention, the candidate set constructing module is defined and includes:
a sentence dividing submodule for dividing each sentence into a single word;
the word vector searching submodule is used for searching a vector corresponding to a single word contained in each sentence in the word2vec data dictionary according to each sentence;
and the sentence vector determining submodule is used for calculating the vector corresponding to the single word contained in the sentence according to a first preset formula aiming at each sentence to obtain the vector corresponding to the sentence.
Optionally, the first similarity calculating module 402 includes:
the query statement vector determination submodule is used for calculating a vector corresponding to a single word contained in the current query statement according to a first preset formula to obtain a vector corresponding to the current query statement;
and the first similarity operator module is used for calculating the first similarity of the vector corresponding to the current query statement and the vector corresponding to each limited candidate statement in the limited candidate set.
Optionally, the first preset formula is:
qv=∑weighti*wvi
where qv denotes the vector defining each statement in the candidate set or the vector of the current query statement, weight denotes the weight of the ith word, and wvi denotes the vector of the ith word.
Optionally, the limited statement set determining module 403 is specifically configured to obtain a first preset number of multiple limited candidate statements according to the order from high to low of the first similarity, so as to form a limited statement set.
Optionally, the to-be-matched word determining module 404 is specifically configured to divide each limited candidate sentence in the limited sentence set into a single word, and use each word as each word to be matched in the current query sentence;
the second similarity calculation module 405 includes:
the frequency counting submodule is used for respectively counting the frequency of the occurrence of each word to be matched in the limited sentence set;
the second similarity calculation submodule is used for calculating a second similarity of the current query sentence and the word to be matched through a second preset formula aiming at the word to be matched, and the second preset formula is as follows:
Scorei=freqi*simqi
wherein, freqiIndicates the frequency of occurrence of the ith word, simqiRepresenting a first similarity of a restricted sentence set corresponding to the ith word;
and the first expansion word determining module is specifically configured to obtain a second preset number of words to be matched according to the second similarity degree order, and determine the words to be matched as expansion words of the current query statement.
Optionally, the restricted sentence set determining module 403 is specifically configured to divide each restricted candidate sentence in the restricted sentence set into a single word, and use each word as each word to be matched in the current query sentence;
the second similarity calculation module 405 is specifically configured to calculate, by using a third preset formula, a second similarity between the current query statement and the word to be matched, for each word to be matched; the third preset formula is:
Scorei=1-Distance(wvi*qv)
wherein wvi represents the vector defining the ith word in the candidate sentence, qv represents the vector corresponding to the current query sentence, Distance represents the cosine Distance between the vector of the ith word and the vector corresponding to the current query sentence, ScoreiRepresenting a second similarity of the current query statement and the ith participle;
the first expanded word determining module 406 is specifically configured to obtain a second preset number of words to be matched according to the second similarity order, and determine the words to be matched as the expanded words of the current query statement.
Optionally, the apparatus further comprises:
the second query sentence acquisition module is used for acquiring the current query term of a word;
defining the candidate set further comprises: words of a user's historical search;
the third similarity calculation module is used for calculating third similarities between the current query terms and the words in the limited candidate set;
and the second expansion word determining module is used for obtaining each word of which the third similarity meets a third preset condition and determining each expansion word as the current query word.
Optionally, the second expansion word determining module is specifically configured to obtain a third preset number of words to be matched according to a third similarity order, and determine the third preset number of words to be matched as expansion words of the current query word.
In order to better explain the query term expansion method of the embodiment of the invention, the embodiment of the invention divides the query term expansion method into an offline part and an online part. The off-line part is a process of constructing a limited candidate set according to historical data, and the on-line part is an expansion word part which is determined to correspond to a query statement after the user inputs the query statement.
The process of establishing the qualified candidate set, in part, off-line, may be as shown in fig. 5. Fig. 5 is a schematic structural diagram of establishing a limited candidate set in the query term expansion method according to the embodiment of the present invention. The module for the offline portion to establish the defined candidate set includes: a limited candidate set module 501, a word2vec data dictionary module 502, and a limited candidate sentence vector generation module 503. The specific process is as follows:
the define candidate set module 501:
the module can construct a limited candidate set containing words and/or sentences searched by a user history, so that the reliability of K limited candidate sentences is obtained in the module, and further, each expansion word corresponding to the current query sentence is determined through the K limited candidate sentences, and the accuracy of the determined expansion words is improved. Optional features are words and/or sentences that the user has queried and has a history of searches for certain click data.
The specific method comprises the step of obtaining words and/or sentences searched historically, wherein the searching amount reaches a certain threshold value in the past period of time and the click rate reaches a certain threshold value, from the log.
word2vec data dictionary module 502:
word2vec data are trained by querying word-click data titles and content crawled by outstations as corpus, vectors corresponding to each word are generated, and each word and the vector corresponding to each word are stored in the word2vec data dictionary generation module.
Each qualified candidate sentence vector generation module 503:
and calculating to obtain a vector corresponding to each limited candidate sentence through a vector corresponding to each word in the word2vec data dictionary generating module and combining a first preset formula, and storing each limited candidate sentence and the vector corresponding to each limited candidate sentence in a limited candidate set.
The online portion is the process of determining the expander for the query statement, as can be seen in FIG. 6. Fig. 6 is a schematic structural diagram of determining an expansion word for a query statement in a query word expansion method according to an embodiment of the present invention. The module for determining the expansion words in the online part comprises the following modules: a data loading and updating module 601, a current query statement vector generating module 602, an expanded word searching module 603, and an expanded word generating module 604. The specific process is as follows:
the data loading update module 601:
and loading a vector corresponding to each word in the word2vec data dictionary on line, and each limited candidate sentence and a vector corresponding to each limited candidate sentence which are stored in a limited candidate set. And supporting the updating operation of each limited candidate statement and the vector corresponding to each limited candidate statement.
The current query statement vector generation module 602:
when a user inputs a current query statement comprising a plurality of query statements, searching whether a vector of a limited candidate statement corresponding to the current query statement exists in a limited candidate set module, and if so, directly returning the vector of the limited candidate statement; if not, calculating the vector of the current query statement according to a first preset formula, and storing the current query statement and the vector corresponding to the current query statement in a limited candidate set.
The expanding word searching module 603:
calculating first similarity between the current query statement and each limited candidate statement in the limited candidate set, and obtaining K limited candidate statements closest to the cosine of the current query statement in a limited candidate set module to form a limited statement set;
the expansion word generation module 604:
calculating a second similarity of the current query statement and each word to be matched in the restricted statement set through a second preset formula, obtaining N/2 words according to the sequence of the second similarity, and determining the words as each expansion word of the current query statement;
and calculating a third similarity of the current query statement and each word to be matched in the restricted statement set through a third preset formula, obtaining N/2 words which are not repeated with the words according to the sequence of the third similarity, and determining the words as each expansion word of the current query statement.
In another aspect, an embodiment of the present invention further discloses an electronic device, as shown in fig. 7. Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which includes a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702 and the memory 703 complete communication with each other through the communication bus 704;
a memory 703 for storing a computer program;
the processor 701 is configured to implement the following method steps when executing the program stored in the memory 703:
obtaining a current query statement comprising a plurality of words;
calculating a first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; qualifying the candidate sentences comprises: sentences searched by a user history;
obtaining a plurality of limited candidate sentences of which the first similarity meets a first preset condition to form a limited sentence set;
obtaining each word to be matched from the limited sentence set;
calculating second similarity between the current query statement and each obtained word to be matched;
and obtaining each word with the second similarity meeting a second preset condition, and determining each word as each expansion word of the current query sentence.
The communication bus 704 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 704 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 702 is used for communication between the above-described electronic apparatus and other apparatuses.
The Memory 703 may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory 703 may also be at least one memory device located remotely from the processor 701.
The Processor 701 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
The electronic equipment provided by the embodiment of the invention realizes the improvement of the search experience of the user. The method specifically comprises the following steps: and determining each limited statement set searched by the user history corresponding to the current query statement in the pre-constructed limited candidate word set through similarity calculation, wherein the limited statement set ensures the real search requirement of the user. And then for each restricted sentence set, calculating the similarity between each word contained in each restricted sentence set and the current query sentence, and determining each word meeting preset conditions as each expansion word of the current query sentence. According to the embodiment of the invention, by combining the word2vec technology, each limited statement set corresponding to the current query statement is determined in the limited candidate word set, on one hand, the real search requirement of a user is considered, on the other hand, the preliminary limited statement set corresponding to the current query statement is directly determined from the limited candidate word set, and the data calculation amount of determining the expansion word corresponding to the current query statement at the later stage is reduced. Furthermore, by analyzing the similarity relation between each word contained in each restricted sentence set and the current query sentence, the effectiveness of each expansion word determined for the query word is ensured, and the purpose of improving the search experience of the user is finally achieved.
In another aspect, an embodiment of the present invention further discloses a computer-readable storage medium, in which instructions are stored, and when the computer-readable storage medium is executed on a computer, the computer is caused to perform the following method steps:
obtaining a current query statement comprising a plurality of words;
calculating a first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; qualifying the candidate sentences comprises: sentences searched by a user history;
obtaining a plurality of limited candidate sentences of which the first similarity meets a first preset condition to form a limited sentence set;
obtaining each word to be matched from the limited sentence set;
calculating second similarity between the current query statement and each obtained word to be matched;
and obtaining each word with the second similarity meeting a second preset condition, and determining each word as each expansion word of the current query sentence.
The computer-readable storage medium provided by the embodiment of the invention realizes the improvement of the search experience of the user. The method specifically comprises the following steps: and determining each limited statement set searched by the user history corresponding to the current query statement in the pre-constructed limited candidate word set through similarity calculation, wherein the limited statement set ensures the real search requirement of the user. And then for each restricted sentence set, calculating the similarity between each word contained in each restricted sentence set and the current query sentence, and determining each word meeting preset conditions as each expansion word of the current query sentence. According to the embodiment of the invention, by combining the word2vec technology, each limited statement set corresponding to the current query statement is determined in the limited candidate word set, on one hand, the real search requirement of a user is considered, on the other hand, the preliminary limited statement set corresponding to the current query statement is directly determined from the limited candidate word set, and the data calculation amount of determining the expansion word corresponding to the current query statement at the later stage is reduced. Furthermore, by analyzing the similarity relation between each word contained in each restricted sentence set and the current query sentence, the effectiveness of each expansion word determined for the query word is ensured, and the purpose of improving the search experience of the user is finally achieved.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiments of the apparatus and the electronic device, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to only in the partial description of the embodiments of the method.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (17)

1. A query term expansion method is characterized by comprising the following steps:
obtaining a current query statement comprising a plurality of words;
calculating a first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; the qualifying candidate sentences include: sentences searched by a user history;
obtaining a plurality of limited candidate sentences of which the first similarity meets a first preset condition to form a limited sentence set;
obtaining each word to be matched from the limited sentence set;
calculating second similarity between the current query statement and each obtained word to be matched;
and obtaining each word with the second similarity meeting a second preset condition, and determining each word as each expansion word of the current query sentence.
2. The method of claim 1, wherein constructing the defined candidate set comprises:
acquiring search content with preset search volume from a historical search log of a user; the searching content comprises: words and/or sentences historically searched by the user;
filtering the search content of a single word in the search content;
obtaining a vector of each statement;
and storing the vector corresponding to each statement into the qualified candidate set.
3. The method of claim 2, wherein obtaining the vector for each statement comprises:
for each sentence, dividing the sentence into single words;
aiming at each sentence, searching a vector corresponding to a single word contained in the sentence in a word2vec data dictionary;
and aiming at each statement, calculating the vector corresponding to a single word contained in the statement according to a first preset formula to obtain the vector corresponding to the statement.
4. The method of claim 2, wherein calculating a first similarity between the current query statement and each qualified candidate statement in the pre-constructed qualified candidate set comprises:
calculating a vector corresponding to a single word contained in the current query statement according to a first preset formula to obtain the vector corresponding to the current query statement;
and calculating first similarity of the vector corresponding to the current query statement and the vector corresponding to each limited candidate statement in the limited candidate set.
5. The method according to any one of claims 3 and 4, wherein the first preset formula is:
qv=∑weighti*wvi
wherein qv represents a vector of each statement in the qualified candidate set or a vector of the current query statement, weightiWeight, wv, representing the ith wordiA vector representing the ith word.
6. The method according to claim 4, wherein the obtaining a plurality of qualified candidate sentences of which the first similarity satisfies a first preset condition to form a qualified sentence set comprises:
and obtaining a first preset number of multiple limited candidate sentences according to the sequence of the first similarity from high to low to form a limited sentence set.
7. The method of claim 4, wherein obtaining each word to be matched from each of the sets of qualified sentences comprises:
dividing each limited candidate sentence in the limited sentence set into a single word, and taking each word as each word to be matched of the current query sentence;
the calculating the second similarity between the current query statement and each obtained word to be matched comprises the following steps:
respectively counting the frequency of the occurrence of each word to be matched in the limited sentence set;
and calculating by a second preset formula aiming at each word to be matched to obtain a second similarity between the current query sentence and the word to be matched, wherein the second preset formula is as follows:
Figure FDA0002922178410000021
wherein, freqiIndicating the frequency of occurrence of the ith word,
Figure FDA0002922178410000022
representing a first similarity of the qualified candidate sentence corresponding to the ith word;
the obtaining of each word with the second similarity meeting a second preset condition is determined as each expansion word of the current query statement, and the method comprises the following steps:
and obtaining a second preset number of words to be matched according to the second similarity sequence, and determining the words to be matched as the expansion words of the current query sentence.
8. The method of claim 2, wherein obtaining each word to be matched from each of the sets of qualified sentences comprises:
aiming at each limited sentence set, dividing the limited sentence set into single words, and taking each word as the limited sentence set to correspond to each word to be matched of the current query sentence;
the calculating the second similarity between the current query statement and each obtained word to be matched comprises the following steps:
calculating by a third preset formula aiming at each word to be matched to obtain a second similarity between the current query sentence and the word to be matched; the third preset formula is as follows:
Scorei=1-Distance(wvi*qv)
wherein, wviRepresenting a vector defining an ith word in a candidate sentence, qv representing a vector corresponding to the current query sentence, Distance representing a cosine Distance between the vector defining the ith word and the vector corresponding to the current query sentence, ScoreiRepresenting a second similarity of the current query statement to the ith word;
the obtaining of each word with the second similarity meeting a second preset condition is determined as each expansion word of the current query statement, and the method comprises the following steps:
and obtaining a second preset number of words to be matched according to the second similarity sequence, and determining the words to be matched as the expansion words of the current query sentence.
9. An apparatus for expanding a query term, comprising:
the first query sentence acquisition module is used for acquiring a current query sentence containing a plurality of words;
the first similarity calculation module is used for calculating first similarity between the current query statement and each limited candidate statement in a pre-constructed limited candidate set; the qualifying candidate sentences include: sentences searched by a user history;
the system comprises a limited statement set determining module, a limiting statement set determining module and a limiting statement set generating module, wherein the limited statement set determining module is used for obtaining a plurality of limited candidate statements of which the first similarity meets a first preset condition to form a limited statement set;
a word to be matched determining module, configured to obtain each word to be matched from the restricted sentence set;
the second similarity calculation module is used for calculating second similarities between the current query statement and the obtained words to be matched;
and the first expansion word determining module is used for obtaining each word with the second similarity meeting a second preset condition and determining each word as each expansion word of the current query sentence.
10. The apparatus of claim 9, further comprising: the limited candidate set construction module is used for acquiring search contents of a preset search amount from a historical search log of a user; the searching content comprises: words and/or sentences historically searched by the user; filtering the search content of a single word in the search content; obtaining a vector of each statement; and storing the vector corresponding to each statement into the qualified candidate set.
11. The apparatus of claim 10, wherein the defined candidate set construction module comprises:
a sentence dividing submodule for dividing each sentence into a single word;
the word vector searching submodule is used for searching a vector corresponding to a single word contained in the sentence in the word2vec data dictionary aiming at each sentence;
and the sentence vector determining submodule is used for calculating the vector corresponding to the single word contained in the sentence according to a first preset formula aiming at each sentence to obtain the vector corresponding to the sentence.
12. The apparatus of claim 10, wherein the first similarity calculation module comprises:
the query statement vector determining submodule is used for calculating a vector corresponding to a single word contained in the current query statement according to a first preset formula to obtain the vector corresponding to the current query statement;
and the first similarity operator module is used for calculating the first similarity between the vector corresponding to the current query statement and the vector corresponding to each limited candidate statement in the limited candidate set.
13. The apparatus according to any one of claims 11 and 12, wherein the first predetermined formula is:
qv=∑weighti*wvi
wherein qv represents a vector of each statement in the qualified candidate set or a vector of the current query statement, weightiWeight, wv, representing the ith wordiA vector representing the ith word.
14. The apparatus of claim 12, wherein the qualified sentence set determining module is specifically configured to obtain a first preset number of the plurality of qualified candidate sentences according to the order from high to low of the first similarity, so as to form the qualified sentence set.
15. The apparatus according to claim 12, wherein the to-be-matched word determining module is specifically configured to divide each of the qualified candidate sentences in the qualified sentence set into a single word, and use each word as each word to be matched in the current query sentence;
the second similarity calculation module comprises:
the frequency counting submodule is used for respectively counting the frequency of the words to be matched appearing in the limited statement set;
the second similarity calculation submodule is configured to calculate, for each word to be matched, a second preset formula to obtain a second similarity between the current query statement and the word to be matched, where the second preset formula is:
Figure FDA0002922178410000051
wherein, freqiIndicating the frequency of occurrence of the ith word,
Figure FDA0002922178410000052
representing a first similarity of the qualified candidate sentence corresponding to the ith word;
the first expansion word determining module is specifically configured to obtain a second preset number of words to be matched according to the second similarity degree order, and determine the words to be matched as expansion words of the current query statement.
16. The apparatus according to claim 10, wherein the qualified sentence set determining module is specifically configured to divide each qualified candidate sentence in the qualified sentence set into a single word, and use each word as each word to be matched in the current query sentence;
the second similarity calculation module is specifically configured to calculate, by using a third preset formula, a second similarity between the current query statement and the word to be matched, for each word to be matched; the third preset formula is as follows:
Scorei=1-Distance(wvi*qv)
wherein, wviRepresenting a vector defining an ith word in a candidate sentence, qv representing a vector corresponding to the current query sentence, Distance representing a cosine Distance between the vector defining the ith word and the vector corresponding to the current query sentence, ScoreiRepresenting a second similarity of the current query statement to the ith word;
the first expansion word determining module is specifically configured to obtain a second preset number of words to be matched according to the second similarity degree order, and determine the words to be matched as expansion words of the current query statement.
17. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
the memory is used for storing a computer program;
the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-8.
CN201810489682.5A 2018-05-21 2018-05-21 Query term expansion method and device and electronic equipment Active CN108804550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810489682.5A CN108804550B (en) 2018-05-21 2018-05-21 Query term expansion method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810489682.5A CN108804550B (en) 2018-05-21 2018-05-21 Query term expansion method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN108804550A CN108804550A (en) 2018-11-13
CN108804550B true CN108804550B (en) 2021-04-16

Family

ID=64091248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810489682.5A Active CN108804550B (en) 2018-05-21 2018-05-21 Query term expansion method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN108804550B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815396B (en) * 2019-01-16 2021-09-21 北京搜狗科技发展有限公司 Search term weight determination method and device
CN111339335A (en) * 2020-03-06 2020-06-26 Oppo广东移动通信有限公司 Image retrieval method, image retrieval device, storage medium and electronic equipment
CN114238619B (en) * 2022-02-23 2022-04-29 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933100A (en) * 2015-05-28 2015-09-23 北京奇艺世纪科技有限公司 Keyword recommendation method and device
CN105224554A (en) * 2014-06-11 2016-01-06 阿里巴巴集团控股有限公司 Search word is recommended to carry out method, system, server and the intelligent terminal searched for
CN107491547A (en) * 2017-08-28 2017-12-19 北京百度网讯科技有限公司 Searching method and device based on artificial intelligence
CN107862015A (en) * 2017-10-30 2018-03-30 北京奇艺世纪科技有限公司 A kind of crucial word association extended method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224554A (en) * 2014-06-11 2016-01-06 阿里巴巴集团控股有限公司 Search word is recommended to carry out method, system, server and the intelligent terminal searched for
CN104933100A (en) * 2015-05-28 2015-09-23 北京奇艺世纪科技有限公司 Keyword recommendation method and device
CN107491547A (en) * 2017-08-28 2017-12-19 北京百度网讯科技有限公司 Searching method and device based on artificial intelligence
CN107862015A (en) * 2017-10-30 2018-03-30 北京奇艺世纪科技有限公司 A kind of crucial word association extended method and device

Also Published As

Publication number Publication date
CN108804550A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN110162695B (en) Information pushing method and equipment
CN106407280B (en) Query target matching method and device
JP5597255B2 (en) Ranking search results based on word weights
CN108376129B (en) Error correction method and device
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN109977233B (en) Idiom knowledge graph construction method and device
CN108804550B (en) Query term expansion method and device and electronic equipment
CN109190014B (en) Regular expression generation method and device and electronic equipment
CN109388634B (en) Address information processing method, terminal device and computer readable storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN105302807B (en) Method and device for acquiring information category
WO2017091985A1 (en) Method and device for recognizing stop word
KR101651780B1 (en) Method and system for extracting association words exploiting big data processing technologies
CN110807487B (en) Method and device for identifying user based on domain name system flow record data
CN113660541B (en) Method and device for generating abstract of news video
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN113722478A (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN113282831A (en) Search information recommendation method and device, electronic equipment and storage medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN111611471B (en) Searching method and device and electronic equipment
CN112487181A (en) Keyword determination method and related equipment
CN113656575A (en) Training data generation method and device, electronic equipment and readable medium
CN113868373A (en) Word cloud generation method and device, electronic equipment and storage medium
CN112926297A (en) Method, apparatus, device and storage medium for processing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant