CN109508414B - Synonym mining method and device - Google Patents

Synonym mining method and device Download PDF

Info

Publication number
CN109508414B
CN109508414B CN201811345950.2A CN201811345950A CN109508414B CN 109508414 B CN109508414 B CN 109508414B CN 201811345950 A CN201811345950 A CN 201811345950A CN 109508414 B CN109508414 B CN 109508414B
Authority
CN
China
Prior art keywords
word
search
word vector
words
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811345950.2A
Other languages
Chinese (zh)
Other versions
CN109508414A (en
Inventor
吴健君
倪嘉呈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201811345950.2A priority Critical patent/CN109508414B/en
Publication of CN109508414A publication Critical patent/CN109508414A/en
Application granted granted Critical
Publication of CN109508414B publication Critical patent/CN109508414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0242Determining effectiveness of advertisements
    • G06Q30/0245Surveys
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

When the synonym mining method and the device of the application are used for carrying out vectorization processing on the target search words of the synonym to be matched, the training sample of the adopted word vector model comprises a plurality of search words corresponding to the historical search behaviors of each user in at least one time window with preset duration, and each search word belonging to the same time window has strong relevance, so that the context information of the long-tail words is provided in the training sample when the word vector model is trained, on the basis, when the synonym of the target search words is mined by utilizing the word vector model and a word vector library obtained based on the word vector model, the long-tail words can have better synonym mining effect based on the vector model and the context information embodied in the word vector library, and the method and the system do not need manual intervention when synonym mining is carried out, so that the synonym mining efficiency can be effectively improved.

Description

Synonym mining method and device
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a synonym mining method and device.
Background
The synonym mining technology is an important technology in advertisement recall based on user search behavior targeting, and synonym expansion is carried out on user search words set by an advertiser by utilizing the technology, so that the advertisement recall efficiency can be improved.
At present, the common synonym mining methods can be generally divided into two types, one type is a rule-based synonym mining method, the method needs a large amount of manual intervention, a synonym list is provided through the priori knowledge of people, although some synonym dictionaries can be used, the information of the dictionaries has hysteresis, and for the propagation of network languages, manual intervention processing is still needed, so that the mining efficiency is low; the other is a mining method based on the context of a search engine, which usually needs to search a click log and a session log (i.e. search logs), and synonyms are calculated through the co-occurrence of different search words (clicking the same url appearing in the same session, i.e. when a search is performed based on different search words and the same url is clicked according to the search result, the different search words are considered to have co-occurrence).
Therefore, the existing synonym mining methods have corresponding defects, and therefore, the field needs to provide a better synonym mining scheme so as to better meet the synonym mining requirements in the targeted advertisement recalls based on the user search behaviors.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for mining synonyms, so as to overcome the problems in the prior art and better satisfy the requirement of mining synonyms in advertisement recalls oriented based on user search behaviors.
Therefore, the invention discloses the following technical scheme:
a synonym mining method, comprising:
obtaining a target search term to be processed;
vectorizing the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: a plurality of search terms corresponding to historical search behaviors of each user in at least one time window with preset time length;
calculating the similarity between each word in the word vector library and the target search word based on the target word vector and the word vectors of all words in a preset word vector library; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words corresponding to historical search behaviors of the users, and the word vectors in the word vector library are vectorization expressions obtained after vectorization processing is carried out on the search words corresponding to the historical search behaviors of the users by using the word vector model;
and selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.
The method preferably further includes, before the obtaining of the target search term to be processed, the following preprocessing process:
obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time;
dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length;
training a word vector model by utilizing each search word of each user in each corresponding time window;
and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.
The method preferably, the obtaining of the target search term to be processed includes:
and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.
Preferably, the calculating, based on the target word vector and the word vectors corresponding to the words in the predetermined word vector library, a similarity between each word in the word vector library and the target search word includes:
and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.
In the foregoing method, preferably, a word vector distance between the target search word and each word included in the word vector library is a cosine distance or an euclidean distance between the target search word and each word included in the word vector library.
Preferably, the selecting a predetermined number of words from the word vector library as synonyms of the target search word based on a predetermined rule includes:
and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.
A synonym mining device, comprising:
the search word acquisition unit is used for acquiring a target search word to be processed;
the vectorization processing unit is used for carrying out vectorization processing on the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: a plurality of search terms corresponding to historical search behaviors of each user in at least one time window with preset time length;
a similarity calculation unit, configured to calculate, based on the target word vector and word vectors of respective words included in a predetermined word vector library, a similarity between each word in the word vector library and the target search word; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words corresponding to historical search behaviors of the users, and the word vectors in the word vector library are vectorization expressions obtained after vectorization processing is carried out on the search words corresponding to the historical search behaviors of the users by using the word vector model;
and the synonym selecting unit is used for selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.
The apparatus preferably further includes a preprocessing unit, configured to execute the following operations before the search term obtaining unit obtains the target search term to be processed:
obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time;
dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length;
training a word vector model by utilizing each search word of each user in each corresponding time window;
and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.
Preferably, the apparatus of the present invention further includes the search term obtaining unit, configured to:
and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.
The above apparatus, preferably, the similarity calculation unit is specifically configured to:
and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.
The above apparatus, preferably, the synonym selecting unit is specifically configured to:
and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.
According to the scheme, when the target search word of the synonym to be matched is subjected to vectorization processing, the training sample of the adopted word vector model comprises a plurality of search words corresponding to historical search behaviors of each user in a time window with at least one preset time length, and each search word (usually a plurality of search words generated by the users based on the same search purpose) belonging to the same time window has strong relevance, so that the context information of the long tail word is provided in the training sample during the training of the word vector model, and on the basis, when the target search word of the target search word is mined by using the word vector model and the word vector library obtained based on the word vector model, the long tail form target search word can have better synonym mining effect based on the word vector model and the context information embodied in the word vector library And the synonym mining method and the synonym mining system do not need manual intervention when synonym mining is carried out, so that the synonym mining efficiency can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a synonym mining method provided by an embodiment of the application;
FIG. 2 is a schematic diagram of a training process of a word vector model according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a logic principle for implementing synonym mining based on the method of the present application according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a synonym mining device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of another synonym mining device provided in an embodiment of the present application.
Detailed Description
For the sake of reference and clarity, the technical terms, abbreviations or abbreviations used hereinafter are to be interpreted in summary as follows:
long-tail words: that is, Long Tail keywords, refer to non-target keywords on a website, but combined keywords related to target keywords that may also bring about search traffic. The long-tail keywords are characterized by being long, and are often composed of 2-3 words, even phrases, which exist in the content page, besides the title of the content page, also exist in the content. The search volume is very small and unstable. The probability of the customers brought by the long-tail keywords to be converted into website product customers is much higher than that of the target keywords because the long-tail keywords have stronger purposiveness.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to overcome the problems in the prior art, such as the need for manual intervention in the synonym mining process or the unsatisfactory mining effect of long-tail synonyms, and the like, so as to better meet the synonym mining requirements in the user search behavior oriented advertisement recall, the present application provides a synonym mining method and apparatus, and the method and apparatus of the present application will be described below through specific embodiments.
Referring to fig. 1, it is a flowchart of a synonym mining method provided in an embodiment of the present application, where in the embodiment, the method includes the following steps:
step 101, obtaining a target search term to be processed.
The target search word to be processed is the target search word of the synonym to be matched currently; the target search term may be a search term corresponding to the current search behavior of the user, or may be a user search term set by the advertiser, where the search term may be a single chinese character (but is not common), a word term, or a long-tail word, a short sentence, or the like.
The long term means a combination term composed of a plurality of (e.g. 2-3) terms, such as brand a laptop, brand B cooling sports shoes, and the like.
102, carrying out vectorization processing on the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: and each user searches a plurality of search terms corresponding to historical search behaviors in a time window with at least one preset time length.
After the target search term to be processed is obtained, in step 102, a pre-trained word vector model is used to perform vectorization processing on the target search term to obtain a target word vector corresponding to the target search term, so as to provide a basis for a subsequent synonym mining process.
The following first describes the training process of the word vector model, which is described below with reference to fig. 2 as follows:
step 201, obtaining search behavior information corresponding to historical search behaviors of a plurality of users, where the search behavior information includes a corresponding relationship between a search word and search time.
And the search time is a time point/moment value of the search operation performed by the user.
Taking a given user u as an example, recording historical search behaviors of the given user u at different time points as<qu1,t1>,…,<qum,tm>Wherein q isui(i is a natural number, i is more than or equal to 1 and less than or equal to m) is a search word corresponding to the historical search behavior of the user u, and t isiThe time points at which the behavior occurred are searched for the history of the user u. This step can obtain the above-mentioned series of historical search behavior information of a plurality of users.
Step 202, dividing the search behavior information of each user by using a time window with a preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length.
Step 203, training a word vector model by using each search word of each user in each corresponding time window.
This step 203 can be realized by the following processing procedure:
1) and generating a window document corresponding to each time window based on each search term corresponding to each time window to obtain a window document set of each user, wherein the window document set of each user comprises not less than one window document.
Wherein, the following operations can be executed aiming at the search behavior information of each user:
q is paired in a sliding time window (said sliding step is preferably said T in this embodiment) according to a time window T of predetermined durationu1,…,qumDividing, merging each search word corresponding to the historical search behavior generated in the time window T to generate a window document (document), and expressing the window document as dui’For multiple time windows in the sliding process, multiple window documents of the user are obtained, that is, the window document set d of the useru1,…,dupWherein, i' is more than or equal to 1 and less than or equal to p, and p represents the window document number of the user u.
2) And combining the window document sets of the users to obtain a search word document set corresponding to the historical search behavior of the users.
And merging the window document sets of the users to obtain a search word document set D corresponding to the historical search behavior of the users, wherein the search word document set D specifically comprises a series of window documents of the users.
3) And training a word vector model by utilizing the search word document set.
On the basis of the above steps, a series of window documents of each user included in the search term document set D are used as training samples to train a term vector model, specifically, the term vector model may be a word2vec term vector model.
Here, it should be noted that the search behavior of a single user in a continuous time period (e.g. within a time window of a certain predetermined time duration (e.g. 10 minutes)) is often generated based on the same search purpose, and correspondingly, a series of search words with strong relevance are often generated in the time period, for example, a user may search not only a certain type of product (e.g. "computer") but also a certain brand of product (e.g. "brand a computer") of the type of product and the product with a specific function tendency or a specific form under the brand (e.g. "brand a tablet computer", "brand a laptop") in a certain time period, and the search words corresponding to the type of product, the certain brand of the type of product, the certain function tendency under the brand, or the product with the specific form respectively generate strong relevance due to the common information including the product name (e.g. computer) of the product, based on the characteristics, the application obtains the searching behavior information corresponding to the historical searching behaviors of a plurality of users, and dividing the search words/keywords corresponding to the historical search behavior of the user according to the time window of the preset time length for each user, and then it is used as the training sample of the word vector model, so that the training sample contains the context information of each search word (the context information can be understood as each other search word in the same time window with the search word), so that for the long-tail word, it is equivalent to inputting the context information of each long-tail word in the model training, this enables the trained model to embody the association between the long-tailed word and its context information, and then provide richer synonym basis for the synonym mining of the long-tail words, and aiming at the keyword targeted advertisement recall system, the advertisement display efficiency can be better improved through the method and the system.
And recording a set of each historical search word in the search word document set D as Q, and on the basis of training the word vector model, further obtaining vectorization expression of each historical search word in Q based on the word vector model, thereby generating a word vector library. The word vector library includes information of correspondence between each historical search word in Q and its word vector, which may be denoted as v (Q) ═ Qj,v(qj)|qj∈Q},qjFor the jth search term in Q, v (Q)j) Is qjCorresponding word vectorJ is a natural number, j is more than or equal to 1 and less than or equal to N, and N is the number of search terms included in Q.
The training process of the word vector model and the generation process of the word vector library can be used as the preprocessing process of the scheme of the application, and on the basis, for the current target search word to be processed, the pre-trained word vector model is used for conducting vectorization processing on the target search word in the step 102 to obtain the word vector corresponding to the target search word.
Step 103, calculating the similarity between each word in the word vector library and the target search word based on the target word vector and the word vectors of the words in the word vector library.
As described above, each word included in the word vector library is each search word (history search word) included in the set Q of history search words, and this step specifically represents the similarity between the target search word and each search word in Q by a predetermined word vector distance between the target search word to be processed and each search word in Q.
The predetermined word vector distance may be a cosine distance (cosine distance) or a euclidean distance between two words, but is not limited thereto.
Taking cosine distance as an example, for a given target search word q of synonyms to be matchediLet it be assumed that the vectorized expression (i.e., word vector) thereof obtained by the above-described steps is v (q)i) Then, this step traverses v (q), and uses the word vector of each search word in v (q) (using v (q))j) Is represented by (a) and the v (q)i) Calculating the cosine distance between each search word in V (Q) and the target search word based on the following calculation formula:
Figure BDA0001863726700000091
where n is the vector dimension of v (q).
The smaller the cosine distance between the target search word and a certain word in V (Q), the higher the similarity between the target search word and the certain word in V (Q).
And 104, selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.
The predetermined rule may be, but is not limited to, the k-nearest neighbor principle.
After calculating the similarity between the target search word and each search word in v (q), k words with the highest similarity may be selected as synonyms of the target search word based on the k-nearest neighbor principle, in this step, a predetermined number (e.g., the first k) of words before sorting may be selected from the word vector library in a descending order of similarity as synonyms of the target search word, and in a case that the similarity is characterized specifically by using cosine distances, a predetermined number of words before sorting may be selected from the word vector library in an ascending order of cosine distances as synonyms of the target search word, referring to fig. 3, fig. 3 shows a schematic logical principle diagram of realizing synonym mining based on the method of the present application.
According to the above scheme, when the target search word of the synonym to be matched is subjected to vectorization processing, the training sample of the word vector model adopted by the method comprises a plurality of search words corresponding to historical search behaviors of each user in a time window with at least one preset time duration, and each search word (often a plurality of search words generated by the users based on the same search purpose) belonging to the same time window has strong relevance, so that the context information of the long tail word is provided in the training sample during training the word vector model, on the basis, when the synonym of the target search word is mined by using the word vector model and the word vector library obtained based on the word vector model, for the target search word in the long tail form, the long tail word has a good synonym mining effect based on the word vector model and the context information embodied in the word vector library, and the method and the system do not need manual intervention when synonym mining is carried out, so that the synonym mining efficiency can be effectively improved.
The present application further provides a synonym mining device corresponding to the above method, and referring to fig. 4, the synonym mining device provided in the embodiment of the present application is a schematic structural diagram, and the device includes:
a search word obtaining unit 401, configured to obtain a target search word to be processed;
a vectorization processing unit 402, configured to perform vectorization processing on the target search term by using a pre-trained term vector model, so as to obtain a target term vector corresponding to the target search term; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: a plurality of search terms corresponding to historical search behaviors of each user in at least one time window with preset time length;
a similarity calculation unit 403, configured to calculate a similarity between each word in the word vector library and the target search word based on the target word vector and word vectors of respective words included in a predetermined word vector library; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words corresponding to historical search behaviors of the users, and the word vectors in the word vector library are vectorization expressions obtained after vectorization processing is carried out on the search words corresponding to the historical search behaviors of the users by using the word vector model;
a synonym selecting unit 404, configured to select a predetermined number of terms from the term vector library as synonyms of the target search term based on a predetermined rule, where a similarity between each selected term and the target search term is not lower than a similarity between any term not selected in the term vector library and the target search term.
In an implementation manner of the embodiment of the present application, as shown in fig. 5, the apparatus further includes a preprocessing unit 405, configured to perform the following operations before the search word obtaining unit 401 obtains a target search word to be processed:
obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time; dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length; training a word vector model by utilizing each search word of each user in each corresponding time window; and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.
In an implementation manner of the embodiment of the present application, the search term obtaining unit 401 is specifically configured to: and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.
In an implementation manner of the embodiment of the present application, the similarity calculation unit 403 is specifically configured to: and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.
In an implementation manner of the embodiment of the present application, the synonym selecting unit 404 is specifically configured to: and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.
For the synonym mining device disclosed in the embodiment of the present application, since it corresponds to the synonym mining method disclosed in the above embodiment, the description is relatively simple, and for the relevant similarities, please refer to the description of the synonym mining method in the above embodiment, and details are not described here.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (11)

1. A synonym mining method, comprising:
obtaining a target search term to be processed;
vectorizing the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model which is trained by utilizing search words which comprise long-tail words and context information and correspond to historical search behaviors of a plurality of users in advance, and the search words which comprise the long-tail words and the context information and correspond to the historical search behaviors of each user comprise: a plurality of search terms comprising long-tail words and context information thereof, which correspond to historical search behaviors of each user in at least one time window with preset time length; the method comprises the steps that a plurality of search terms corresponding to historical search behaviors in the same time window are related search terms generated based on the same search purpose, and the same time window represents a continuous time period of preset duration;
calculating the similarity between each word in the word vector library and the target search word based on the target word vector and the word vectors of all words in a preset word vector library; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words which correspond to historical search behaviors of the users and comprise long-tail words and context information of the long-tail words, and the word vectors in the word vector library are vectorized expressions obtained after vectorization processing is carried out on the search words which correspond to the historical search behaviors of the users and comprise the long-tail words and the context information of the long-tail words by using the word vector model;
and selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.
2. The method according to claim 1, further comprising the following preprocessing procedure before the obtaining of the target search term to be processed:
obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time;
dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length;
training a word vector model by utilizing each search word of each user in each corresponding time window;
and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.
3. The method of claim 1, wherein obtaining the target search term to be processed comprises:
and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.
4. The method of claim 1, wherein the calculating the similarity of each word in the word vector library to the target search word based on the target word vector and word vectors corresponding to respective words included in a predetermined word vector library comprises:
and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.
5. The method of claim 4, wherein the word vector distance of the target search word from each word included in the word vector library is a cosine distance or a Euclidean distance of the target search word from each word included in the word vector library.
6. The method of claim 1, wherein the selecting a predetermined number of words from the word vector library as synonyms of the target search word based on a predetermined rule comprises:
and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.
7. A synonym mining device, comprising:
the search word acquisition unit is used for acquiring a target search word to be processed;
the vectorization processing unit is used for carrying out vectorization processing on the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model which is trained by utilizing search words which comprise long-tail words and context information and correspond to historical search behaviors of a plurality of users in advance, and the search words which comprise the long-tail words and the context information and correspond to the historical search behaviors of each user comprise: a plurality of search terms comprising long-tail words and context information thereof, which correspond to historical search behaviors of each user in at least one time window with preset time length; the method comprises the steps that a plurality of search terms corresponding to historical search behaviors in the same time window are related search terms generated based on the same search purpose, and the same time window represents a continuous time period of preset duration;
a similarity calculation unit, configured to calculate, based on the target word vector and word vectors of respective words included in a predetermined word vector library, a similarity between each word in the word vector library and the target search word; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words which correspond to historical search behaviors of the users and comprise long-tail words and context information of the long-tail words, and the word vectors in the word vector library are vectorized expressions obtained after vectorization processing is carried out on the search words which correspond to the historical search behaviors of the users and comprise the long-tail words and the context information of the long-tail words by using the word vector model;
and the synonym selecting unit is used for selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.
8. The apparatus according to claim 7, further comprising a preprocessing unit, configured to, before the search word obtaining unit obtains the target search word to be processed, perform the following operations:
obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time;
dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length;
training a word vector model by utilizing each search word of each user in each corresponding time window;
and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.
9. The apparatus according to claim 7, wherein the search term obtaining unit is specifically configured to:
and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.
10. The apparatus according to claim 7, wherein the similarity calculation unit is specifically configured to:
and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.
11. The apparatus according to claim 7, wherein the synonym selecting unit is specifically configured to:
and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.
CN201811345950.2A 2018-11-13 2018-11-13 Synonym mining method and device Active CN109508414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811345950.2A CN109508414B (en) 2018-11-13 2018-11-13 Synonym mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811345950.2A CN109508414B (en) 2018-11-13 2018-11-13 Synonym mining method and device

Publications (2)

Publication Number Publication Date
CN109508414A CN109508414A (en) 2019-03-22
CN109508414B true CN109508414B (en) 2021-02-09

Family

ID=65748251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811345950.2A Active CN109508414B (en) 2018-11-13 2018-11-13 Synonym mining method and device

Country Status (1)

Country Link
CN (1) CN109508414B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348010B (en) * 2019-06-21 2023-06-02 北京小米智能科技有限公司 Synonymous phrase acquisition method and apparatus
CN110263347B (en) * 2019-06-26 2024-08-20 腾讯科技(深圳)有限公司 Synonym construction method and related device
CN110795612A (en) * 2019-10-28 2020-02-14 北京字节跳动网络技术有限公司 Search word recommendation method and device, electronic equipment and computer-readable storage medium
CN110889020B (en) * 2019-11-22 2022-08-23 百度在线网络技术(北京)有限公司 Site resource mining method and device and electronic equipment
CN111126048B (en) * 2019-12-25 2021-10-22 腾讯科技(深圳)有限公司 Candidate synonym determination method, device, server and storage medium
CN111460264B (en) * 2020-03-30 2023-08-01 口口相传(北京)网络技术有限公司 Training method and device for semantic similarity matching model
CN111881255B (en) * 2020-06-24 2023-10-27 百度在线网络技术(北京)有限公司 Synonymous text acquisition method and device, electronic equipment and storage medium
CN111831786A (en) * 2020-07-24 2020-10-27 刘秀萍 Full-text database accurate and efficient retrieval method for perfecting subject term
CN111950254B (en) * 2020-09-22 2023-07-25 北京百度网讯科技有限公司 Word feature extraction method, device and equipment for searching samples and storage medium
CN112115342B (en) * 2020-09-22 2024-07-16 深圳市欢太科技有限公司 Searching method, searching device, storage medium and terminal
CN113204622A (en) * 2021-05-25 2021-08-03 广州三星通信技术研究有限公司 Electronic device and information processing method thereof
CN113239183B (en) * 2021-05-28 2024-08-02 北京达佳互联信息技术有限公司 Training method and device for ranking model, electronic equipment and storage medium
CN113988056A (en) * 2021-11-08 2022-01-28 阿里巴巴(中国)有限公司 Synonym obtaining method and device
CN113821646A (en) * 2021-11-19 2021-12-21 达而观科技(北京)有限公司 Intelligent patent similarity searching method and device based on semantic retrieval
CN116340469B (en) * 2023-05-29 2023-08-11 之江实验室 Synonym mining method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101981571A (en) * 2008-01-30 2011-02-23 谷歌公司 Providing content using stored query information
CN102346778A (en) * 2011-10-11 2012-02-08 北京百度网讯科技有限公司 Method and equipment for providing searching result
CN106663104A (en) * 2014-06-17 2017-05-10 微软技术许可有限责任公司 Learning and using contextual content retrieval rules for query disambiguation
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045781B (en) * 2015-08-27 2020-06-23 广州神马移动信息科技有限公司 Query term similarity calculation method and device and query term search method and device
US9984068B2 (en) * 2015-09-18 2018-05-29 Mcafee, Llc Systems and methods for multilingual document filtering
CN106547732A (en) * 2016-10-14 2017-03-29 深圳中兴网信科技有限公司 Near synonym recognition methodss and near synonym identifying system
CN106844571B (en) * 2017-01-03 2020-04-07 北京齐尔布莱特科技有限公司 Method and device for identifying synonyms and computing equipment
CN107451126B (en) * 2017-08-21 2020-07-28 广州多益网络股份有限公司 Method and system for screening similar meaning words

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101981571A (en) * 2008-01-30 2011-02-23 谷歌公司 Providing content using stored query information
CN102346778A (en) * 2011-10-11 2012-02-08 北京百度网讯科技有限公司 Method and equipment for providing searching result
CN106663104A (en) * 2014-06-17 2017-05-10 微软技术许可有限责任公司 Learning and using contextual content retrieval rules for query disambiguation
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information

Also Published As

Publication number Publication date
CN109508414A (en) 2019-03-22

Similar Documents

Publication Publication Date Title
CN109508414B (en) Synonym mining method and device
CN106649818B (en) Application search intention identification method and device, application search method and server
CN107862027B (en) Retrieve intension recognizing method, device, electronic equipment and readable storage medium storing program for executing
CN107451126B (en) Method and system for screening similar meaning words
CN104885081B (en) Search system and corresponding method
Neethu et al. Sentiment analysis in twitter using machine learning techniques
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
US8204874B2 (en) Abbreviation handling in web search
CN102945237B (en) Based on original user input suggestion and the system and method for refined user input
Zhou et al. Resolving surface forms to wikipedia topics
EP2307951A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN104484380A (en) Personalized search method and personalized search device
CN110083696A (en) Global quotation recommended method, recommender system based on meta structure technology
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
Chen et al. Doctag2vec: An embedding based multi-label learning approach for document tagging
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
Tiwari et al. Ensemble approach for twitter sentiment analysis
CN110866102A (en) Search processing method
Hillard et al. Learning weighted entity lists from web click logs for spoken language understanding
CN115374362A (en) Multi-way recall model training method, multi-way recall device and electronic equipment
Hashemzadeh et al. Improving keyword extraction in multilingual texts.
CN114255067A (en) Data pricing method and device, electronic equipment and storage medium
CN111460177B (en) Video expression search method and device, storage medium and computer equipment
Figueroa et al. Collaborative ranking between supervised and unsupervised approaches for keyphrase extraction
US20210141823A1 (en) Concept discovery from text via knowledge transfer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant