CN109508414B

CN109508414B - Synonym mining method and device

Info

Publication number: CN109508414B
Application number: CN201811345950.2A
Authority: CN
Inventors: 吴健君; 倪嘉呈
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2021-02-09
Anticipated expiration: 2038-11-13
Also published as: CN109508414A

Abstract

When the synonym mining method and the device of the application are used for carrying out vectorization processing on the target search words of the synonym to be matched, the training sample of the adopted word vector model comprises a plurality of search words corresponding to the historical search behaviors of each user in at least one time window with preset duration, and each search word belonging to the same time window has strong relevance, so that the context information of the long-tail words is provided in the training sample when the word vector model is trained, on the basis, when the synonym of the target search words is mined by utilizing the word vector model and a word vector library obtained based on the word vector model, the long-tail words can have better synonym mining effect based on the vector model and the context information embodied in the word vector library, and the method and the system do not need manual intervention when synonym mining is carried out, so that the synonym mining efficiency can be effectively improved.

Description

Synonym mining method and device

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a synonym mining method and device.

Background

The synonym mining technology is an important technology in advertisement recall based on user search behavior targeting, and synonym expansion is carried out on user search words set by an advertiser by utilizing the technology, so that the advertisement recall efficiency can be improved.

At present, the common synonym mining methods can be generally divided into two types, one type is a rule-based synonym mining method, the method needs a large amount of manual intervention, a synonym list is provided through the priori knowledge of people, although some synonym dictionaries can be used, the information of the dictionaries has hysteresis, and for the propagation of network languages, manual intervention processing is still needed, so that the mining efficiency is low; the other is a mining method based on the context of a search engine, which usually needs to search a click log and a session log (i.e. search logs), and synonyms are calculated through the co-occurrence of different search words (clicking the same url appearing in the same session, i.e. when a search is performed based on different search words and the same url is clicked according to the search result, the different search words are considered to have co-occurrence).

Therefore, the existing synonym mining methods have corresponding defects, and therefore, the field needs to provide a better synonym mining scheme so as to better meet the synonym mining requirements in the targeted advertisement recalls based on the user search behaviors.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for mining synonyms, so as to overcome the problems in the prior art and better satisfy the requirement of mining synonyms in advertisement recalls oriented based on user search behaviors.

Therefore, the invention discloses the following technical scheme:

a synonym mining method, comprising:

obtaining a target search term to be processed;

vectorizing the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: a plurality of search terms corresponding to historical search behaviors of each user in at least one time window with preset time length;

calculating the similarity between each word in the word vector library and the target search word based on the target word vector and the word vectors of all words in a preset word vector library; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words corresponding to historical search behaviors of the users, and the word vectors in the word vector library are vectorization expressions obtained after vectorization processing is carried out on the search words corresponding to the historical search behaviors of the users by using the word vector model;

and selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.

The method preferably further includes, before the obtaining of the target search term to be processed, the following preprocessing process:

obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time;

dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length;

training a word vector model by utilizing each search word of each user in each corresponding time window;

and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.

The method preferably, the obtaining of the target search term to be processed includes:

and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.

Preferably, the calculating, based on the target word vector and the word vectors corresponding to the words in the predetermined word vector library, a similarity between each word in the word vector library and the target search word includes:

and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.

In the foregoing method, preferably, a word vector distance between the target search word and each word included in the word vector library is a cosine distance or an euclidean distance between the target search word and each word included in the word vector library.

Preferably, the selecting a predetermined number of words from the word vector library as synonyms of the target search word based on a predetermined rule includes:

and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.

A synonym mining device, comprising:

the search word acquisition unit is used for acquiring a target search word to be processed;

the vectorization processing unit is used for carrying out vectorization processing on the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: a plurality of search terms corresponding to historical search behaviors of each user in at least one time window with preset time length;

a similarity calculation unit, configured to calculate, based on the target word vector and word vectors of respective words included in a predetermined word vector library, a similarity between each word in the word vector library and the target search word; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words corresponding to historical search behaviors of the users, and the word vectors in the word vector library are vectorization expressions obtained after vectorization processing is carried out on the search words corresponding to the historical search behaviors of the users by using the word vector model;

and the synonym selecting unit is used for selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.

The apparatus preferably further includes a preprocessing unit, configured to execute the following operations before the search term obtaining unit obtains the target search term to be processed:

Preferably, the apparatus of the present invention further includes the search term obtaining unit, configured to:

The above apparatus, preferably, the similarity calculation unit is specifically configured to:

The above apparatus, preferably, the synonym selecting unit is specifically configured to:

According to the scheme, when the target search word of the synonym to be matched is subjected to vectorization processing, the training sample of the adopted word vector model comprises a plurality of search words corresponding to historical search behaviors of each user in a time window with at least one preset time length, and each search word (usually a plurality of search words generated by the users based on the same search purpose) belonging to the same time window has strong relevance, so that the context information of the long tail word is provided in the training sample during the training of the word vector model, and on the basis, when the target search word of the target search word is mined by using the word vector model and the word vector library obtained based on the word vector model, the long tail form target search word can have better synonym mining effect based on the word vector model and the context information embodied in the word vector library And the synonym mining method and the synonym mining system do not need manual intervention when synonym mining is carried out, so that the synonym mining efficiency can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a synonym mining method provided by an embodiment of the application;

FIG. 2 is a schematic diagram of a training process of a word vector model according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a logic principle for implementing synonym mining based on the method of the present application according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a synonym mining device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of another synonym mining device provided in an embodiment of the present application.

Detailed Description

For the sake of reference and clarity, the technical terms, abbreviations or abbreviations used hereinafter are to be interpreted in summary as follows:

long-tail words: that is, Long Tail keywords, refer to non-target keywords on a website, but combined keywords related to target keywords that may also bring about search traffic. The long-tail keywords are characterized by being long, and are often composed of 2-3 words, even phrases, which exist in the content page, besides the title of the content page, also exist in the content. The search volume is very small and unstable. The probability of the customers brought by the long-tail keywords to be converted into website product customers is much higher than that of the target keywords because the long-tail keywords have stronger purposiveness.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to overcome the problems in the prior art, such as the need for manual intervention in the synonym mining process or the unsatisfactory mining effect of long-tail synonyms, and the like, so as to better meet the synonym mining requirements in the user search behavior oriented advertisement recall, the present application provides a synonym mining method and apparatus, and the method and apparatus of the present application will be described below through specific embodiments.

Referring to fig. 1, it is a flowchart of a synonym mining method provided in an embodiment of the present application, where in the embodiment, the method includes the following steps:

step 101, obtaining a target search term to be processed.

The target search word to be processed is the target search word of the synonym to be matched currently; the target search term may be a search term corresponding to the current search behavior of the user, or may be a user search term set by the advertiser, where the search term may be a single chinese character (but is not common), a word term, or a long-tail word, a short sentence, or the like.

The long term means a combination term composed of a plurality of (e.g. 2-3) terms, such as brand a laptop, brand B cooling sports shoes, and the like.

102, carrying out vectorization processing on the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: and each user searches a plurality of search terms corresponding to historical search behaviors in a time window with at least one preset time length.

After the target search term to be processed is obtained, in step 102, a pre-trained word vector model is used to perform vectorization processing on the target search term to obtain a target word vector corresponding to the target search term, so as to provide a basis for a subsequent synonym mining process.

The following first describes the training process of the word vector model, which is described below with reference to fig. 2 as follows:

step 201, obtaining search behavior information corresponding to historical search behaviors of a plurality of users, where the search behavior information includes a corresponding relationship between a search word and search time.

And the search time is a time point/moment value of the search operation performed by the user.

Taking a given user u as an example, recording historical search behaviors of the given user u at different time points as<q_u1,t₁>,…,<q_um,t_m>Wherein q is_ui(i is a natural number, i is more than or equal to 1 and less than or equal to m) is a search word corresponding to the historical search behavior of the user u, and t is_iThe time points at which the behavior occurred are searched for the history of the user u. This step can obtain the above-mentioned series of historical search behavior information of a plurality of users.

Step 202, dividing the search behavior information of each user by using a time window with a preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length.

Step 203, training a word vector model by using each search word of each user in each corresponding time window.

This step 203 can be realized by the following processing procedure:

1) and generating a window document corresponding to each time window based on each search term corresponding to each time window to obtain a window document set of each user, wherein the window document set of each user comprises not less than one window document.

Wherein, the following operations can be executed aiming at the search behavior information of each user:

q is paired in a sliding time window (said sliding step is preferably said T in this embodiment) according to a time window T of predetermined duration_u1,…,q_umDividing, merging each search word corresponding to the historical search behavior generated in the time window T to generate a window document (document), and expressing the window document as d_ui’For multiple time windows in the sliding process, multiple window documents of the user are obtained, that is, the window document set d of the user_u1,…,d_upWherein, i' is more than or equal to 1 and less than or equal to p, and p represents the window document number of the user u.

2) And combining the window document sets of the users to obtain a search word document set corresponding to the historical search behavior of the users.

And merging the window document sets of the users to obtain a search word document set D corresponding to the historical search behavior of the users, wherein the search word document set D specifically comprises a series of window documents of the users.

3) And training a word vector model by utilizing the search word document set.

On the basis of the above steps, a series of window documents of each user included in the search term document set D are used as training samples to train a term vector model, specifically, the term vector model may be a word2vec term vector model.

Here, it should be noted that the search behavior of a single user in a continuous time period (e.g. within a time window of a certain predetermined time duration (e.g. 10 minutes)) is often generated based on the same search purpose, and correspondingly, a series of search words with strong relevance are often generated in the time period, for example, a user may search not only a certain type of product (e.g. "computer") but also a certain brand of product (e.g. "brand a computer") of the type of product and the product with a specific function tendency or a specific form under the brand (e.g. "brand a tablet computer", "brand a laptop") in a certain time period, and the search words corresponding to the type of product, the certain brand of the type of product, the certain function tendency under the brand, or the product with the specific form respectively generate strong relevance due to the common information including the product name (e.g. computer) of the product, based on the characteristics, the application obtains the searching behavior information corresponding to the historical searching behaviors of a plurality of users, and dividing the search words/keywords corresponding to the historical search behavior of the user according to the time window of the preset time length for each user, and then it is used as the training sample of the word vector model, so that the training sample contains the context information of each search word (the context information can be understood as each other search word in the same time window with the search word), so that for the long-tail word, it is equivalent to inputting the context information of each long-tail word in the model training, this enables the trained model to embody the association between the long-tailed word and its context information, and then provide richer synonym basis for the synonym mining of the long-tail words, and aiming at the keyword targeted advertisement recall system, the advertisement display efficiency can be better improved through the method and the system.

And recording a set of each historical search word in the search word document set D as Q, and on the basis of training the word vector model, further obtaining vectorization expression of each historical search word in Q based on the word vector model, thereby generating a word vector library. The word vector library includes information of correspondence between each historical search word in Q and its word vector, which may be denoted as v (Q) ═ Q_j，v(q_j)|q_j∈Q}，q_jFor the jth search term in Q, v (Q)_j) Is q_jCorresponding word vectorJ is a natural number, j is more than or equal to 1 and less than or equal to N, and N is the number of search terms included in Q.

The training process of the word vector model and the generation process of the word vector library can be used as the preprocessing process of the scheme of the application, and on the basis, for the current target search word to be processed, the pre-trained word vector model is used for conducting vectorization processing on the target search word in the step 102 to obtain the word vector corresponding to the target search word.

Step 103, calculating the similarity between each word in the word vector library and the target search word based on the target word vector and the word vectors of the words in the word vector library.

As described above, each word included in the word vector library is each search word (history search word) included in the set Q of history search words, and this step specifically represents the similarity between the target search word and each search word in Q by a predetermined word vector distance between the target search word to be processed and each search word in Q.

The predetermined word vector distance may be a cosine distance (cosine distance) or a euclidean distance between two words, but is not limited thereto.

Taking cosine distance as an example, for a given target search word q of synonyms to be matched_iLet it be assumed that the vectorized expression (i.e., word vector) thereof obtained by the above-described steps is v (q)_i) Then, this step traverses v (q), and uses the word vector of each search word in v (q) (using v (q))_j) Is represented by (a) and the v (q)_i) Calculating the cosine distance between each search word in V (Q) and the target search word based on the following calculation formula:

where n is the vector dimension of v (q).

The smaller the cosine distance between the target search word and a certain word in V (Q), the higher the similarity between the target search word and the certain word in V (Q).

And 104, selecting a preset number of words from the word vector library as synonyms of the target search words based on a preset rule.

The predetermined rule may be, but is not limited to, the k-nearest neighbor principle.

After calculating the similarity between the target search word and each search word in v (q), k words with the highest similarity may be selected as synonyms of the target search word based on the k-nearest neighbor principle, in this step, a predetermined number (e.g., the first k) of words before sorting may be selected from the word vector library in a descending order of similarity as synonyms of the target search word, and in a case that the similarity is characterized specifically by using cosine distances, a predetermined number of words before sorting may be selected from the word vector library in an ascending order of cosine distances as synonyms of the target search word, referring to fig. 3, fig. 3 shows a schematic logical principle diagram of realizing synonym mining based on the method of the present application.

According to the above scheme, when the target search word of the synonym to be matched is subjected to vectorization processing, the training sample of the word vector model adopted by the method comprises a plurality of search words corresponding to historical search behaviors of each user in a time window with at least one preset time duration, and each search word (often a plurality of search words generated by the users based on the same search purpose) belonging to the same time window has strong relevance, so that the context information of the long tail word is provided in the training sample during training the word vector model, on the basis, when the synonym of the target search word is mined by using the word vector model and the word vector library obtained based on the word vector model, for the target search word in the long tail form, the long tail word has a good synonym mining effect based on the word vector model and the context information embodied in the word vector library, and the method and the system do not need manual intervention when synonym mining is carried out, so that the synonym mining efficiency can be effectively improved.

The present application further provides a synonym mining device corresponding to the above method, and referring to fig. 4, the synonym mining device provided in the embodiment of the present application is a schematic structural diagram, and the device includes:

a search word obtaining unit 401, configured to obtain a target search word to be processed;

a vectorization processing unit 402, configured to perform vectorization processing on the target search term by using a pre-trained term vector model, so as to obtain a target term vector corresponding to the target search term; the word vector model is a model trained by utilizing search words corresponding to historical search behaviors of a plurality of users in advance, and the search words corresponding to the historical search behaviors of each user comprise: a plurality of search terms corresponding to historical search behaviors of each user in at least one time window with preset time length;

a similarity calculation unit 403, configured to calculate a similarity between each word in the word vector library and the target search word based on the target word vector and word vectors of respective words included in a predetermined word vector library; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words corresponding to historical search behaviors of the users, and the word vectors in the word vector library are vectorization expressions obtained after vectorization processing is carried out on the search words corresponding to the historical search behaviors of the users by using the word vector model;

a synonym selecting unit 404, configured to select a predetermined number of terms from the term vector library as synonyms of the target search term based on a predetermined rule, where a similarity between each selected term and the target search term is not lower than a similarity between any term not selected in the term vector library and the target search term.

In an implementation manner of the embodiment of the present application, as shown in fig. 5, the apparatus further includes a preprocessing unit 405, configured to perform the following operations before the search word obtaining unit 401 obtains a target search word to be processed:

obtaining search behavior information corresponding to historical search behaviors of a plurality of users, wherein the search behavior information comprises a corresponding relation between a search word and search time; dividing the search behavior information of each user by using a time window with preset time length to obtain each search word corresponding to each user in at least one time window with the preset time length; training a word vector model by utilizing each search word of each user in each corresponding time window; and vectorizing each search word of each user in the corresponding time window by using the word vector model to obtain a word vector corresponding to each search word, and generating a word vector library based on the corresponding relation between each search word of each user and the corresponding word vector.

In an implementation manner of the embodiment of the present application, the search term obtaining unit 401 is specifically configured to: and obtaining the search word corresponding to the current search behavior of the user as a target search word to be processed.

In an implementation manner of the embodiment of the present application, the similarity calculation unit 403 is specifically configured to: and calculating a word vector distance between the target search word and each word included in the word vector library based on the target word vector and a word vector corresponding to each word included in the word vector library by using a preset word vector distance calculation formula, wherein the word vector distance of each word represents the similarity between the target search word and each word included in the word vector library.

In an implementation manner of the embodiment of the present application, the synonym selecting unit 404 is specifically configured to: and selecting a preset number of words before sorting from the word vector library according to the descending order of the similarity as synonyms of the target search words.

For the synonym mining device disclosed in the embodiment of the present application, since it corresponds to the synonym mining method disclosed in the above embodiment, the description is relatively simple, and for the relevant similarities, please refer to the description of the synonym mining method in the above embodiment, and details are not described here.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A synonym mining method, comprising:

obtaining a target search term to be processed;

vectorizing the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model which is trained by utilizing search words which comprise long-tail words and context information and correspond to historical search behaviors of a plurality of users in advance, and the search words which comprise the long-tail words and the context information and correspond to the historical search behaviors of each user comprise: a plurality of search terms comprising long-tail words and context information thereof, which correspond to historical search behaviors of each user in at least one time window with preset time length; the method comprises the steps that a plurality of search terms corresponding to historical search behaviors in the same time window are related search terms generated based on the same search purpose, and the same time window represents a continuous time period of preset duration;

calculating the similarity between each word in the word vector library and the target search word based on the target word vector and the word vectors of all words in a preset word vector library; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words which correspond to historical search behaviors of the users and comprise long-tail words and context information of the long-tail words, and the word vectors in the word vector library are vectorized expressions obtained after vectorization processing is carried out on the search words which correspond to the historical search behaviors of the users and comprise the long-tail words and the context information of the long-tail words by using the word vector model;

2. The method according to claim 1, further comprising the following preprocessing procedure before the obtaining of the target search term to be processed:

3. The method of claim 1, wherein obtaining the target search term to be processed comprises:

4. The method of claim 1, wherein the calculating the similarity of each word in the word vector library to the target search word based on the target word vector and word vectors corresponding to respective words included in a predetermined word vector library comprises:

5. The method of claim 4, wherein the word vector distance of the target search word from each word included in the word vector library is a cosine distance or a Euclidean distance of the target search word from each word included in the word vector library.

6. The method of claim 1, wherein the selecting a predetermined number of words from the word vector library as synonyms of the target search word based on a predetermined rule comprises:

7. A synonym mining device, comprising:

the vectorization processing unit is used for carrying out vectorization processing on the target search word by using a pre-trained word vector model to obtain a target word vector corresponding to the target search word; the word vector model is a model which is trained by utilizing search words which comprise long-tail words and context information and correspond to historical search behaviors of a plurality of users in advance, and the search words which comprise the long-tail words and the context information and correspond to the historical search behaviors of each user comprise: a plurality of search terms comprising long-tail words and context information thereof, which correspond to historical search behaviors of each user in at least one time window with preset time length; the method comprises the steps that a plurality of search terms corresponding to historical search behaviors in the same time window are related search terms generated based on the same search purpose, and the same time window represents a continuous time period of preset duration;

a similarity calculation unit, configured to calculate, based on the target word vector and word vectors of respective words included in a predetermined word vector library, a similarity between each word in the word vector library and the target search word; the word vector library comprises corresponding relation information of a plurality of words and word vectors, the words in the word vector library are search words which correspond to historical search behaviors of the users and comprise long-tail words and context information of the long-tail words, and the word vectors in the word vector library are vectorized expressions obtained after vectorization processing is carried out on the search words which correspond to the historical search behaviors of the users and comprise the long-tail words and the context information of the long-tail words by using the word vector model;

8. The apparatus according to claim 7, further comprising a preprocessing unit, configured to, before the search word obtaining unit obtains the target search word to be processed, perform the following operations:

9. The apparatus according to claim 7, wherein the search term obtaining unit is specifically configured to:

10. The apparatus according to claim 7, wherein the similarity calculation unit is specifically configured to:

11. The apparatus according to claim 7, wherein the synonym selecting unit is specifically configured to: