CN107688563A

CN107688563A - A kind of recognition methods of synonym and identification device

Info

Publication number: CN107688563A
Application number: CN201610641371.7A
Authority: CN
Inventors: 郑婷婷; 毕娅娜
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Co Ltd
Priority date: 2016-08-05
Filing date: 2016-08-05
Publication date: 2018-02-13
Anticipated expiration: 2036-08-05
Also published as: CN107688563B

Abstract

The invention discloses a kind of recognition methods of synonym and identification device, to improve the degree of accuracy of synonym identification, and then improves user and inquires about experience.This method is：For belonging to the same category of first participle and the second participle, after calculating address similarity and the literal similarity between the first participle and the second participle, further according to address similarity and literal similarity, calculate the comprehensive similarity between the first participle and the second participle, when determining that comprehensive similarity is not less than predetermined threshold value, the first participle and the second participle synonym each other are judged.So, by being considered from the address similarity between two participles and literal similarity so that the comprehensive similarity calculated is more accurate, and then so that the recognition result of synonym is more accurate.Moreover, calculating comprehensive similarity for belonging to same category of two participles, the degree of accuracy of synonym identification is further increased.

Description

A kind of recognition methods of synonym and identification device

Technical field

The present invention relates to recognition methods and the identification device of field of computer technology, more particularly to a kind of synonym.

Background technology

Synonym, not only the same or like word of symbolical meaningses, goes back the related word of symbolical meaningses.Such as：" potato " " potato " is meaning identical synonym, and " strict " and " severe " is the synonym being close in meaning, " employment " and " recruitment " It is related synonym of meaning, etc..

In practical application, in internet arena, particularly in query search field, the excavation of synonym is one very heavy The work wanted, the Query Information that its realization inputs for deep understanding user, Query Result is enriched, and provided the user more Good inquiry experience is very helpful.At present, obtaining the method for synonym mainly has two kinds of means, and one kind is special by language Family writes thesaurus according to word accumulation, and another kind is using the degree of correlation of semantic analysis technology identification word, automatic mining Synonym.Because the artificial synonym that obtains needs to expend substantial amounts of human resources and material resources, efficiency also than relatively low, so, It is more and more common according to the mode of semantic analysis automatic identification synonym.

In the prior art, it is proposed that following two synonym automatic identifying methods：

First method：It is determined that the smallest edit distance between two Chinese words for needing to identify is less than or equal to editing distance After threshold value, by judge the two Chinese words whether all exist with default thesaurus, to judge that the two Chinese words are No is synonym.

Second method：Each Query Information in inquiry log is first divided into word, and respectively by each word of division With the result address composition word in inquiry log and the matching pair of result address, and according to the frequency of user's match query pair and The number of matching pair corresponding to each result address, screens all matchings pair, and is composed of matching pair by what is filtered out Set, according to result address, from the matching of composition to searching the word matched with the result address in set, by the word found work For synonym.

Based on above-mentioned analysis, there is following drawback in the synonymous word recognition method proposed in the prior art：

(1) it is directed to the first above-mentioned synonymous word recognition method：If two words are synonyms, but the two words on text not It is very close to even two words are synonyms, but the editing distance between the two words is farther out, then may result in None- identified The two synonyms；If two words are not synonyms, but the two words on text very close to even two words are not synonymous Word, but the editing distance between the two words is nearer, then may result in identification synonym mistake.Such as：It is " how is perfume " and " double Farther out, but they are synonyms to editing distance between C "；And for example：Editing distance between " milk " and " milk cow " is nearer, but They are not synonyms.Moreover, the Internet era risen suddenly and sharply in word information, new word language continually, if the knowledge of synonym Other method excessively relies on the thesaurus write in advance, then may cause because word that thesaurus is covered is than relatively limited None- identified new life synonym.

(2) it is directed to above-mentioned second synonymous word recognition method：Although this method is independent of thesaurus as identification base Plinth, the editing distance between two words is not used yet, relative to the first above-mentioned synonymous word recognition method, improve synonym knowledge Other degree of accuracy, still, the synonym recognizer are fairly simple, the calculating quantified to the similarity degree neither one of synonym Value is weighed, and the synonym degree of accuracy identified is still very low, and then be have impact on user and inquired about experience.

The content of the invention

It is of the prior art to solve the embodiments of the invention provide a kind of recognition methods of synonym and identification device It is relatively low recognition accuracy to be present in synonymous word recognition method, and then influences the problem of user inquires about experience.

Concrete technical scheme provided in an embodiment of the present invention is as follows：

A kind of recognition methods of synonym, including：

For belonging to the same category of first participle and the second participle, the above-mentioned first participle and the above-mentioned second participle are calculated Between address similarity；Wherein, the first user corresponding to the above-mentioned first participle of address above mentioned similarity characterization clicks on inquiry knot Fruit address set second user corresponding with the above-mentioned second participle clicks on the similarity between Query Result address set；

Calculate the literal similarity between the above-mentioned first participle and above-mentioned second participle；Wherein, above-mentioned literal similarity table Levy the similarity between the first character group that the above-mentioned first participle includes and the second character group that above-mentioned second participle includes；

Based on address above mentioned similarity and above-mentioned literal similarity, calculate between the above-mentioned first participle and above-mentioned second participle Comprehensive similarity；

When determining that above-mentioned comprehensive similarity is not less than predetermined threshold value, judge the above-mentioned first participle with the above-mentioned second participle each other Synonym.

Preferably, for belong to the same category of first participle and second participle, calculate the above-mentioned first participle with it is above-mentioned Before address similarity between second participle, further comprise：

User's inquiry log is gathered, wherein, user's inquiry log comprises at least：The Query Information of user's input, base In all Query Result addresses that above-mentioned Query Information is shown to user, and the Query Result address that all users click on；

All Query Informations in preset time range are carried out with word segmentation processing respectively, obtains corresponding each participle, and The Query Result address that all users corresponding to counting each participle respectively click on；

The Query Result address clicked on based on all users corresponding to each participle and each participle, generates phase respectively The user answered clicks on Query Result address set.

Preferably, the address similarity between the above-mentioned first participle and above-mentioned second participle is calculated, including：

The Query Result address for all users click that Query Result address set includes is clicked on based on above-mentioned first user Domain name, and above-mentioned second user click on the Query Result address field that all users that Query Result address set includes click on Name, the first Query Result address sum is calculated, wherein, above-mentioned first Query Result address sum characterizes above-mentioned first user and clicked on Query Result address set and above-mentioned second user click on Query Result address domain name identical between Query Result address set The summation of all Query Result addresses；

The Query Result address for all users click that Query Result address set includes is clicked on based on above-mentioned first user Number, and above-mentioned second user click on the Query Result number of addresses that all users that Query Result address set includes click on Mesh, the second Query Result address sum is calculated, wherein, above-mentioned second Query Result address sum characterizes above-mentioned first user and clicked on Query Result address set and above-mentioned second user click on the summation of all Query Result addresses between Query Result address set；

Based on the total and above-mentioned second Query Result address sum in above-mentioned first Query Result address, above-mentioned first point is calculated Address similarity between word and above-mentioned second participle.

Preferably, the literal similarity between the above-mentioned first participle and above-mentioned second participle is calculated, including：

All identical characters between above-mentioned first character group and above-mentioned second character group are counted, it is all identical based on statistics Character, determine the identical characters sum between the above-mentioned first participle and above-mentioned second participle；

The second character included based on total and above-mentioned second character group of the first character that above-mentioned first character group includes is total Number, the total minimum character sum between above-mentioned second character sum of above-mentioned first character is determined,

Based on the total and above-mentioned minimum character sum of above-mentioned identical characters, the above-mentioned first participle and the above-mentioned second participle are calculated Between literal similarity.

Preferably, being based on address above mentioned similarity and above-mentioned literal similarity, the above-mentioned first participle and above-mentioned second are calculated Comprehensive similarity between participle, including：

It is determined that characterize the first constant of address above mentioned similarity weight and characterize the second normal of above-mentioned literal similarity weight Number, wherein, above-mentioned first constant is 1 with above-mentioned second constant sum；

Based on address above mentioned similarity and above-mentioned first constant, and above-mentioned literal similarity and above-mentioned second constant, meter Count in stating the comprehensive similarity between the first participle and above-mentioned second participle.

A kind of identification device of synonym, including：

First computing unit, for for belonging to the same category of first participle and the second participle, calculating above-mentioned first Address similarity between participle and above-mentioned second participle；Wherein, corresponding to the above-mentioned first participle of address above mentioned similarity characterization First user clicks on Query Result address set second user corresponding with the above-mentioned second participle and clicks on Query Result address set Between similarity；

Second computing unit, for calculating the literal similarity between the above-mentioned first participle and above-mentioned second participle；Wherein, The second character group that the first character group that the above-mentioned above-mentioned first participle of literal similarity characterization includes includes with the above-mentioned second participle Between similarity；

3rd computing unit, for based on address above mentioned similarity and above-mentioned literal similarity, calculating the above-mentioned first participle With the comprehensive similarity between the above-mentioned second participle；

Recognition unit, during for determining that above-mentioned comprehensive similarity is not less than predetermined threshold value, judge the above-mentioned first participle with it is upper State the second participle synonym each other.

Preferably, above-mentioned identification device also includes：Collecting unit, pretreatment unit, gather generation unit, wherein, upper The first computing unit is stated for belonging to the same category of first participle and the second participle, calculates the above-mentioned first participle and above-mentioned the Before address similarity between two participles,

Above-mentioned collecting unit, for gathering user's inquiry log, wherein, user's inquiry log comprises at least：User The Query Information of input, all Query Result addresses shown based on above-mentioned Query Information to user, and all users are clicked on Query Result address；

Above-mentioned pretreatment unit, for carrying out word segmentation processing respectively to all Query Informations in preset time range, obtain Corresponding each participle is taken, and counts the Query Result address of all users' clicks corresponding to each participle respectively；

Above-mentioned set generation unit, looked into for what is clicked on based on all users corresponding to each participle and each participle Result address is ask, corresponding user is generated respectively and clicks on Query Result address set.

Preferably, when calculating the address similarity between the above-mentioned first participle and above-mentioned second participle, above-mentioned first calculates Unit is specifically used for：

Preferably, when calculating the literal similarity between the above-mentioned first participle and above-mentioned second participle, above-mentioned second calculates Unit is specifically used for：

Preferably, being based on address above mentioned similarity and above-mentioned literal similarity, the above-mentioned first participle and above-mentioned second are calculated During comprehensive similarity between participle, above-mentioned 3rd computing unit is specifically used for：

The embodiment of the present invention has the beneficial effect that：

In the embodiment of the present invention, by calculating the comprehensive similarity between two participles, you can judge the two participles Whether it is synonym, the synonym between being segmented suitable for any two identifies, is also no longer dependent on the synonym write in advance Storehouse, avoid because the word that thesaurus is covered is than relatively limited, lead to not the newborn synonymous word problem of identification.It is moreover, logical Cross and considered on both side from the address similarity between two participles and literal similarity so that two calculated segment it Between comprehensive similarity it is more accurate, and then improve synonym identification accuracy.Further, for belonging to same category Two participle calculate comprehensive similarities, further increase synonym identification the degree of accuracy.

Brief description of the drawings

Fig. 1 is the overview schematic diagram of synonymous word recognition method in the embodiment of the present invention；

Fig. 2 is the idiographic flow schematic diagram of synonymous word recognition method in the embodiment of the present invention；

Fig. 3 is the illustrative view of functional configuration of synonym identification device in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, is not whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

In order to solve synonymous word recognition method of the prior art, recognition accuracy to be present relatively low, and then influences user's inquiry The problem of experience, in the embodiment of the present invention, it can be directed to and belong to the same category of first participle and the second participle, first calculate above-mentioned Address similarity and literal similarity between the first participle and above-mentioned second participle, then based on the above-mentioned first participle and above-mentioned the Address similarity and literal similarity between two participles, calculate the synthesis phase between the above-mentioned first participle and above-mentioned second participle Like degree, finally, when it is determined that above-mentioned comprehensive similarity is not less than predetermined threshold value, it is possible to judge the above-mentioned first participle and above-mentioned the Two segment synonym each other.

The present invention program is described in detail below by specific embodiment, certainly, the present invention is not limited to following reality Apply example.

As shown in fig.1, synonymous word recognition method provided in an embodiment of the present invention, can apply to but be not limited to：Search Engine server, specifically, the flow for the synonymous word recognition method that search engine server uses are as follows：

Step 100：For belong to the same category of first participle and second participle, calculate the above-mentioned first participle with it is above-mentioned Address similarity between second participle；Wherein, first user's point corresponding to the above-mentioned first participle of address above mentioned similarity characterization Hit corresponding with the above-mentioned second participle second user of Query Result address set click on it is similar between Query Result address set Degree.

In actual applications, before step 100 is performed, search engine server can also perform but be not limited to following step Suddenly：

First, search engine server gathers user's inquiry log in real time, wherein, user's inquiry log at least wraps Include：The Query Information of user's input, all Query Result addresses shown based on above-mentioned Query Information to user, and institute are useful The Query Result address that family is clicked on.

Then, search engine server carries out word segmentation processing respectively to all Query Informations in preset time range, obtains Corresponding each participle is taken, and is classified to each, and for each participle that each classification includes, is united respectively The Query Result address that all users corresponding to counting each participle click on.

Tellable to be, all Query Informations of the search engine server in preset time range segment respectively Before processing, additional character processing can be also carried out respectively for each Query Information, go to stop the relevant treatments such as word processing.Than Such as：For Query Information " fresh flower shop (Zhichun Road shop) ", search engine server can be removed in the Query Information " bracket "；Pin To Query Information " the fresh flower shop of Zhichun Road ", search engine server can remove in the Query Information " ", etc..Specifically Ground, go additional character processing, go to stop the correlation process methods such as word processing, it is same as the prior art, it will not be repeated here.

Finally, the inquiry that search engine server is clicked on based on all users corresponding to each participle and each participle Result address, corresponding user is generated respectively and clicks on Query Result address set.

Such as：In the user journal information 1 that search engine server collects, the Query Information 1 of user's input is：Haidian The fresh flower shop of area Zhichun Road；Search engine server is to all Query Result addresses that user shows：URL (Uniform Resource Locator, URL) 1, URL 2, URL 3, URL 4 and URL 5；The inquiry knot that all users click on Fruit address is：URL 1, URL 2 and URL 4.

In the user journal information 2 that search engine server collects, the Query Information 2 of user's input is：Haidian fresh flower Shop (Zhichun Road shop)；Search engine server is to all Query Result addresses that user shows：URL 1、URL 2、URL 3、 URL 4 and URL 5；The Query Result address that all users click on is：URL 1, URL 2, URL 3 and URL 4.

Search engine server is for all Query Informations in 1 hour (i.e. in preset time range) (assuming that having：Look into Ask information 1 and Query Information 2), remove in Query Information 1 " ", obtaining corresponding Query Information 1, " Haidian District Zhichun Road is fresh Florist's shop ", and " bracket " in Query Information 2 is removed, obtain corresponding Query Information 2 " Haidian fresh flower shop Zhichun Road shop ".

Search engine server carries out word segmentation processing to Query Information 1 " Haidian District Zhichun Road fresh flower shop ", and what is got is each Individual participle is：Haidian District, Zhichun Road fresh flower shop, and word segmentation processing is carried out to Query Information 2 " Haidian fresh flower shop Zhichun Road shop ", The each participle got is：Haidian District, fresh flower shop Zhichun Road shop, i.e., each participle that search engine server is got are： Haidian District, Zhichun Road fresh flower shop and fresh flower shop Zhichun Road shop.

Search engine server is classified to 3 of acquisition, such as：Participle " Haidian District " is ranged into " area Class ", participle " Zhichun Road fresh flower shop " and participle " fresh flower shop Zhichun Road shop " are ranged " sweets shop class ".

Below only by taking participle " Zhichun Road fresh flower shop " and participle " fresh flower shop Zhichun Road shop " that " sweets shop class " includes as an example It is described in detail.

Search engine server is directed to the participle " Zhichun Road fresh flower shop " that " sweets shop class " includes, and counts corresponding to the participle The Query Result address that all users click on is：URL 1, URL 2 and URL 4；And for participle that " sweets shop class " includes " fresh flower shop Zhichun Road shop ", counting the Query Result address that all users corresponding to the participle click on is：URL 1、URL 2、URL 3 and URL 4.

Search engine server is based on participle " Zhichun Road fresh flower shop ", and the participle (referred to below as segments " Zhichun Road fresh flower Shop " is KW1) corresponding to the Query Result address clicked on of all users：URL 1, URL 2 and URL 4, generation user click on inquiry Result address set 1, it is { KW1, URL 1, URL 2, URL 4 }.

Search engine server is based on participle " fresh flower shop Zhichun Road shop ", and the participle (referred to below as " know in fresh flower shop by participle Chun Lu shops " are KW2) corresponding to the Query Result address clicked on of all users：URL 1, URL 2, URL 3 and URL 4, generation are used Query Result address set 2 is clicked at family, is { KW2, URL 1, URL 2, URL 3, URL 4 }.

Preferably, in all Query Result addresses shown due to search engine server to user, partial query result The degree of association between the Query Information that address may input with user is relatively low, so, in order to avoid due to search engine server The Query Result address of offer is inaccurate, causes two gone out based on the Query Result address computation that search engine server provides The problem of similarity degree of accuracy between participle is poor, in the embodiment of the present invention, the Query Result address clicked on according to user, meter Count in stating the address similarity between the first participle and above-mentioned second participle, so, search engine server shows institute to user After having Query Result address, because user can initiate to access and ask according to self-demand and expectation to corresponding Query Result address Ask, so, the degree of association between the Query Information that the Query Result address and user that user clicks on input is higher, and then, according to The degree of accuracy for the address similarity that the Query Result address computation that user clicks on goes out is also higher.

Specifically, each participle that search engine server includes for each classification, generates corresponding user and clicks on After Query Result address set, for belong to the same category of first participle and second participle, calculate the above-mentioned first participle with It is above-mentioned second participle between address similarity when, can use but be not limited in the following manner：

First, search engine server is based on the first user corresponding to the above-mentioned first participle and clicks on Query Result address set Comprising all users click on Query Result address domain name, and it is above-mentioned second participle corresponding to second user click on inquiry knot The Query Result address domain name that all users that fruit address set includes click on, the first Query Result address sum is calculated, wherein, Above-mentioned first Query Result address sum characterizes above-mentioned first user and clicks on Query Result address set and above-mentioned second user point Hit the summation of all Query Result addresses of Query Result address domain name identical between Query Result address set.

Then, all users that search engine server is included based on above-mentioned first user click Query Result address set The Query Result address number of click, and above-mentioned second user are clicked on all users that Query Result address set includes and clicked on Query Result address number, calculate the second Query Result address sum, wherein, above-mentioned second Query Result address sum characterizes Above-mentioned first user clicks on all between Query Result address set and above-mentioned second user click Query Result address set look into Ask the summation of result address.

Finally, search engine server is based on the total and above-mentioned second Query Result address in above-mentioned first Query Result address Sum, calculate the address similarity between the above-mentioned first participle and above-mentioned second participle.

Specifically, search engine server computationally states the address similarity between the first participle and above-mentioned second participle When, it can use but be not limited to following calculation：

Wherein, in above-mentioned formula (1), SIM_chickedurl(KWi, KWi+1) is characterized between participle KWi and participle KWi+1 Address similarity,Characterize the first Query Result address sum, URL (KWi) ∪ URL (KWi+ 1) the second Query Result address sum is characterized.

Such as：Continue to use the example above, search engine server user according to corresponding to KW1 clicks on Query Result address set 1 { KW1, URL 1, URL 2, URL 4 }, and user corresponding to KW2 click on { KW2, the URL 1, URL of Query Result address set 2 2, URL 3, URL 4 }, determine that user clicks on and looked between Query Result address set 1 and user's click Query Result address set 2 Asking all Query Result addresses of result address domain name identical is：URL 1, URL 2 and URL 4, further determine that the first inquiry Result address sum is 3.

The Query Result that all users that search engine server includes according to { KW1, URL 1, URL 2, URL 4 } click on Address number 3, and the Query Result address that { KW2, URL 1, URL 2, URL 3, URL 4 } all users for including click on Number 4, determine that the second Query Result address sum is：3+4=7.

Search engine server calculates according to the first Query Result address sum 3 and the second Query Result address sum 7 Address similarity between KW1 and KW2 is：

Step 101：Calculate the literal similarity between the above-mentioned first participle and above-mentioned second participle；Wherein, it is above-mentioned literal Phase between the second character group that the first character group that the above-mentioned first participle of similarity characterization includes includes with the above-mentioned second participle Like degree.

Specifically, search engine server computationally states the literal similarity between the first participle and above-mentioned second participle When, it can use but be not limited in the following manner：

First, search engine server counts the first character group that the above-mentioned first participle includes and included with the above-mentioned second participle The second character group between all identical characters, all identical characters based on statistics, determine the above-mentioned first participle and above-mentioned the Identical characters sum between two participles.

Then, search engine server is based on total and above-mentioned second character of the first character that above-mentioned first character group includes The second character sum that group includes, determines that the total minimum character between above-mentioned second character sum of above-mentioned first character is total Number.

Finally, search engine server is based on the total and above-mentioned minimum character sum of above-mentioned identical characters, calculating above-mentioned the Literal similarity between one participle and above-mentioned second participle.

Tellable to be, search engine server is computationally stated literal between the first participle and above-mentioned second participle During similarity, it can use but be not limited to following calculation：

Wherein, in above-mentioned formula (2), SIM_typeface(KWi, KWi+1) characterizes participle KWi and segments the word between KWi+1 Face similarity ,/KWi/ ∩/KWi+1/ characterize participle KWi and segment the identical characters sum between KWi+1, Min (/KWi/ ,/ KWi+1/) characterize participle KWi and segment the minimum character sum between KWi+1.

Such as：Continue to use the example above, the character group 1 that search engine server includes according to KW1：Zhichun Road fresh flower shop and The character group 2 that KW2 is included：Fresh flower shop Zhichun Road shop, all identical characters counted between character group 1 and character group 2 are：Know the spring Road fresh flower shop, and further according to all identical characters of statistics：Zhichun Road fresh flower shop, determines the same word between KW1 and KW2 Symbol sum is 6.

The character sum 6 that search engine server includes according to character group 1, and the character sum 7 that character group 2 includes It is individual, it is determined that minimum character sum is 6.

Search engine server is 6 according to identical characters sum and minimum character sum is 6, calculates KW1 and KW2 Between literal similarity be：

Step 102：Based on address above mentioned similarity and above-mentioned literal similarity, the above-mentioned first participle and above-mentioned second are calculated Comprehensive similarity between participle.

Specifically, search engine server computationally states the summation similarity between the first participle and above-mentioned second participle When, it can use but be not limited in the following manner：

Search engine server is determined to characterize the first constant of address above mentioned similarity weight and characterized above-mentioned literal similar After spending the second constant of weight, then based on address above mentioned similarity and above-mentioned first constant, and above-mentioned literal similarity and upper State second constant, calculate the above-mentioned first participle and it is above-mentioned second participle between comprehensive similarity, wherein, above-mentioned first constant with Above-mentioned second constant sum is 1.

Preferably, search engine server computationally states the summation similarity between the first participle and above-mentioned second participle When, it can use but be not limited to following calculation：

SIM_combined(KWi, KWi+1)=α × SIM_clickedurl(KWi, KWi+1)+β × SIM_typeface(KWi, KWi+1) ... ... formula (3)

Wherein, in above-mentioned formula (3), SIM_combined(KWi, KWi+1) characterizes comprehensive between participle KWi and participle KWi+1 Close similarity, SIM_clickedurl(KWi, KWi+1) characterizes participle KWi and segments the address similarity between KWi+1, SIM_typeface (KWi, KWi+1) characterizes participle KWi and segments the literal similarity between KWi+1, and α characterizes first constant, and it is normal that β characterizes second Number.

Tellable to be, above-mentioned first constant and above-mentioned second constant can flexibly be matched somebody with somebody according to different application scenarios Put, specifically, to improve address above mentioned similarity weight, then can increase above-mentioned first constant；It is above-mentioned literal to improve Similarity weight, then it can increase above-mentioned second constant.

For example, continue to use the example above, it is assumed that first constant α=0.6, second constant β=0.4.

Search engine server is according to the address similarity between the KW1 and KW2 calculatedLiteral similarity：SIM_typeface(KW1, KW2)=1, and first constant α=0.6, Second constant β=0.4, the comprehensive similarity calculated between KW1 and KW2 are：

Step 103：When determining that above-mentioned comprehensive similarity is not less than predetermined threshold value, the above-mentioned first participle and above-mentioned second are judged Segment synonym each other.

In actual applications, search engine server determines the synthesis phase between the above-mentioned first participle and the above-mentioned second participle When being not less than predetermined threshold value like degree, judge that the above-mentioned first participle segments synonym each other with above-mentioned second.It is tellable to be, it is above-mentioned Predetermined threshold value can also flexibly be set according to different application scenarios.

Such as：Continue to use the example above, it is assumed that predetermined threshold value 60%.

After the comprehensive similarity that search engine server calculates between KW1 and KW2 is 65.7%, comprehensive similarity is determined It is more than predetermined threshold value 60% for 65.7%, further determines that KW1 and KW2 synonyms each other.

Above-described embodiment is described in further detail using specific application scenarios below, as shown in fig.2, of the invention In embodiment, the idiographic flow of synonymous word recognition method is as follows：

Step 200：Search engine server gathers user's inquiry log in real time.

Wherein, in the user journal information 1 collected, the Query Information of user's input is 1：Haidian District Zhichun Road it is fresh Florist's shop；Search engine server is to all Query Result addresses that user shows：URL 1, URL 2, URL 3, the and of URL 4 URL 5；The Query Result address that all users click on is：URL 1, URL 2 and URL 4.

In the user journal information 2 collected, the Query Information 2 of user's input is：Haidian fresh flower shop (Zhichun Road shop)；Search Rope engine server is to all Query Result addresses that user shows：URL 1, URL 2, URL 3, URL 4 and URL 5；Institute The Query Result address for having user to click on is：URL 1, URL 2, URL 3 and URL 4.

Step 201：Search engine server is for all Query Informations in 1 hour (assuming that having：The He of Query Information 1 Query Information 2), remove Query Information 1 in " ", obtain corresponding Query Information 1 " Haidian District Zhichun Road fresh flower shop ", and " bracket " in Query Information 2 is removed, obtains corresponding Query Information 2 " Haidian fresh flower shop Zhichun Road shop ".

Step 202：Search engine server is to Query Information 1 " Haidian District Zhichun Road fresh flower shop " and " Haidian of Query Information 2 Fresh flower shop Zhichun Road shop " carries out word segmentation processing respectively, gets corresponding each participle and is：Haidian District, Zhichun Road fresh flower shop and Fresh flower shop Zhichun Road shop.

Step 203：Search engine server is classified to 3 of acquisition, and participle " Haidian District " is ranged into " Area's class ", participle " Zhichun Road fresh flower shop " and participle " fresh flower shop Zhichun Road shop " are ranged " sweets shop class ".

Step 204：Search engine server is directed to the participle " Zhichun Road fresh flower shop " that " sweets shop class " includes, and counts this point The Query Result address that all users corresponding to word click on is：URL 1, URL 2 and URL 4；And wrapped for " sweets shop class " The participle " fresh flower shop Zhichun Road shop " contained, counting the Query Result address that all users corresponding to the participle click on is：URL 1、 URL 2, URL 3 and URL 4.

Step 205：Search engine server is based on participle " Zhichun Road fresh flower shop ", and the participle (referred to below as " know by participle Spring road fresh flower shop " is KW1) corresponding to the Query Result address clicked on of all users：URL1, URL 2 and URL 4, generate user Query Result address set 1 is clicked on, is { KW1, URL 1, URL 2, URL 4 }.

Step 206：Search engine server is based on participle " fresh flower shop Zhichun Road shop ", and the participle (referred to below as segments " fresh flower shop Zhichun Road shop " is KW2) corresponding to the Query Result address clicked on of all users：URL 1, URL 2, URL 3 and URL 4, generation user clicks on Query Result address set 2, is { KW2, URL 1, URL 2, URL 3, URL 4 }.

Step 207：Search engine server according to user click on Query Result address set 1 KW1, URL1, URL 2, URL 4 }, and user's click Query Result address set 2 { KW2, URL 1, URL 2, URL 3, URL 4 }, it is determined that inquiry knot All Query Result addresses of fruit address domain name identical are：URL 1, URL 2 and URL 4, further determine that the first Query Result Address sum is 3.

Step 208：What all users that search engine server includes according to { KW1, URL 1, URL 2, URL 4 } clicked on Query Result address number 3, and the inquiry that { KW2, URL 1, URL 2, URL 3, URL 4 } all users for including click on Result address number 4, determine that the second Query Result address sum is：3+4=7.

Step 209：Search engine server is total according to the first Query Result address sum 3 and the second Query Result address Number 7, the address similarity calculated between KW1 and KW2 are：

Step 210：The character group 1 that search engine server includes according to KW1：Zhichun Road fresh flower shop, and KW2 are included Character group 2：Fresh flower shop Zhichun Road shop, all identical characters counted between character group 1 and character group 2 are：Zhichun Road fresh flower shop, And further according to all identical characters of statistics：Zhichun Road fresh flower shop, determine that the identical characters sum between KW1 and KW2 is 6 It is individual.

Step 211：Character sum 6 that search engine server is included according to character group 1, and character group 2 include Character sum 7, it is determined that minimum character sum is 6.

Step 212：Search engine server is 6 according to identical characters sum and minimum character sum is 6, calculates Literal similarity between KW1 and KW2 is：

Step 213：Search engine server is according to the address similarity between the KW1 and KW2 calculatedLiteral similarity：SIM_typeface(KW1, KW2)=1, and first constant α=0.6, Second constant β=0.4, the comprehensive similarity calculated between KW1 and KW2 are：

Step 214：Whether the comprehensive similarity 65.7% between KW1 and KW2 that search engine server judgement calculates Not less than predetermined threshold value 60%, if so, then performing step 215；Otherwise, step 216 is performed.

Step 215：Search engine server determines KW1 and KW2 synonyms each other.

Step 216：Search engine server determines that KW1 and KW2 is not synonym.

Based on above-described embodiment, as shown in fig.3, in the embodiment of the present invention, synonym identification device, comprise at least：

First computing unit 303, for for belonging to the same category of first participle and the second participle, calculating above-mentioned the Address similarity between one participle and above-mentioned second participle；Wherein, the above-mentioned first participle of address above mentioned similarity characterization is corresponding The first user click on Query Result address set corresponding with the above-mentioned second participle second user click Query Result address set Similarity between conjunction；

Second computing unit 304, for calculating the literal similarity between the above-mentioned first participle and above-mentioned second participle；Its In, the first character group that the above-mentioned literal above-mentioned first participle of similarity characterization includes segments the second character included with above-mentioned second Similarity between group；

3rd computing unit 305, for based on address above mentioned similarity and above-mentioned literal similarity, calculating above-mentioned first point Comprehensive similarity between word and above-mentioned second participle；

Recognition unit 306, during for determining that above-mentioned comprehensive similarity is not less than predetermined threshold value, judge the above-mentioned first participle with Above-mentioned second segments synonym each other.

Preferably, above-mentioned identification device also includes：Collecting unit 300, pretreatment unit 301, gather generation unit 302, Wherein, above-mentioned first is calculated for belonging to the same category of first participle and the second participle in above-mentioned first computing unit 303 Before address similarity between participle and above-mentioned second participle,

Above-mentioned collecting unit 300, for gathering user's inquiry log, wherein, user's inquiry log comprises at least：With The Query Information of family input, all Query Result addresses shown based on above-mentioned Query Information to user, and all users point The Query Result address hit；

Above-mentioned pretreatment unit 301, for carrying out word segmentation processing respectively to all Query Informations in preset time range, Corresponding each participle is obtained, and counts the Query Result address of all users' clicks corresponding to each participle respectively；

Above-mentioned set generation unit 302, for being clicked on based on all users corresponding to each participle and each participle Query Result address, generate corresponding user respectively and click on Query Result address set.

Preferably, when calculating the address similarity between the above-mentioned first participle and above-mentioned second participle, above-mentioned first calculates Unit 303 is specifically used for：

Preferably, when calculating the literal similarity between the above-mentioned first participle and above-mentioned second participle, above-mentioned second calculates Unit 304 is specifically used for：

Preferably, being based on address above mentioned similarity and above-mentioned literal similarity, the above-mentioned first participle and above-mentioned second are calculated During comprehensive similarity between participle, above-mentioned 3rd computing unit 305 is specifically used for：

In summary, in the embodiment of the present invention, for belonging to the same category of first participle and the second participle, the is calculated After address similarity and literal similarity between one participle and the second participle, further according to address similarity and literal similar Degree, the comprehensive similarity between the first participle and the second participle is calculated, when determining that comprehensive similarity is not less than predetermined threshold value, judged The first participle and second segments synonym each other.So, by calculating the comprehensive similarity between two participles, you can judge Whether the two participles are synonyms, and the synonym between being segmented suitable for any two identifies, is also no longer dependent on advance volume The thesaurus write, avoid because the word that thesaurus is covered is than relatively limited, lead to not identify asking for newborn synonym Topic.Moreover, by being considered on both side from the address similarity between two participles and literal similarity so that calculate Comprehensive similarity between two participles is more accurate, and then, improve the accuracy that synonym identifies.Further, for returning Belong to same category of two participles and calculate comprehensive similarity, further increase the degree of accuracy of synonym identification.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by each in computer program instructions implementation process figure and/or block diagram Flow and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer journeys can be provided Sequence instruction to all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices processor with Produce a machine so that produce and be used for by the instruction of computer or the computing device of other programmable data processing devices Realize the dress for the function of being specified in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames Put.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Although preferred embodiments of the present invention have been described, but those skilled in the art once know basic creation Property concept, then can make other change and modification to these embodiments.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into having altered and changing for the scope of the invention.

Obviously, those skilled in the art can carry out various changes and modification without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.So, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to comprising including these changes and modification.

Claims

A kind of 1. recognition methods of synonym, it is characterised in that including：

For belonging to the same category of first participle and the second participle, calculate between the first participle and second participle Address similarity；Wherein, the first user corresponding to the first participle described in the address similarity characterization clicks on Query Result Gather the similarity between second user click Query Result address set corresponding with the described second participle in location；

Calculate the literal similarity between the first participle and second participle；Wherein, the literal similarity characterization institute State the similarity between the first character group that the first participle includes and the second character group that second participle includes；

Based on the address similarity and the literal similarity, calculate comprehensive between the first participle and second participle Close similarity；

When determining that the comprehensive similarity is not less than predetermined threshold value, judge that the first participle and the described second participle are synonymous each other Word.
2. recognition methods as claimed in claim 1, it is characterised in that for belonging to the same category of first participle and second Participle, before calculating the address similarity between the first participle and second participle, further comprise：

User's inquiry log is gathered, wherein, user's inquiry log comprises at least：The Query Information of user's input, based on institute State all Query Result addresses that Query Information is shown to user, and the Query Result address that all users click on；

All Query Informations in preset time range are carried out with word segmentation processing respectively, obtains corresponding each participle, and respectively The Query Result address that all users corresponding to counting each participle click on；

The Query Result address clicked on based on all users corresponding to each participle and each participle, generation is corresponding respectively User clicks on Query Result address set.
3. recognition methods as claimed in claim 1 or 2, it is characterised in that calculate the first participle and the described second participle Between address similarity, including：

The Query Result address domain name for all users click that Query Result address set includes is clicked on based on first user, And the second user clicks on the Query Result address domain name that all users that Query Result address set includes click on, and calculates First Query Result address sum, wherein, the first Query Result address sum characterizes first user and clicks on inquiry knot Query Result address domain name identical is all between fruit address set and second user click Query Result address set looks into Ask the summation of result address；

The Query Result address number for all users click that Query Result address set includes is clicked on based on first user, And the second user clicks on the Query Result address number that all users that Query Result address set includes click on, and calculates Second Query Result address sum, wherein, the second Query Result address sum characterizes first user and clicks on inquiry knot Fruit address set and the second user click on the summation of all Query Result addresses between Query Result address set；

Based on the total and described second Query Result address in the first Query Result address sum, calculate the first participle with Address similarity between second participle.
4. recognition methods as claimed in claim 1 or 2, it is characterised in that calculate the first participle and the described second participle Between literal similarity, including：

All identical characters between first character group and second character group are counted, all same words based on statistics Symbol, determine the identical characters sum between the first participle and second participle；

The the second character sum included based on total and described second character group of the first character that first character group includes, really The total minimum character sum between the second character sum of fixed first character,

Based on the total and described minimum character sum of the identical characters, calculate between the first participle and second participle Literal similarity.
5. the recognition methods as described in claim any one of 1-4, it is characterised in that based on the address similarity and the word Face similarity, the comprehensive similarity between the first participle and second participle is calculated, including：

It is determined that characterize the first constant of the address similarity weight and characterize the second constant of the literal similarity weight, its In, the first constant is 1 with the second constant sum；

Based on the address similarity and the first constant, and the literal similarity and the second constant, institute is calculated State the comprehensive similarity between the first participle and second participle.
A kind of 6. identification device of synonym, it is characterised in that including：

First computing unit, for for belonging to the same category of first participle and the second participle, calculating the first participle With the address similarity between the described second participle；Wherein, first corresponding to the first participle described in the address similarity characterization User is clicked between Query Result address set second user click Query Result address set corresponding with the described second participle Similarity；

Second computing unit, for calculating the literal similarity between the first participle and second participle；Wherein, it is described Between the second character group that the first character group and second participle that the first participle described in literal similarity characterization includes include Similarity；

3rd computing unit, for based on the address similarity and the literal similarity, calculating the first participle and institute State the comprehensive similarity between the second participle；

Recognition unit, during for determining that the comprehensive similarity is not less than predetermined threshold value, judge the first participle and described the Two segment synonym each other.
7. identification device as claimed in claim 6, it is characterised in that also include：Collecting unit, pretreatment unit, Yi Jiji Generation unit is closed, wherein, in first computing unit for belonging to the same category of first participle and the second participle, calculate Before address similarity between the first participle and second participle,

The collecting unit, for gathering user's inquiry log, wherein, user's inquiry log comprises at least：User inputs Query Information, all Query Result addresses shown based on the Query Information to user, and all users click on look into Ask result address；

The pretreatment unit, for all Query Informations in preset time range to be carried out with word segmentation processing respectively, obtain phase The each participle answered, and the Query Result address of all users' clicks corresponding to each participle is counted respectively；

The set generation unit, for the inquiry knot clicked on based on all users corresponding to each participle and each participle Fruit address, corresponding user is generated respectively and clicks on Query Result address set.
8. identification device as claimed in claims 6 or 7, it is characterised in that calculate the first participle and the described second participle Between address similarity when, first computing unit is specifically used for：

The Query Result address domain name for all users click that Query Result address set includes is clicked on based on first user, And the second user clicks on the Query Result address domain name that all users that Query Result address set includes click on, and calculates First Query Result address sum, wherein, the first Query Result address sum characterizes first user and clicks on inquiry knot Query Result address domain name identical is all between fruit address set and second user click Query Result address set looks into Ask the summation of result address；

The Query Result address number for all users click that Query Result address set includes is clicked on based on first user, And the second user clicks on the Query Result address number that all users that Query Result address set includes click on, and calculates Second Query Result address sum, wherein, the second Query Result address sum characterizes first user and clicks on inquiry knot Fruit address set and the second user click on the summation of all Query Result addresses between Query Result address set；

Based on the total and described second Query Result address in the first Query Result address sum, calculate the first participle with Address similarity between second participle.
9. identification device as claimed in claims 6 or 7, it is characterised in that calculate the first participle and the described second participle Between literal similarity when, second computing unit is specifically used for：

All identical characters between first character group and second character group are counted, all same words based on statistics Symbol, determine the identical characters sum between the first participle and second participle；

The the second character sum included based on total and described second character group of the first character that first character group includes, really The total minimum character sum between the second character sum of fixed first character,

Based on the total and described minimum character sum of the identical characters, calculate between the first participle and second participle Literal similarity.
10. the identification device as described in claim any one of 6-9, it is characterised in that based on the address similarity and described Literal similarity, when calculating the comprehensive similarity between the first participle and second participle, the 3rd computing unit It is specifically used for：

It is determined that characterize the first constant of the address similarity weight and characterize the second constant of the literal similarity weight, its In, the first constant is 1 with the second constant sum；

Based on the address similarity and the first constant, and the literal similarity and the second constant, institute is calculated State the comprehensive similarity between the first participle and second participle.