CN103942339A

CN103942339A - Synonym mining method and device

Info

Publication number: CN103942339A
Application number: CN201410193704.5A
Authority: CN
Inventors: 车天文; 王更生; 刘捷; 雷大伟
Original assignee: Shenzhen Yisou Science & Technology Development Co Ltd
Current assignee: Shenzhen easou world Polytron Technologies Inc
Priority date: 2014-05-08
Filing date: 2014-05-08
Publication date: 2014-07-23
Anticipated expiration: 2034-05-08
Also published as: CN103942339B

Abstract

The invention discloses a synonym mining method which comprises the steps of extracting similar aligning language materials; conducting word segmentation on each pair of similar aligning statements S1 and S2 to obtain word strings S1(T1[1], T1[2], ... , T1[i]) and S2(T2[1], T2[2], ... , T2[j]); conducting adaptive mining on S2(T2[1], T2[2], ... , T2[j]) of each pair of word strings for the synonyms of the words in the S1(T1[1], T1[2], ... , T1[i]), and calculating the synonymy probability of the words in the S1(T1[1], T1[2], ... , T1[i]) relative to the words in the S2(T2[1], T2[2], ... , T2[j]); conducting iterative operation on the synonymy probability of NT1[i] relative to NT2[i]; calculating the global synonymy confidence of NT1[i] relative to NT2[i], and outputting the word pairs with the confidence higher than a preset confidence threshold value as synonyms. The invention further discloses a synonym mining device. The synonym mining method and device improve the accuracy of synonym mining and are easy to operate and implement.

Description

Synonym method for digging and device

Technical field

The present invention relates to information retrieval field, relate in particular to a kind of synonym method for digging and device.

Background technology

Internet search engine has become the main flow instrument of people's obtaining information.Existing search is generally still the search based on term, and user inputs term and inquires about by search engine, and search engine returns to the related web page result that comprises these terms.In fact; be not the principle that each user understands search engine, and the difference of user's education background, speech habits, operating specification, make them in use; often can use some terms similar and that expression way is different that look like, as " diarrhoea " and " diarrhoea ".If search engine is not identified synon function,, when user search " child's diarrhoea what if ", some also just cannot return containing the possibility of result of the high-quality of " child's diarrhoea ".

Synonym is a unique phenomenon in natural language, it is a very important element task that synonym excavates in natural language processing, also be an extremely important significant job, its realization is replaced for search inquiry, rewrite, enrich search results, promotes inquiry experience and is very helpful.Up to now, the method that relevant synonym excavates, mainly contains following several:

1, manual mode is obtained, and is generally the knowledge accumulation based on linguist, all kinds of thesaurus of writing, and as hownet, the dictionary of wordnet and so on.But such one be can expend very large human and material resources, resource is collected and is write, the 2nd, in actual applications, use the thesaurus cost of this class larger, because the inclined to one side Academic research of this class dictionary, just can synonym (" Mount Taishan " and " father-in-law ") under some linguistic context, and cannot directly apply.

2, the excavation based on synonym template, as in encyclopaedia, document and all kinds of article, utilize " having another name called ", similar word excavated in key words such as " titles again ", accuracy rate can be higher, but template is limited, and the number of excavating is also limited, and the synonym pair so digging out, is not easy to determine and puts letter grade between word.

3, calculate based on the dependent probability between each word in corpus; Calculate the dependent probability between each word in corpus and carry out synonym excavation, this mode need to be calculated between two to the word in corpus, and efficiency is very low.

4, utilizing the excavation of the Internet search engines result, is to utilize the large data in internet, excavates synonym pair in conjunction with user's use habit and real web pages article.

Summary of the invention

The object of the invention is, a kind of synonym method for digging and device are provided, excavate poor accuracy, inefficient problem to improve existing synonym.

The invention discloses a kind of synonym method for digging, said method is periodically carried out following steps:

Steps A: according to search daily record, extract similar alignment language material, suppose to comprise Q to similar alignment statement in above-mentioned similar alignment language material;

Step B: every couple of similar alignment statement S1, S2 are carried out respectively to word segmentation processing, obtain Q to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]);

Step C: the S2 of every pair of sequence of terms (T2[1], T2[2] ... T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2] ... T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2],, T1[i]) the relative S2 of word (T2[1], T2[2], T2[j]) the synonym probability of word, finally obtain Q synonym probability matrix S (NT1[i], NT2[j]);

Step D: be basis with all synonym probability matrix S (NT1[i], NT2[j]), to NT1[i] with respect to NT2[j] synonym probability carry out interative computation;

Step e: be basis with all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym.

Preferably, above-mentioned steps A specifically comprises the following steps:

Extract successively access times in search daily record and be greater than the term of preset times;

Extract in the webpage retrieving according to current term, have the title of the webpage of click;

Current term and each title form a pair of similar to its statement;

All similar statements form similar alignment language material.

Preferably, above-mentioned steps B also carries out following steps to every pair of sequence of terms:

For S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j];

Travel through above-mentioned S1 (T1[1], T1[2] ... T1[i]);

If T1[i] be place name, make flag[i]=ADDRESS_LABEL;

If T1[i] be English, make flag[i]=ENG_LABEL;

If T1[i] be numeral, make flag[i]=NUM_LABEL;

If T1[i] do not appear at S2 (T2[1], T2[2] ..., T2[j]) in, make flag[i]=DIFF_LABEL;

After having traveled through, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]);

Travel through simultaneously S2 (T2[1], T2[2] ... T2[j]);

If T2[j] be place name, make flag[j]=ADDRESS_LABEL;

If T2[j] be English, make flag[j]=ENG_LABEL;

If T2[j] be numeral, make flag[j]=NUM_LABEL;

If T2[j] do not appear at S1 (T1[1], T1[2] ..., T1[i]) in, make flag[j]=DIFF_LABEL;

After having traveled through, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]).

Preferably, above-mentioned steps C sequence of terms is excavated before synonym, also carries out following steps:

Deletion S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof.

Preferably, above-mentioned steps C, for every pair of sequence of terms, specifically carries out following steps:

C1: according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]), obtain synonym probability matrix S (NT1[i], NT2[j]);

C2: according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value in above-mentioned synonym probability matrix S (NT1[i], NT2[j]);

C3: by S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to;

C4: according to S1 (NT1[1], NT1[2], NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value in above-mentioned synonym probability matrix S (NT1[i], NT2[j]).

Preferably, above-mentioned NT1[i] with respect to NT2[j] synonym probability meet following formula:

Σ_{i = 1}^{| NS 1 |} P (NT 2 [j] | NT 1 [i]) = 1

Wherein, | NS1| represent S1 (NT1[1], NT1[2] ..., NT1[i]) in the number of word; J=1,2 ..., | NS2|, | NS2| represent S2 (NT2[1], NT2[2] ..., NT2[j]) in the number of word.

Preferably, above-mentioned according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value step in above-mentioned synonym probability matrix S (NT1[i], NT2[j]) and be specially:

Calculate NT1[i by following formula] with respect to NT2[j] similarity:

sim (NT 1 [i], NT 2 [j]) = \frac{sub (NT 1 [i], nt 2 [j])}{\max (NT 1 [i], NT 2 [j])}

Wherein, sub (NT1[i], NT2[j]) is NT1[i], NT2[j] in the number of identical word;

Max (NT1[i], NT2[j]) is NT1[i], NT2[j] in maximum number of words;

Judge whether above-mentioned sim (NT1[i], NT2[j]) is more than or equal to 0.5;

If sim (NT1[i], NT2[j]) be more than or equal to 0.5, order

P1＝rP(NT2[j]|NT1[i])

Wherein, r is the adjustment coefficient of presetting;

For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[j] synonym probable value add P1;

For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[j] the synonym probable value of word in addition deducts P1/ (| NS2|-1);

In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[i] word is in addition with respect to NT2[j] and synonym probable value deduct P1/ (| NS1|-1);

For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[i] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[j] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1).

Preferably, above-mentioned according to S1 (NT1[1], NT1[2],, NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2], NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value step in above-mentioned synonym probability matrix S (NT1[i], NT2[j]) and be specially:

Judge S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word NT1[k of NUM_LABEL] with S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the word NT2[h of NUM_LABEL] whether identical;

If identical,

For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[h] synonym probable value add P1;

For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[h] the synonym probable value of word in addition deducts P1/ (| NS2|-1);

In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[k] word is in addition with respect to NT2[h] and synonym probable value deduct P1/ (| NS1|-1);

For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[k] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[h] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1).

Preferably, above-mentioned steps D comprises the following steps:

Step D1: iterations is set;

Step D2: calculate the NT1[i excavating by following formula from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum:

Pg 1 (NT 2 [j] | NT 1 [i]) = Σ_{s = 1}^{M} P (NT 2 [j] | NT 1 [i])

Wherein, M is the NT1[i excavating from similar alignment language material] correspond to NT2[j] number of times;

Step D3: according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i by following formula] with respect to NT2[j] overall synonym probability:

(NT 2 [j] | NT 1 [i]) = 1 (NT 2 [j] | NT 1 [i]) / Σ_{x = 1}^{y} Pg 1 (NT 2 [j] | NT 1 [x])

Wherein, y is in all synonym probability matrix S (NT1[i], NT2[j]), containing NT2[j] the right number of word;

Step D4: judge that whether this iteration is last, if so, performs step E; Otherwise, execution step D5;

Step D5: by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value, execution step C2.

Preferably, above-mentioned steps E specifically comprises the steps:

Be basis with all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i by following formula] with respect to NT2[j] overall synonym degree of confidence:

conf(NT2[j]|NT1[i])＝Pg1(NT2[j]|NT1[i])/M

Wherein, M is the NT1[i excavating from similar alignment language material] with respect to NT2[j] number of times;

Extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value;

Upper predicate, to as synonym output, is exported to its synonym simultaneously and replaced linguistic context and linguistic context grade.

The invention also discloses a kind of synonym excavating gear, said apparatus comprises that similar alignment language material extraction module, word segmentation processing module, self-adaptation excavation module, iteration module and synonym are to output module, above-mentioned

Similar alignment language material extraction module, for according to search daily record, extracts similar alignment language material;

Word segmentation processing module, for similar alignment statement S1, S2 are carried out to word segmentation processing, obtain sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]);

Self-adaptation excavate module, for S2 (T2[1], T2[2],, T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2],, T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2] ..., T1[i]) the relative S2 of word (T2[1], T2[2], T2[j]) the synonym probability of word, obtain synonym probability matrix S (NT1[i], NT2[j]);

Iteration module, for to NT1[i] with respect to NT2[j] synonym probability carry out interative computation;

Synonym is to output module, for calculating NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym.

Preferably, above-mentioned word segmentation processing module, for to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j], and travel through above-mentioned S1 (T1[1], T1[2] ... T1[i]) and S2 (T2[1], T2[2] ..., T2[j]); By S1 (T1[1], T1[2] ..., T1[i]) in be the mark flag[i of the word of place name] be set to ADDRESS_LABEL; For the flag[i of English word] be set to ENG_LABEL; For the flag[i of digital word] be set to NUM_LABEL; To not appear at S2 (T2[1], T2[2] ..., T2[j]) in the mark flag[i of word] be set to DIFF_LABEL, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); By S2 (T2[1], T2[2] ..., T2[j]) in be the mark flag[j of the word of place name] be set to ADDRESS_LABEL; For the flag[j of English word] be set to ENG_LABEL; For the flag[j of digital word] be set to NUM_LABEL; To not appear at S1 (T1[1], T1[2] ..., T1[i]) in the mark flag[j of word] be set to DIFF_LABEL, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]);

Above-mentioned self-adaptation is excavated module, for delete S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof; And according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]); Calculate NT1[i] with respect to NT2[j] similarity, and according to above-mentioned similarity, adjust NT1[i] with respect to NT2[j] probable value; By S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to; According to S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value;

Above-mentioned iteration module, for preserving default iterations; Calculate the NT1[i that excavates from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum Pg1 (NT2[j] | NT1[i]); And according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]); And in the time that current iteration is not last iteration, by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value;

Above-mentioned synonym is to output module, the right context of word that is greater than default confidence threshold value for extracting and preserve degree of confidence, and when output synonym is right, exports its synonym and replace linguistic context and linguistic context grade.

The present invention adopts the method automatic mining synonym aliging with the synonym of web page title based on user search language, can periodically update, and the accuracy rate of continuable lifting synonym excavation, easy operating is realized.

Brief description of the drawings

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the overview flow chart of synonym method for digging of the present invention;

Fig. 2 is the particular flow sheet of step S01 in Fig. 1;

Fig. 3 is the process flow diagram while carrying out mark for the word in sequence of terms;

Fig. 4 is the particular flow sheet of step S03 in Fig. 1;

Fig. 5 is the particular flow sheet of step S04 in Fig. 1;

Fig. 6 is the particular flow sheet of step S05 in Fig. 1;

Fig. 7 is the theory diagram of synonym excavating gear of the present invention.

Embodiment

In order to make technical matters to be solved by this invention, technical scheme and beneficial effect clearer, clear, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

As shown in Figure 1, be the overview flow chart of synonym method for digging of the present invention;

Step S01: according to search daily record, extract similar alignment language material, suppose to comprise Q to similar alignment statement in similar alignment language material;

Search engine all can have the retrieve log of oneself, recorded in detail the term of user's input, by the web results of returning after this term retrieval, and the details such as the content of clicking.Based on these type of data, extract the web page title of user search word and retrieval, for promoting accuracy rate, can choose the term that has certain retrieval number of times, abandon those and searched for term once; Choose the webpage that has click simultaneously, abandon the webpage that those are not clicked.The a pair of sentence obtaining like this, such sentence meaning is also incomplete same, and structure also may be inconsistent, but be that the meaning is correlated with or close sentence certainly, so be referred to as similar alignment statement.That is to say, so-called alignment statement, is exactly one group of equivalent in meaning but sentence that expression way is different, for example " song of Liu Dehua ", " music of Liu Dehua ", " music that China is young " this class sentence.These similar alignment statements, form similar alignment language material.

Therefore this step as shown in Figure 2, specifically comprises the following steps:

S101: extract the term that access times are greater than preset times in search daily record;

S102: extract in the webpage retrieving according to current term, have the title of the webpage of click;

S103: current term forms a pair of similar statement that aligns with each title;

S104: all similar statements form similar alignment language material.

These language materials are intimate identical statements of the meaning in pairs.

In view of the being easy to get property of language material, the present invention can periodically update language material, and then periodically excavates synonym, thereby the renewal of continuation, supplements thesaurus.

Step S02: every couple of similar alignment statement S1, S2 are carried out respectively to word segmentation processing, obtain Q to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]); And respectively the word in every pair of sequence of terms is carried out to mark; As shown in Figure 3, be the process flow diagram while carrying out mark for the word in sequence of terms, specifically comprise:

S201: for S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j];

S202: travel through above-mentioned S1 (T1[1], T1[2] ..., T1[i]);

S203: if T1[i] be place name, make flag[i]=ADDRESS_LABEL;

S204: if T1[i] be English, make flag[i]=ENG_LABEL;

S205: if T1[i] be numeral, make flag[i]=NUM_LABEL;

S206: if T1[i] do not appear at S2 (T2[1], T2[2] ..., T2[j]) in, make flag[i]=DIFF_LABEL;

S207: after having traveled through, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); Simultaneously

S208: traversal S2 (T2[1], T2[2] ..., T2[j]);

S209: if T2[j] be place name, make flag[j]=ADDRESS_LABEL;

S210: if T2[j] be English, make flag[j]=ENG_LABEL;

S211: if T2[j] be numeral, make flag[j]=NUM_LABEL;

S212: if T2[j] do not appear at S1 (T1[1], T1[2] ..., T1[i]) in, make flag[j]=DIFF_LABEL;

S213: after having traveled through, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]).

For example, S1 and S2 are respectively " single backhand is according to skill " and " slr camera is taken introduction skill "; After word segmentation processing, obtain sequence of terms, and S1 (single anti-, take pictures, technology), S2 (single anti-, camera, takes, introduction, skill).

Step S03: the S2 of every pair of sequence of terms (T2[1], T2[2] ... T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2] ... T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2], T1[i]) the relative S2 of word (T2[1], T2[2] ... T2[j]) the synonym probability of word, finally obtain Q synonym probability matrix S (NT1[i], NT2[j]); This step specifically as shown in Figure 4, is all carried out following steps to every pair of sequence of terms:

S301: deletion S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof;

By S1 (single anti-, take pictures, technology), after identical word " single anti-" is deleted in S2 (single anti-, camera, takes, introduction, skill), obtain S1 (taking pictures, technology), S2 (camera, takes, introduction, skill).

S302: according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]), obtain synonym probability matrix S (NT1[i], NT2[j]);

Think NT1[i] with NT2[j] be that equiprobability is corresponding; And the probable value of every row all meets following formula:

Σ_{i = 1}^{| NS 1 |} P (NT 2 [j] | NT 1 [i]) = 1 - - - (1)

For S1 (taking pictures, technology), S2 (camera, takes, introduction, skill), the synonym probability matrix obtaining is as following table:

Table 1

S1 S2	Camera	Take	Introduction	Skill
					Take pictures	0.5	0.5	0.5	0.5
Technology	0.5	0.5	0.5	0.5

S303: calculate NT1[i in synonym probability matrix by following formula] with respect to NT2[j] similarity sim (NT1[i], NT2[j]);

sim (NT 1 [i], NT 2 [j]) = \frac{sub (NT 1 [i], nt 2 [j])}{\max (NT 1 [i], NT 2 [j])} - - - (2)

Wherein, sub (NT1[i], NT2[j]) is NT1[i], NT2[j] in the number of identical word; Max (NT1[i], NT2[j]) is NT1[i], NT2[j] in maximum number of words; As in table 1, " taking pictures " and " shooting ", wherein identical word is " bat ", therefore sub (takes pictures, take)=1, two word is all 2 words, so max (takes pictures, take)=2, their similarity sim (take pictures, take)=1/2.

S304: judge whether above-mentioned sim (NT1[i], NT2[j]) is more than or equal to 0.5; If so, carry out S305; Otherwise, carry out S307;

For example, in table 1, " taking pictures " do not meet similarity requirement with " camera ", and last word pair of non-matrix, thus continue to calculate the similarity of " taking pictures " and " shooting ", because " taking pictures " and " shooting " meet similarity requirement, therefore carry out S305;

S305: make P1=rP (NT2[j] | NT1[i]); Wherein, r is the adjustment coefficient of presetting;

In table 1, P (take pictures | take)=0.5, make r=0.5, so, P1=0.5*0.5=0.25;

S306: be NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[j] synonym probable value add P1;

After adjustment, P (take | take pictures)=0.5+0.25=0.75;

For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[j] the synonym probable value of word in addition deducts P1/ (| NS2|-1); Row become (j changes) row constant (i is constant), give NT1[i] the non-NT2[j of row] probable value that is listed as deducts P1/ (| NS2|-1);

After adjustment, P (camera | take pictures)=P (introduction | take pictures)=P (skill | take pictures)=0.5-(0.25/ (4-1)=0.42;

In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[i] word is in addition with respect to NT2[j] and synonym probable value deduct P1/ (| NS1|-1); Become at once (i changes) and be listed as constant (j is constant), give non-NT1[i] NT2[j of row] probable value that is listed as deducts P1/ (| NS1|-1);

After adjustment, and P (take | technology)=0.5-(0.25/ (2-1))=0.25;

For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[i] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[j] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1); Be that ranks all become, give non-NT1[i] the non-NT2[j of row] probable value of row adds P1/ (| NS1|-1)/(| NS2|-1);

After adjustment, P (camera | technology)=P (introduction | technology)=P (skill | technology)=0.5+ (0.25/ (2-1)/(4-1))=0.58;

The probable value of the every row in synonym probability matrix S after adjustment (NT1[i], NT2[j]) is still observed formula (1).

According to the synonym probability matrix S after " taking pictures " and " shootings " adjustment (NT1[i], NT2[j]) as following table:

Table 2

S1 S2	Camera	Take	Introduction	Skill
					Take pictures	0.42	0.75	0.42	0.42
Technology	0.58	0.25	0.58	0.58

S307: judge current NT1[i] and NT2[j] be whether that last of synonym probability matrix is to word; If so, perform step S308; Otherwise, be lower a pair of word execution step S303;

Because " taking pictures " and " shooting " are not that last is to word, therefore continue to calculate the similarity of lower a pair of word; Until the whole traversals of all words are complete, the synonym probability matrix finally obtaining is as shown in table 3:

Table 3

S1 S2	Camera	Take	Introduction	Skill
					Take pictures	0.52	0.85	0.52	0.13
Technology	0.48	0.15	0.48	0.87

S308: judge S1 (NT1[1], NT1[2] ... NT1[i]) and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether have the word that is labeled as NUM_LABEL, if so, perform step S309; Otherwise, carry out S312;

S309: by S1 (NT1[1], NT1[2] ... NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to;

S310: judge S1 (NT1[1], NT1[2], NT1[i]) in be labeled as the word NT1[k of NUM_LABEL] (k=1～i) and S2 (NT2[1], NT2[2],, NT2[j]) in be labeled as the word NT2[h of NUM_LABEL] (h=1～j) whether identical; If so, carry out S310; Otherwise the probable value in synonym probability matrix S (NT1[i], NT2[j]) is constant, execution step S04;

S311: be NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[h] synonym probable value add P1;

For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[k] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[h] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1);

Probability after adjustment is observed formula (1) equally.

S312: judge current sequence of terms be whether last to sequence of terms, if so, perform step S04; Otherwise, continue as lower a pair of sequence of terms and carry out S301.

Step S04: be basis with all synonym probability matrix S (NT1[i], NT2[j]), to NT1[i] with respect to NT2[j] synonym probability carry out interative computation; Taking the right synonym probability of all words as basis, to NT1[i] with respect to NT2[j] synonym probability carry out this step of interative computation as shown in Figure 5, specifically comprise:

S401: iterations is set;

S402: calculate all NT1[i that excavate by following formula from similar alignment language material] with respect to NT2[j] synonym probability sum Pg1 (NT2[j] | NT1[i]);

Pg 1 (NT 2 [j] | NT 1 [i]) = Σ_{s = 1}^{M} P (NT 2 [j] | NT 1 [i]) - - - (3)

M is the NT1[i excavating from similar alignment language material] with respect to NT2[j] number of times;

Supposing only has statement in similar alignment language material: " single backhand is according to skill " and " slr camera is taken introduction skill ", and " camera is taken pictures " and " camera shooting skill "; And after step S03, the probability matrix obtaining is table 3 and table 4;

Table 4

According to table 3, table 4, number of times and " camera " that " taking pictures " corresponds to " shooting " corresponds to the number of times of " taking pictures ", the number of times that " skill " corresponds to " taking pictures " is 2 times, therefore M=2, other words, to all for once, are also M=1; Can be calculated according to formula (3):

Pg1 (camera | take pictures)=0.52+0.094=0.614;

Pg1 (camera | technology)=0.48;

Pg1 (camera | camera)=0.096;

Pg1 (take | take pictures)=0.85+0.937=1.787;

Pg1 (take | technology)=0.15;

Pg1 (take | camera)=0.063;

Pg1 (introduction | take pictures)=0.52;

Pg1 (introduction | technology)=0.48;

Pg1 (skill | take pictures)=0.13+0.469=0.599;

Pg1 (skill | technology)=0.87;

Pg1 (skill | camera)=0.531;

S403: according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i by following formula (4)] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]);

(NT 2 [j] | NT 1 [i]) = 1 (NT 2 [j] | NT 1 [i]) / Σ_{x = 1}^{i} Pg 1 (NT 2 [j] | NT 1 [x]) - - - (4)

Pg (introduction | take pictures)=Pg1 (introduction | take pictures)/(Pg1 (introduction | take pictures)+Pg1 (introduction | technology))=0.52/ (0.52+0.48)=0.52;

Pg (introduction | technology)=Pg1 (introduction | technology)/(Pg1 (introduction | take pictures)+Pg1 (introduction | technology))=0.48/ (0.52+0.48)=0.48;

S404: judge that whether this iteration is last, if so, performs step S05; Otherwise, carry out S405;

S405: by each synonym probability matrix S (NT1[i], NT2[j]) in NT1[i] with respect to NT2[j] and synonym probable value be initialized as the corresponding NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value, even P (NT2[j] | NT1[i])=Pg (NT2[j] | NT1[i]), carry out S303.

Step S05: with all synonym probability matrix S (NT1[i], NT2[j]) be basis, calculate NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym, specifically as shown in Figure 6, comprising:

S501: according to all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i by following formula (5)] with respect to NT2[j] overall synonym degree of confidence;

conf(NT2[j]|NT1[i])＝Pg1(NT2[j]|NT1[i])/M (5)

Suppose that table 3, table 4, after step S03, S04 process, obtain as following table 5, table 6:

Table 5

Table 6

Can obtain according to formula (5):

Conf (camera | take pictures)=Pg1 (camera | take pictures)/2=(0.5+0.06)/2=0.28;

Conf (camera | technology)=Pg1 (camera | technology)/1=0.5;

Conf (camera | camera)=Pg1 (camera | camera)/1=0.94;

Conf (take | take pictures)=Pg1 (take | take pictures)=(0.9+0.94)/2=0.92;

Conf (take | technology)=Pg1 (take | technology)/1=0.1;

Conf (take | camera)=Pg1 (take | camera)/1=0.06;

Conf (introduction | take pictures)=Pg1 (introduction | take pictures)/1=0.52;

Conf (introduction | technology)=Pg1 (introduction | technology)/1=0.48;

Conf (skill | take pictures)=Pg1 (skill | take pictures)/2=(0.08+0.46)/2=0.27;

Conf (skill | technology)=Pg1 (skill | technology)/1=0.92;

Conf (skill | camera)=Pg1 (skill | camera)/1=0.54;

S502: extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value;

S503: upper predicate, to as synonym output, is exported to its synonym simultaneously and replaced linguistic context and linguistic context grade.

As shown in Figure 7, be the theory diagram of synonym excavating gear of the present invention, comprise that similar alignment language material extraction module 10, word segmentation processing module 20, self-adaptation excavation module 30, iteration module 40 and synonym are to output module 50, wherein

Similar alignment language material extraction module 10, for according to search daily record, extracts similar alignment language material;

Word segmentation processing module 20, for similar alignment statement S1, S2 are carried out to word segmentation processing, obtain sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]); And be sequence of terms S1 (T1[1], T1[2] ... T1[i]), S2 (T2[1], T2[2],, T2[j]) each word initial value is set is 0 mark flag[i], flag[j], and travel through above-mentioned S1 (T1[1], T1[2],, T1[i]) and S2 (T2[1], T2[2],, T2[j]); By S1 (T1[1], T1[2] ..., T1[i]) in be the mark flag[i of the word of place name] be set to ADDRESS_LABEL; For the flag[i of digital word] be set to NUM_LABEL; To not appear at S2 (T2[1], T2[2] ..., T2[j]) in the mark flag[i of word] be set to DIFF_LABEL, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); By S2 (T2[1], T2[2] ..., T2[j]) in be the mark flag[j of the word of place name] be set to ADDRESS_LABEL; For the flag[j of digital word] be set to NUM_LABEL; To not appear at S1 (T1[1], T1[2] ..., T1[i]) in the mark flag[j of word] be set to DIFF_LABEL, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]);

Self-adaptation excavate module 30, for delete S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof; And according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]); Calculate NT1[i] with respect to NT2[j] similarity, and according to above-mentioned similarity, adjust NT1[i] with respect to NT2[j] synonym probable value; By S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to; According to S1 (NT1[1], NT1[2] ... NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ... NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust NT1[i] with respect to NT2[j] synonym probable value;

Iteration module 40, for to NT1[i] with respect to NT2[j] synonym probability carry out interative computation; For preserving default iterations; Calculate the NT1[i that excavates from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum Pg1 (NT2[j] | NT1[i]); And according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]); And in the time that current iteration is not last iteration, by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value;

Synonym is to output module 50, for calculating NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym; And extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value, and when output synonym is right, exports its synonym and replace linguistic context and linguistic context grade.

Above-mentioned explanation illustrates and has described the preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to disclosed form herein, should not regard the eliminating to other embodiment as, and can be used for various other combinations, amendment and environment, and can, in invention contemplated scope described herein, change by technology or the knowledge of above-mentioned instruction or association area.And the change that those skilled in the art carry out and variation do not depart from the spirit and scope of the present invention, all should be in the protection domain of claims of the present invention.

Claims

1. a synonym method for digging, is characterized in that, described method is periodically carried out following steps:

Steps A: according to search daily record, extract similar alignment language material, suppose to comprise Q to similar alignment statement in described similar alignment language material;

2. synonym method for digging as claimed in claim 1, is characterized in that, described steps A specifically comprises the following steps:

Current term and each title form a pair of similar to its statement;

All similar statements form similar alignment language material.

3. synonym method for digging as claimed in claim 1, is characterized in that, described step B also carries out following steps to every pair of sequence of terms:

Travel through described S1 (T1[1], T1[2] ... T1[i]);

If T1[i] be place name, make flag[i]=ADDRESS_LABEL;

If T1[i] be English, make flag[i]=ENG_LABEL;

If T1[i] be numeral, make flag[i]=NUM_LABEL;

If T1[i] do not appear at S2 (T2[1], T2[2] ..., T2[j]) in, order

flag[i]＝DIFF_LABEL；

Travel through simultaneously S2 (T2[1], T2[2] ... T2[j]);

If T2[j] be place name, make flag[j]=ADDRESS_LABEL;

If T2[j] be English, make flag[j]=ENG_LABEL;

If T2[j] be numeral, make flag[j]=NUM_LABEL;

If T2[j] do not appear at S1 (T1[1], T1[2] ..., T1[i]) in, order

flag[j]＝DIFF_LABEL；

4. synonym method for digging as claimed in claim 3, is characterized in that, described step C sequence of terms is excavated before synonym, also carries out following steps:

5. synonym method for digging as claimed in claim 4, is characterized in that, described step C, for every pair of sequence of terms, specifically carries out following steps:

C2: according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value in described synonym probability matrix S (NT1[i], NT2[j]);

C4: according to S1 (NT1[1], NT1[2], NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value in described synonym probability matrix S (NT1[i], NT2[j]).

6. synonym method for digging as claimed in claim 5, is characterized in that, described NT1[i] with respect to NT2[j] synonym probability meet following formula:

Σ_{i = 1}^{| NS 1 |} P (NT 2 [j] | NT 1 [i]) = 1

7. synonym method for digging as claimed in claim 5, is characterized in that, described according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value step in described synonym probability matrix S (NT1[i], NT2[j]) and be specially:

Calculate NT1[i by following formula] with respect to NT2[j] similarity:

sim (NT 1 [i], NT 2 [j]) = \frac{sub (NT 1 [i], nt 2 [j])}{\max (NT 1 [i], NT 2 [j])}

Max (NT1[i], NT2[j]) is NT1[i], NT2[j] in maximum number of words;

Judge whether described sim (NT1[i], NT2[j]) is more than or equal to 0.5;

If sim (NT1[i], NT2[j]) be more than or equal to 0.5, order

P1＝rP(NT2[j]|NT1[i])

Wherein, r is the adjustment coefficient of presetting;

8. synonym method for digging as claimed in claim 5, it is characterized in that, described according to S1 (NT1[1], NT1[2] ... NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjusting corresponding probable value step in described synonym probability matrix S (NT1[i], NT2[j]) is specially:

If identical,

9. synonym method for digging as claimed in claim 5, is characterized in that, described step D comprises the following steps:

Step D1: iterations is set;

Pg 1 (NT 2 [j] | NT 1 [i]) = Σ_{s = 1}^{M} P (NT 2 [j] | NT 1 [i])

(NT 2 [j] | NT 1 [i]) = 1 (NT 2 [j] | NT 1 [i]) / Σ_{x = 1}^{y} Pg 1 (NT 2 [j] | NT 1 [x])

10. synonym method for digging as claimed in claim 1, is characterized in that, described step e specifically comprises the steps:

conf(NT2[j]|NT1[i])＝Pg1(NT2[j]|NT1[i])/M

Institute's predicate, to as synonym output, is exported to its synonym simultaneously and replaced linguistic context and linguistic context grade.

11. 1 kinds of synonym excavating gears, is characterized in that, described device comprises that similar alignment language material extraction module, word segmentation processing module, self-adaptation excavation module, iteration module and synonym are to output module, described in

12. synonym excavating gears as claimed in claim 11, is characterized in that,

Described word segmentation processing module, for to sequence of terms S1 (T1[1], T1[2],, T1[i]), S2 (T2[1], T2[2],, T2[j]) each word initial value is set is 0 mark flag[i], flag[j], and travel through described S1 (T1[1], T1[2],, T1[i]) and S2 (T2[1], T2[2],, T2[j]); By S1 (T1[1], T1[2] ..., T1[i]) in be the mark flag[i of the word of place name] be set to ADDRESS_LABEL; For the flag[i of English word] be set to ENG_LABEL; For the flag[i of digital word] be set to NUM_LABEL; To not appear at S2 (T2[1], T2[2] ..., T2[j]) in the mark flag[i of word] be set to DIFF_LABEL, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); By S2 (T2[1], T2[2] ..., T2[j]) in be the mark flag[j of the word of place name] be set to ADDRESS_LABEL; For the flag[j of English word] be set to ENG_LABEL; For the flag[j of digital word] be set to NUM_LABEL; To not appear at S1 (T1[1], T1[2] ..., T1[i]) in the mark flag[j of word] be set to DIFF_LABEL, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]);

Described self-adaptation is excavated module, for delete S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof; And according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]); Calculate NT1[i] with respect to NT2[j] similarity, and according to described similarity, adjust NT1[i] with respect to NT2[j] probable value; By S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to; According to S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value;

Described iteration module, for preserving default iterations; Calculate the NT1[i that excavates from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum Pg1 (NT2[j] | NT1[i]); And according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]); And in the time that current iteration is not last iteration, by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value;

Described synonym is to output module, the right context of word that is greater than default confidence threshold value for extracting and preserve degree of confidence, and when output synonym is right, exports its synonym and replace linguistic context and linguistic context grade.