CN103942339A - Synonym mining method and device - Google Patents

Synonym mining method and device Download PDF

Info

Publication number
CN103942339A
CN103942339A CN201410193704.5A CN201410193704A CN103942339A CN 103942339 A CN103942339 A CN 103942339A CN 201410193704 A CN201410193704 A CN 201410193704A CN 103942339 A CN103942339 A CN 103942339A
Authority
CN
China
Prior art keywords
synonym
word
respect
label
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410193704.5A
Other languages
Chinese (zh)
Other versions
CN103942339B (en
Inventor
车天文
王更生
刘捷
雷大伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen easou world Polytron Technologies Inc
Original Assignee
Shenzhen Yisou Science & Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yisou Science & Technology Development Co Ltd filed Critical Shenzhen Yisou Science & Technology Development Co Ltd
Priority to CN201410193704.5A priority Critical patent/CN103942339B/en
Publication of CN103942339A publication Critical patent/CN103942339A/en
Application granted granted Critical
Publication of CN103942339B publication Critical patent/CN103942339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Abstract

The invention discloses a synonym mining method which comprises the steps of extracting similar aligning language materials; conducting word segmentation on each pair of similar aligning statements S1 and S2 to obtain word strings S1(T1[1], T1[2], ... , T1[i]) and S2(T2[1], T2[2], ... , T2[j]); conducting adaptive mining on S2(T2[1], T2[2], ... , T2[j]) of each pair of word strings for the synonyms of the words in the S1(T1[1], T1[2], ... , T1[i]), and calculating the synonymy probability of the words in the S1(T1[1], T1[2], ... , T1[i]) relative to the words in the S2(T2[1], T2[2], ... , T2[j]); conducting iterative operation on the synonymy probability of NT1[i] relative to NT2[i]; calculating the global synonymy confidence of NT1[i] relative to NT2[i], and outputting the word pairs with the confidence higher than a preset confidence threshold value as synonyms. The invention further discloses a synonym mining device. The synonym mining method and device improve the accuracy of synonym mining and are easy to operate and implement.

Description

Synonym method for digging and device
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of synonym method for digging and device.
Background technology
Internet search engine has become the main flow instrument of people's obtaining information.Existing search is generally still the search based on term, and user inputs term and inquires about by search engine, and search engine returns to the related web page result that comprises these terms.In fact; be not the principle that each user understands search engine, and the difference of user's education background, speech habits, operating specification, make them in use; often can use some terms similar and that expression way is different that look like, as " diarrhoea " and " diarrhoea ".If search engine is not identified synon function,, when user search " child's diarrhoea what if ", some also just cannot return containing the possibility of result of the high-quality of " child's diarrhoea ".
Synonym is a unique phenomenon in natural language, it is a very important element task that synonym excavates in natural language processing, also be an extremely important significant job, its realization is replaced for search inquiry, rewrite, enrich search results, promotes inquiry experience and is very helpful.Up to now, the method that relevant synonym excavates, mainly contains following several:
1, manual mode is obtained, and is generally the knowledge accumulation based on linguist, all kinds of thesaurus of writing, and as hownet, the dictionary of wordnet and so on.But such one be can expend very large human and material resources, resource is collected and is write, the 2nd, in actual applications, use the thesaurus cost of this class larger, because the inclined to one side Academic research of this class dictionary, just can synonym (" Mount Taishan " and " father-in-law ") under some linguistic context, and cannot directly apply.
2, the excavation based on synonym template, as in encyclopaedia, document and all kinds of article, utilize " having another name called ", similar word excavated in key words such as " titles again ", accuracy rate can be higher, but template is limited, and the number of excavating is also limited, and the synonym pair so digging out, is not easy to determine and puts letter grade between word.
3, calculate based on the dependent probability between each word in corpus; Calculate the dependent probability between each word in corpus and carry out synonym excavation, this mode need to be calculated between two to the word in corpus, and efficiency is very low.
4, utilizing the excavation of the Internet search engines result, is to utilize the large data in internet, excavates synonym pair in conjunction with user's use habit and real web pages article.
Summary of the invention
The object of the invention is, a kind of synonym method for digging and device are provided, excavate poor accuracy, inefficient problem to improve existing synonym.
The invention discloses a kind of synonym method for digging, said method is periodically carried out following steps:
Steps A: according to search daily record, extract similar alignment language material, suppose to comprise Q to similar alignment statement in above-mentioned similar alignment language material;
Step B: every couple of similar alignment statement S1, S2 are carried out respectively to word segmentation processing, obtain Q to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]);
Step C: the S2 of every pair of sequence of terms (T2[1], T2[2] ... T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2] ... T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2],, T1[i]) the relative S2 of word (T2[1], T2[2], T2[j]) the synonym probability of word, finally obtain Q synonym probability matrix S (NT1[i], NT2[j]);
Step D: be basis with all synonym probability matrix S (NT1[i], NT2[j]), to NT1[i] with respect to NT2[j] synonym probability carry out interative computation;
Step e: be basis with all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym.
Preferably, above-mentioned steps A specifically comprises the following steps:
Extract successively access times in search daily record and be greater than the term of preset times;
Extract in the webpage retrieving according to current term, have the title of the webpage of click;
Current term and each title form a pair of similar to its statement;
All similar statements form similar alignment language material.
Preferably, above-mentioned steps B also carries out following steps to every pair of sequence of terms:
For S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j];
Travel through above-mentioned S1 (T1[1], T1[2] ... T1[i]);
If T1[i] be place name, make flag[i]=ADDRESS_LABEL;
If T1[i] be English, make flag[i]=ENG_LABEL;
If T1[i] be numeral, make flag[i]=NUM_LABEL;
If T1[i] do not appear at S2 (T2[1], T2[2] ..., T2[j]) in, make flag[i]=DIFF_LABEL;
After having traveled through, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]);
Travel through simultaneously S2 (T2[1], T2[2] ... T2[j]);
If T2[j] be place name, make flag[j]=ADDRESS_LABEL;
If T2[j] be English, make flag[j]=ENG_LABEL;
If T2[j] be numeral, make flag[j]=NUM_LABEL;
If T2[j] do not appear at S1 (T1[1], T1[2] ..., T1[i]) in, make flag[j]=DIFF_LABEL;
After having traveled through, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]).
Preferably, above-mentioned steps C sequence of terms is excavated before synonym, also carries out following steps:
Deletion S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof.
Preferably, above-mentioned steps C, for every pair of sequence of terms, specifically carries out following steps:
C1: according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]), obtain synonym probability matrix S (NT1[i], NT2[j]);
C2: according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value in above-mentioned synonym probability matrix S (NT1[i], NT2[j]);
C3: by S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to;
C4: according to S1 (NT1[1], NT1[2], NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value in above-mentioned synonym probability matrix S (NT1[i], NT2[j]).
Preferably, above-mentioned NT1[i] with respect to NT2[j] synonym probability meet following formula:
Σ i = 1 | NS 1 | P ( NT 2 [ j ] | NT 1 [ i ] ) = 1
Wherein, | NS1| represent S1 (NT1[1], NT1[2] ..., NT1[i]) in the number of word; J=1,2 ..., | NS2|, | NS2| represent S2 (NT2[1], NT2[2] ..., NT2[j]) in the number of word.
Preferably, above-mentioned according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value step in above-mentioned synonym probability matrix S (NT1[i], NT2[j]) and be specially:
Calculate NT1[i by following formula] with respect to NT2[j] similarity:
sim ( NT 1 [ i ] , NT 2 [ j ] ) = sub ( NT 1 [ i ] , nt 2 [ j ] ) max ( NT 1 [ i ] , NT 2 [ j ] )
Wherein, sub (NT1[i], NT2[j]) is NT1[i], NT2[j] in the number of identical word;
Max (NT1[i], NT2[j]) is NT1[i], NT2[j] in maximum number of words;
Judge whether above-mentioned sim (NT1[i], NT2[j]) is more than or equal to 0.5;
If sim (NT1[i], NT2[j]) be more than or equal to 0.5, order
P1=rP(NT2[j]|NT1[i])
Wherein, r is the adjustment coefficient of presetting;
For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[j] synonym probable value add P1;
For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[j] the synonym probable value of word in addition deducts P1/ (| NS2|-1);
In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[i] word is in addition with respect to NT2[j] and synonym probable value deduct P1/ (| NS1|-1);
For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[i] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[j] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1).
Preferably, above-mentioned according to S1 (NT1[1], NT1[2],, NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2], NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value step in above-mentioned synonym probability matrix S (NT1[i], NT2[j]) and be specially:
Judge S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word NT1[k of NUM_LABEL] with S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the word NT2[h of NUM_LABEL] whether identical;
If identical,
For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[h] synonym probable value add P1;
For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[h] the synonym probable value of word in addition deducts P1/ (| NS2|-1);
In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[k] word is in addition with respect to NT2[h] and synonym probable value deduct P1/ (| NS1|-1);
For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[k] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[h] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1).
Preferably, above-mentioned steps D comprises the following steps:
Step D1: iterations is set;
Step D2: calculate the NT1[i excavating by following formula from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum:
Pg 1 ( NT 2 [ j ] | NT 1 [ i ] ) = Σ s = 1 M P ( NT 2 [ j ] | NT 1 [ i ] )
Wherein, M is the NT1[i excavating from similar alignment language material] correspond to NT2[j] number of times;
Step D3: according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i by following formula] with respect to NT2[j] overall synonym probability:
( NT 2 [ j ] | NT 1 [ i ] ) = 1 ( NT 2 [ j ] | NT 1 [ i ] ) / Σ x = 1 y Pg 1 ( NT 2 [ j ] | NT 1 [ x ] )
Wherein, y is in all synonym probability matrix S (NT1[i], NT2[j]), containing NT2[j] the right number of word;
Step D4: judge that whether this iteration is last, if so, performs step E; Otherwise, execution step D5;
Step D5: by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value, execution step C2.
Preferably, above-mentioned steps E specifically comprises the steps:
Be basis with all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i by following formula] with respect to NT2[j] overall synonym degree of confidence:
conf(NT2[j]|NT1[i])=Pg1(NT2[j]|NT1[i])/M
Wherein, M is the NT1[i excavating from similar alignment language material] with respect to NT2[j] number of times;
Extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value;
Upper predicate, to as synonym output, is exported to its synonym simultaneously and replaced linguistic context and linguistic context grade.
The invention also discloses a kind of synonym excavating gear, said apparatus comprises that similar alignment language material extraction module, word segmentation processing module, self-adaptation excavation module, iteration module and synonym are to output module, above-mentioned
Similar alignment language material extraction module, for according to search daily record, extracts similar alignment language material;
Word segmentation processing module, for similar alignment statement S1, S2 are carried out to word segmentation processing, obtain sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]);
Self-adaptation excavate module, for S2 (T2[1], T2[2],, T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2],, T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2] ..., T1[i]) the relative S2 of word (T2[1], T2[2], T2[j]) the synonym probability of word, obtain synonym probability matrix S (NT1[i], NT2[j]);
Iteration module, for to NT1[i] with respect to NT2[j] synonym probability carry out interative computation;
Synonym is to output module, for calculating NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym.
Preferably, above-mentioned word segmentation processing module, for to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j], and travel through above-mentioned S1 (T1[1], T1[2] ... T1[i]) and S2 (T2[1], T2[2] ..., T2[j]); By S1 (T1[1], T1[2] ..., T1[i]) in be the mark flag[i of the word of place name] be set to ADDRESS_LABEL; For the flag[i of English word] be set to ENG_LABEL; For the flag[i of digital word] be set to NUM_LABEL; To not appear at S2 (T2[1], T2[2] ..., T2[j]) in the mark flag[i of word] be set to DIFF_LABEL, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); By S2 (T2[1], T2[2] ..., T2[j]) in be the mark flag[j of the word of place name] be set to ADDRESS_LABEL; For the flag[j of English word] be set to ENG_LABEL; For the flag[j of digital word] be set to NUM_LABEL; To not appear at S1 (T1[1], T1[2] ..., T1[i]) in the mark flag[j of word] be set to DIFF_LABEL, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]);
Above-mentioned self-adaptation is excavated module, for delete S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof; And according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]); Calculate NT1[i] with respect to NT2[j] similarity, and according to above-mentioned similarity, adjust NT1[i] with respect to NT2[j] probable value; By S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to; According to S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value;
Above-mentioned iteration module, for preserving default iterations; Calculate the NT1[i that excavates from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum Pg1 (NT2[j] | NT1[i]); And according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]); And in the time that current iteration is not last iteration, by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value;
Above-mentioned synonym is to output module, the right context of word that is greater than default confidence threshold value for extracting and preserve degree of confidence, and when output synonym is right, exports its synonym and replace linguistic context and linguistic context grade.
The present invention adopts the method automatic mining synonym aliging with the synonym of web page title based on user search language, can periodically update, and the accuracy rate of continuable lifting synonym excavation, easy operating is realized.
Brief description of the drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the overview flow chart of synonym method for digging of the present invention;
Fig. 2 is the particular flow sheet of step S01 in Fig. 1;
Fig. 3 is the process flow diagram while carrying out mark for the word in sequence of terms;
Fig. 4 is the particular flow sheet of step S03 in Fig. 1;
Fig. 5 is the particular flow sheet of step S04 in Fig. 1;
Fig. 6 is the particular flow sheet of step S05 in Fig. 1;
Fig. 7 is the theory diagram of synonym excavating gear of the present invention.
Embodiment
In order to make technical matters to be solved by this invention, technical scheme and beneficial effect clearer, clear, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
As shown in Figure 1, be the overview flow chart of synonym method for digging of the present invention;
Step S01: according to search daily record, extract similar alignment language material, suppose to comprise Q to similar alignment statement in similar alignment language material;
Search engine all can have the retrieve log of oneself, recorded in detail the term of user's input, by the web results of returning after this term retrieval, and the details such as the content of clicking.Based on these type of data, extract the web page title of user search word and retrieval, for promoting accuracy rate, can choose the term that has certain retrieval number of times, abandon those and searched for term once; Choose the webpage that has click simultaneously, abandon the webpage that those are not clicked.The a pair of sentence obtaining like this, such sentence meaning is also incomplete same, and structure also may be inconsistent, but be that the meaning is correlated with or close sentence certainly, so be referred to as similar alignment statement.That is to say, so-called alignment statement, is exactly one group of equivalent in meaning but sentence that expression way is different, for example " song of Liu Dehua ", " music of Liu Dehua ", " music that China is young " this class sentence.These similar alignment statements, form similar alignment language material.
Therefore this step as shown in Figure 2, specifically comprises the following steps:
S101: extract the term that access times are greater than preset times in search daily record;
S102: extract in the webpage retrieving according to current term, have the title of the webpage of click;
S103: current term forms a pair of similar statement that aligns with each title;
S104: all similar statements form similar alignment language material.
These language materials are intimate identical statements of the meaning in pairs.
In view of the being easy to get property of language material, the present invention can periodically update language material, and then periodically excavates synonym, thereby the renewal of continuation, supplements thesaurus.
Step S02: every couple of similar alignment statement S1, S2 are carried out respectively to word segmentation processing, obtain Q to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]); And respectively the word in every pair of sequence of terms is carried out to mark; As shown in Figure 3, be the process flow diagram while carrying out mark for the word in sequence of terms, specifically comprise:
S201: for S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j];
S202: travel through above-mentioned S1 (T1[1], T1[2] ..., T1[i]);
S203: if T1[i] be place name, make flag[i]=ADDRESS_LABEL;
S204: if T1[i] be English, make flag[i]=ENG_LABEL;
S205: if T1[i] be numeral, make flag[i]=NUM_LABEL;
S206: if T1[i] do not appear at S2 (T2[1], T2[2] ..., T2[j]) in, make flag[i]=DIFF_LABEL;
S207: after having traveled through, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); Simultaneously
S208: traversal S2 (T2[1], T2[2] ..., T2[j]);
S209: if T2[j] be place name, make flag[j]=ADDRESS_LABEL;
S210: if T2[j] be English, make flag[j]=ENG_LABEL;
S211: if T2[j] be numeral, make flag[j]=NUM_LABEL;
S212: if T2[j] do not appear at S1 (T1[1], T1[2] ..., T1[i]) in, make flag[j]=DIFF_LABEL;
S213: after having traveled through, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]).
For example, S1 and S2 are respectively " single backhand is according to skill " and " slr camera is taken introduction skill "; After word segmentation processing, obtain sequence of terms, and S1 (single anti-, take pictures, technology), S2 (single anti-, camera, takes, introduction, skill).
Step S03: the S2 of every pair of sequence of terms (T2[1], T2[2] ... T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2] ... T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2], T1[i]) the relative S2 of word (T2[1], T2[2] ... T2[j]) the synonym probability of word, finally obtain Q synonym probability matrix S (NT1[i], NT2[j]); This step specifically as shown in Figure 4, is all carried out following steps to every pair of sequence of terms:
S301: deletion S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof;
By S1 (single anti-, take pictures, technology), after identical word " single anti-" is deleted in S2 (single anti-, camera, takes, introduction, skill), obtain S1 (taking pictures, technology), S2 (camera, takes, introduction, skill).
S302: according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]), obtain synonym probability matrix S (NT1[i], NT2[j]);
Think NT1[i] with NT2[j] be that equiprobability is corresponding; And the probable value of every row all meets following formula:
Σ i = 1 | NS 1 | P ( NT 2 [ j ] | NT 1 [ i ] ) = 1 - - - ( 1 )
Wherein, | NS1| represent S1 (NT1[1], NT1[2] ..., NT1[i]) in the number of word; J=1,2 ..., | NS2|, | NS2| represent S2 (NT2[1], NT2[2] ..., NT2[j]) in the number of word.
For S1 (taking pictures, technology), S2 (camera, takes, introduction, skill), the synonym probability matrix obtaining is as following table:
Table 1
S1 S2 Camera Take Introduction Skill
Take pictures 0.5 0.5 0.5 0.5
Technology 0.5 0.5 0.5 0.5
S303: calculate NT1[i in synonym probability matrix by following formula] with respect to NT2[j] similarity sim (NT1[i], NT2[j]);
sim ( NT 1 [ i ] , NT 2 [ j ] ) = sub ( NT 1 [ i ] , nt 2 [ j ] ) max ( NT 1 [ i ] , NT 2 [ j ] ) - - - ( 2 )
Wherein, sub (NT1[i], NT2[j]) is NT1[i], NT2[j] in the number of identical word; Max (NT1[i], NT2[j]) is NT1[i], NT2[j] in maximum number of words; As in table 1, " taking pictures " and " shooting ", wherein identical word is " bat ", therefore sub (takes pictures, take)=1, two word is all 2 words, so max (takes pictures, take)=2, their similarity sim (take pictures, take)=1/2.
S304: judge whether above-mentioned sim (NT1[i], NT2[j]) is more than or equal to 0.5; If so, carry out S305; Otherwise, carry out S307;
For example, in table 1, " taking pictures " do not meet similarity requirement with " camera ", and last word pair of non-matrix, thus continue to calculate the similarity of " taking pictures " and " shooting ", because " taking pictures " and " shooting " meet similarity requirement, therefore carry out S305;
S305: make P1=rP (NT2[j] | NT1[i]); Wherein, r is the adjustment coefficient of presetting;
In table 1, P (take pictures | take)=0.5, make r=0.5, so, P1=0.5*0.5=0.25;
S306: be NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[j] synonym probable value add P1;
After adjustment, P (take | take pictures)=0.5+0.25=0.75;
For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[j] the synonym probable value of word in addition deducts P1/ (| NS2|-1); Row become (j changes) row constant (i is constant), give NT1[i] the non-NT2[j of row] probable value that is listed as deducts P1/ (| NS2|-1);
After adjustment, P (camera | take pictures)=P (introduction | take pictures)=P (skill | take pictures)=0.5-(0.25/ (4-1)=0.42;
In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[i] word is in addition with respect to NT2[j] and synonym probable value deduct P1/ (| NS1|-1); Become at once (i changes) and be listed as constant (j is constant), give non-NT1[i] NT2[j of row] probable value that is listed as deducts P1/ (| NS1|-1);
After adjustment, and P (take | technology)=0.5-(0.25/ (2-1))=0.25;
For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[i] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[j] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1); Be that ranks all become, give non-NT1[i] the non-NT2[j of row] probable value of row adds P1/ (| NS1|-1)/(| NS2|-1);
After adjustment, P (camera | technology)=P (introduction | technology)=P (skill | technology)=0.5+ (0.25/ (2-1)/(4-1))=0.58;
The probable value of the every row in synonym probability matrix S after adjustment (NT1[i], NT2[j]) is still observed formula (1).
According to the synonym probability matrix S after " taking pictures " and " shootings " adjustment (NT1[i], NT2[j]) as following table:
Table 2
S1 S2 Camera Take Introduction Skill
Take pictures 0.42 0.75 0.42 0.42
Technology 0.58 0.25 0.58 0.58
S307: judge current NT1[i] and NT2[j] be whether that last of synonym probability matrix is to word; If so, perform step S308; Otherwise, be lower a pair of word execution step S303;
Because " taking pictures " and " shooting " are not that last is to word, therefore continue to calculate the similarity of lower a pair of word; Until the whole traversals of all words are complete, the synonym probability matrix finally obtaining is as shown in table 3:
Table 3
S1 S2 Camera Take Introduction Skill
Take pictures 0.52 0.85 0.52 0.13
Technology 0.48 0.15 0.48 0.87
S308: judge S1 (NT1[1], NT1[2] ... NT1[i]) and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether have the word that is labeled as NUM_LABEL, if so, perform step S309; Otherwise, carry out S312;
S309: by S1 (NT1[1], NT1[2] ... NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to;
S310: judge S1 (NT1[1], NT1[2], NT1[i]) in be labeled as the word NT1[k of NUM_LABEL] (k=1~i) and S2 (NT2[1], NT2[2],, NT2[j]) in be labeled as the word NT2[h of NUM_LABEL] (h=1~j) whether identical; If so, carry out S310; Otherwise the probable value in synonym probability matrix S (NT1[i], NT2[j]) is constant, execution step S04;
S311: be NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[h] synonym probable value add P1;
For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[h] the synonym probable value of word in addition deducts P1/ (| NS2|-1);
In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[k] word is in addition with respect to NT2[h] and synonym probable value deduct P1/ (| NS1|-1);
For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[k] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[h] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1);
Probability after adjustment is observed formula (1) equally.
S312: judge current sequence of terms be whether last to sequence of terms, if so, perform step S04; Otherwise, continue as lower a pair of sequence of terms and carry out S301.
Step S04: be basis with all synonym probability matrix S (NT1[i], NT2[j]), to NT1[i] with respect to NT2[j] synonym probability carry out interative computation; Taking the right synonym probability of all words as basis, to NT1[i] with respect to NT2[j] synonym probability carry out this step of interative computation as shown in Figure 5, specifically comprise:
S401: iterations is set;
S402: calculate all NT1[i that excavate by following formula from similar alignment language material] with respect to NT2[j] synonym probability sum Pg1 (NT2[j] | NT1[i]);
Pg 1 ( NT 2 [ j ] | NT 1 [ i ] ) = Σ s = 1 M P ( NT 2 [ j ] | NT 1 [ i ] ) - - - ( 3 )
M is the NT1[i excavating from similar alignment language material] with respect to NT2[j] number of times;
Supposing only has statement in similar alignment language material: " single backhand is according to skill " and " slr camera is taken introduction skill ", and " camera is taken pictures " and " camera shooting skill "; And after step S03, the probability matrix obtaining is table 3 and table 4;
Table 4
According to table 3, table 4, number of times and " camera " that " taking pictures " corresponds to " shooting " corresponds to the number of times of " taking pictures ", the number of times that " skill " corresponds to " taking pictures " is 2 times, therefore M=2, other words, to all for once, are also M=1; Can be calculated according to formula (3):
Pg1 (camera | take pictures)=0.52+0.094=0.614;
Pg1 (camera | technology)=0.48;
Pg1 (camera | camera)=0.096;
Pg1 (take | take pictures)=0.85+0.937=1.787;
Pg1 (take | technology)=0.15;
Pg1 (take | camera)=0.063;
Pg1 (introduction | take pictures)=0.52;
Pg1 (introduction | technology)=0.48;
Pg1 (skill | take pictures)=0.13+0.469=0.599;
Pg1 (skill | technology)=0.87;
Pg1 (skill | camera)=0.531;
S403: according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i by following formula (4)] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]);
( NT 2 [ j ] | NT 1 [ i ] ) = 1 ( NT 2 [ j ] | NT 1 [ i ] ) / Σ x = 1 i Pg 1 ( NT 2 [ j ] | NT 1 [ x ] ) - - - ( 4 )
For example: Pg (camera | take pictures)=Pg1 (camera | take pictures)/(Pg1 (camera | take pictures)+Pg1 (camera | technology)+Pg1 (camera | camera))=0.614/ (0.614+0.48+0.096)=0.52;
Pg (camera | technology)=Pg1 (camera | technology)/(Pg1 (camera | take pictures)+Pg1 (camera | technology)+Pg1 (camera | camera))=0.48/ (0.614+0.48+0.096)=0.4;
Pg (camera | camera)=Pg1 (camera | camera)/(Pg1 (camera | take pictures)+Pg1 (camera | technology)+Pg1 (camera | camera))=0.096/ (0.614+0.48+0.096)=0.08;
Pg (take | take pictures)=Pg1 (take | take pictures)/(Pg1 (take | take pictures)+Pg1 (take | technology)+Pg1 (take | camera))=1.787/ (1.787+0.15+0.063)=0.89;
Pg (take | technology)=Pg1 (take | technology)/(Pg1 (take | take pictures)+Pg1 (take | technology)+Pg1 (take | camera))=0.15/ (1.787+0.15+0.063)=0.08;
Pg (take | camera)=Pg1 (take | camera)/(Pg1 (take | take pictures)+Pg1 (take | technology)+Pg1 (take | camera))=0.063/ (1.787+0.15+0.063)=0.03;
Pg (introduction | take pictures)=Pg1 (introduction | take pictures)/(Pg1 (introduction | take pictures)+Pg1 (introduction | technology))=0.52/ (0.52+0.48)=0.52;
Pg (introduction | technology)=Pg1 (introduction | technology)/(Pg1 (introduction | take pictures)+Pg1 (introduction | technology))=0.48/ (0.52+0.48)=0.48;
Pg (skill | take pictures)=Pg1 (skill | take pictures)/(Pg1 (skill | take pictures)+Pg1 (skill | technology)+Pg1 (skill | camera))=0.599/ (0.599+0.87+0.531)=0.3;
Pg (skill | technology)=Pg1 (skill | technology)/(Pg1 (skill | take pictures)+Pg1 (skill | technology)+Pg1 (skill | camera))=0.87/ (0.599+0.87+0.531)=0.44;
Pg (skill | camera)=Pg1 (skill | camera)/(Pg1 (skill | take pictures)+Pg1 (skill | technology)+Pg1 (skill | camera))=0.531/ (0.599+0.87+0.531)=0.27;
S404: judge that whether this iteration is last, if so, performs step S05; Otherwise, carry out S405;
S405: by each synonym probability matrix S (NT1[i], NT2[j]) in NT1[i] with respect to NT2[j] and synonym probable value be initialized as the corresponding NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value, even P (NT2[j] | NT1[i])=Pg (NT2[j] | NT1[i]), carry out S303.
Step S05: with all synonym probability matrix S (NT1[i], NT2[j]) be basis, calculate NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym, specifically as shown in Figure 6, comprising:
S501: according to all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i by following formula (5)] with respect to NT2[j] overall synonym degree of confidence;
conf(NT2[j]|NT1[i])=Pg1(NT2[j]|NT1[i])/M (5)
M is the NT1[i excavating from similar alignment language material] with respect to NT2[j] number of times;
Suppose that table 3, table 4, after step S03, S04 process, obtain as following table 5, table 6:
Table 5
Table 6
Can obtain according to formula (5):
Conf (camera | take pictures)=Pg1 (camera | take pictures)/2=(0.5+0.06)/2=0.28;
Conf (camera | technology)=Pg1 (camera | technology)/1=0.5;
Conf (camera | camera)=Pg1 (camera | camera)/1=0.94;
Conf (take | take pictures)=Pg1 (take | take pictures)=(0.9+0.94)/2=0.92;
Conf (take | technology)=Pg1 (take | technology)/1=0.1;
Conf (take | camera)=Pg1 (take | camera)/1=0.06;
Conf (introduction | take pictures)=Pg1 (introduction | take pictures)/1=0.52;
Conf (introduction | technology)=Pg1 (introduction | technology)/1=0.48;
Conf (skill | take pictures)=Pg1 (skill | take pictures)/2=(0.08+0.46)/2=0.27;
Conf (skill | technology)=Pg1 (skill | technology)/1=0.92;
Conf (skill | camera)=Pg1 (skill | camera)/1=0.54;
S502: extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value;
S503: upper predicate, to as synonym output, is exported to its synonym simultaneously and replaced linguistic context and linguistic context grade.
As shown in Figure 7, be the theory diagram of synonym excavating gear of the present invention, comprise that similar alignment language material extraction module 10, word segmentation processing module 20, self-adaptation excavation module 30, iteration module 40 and synonym are to output module 50, wherein
Similar alignment language material extraction module 10, for according to search daily record, extracts similar alignment language material;
Word segmentation processing module 20, for similar alignment statement S1, S2 are carried out to word segmentation processing, obtain sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]); And be sequence of terms S1 (T1[1], T1[2] ... T1[i]), S2 (T2[1], T2[2],, T2[j]) each word initial value is set is 0 mark flag[i], flag[j], and travel through above-mentioned S1 (T1[1], T1[2],, T1[i]) and S2 (T2[1], T2[2],, T2[j]); By S1 (T1[1], T1[2] ..., T1[i]) in be the mark flag[i of the word of place name] be set to ADDRESS_LABEL; For the flag[i of digital word] be set to NUM_LABEL; To not appear at S2 (T2[1], T2[2] ..., T2[j]) in the mark flag[i of word] be set to DIFF_LABEL, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); By S2 (T2[1], T2[2] ..., T2[j]) in be the mark flag[j of the word of place name] be set to ADDRESS_LABEL; For the flag[j of digital word] be set to NUM_LABEL; To not appear at S1 (T1[1], T1[2] ..., T1[i]) in the mark flag[j of word] be set to DIFF_LABEL, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]);
Self-adaptation excavate module 30, for delete S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof; And according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]); Calculate NT1[i] with respect to NT2[j] similarity, and according to above-mentioned similarity, adjust NT1[i] with respect to NT2[j] synonym probable value; By S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to; According to S1 (NT1[1], NT1[2] ... NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ... NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust NT1[i] with respect to NT2[j] synonym probable value;
Iteration module 40, for to NT1[i] with respect to NT2[j] synonym probability carry out interative computation; For preserving default iterations; Calculate the NT1[i that excavates from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum Pg1 (NT2[j] | NT1[i]); And according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]); And in the time that current iteration is not last iteration, by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value;
Synonym is to output module 50, for calculating NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym; And extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value, and when output synonym is right, exports its synonym and replace linguistic context and linguistic context grade.
Above-mentioned explanation illustrates and has described the preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to disclosed form herein, should not regard the eliminating to other embodiment as, and can be used for various other combinations, amendment and environment, and can, in invention contemplated scope described herein, change by technology or the knowledge of above-mentioned instruction or association area.And the change that those skilled in the art carry out and variation do not depart from the spirit and scope of the present invention, all should be in the protection domain of claims of the present invention.

Claims (12)

1. a synonym method for digging, is characterized in that, described method is periodically carried out following steps:
Steps A: according to search daily record, extract similar alignment language material, suppose to comprise Q to similar alignment statement in described similar alignment language material;
Step B: every couple of similar alignment statement S1, S2 are carried out respectively to word segmentation processing, obtain Q to sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]);
Step C: the S2 of every pair of sequence of terms (T2[1], T2[2] ... T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2] ... T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2],, T1[i]) the relative S2 of word (T2[1], T2[2], T2[j]) the synonym probability of word, finally obtain Q synonym probability matrix S (NT1[i], NT2[j]);
Step D: be basis with all synonym probability matrix S (NT1[i], NT2[j]), to NT1[i] with respect to NT2[j] synonym probability carry out interative computation;
Step e: be basis with all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym.
2. synonym method for digging as claimed in claim 1, is characterized in that, described steps A specifically comprises the following steps:
Extract successively access times in search daily record and be greater than the term of preset times;
Extract in the webpage retrieving according to current term, have the title of the webpage of click;
Current term and each title form a pair of similar to its statement;
All similar statements form similar alignment language material.
3. synonym method for digging as claimed in claim 1, is characterized in that, described step B also carries out following steps to every pair of sequence of terms:
For S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]) each word initial value is set is 0 mark flag[i], flag[j];
Travel through described S1 (T1[1], T1[2] ... T1[i]);
If T1[i] be place name, make flag[i]=ADDRESS_LABEL;
If T1[i] be English, make flag[i]=ENG_LABEL;
If T1[i] be numeral, make flag[i]=NUM_LABEL;
If T1[i] do not appear at S2 (T2[1], T2[2] ..., T2[j]) in, order
flag[i]=DIFF_LABEL;
After having traveled through, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]);
Travel through simultaneously S2 (T2[1], T2[2] ... T2[j]);
If T2[j] be place name, make flag[j]=ADDRESS_LABEL;
If T2[j] be English, make flag[j]=ENG_LABEL;
If T2[j] be numeral, make flag[j]=NUM_LABEL;
If T2[j] do not appear at S1 (T1[1], T1[2] ..., T1[i]) in, order
flag[j]=DIFF_LABEL;
After having traveled through, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]).
4. synonym method for digging as claimed in claim 3, is characterized in that, described step C sequence of terms is excavated before synonym, also carries out following steps:
Deletion S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof.
5. synonym method for digging as claimed in claim 4, is characterized in that, described step C, for every pair of sequence of terms, specifically carries out following steps:
C1: according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]), obtain synonym probability matrix S (NT1[i], NT2[j]);
C2: according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value in described synonym probability matrix S (NT1[i], NT2[j]);
C3: by S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to;
C4: according to S1 (NT1[1], NT1[2], NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value in described synonym probability matrix S (NT1[i], NT2[j]).
6. synonym method for digging as claimed in claim 5, is characterized in that, described NT1[i] with respect to NT2[j] synonym probability meet following formula:
Σ i = 1 | NS 1 | P ( NT 2 [ j ] | NT 1 [ i ] ) = 1
Wherein, | NS1| represent S1 (NT1[1], NT1[2] ..., NT1[i]) in the number of word; J=1,2 ..., | NS2|, | NS2| represent S2 (NT2[1], NT2[2] ..., NT2[j]) in the number of word.
7. synonym method for digging as claimed in claim 5, is characterized in that, described according to NT1[i] with respect to NT2[j] similarity, adjust corresponding probable value step in described synonym probability matrix S (NT1[i], NT2[j]) and be specially:
Calculate NT1[i by following formula] with respect to NT2[j] similarity:
sim ( NT 1 [ i ] , NT 2 [ j ] ) = sub ( NT 1 [ i ] , nt 2 [ j ] ) max ( NT 1 [ i ] , NT 2 [ j ] )
Wherein, sub (NT1[i], NT2[j]) is NT1[i], NT2[j] in the number of identical word;
Max (NT1[i], NT2[j]) is NT1[i], NT2[j] in maximum number of words;
Judge whether described sim (NT1[i], NT2[j]) is more than or equal to 0.5;
If sim (NT1[i], NT2[j]) be more than or equal to 0.5, order
P1=rP(NT2[j]|NT1[i])
Wherein, r is the adjustment coefficient of presetting;
For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[j] synonym probable value add P1;
For NT1[i in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[j] the synonym probable value of word in addition deducts P1/ (| NS2|-1);
In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[i] word is in addition with respect to NT2[j] and synonym probable value deduct P1/ (| NS1|-1);
For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[i] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[j] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1).
8. synonym method for digging as claimed in claim 5, it is characterized in that, described according to S1 (NT1[1], NT1[2] ... NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjusting corresponding probable value step in described synonym probability matrix S (NT1[i], NT2[j]) is specially:
Judge S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word NT1[k of NUM_LABEL] with S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the word NT2[h of NUM_LABEL] whether identical;
If identical,
For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to NT2[h] synonym probable value add P1;
For NT1[k in synonym probability matrix S (NT1[i], NT2[j])] with respect to S2 (NT2[1], NT2[2] ..., NT2[j]) in NT2[h] the synonym probable value of word in addition deducts P1/ (| NS2|-1);
In synonym probability matrix S (NT1[i], NT2[j]), S1 (NT1[1], NT1[2] ..., NT1[i]) in NT1[k] word is in addition with respect to NT2[h] and synonym probable value deduct P1/ (| NS1|-1);
For synonym probability matrix S (NT1[i], NT2[j]) in, S1 (NT1[1], NT1[2],, NT1[i]) in NT1[k] word in addition with respect to S2 (NT2[1], NT2[2],, NT2[j]) in NT2[h] the synonym probable value of word in addition adds P1/ (| NS1|-1)/(| NS2|-1).
9. synonym method for digging as claimed in claim 5, is characterized in that, described step D comprises the following steps:
Step D1: iterations is set;
Step D2: calculate the NT1[i excavating by following formula from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum:
Pg 1 ( NT 2 [ j ] | NT 1 [ i ] ) = Σ s = 1 M P ( NT 2 [ j ] | NT 1 [ i ] )
Wherein, M is the NT1[i excavating from similar alignment language material] correspond to NT2[j] number of times;
Step D3: according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i by following formula] with respect to NT2[j] overall synonym probability:
( NT 2 [ j ] | NT 1 [ i ] ) = 1 ( NT 2 [ j ] | NT 1 [ i ] ) / Σ x = 1 y Pg 1 ( NT 2 [ j ] | NT 1 [ x ] )
Wherein, y is in all synonym probability matrix S (NT1[i], NT2[j]), containing NT2[j] the right number of word;
Step D4: judge that whether this iteration is last, if so, performs step E; Otherwise, execution step D5;
Step D5: by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value, execution step C2.
10. synonym method for digging as claimed in claim 1, is characterized in that, described step e specifically comprises the steps:
Be basis with all synonym probability matrix S (NT1[i], NT2[j]), calculate NT1[i by following formula] with respect to NT2[j] overall synonym degree of confidence:
conf(NT2[j]|NT1[i])=Pg1(NT2[j]|NT1[i])/M
Wherein, M is the NT1[i excavating from similar alignment language material] with respect to NT2[j] number of times;
Extract and preserve the right context of word that degree of confidence is greater than default confidence threshold value;
Institute's predicate, to as synonym output, is exported to its synonym simultaneously and replaced linguistic context and linguistic context grade.
11. 1 kinds of synonym excavating gears, is characterized in that, described device comprises that similar alignment language material extraction module, word segmentation processing module, self-adaptation excavation module, iteration module and synonym are to output module, described in
Similar alignment language material extraction module, for according to search daily record, extracts similar alignment language material;
Word segmentation processing module, for similar alignment statement S1, S2 are carried out to word segmentation processing, obtain sequence of terms S1 (T1[1], T1[2] ..., T1[i]), S2 (T2[1], T2[2] ..., T2[j]);
Self-adaptation excavate module, for S2 (T2[1], T2[2],, T2[j]) middle self-adaptation excavation S1 (T1[1], T1[2],, T1[i]) the synonym of word, and calculate S1 (T1[1], T1[2] ..., T1[i]) the relative S2 of word (T2[1], T2[2], T2[j]) the synonym probability of word, obtain synonym probability matrix S (NT1[i], NT2[j]);
Iteration module, for to NT1[i] with respect to NT2[j] synonym probability carry out interative computation;
Synonym is to output module, for calculating NT1[i] with respect to NT2[j] overall synonym degree of confidence, and the word that degree of confidence is greater than to default confidence threshold value is to exporting as synonym.
12. synonym excavating gears as claimed in claim 11, is characterized in that,
Described word segmentation processing module, for to sequence of terms S1 (T1[1], T1[2],, T1[i]), S2 (T2[1], T2[2],, T2[j]) each word initial value is set is 0 mark flag[i], flag[j], and travel through described S1 (T1[1], T1[2],, T1[i]) and S2 (T2[1], T2[2],, T2[j]); By S1 (T1[1], T1[2] ..., T1[i]) in be the mark flag[i of the word of place name] be set to ADDRESS_LABEL; For the flag[i of English word] be set to ENG_LABEL; For the flag[i of digital word] be set to NUM_LABEL; To not appear at S2 (T2[1], T2[2] ..., T2[j]) in the mark flag[i of word] be set to DIFF_LABEL, obtain sequence of terms S1 after mark (NT1[1], NT1[2] ..., NT1[i]); By S2 (T2[1], T2[2] ..., T2[j]) in be the mark flag[j of the word of place name] be set to ADDRESS_LABEL; For the flag[j of English word] be set to ENG_LABEL; For the flag[j of digital word] be set to NUM_LABEL; To not appear at S1 (T1[1], T1[2] ..., T1[i]) in the mark flag[j of word] be set to DIFF_LABEL, obtain sequence of terms S2 after mark (NT2[1], NT2[2] ..., NT2[j]);
Described self-adaptation is excavated module, for delete S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as 0 word and mark thereof; And according to entropy principle, initialization NT1[i] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]); Calculate NT1[i] with respect to NT2[j] similarity, and according to described similarity, adjust NT1[i] with respect to NT2[j] probable value; By S1 (NT1[1], NT1[2] ..., NT1[i]), S2 (NT2[1], NT2[2] ..., NT2[j]) in be labeled as the non-Arabic type of NUM_LABEL word convert the word of Arabic type to; According to S1 (NT1[1], NT1[2] ..., NT1[i]) in be labeled as the word of NUM_LABEL and S2 (NT2[1], NT2[2] ..., NT2[j]) in whether be labeled as the word of NUM_LABEL identical, adjust corresponding probable value;
Described iteration module, for preserving default iterations; Calculate the NT1[i that excavates from similar alignment language material] with respect to NT2[j] synonym probability P (NT2[j] | NT1[i]) sum Pg1 (NT2[j] | NT1[i]); And according to Pg1 (NT2[j] | NT1[i]), calculate NT1[i] with respect to NT2[j] overall synonym probability P g (NT2[j] | NT1[i]); And in the time that current iteration is not last iteration, by NT1[i] with respect to NT2[j] synonym probable value be initialized as the NT1[i that this iteration obtains] with respect to NT2[j] and overall synonym probable value;
Described synonym is to output module, the right context of word that is greater than default confidence threshold value for extracting and preserve degree of confidence, and when output synonym is right, exports its synonym and replace linguistic context and linguistic context grade.
CN201410193704.5A 2014-05-08 2014-05-08 Synonym method for digging and device Active CN103942339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410193704.5A CN103942339B (en) 2014-05-08 2014-05-08 Synonym method for digging and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410193704.5A CN103942339B (en) 2014-05-08 2014-05-08 Synonym method for digging and device

Publications (2)

Publication Number Publication Date
CN103942339A true CN103942339A (en) 2014-07-23
CN103942339B CN103942339B (en) 2017-06-09

Family

ID=51190007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410193704.5A Active CN103942339B (en) 2014-05-08 2014-05-08 Synonym method for digging and device

Country Status (1)

Country Link
CN (1) CN103942339B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331398A (en) * 2014-10-30 2015-02-04 百度在线网络技术(北京)有限公司 Method and device for generating synonym alignment dictionary
CN105335351A (en) * 2015-10-27 2016-02-17 北京信息科技大学 Synonymy automatically mining method based on patent search log user behaviors
CN106202038A (en) * 2016-06-29 2016-12-07 北京智能管家科技有限公司 Synonym method for digging based on iteration and device
WO2017063538A1 (en) * 2015-10-12 2017-04-20 广州神马移动信息科技有限公司 Method for mining related words, search method, search system
CN106777283A (en) * 2016-12-29 2017-05-31 北京奇虎科技有限公司 The method for digging and device of a kind of synonym
CN106844325A (en) * 2015-12-04 2017-06-13 北大医疗信息技术有限公司 Medical information processing method and medical information processing unit
CN106844571A (en) * 2017-01-03 2017-06-13 北京齐尔布莱特科技有限公司 Recognize method, device and the computing device of synonym
CN107391495A (en) * 2017-06-09 2017-11-24 北京吾译超群科技有限公司 A kind of sentence alignment schemes of bilingual parallel corporas
CN107451212A (en) * 2017-07-14 2017-12-08 北京京东尚科信息技术有限公司 Synonymous method for digging and device based on relevant search
CN107562713A (en) * 2016-06-30 2018-01-09 北京智能管家科技有限公司 The method for digging and device of synonymous text
CN107748755A (en) * 2017-09-19 2018-03-02 华为技术有限公司 Synonym method for digging, device, equipment and computer-readable recording medium
CN107958078A (en) * 2017-12-13 2018-04-24 北京百度网讯科技有限公司 Information generating method and device
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010013228A1 (en) * 2008-07-31 2010-02-04 Ginger Software, Inc. Automatic context sensitive language generation, correction and enhancement using an internet corpus
US20120197905A1 (en) * 2011-02-02 2012-08-02 Microsoft Corporation Information retrieval using subject-aware document ranker
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010013228A1 (en) * 2008-07-31 2010-02-04 Ginger Software, Inc. Automatic context sensitive language generation, correction and enhancement using an internet corpus
US20120197905A1 (en) * 2011-02-02 2012-08-02 Microsoft Corporation Information retrieval using subject-aware document ranker
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AM COHEN等: "Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts", 《BMC BIOINFORMATICS》 *
PETER D. TURNEY等: "Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL", 《EUROPEAN CONFERENCE ON MACHINE LEARNING: ECML 2001》 *
宋宇轩: "基于搜索日志和点击日志的同义词挖掘的研究和实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈建超等: "基于特征词关联性的同义词集挖掘算法", 《计算机应用研究》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331398A (en) * 2014-10-30 2015-02-04 百度在线网络技术(北京)有限公司 Method and device for generating synonym alignment dictionary
WO2017063538A1 (en) * 2015-10-12 2017-04-20 广州神马移动信息科技有限公司 Method for mining related words, search method, search system
CN105335351A (en) * 2015-10-27 2016-02-17 北京信息科技大学 Synonymy automatically mining method based on patent search log user behaviors
CN105335351B (en) * 2015-10-27 2018-08-28 北京信息科技大学 A kind of synonym automatic mining method based on patent search daily record user behavior
CN106844325A (en) * 2015-12-04 2017-06-13 北大医疗信息技术有限公司 Medical information processing method and medical information processing unit
CN106844325B (en) * 2015-12-04 2022-01-25 北大医疗信息技术有限公司 Medical information processing method and medical information processing apparatus
CN106202038A (en) * 2016-06-29 2016-12-07 北京智能管家科技有限公司 Synonym method for digging based on iteration and device
CN107562713A (en) * 2016-06-30 2018-01-09 北京智能管家科技有限公司 The method for digging and device of synonymous text
CN106777283A (en) * 2016-12-29 2017-05-31 北京奇虎科技有限公司 The method for digging and device of a kind of synonym
CN106777283B (en) * 2016-12-29 2021-02-26 北京奇虎科技有限公司 Synonym mining method and synonym mining device
CN106844571B (en) * 2017-01-03 2020-04-07 北京齐尔布莱特科技有限公司 Method and device for identifying synonyms and computing equipment
CN106844571A (en) * 2017-01-03 2017-06-13 北京齐尔布莱特科技有限公司 Recognize method, device and the computing device of synonym
CN107391495A (en) * 2017-06-09 2017-11-24 北京吾译超群科技有限公司 A kind of sentence alignment schemes of bilingual parallel corporas
CN107391495B (en) * 2017-06-09 2020-08-21 北京同文世纪科技有限公司 Sentence alignment method of bilingual parallel corpus
CN107451212A (en) * 2017-07-14 2017-12-08 北京京东尚科信息技术有限公司 Synonymous method for digging and device based on relevant search
CN107748755A (en) * 2017-09-19 2018-03-02 华为技术有限公司 Synonym method for digging, device, equipment and computer-readable recording medium
CN107748755B (en) * 2017-09-19 2019-11-05 华为技术有限公司 Synonym method for digging, device, equipment and computer readable storage medium
WO2019056781A1 (en) * 2017-09-19 2019-03-28 华为技术有限公司 Synonym mining method, device, equipment and computer readable storage medium
CN107958078A (en) * 2017-12-13 2018-04-24 北京百度网讯科技有限公司 Information generating method and device
CN109522547B (en) * 2018-10-23 2020-09-18 浙江大学 Chinese synonym iteration extraction method based on pattern learning
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning

Also Published As

Publication number Publication date
CN103942339B (en) 2017-06-09

Similar Documents

Publication Publication Date Title
CN103942339A (en) Synonym mining method and device
Hyvönen et al. Semantic autocompletion
Badiou Can politics be thought?
Alex et al. Adapting the Edinburgh geoparser for historical georeferencing
US10678820B2 (en) System and method for computerized semantic indexing and searching
Nguyen-Hoang et al. TSGVi: a graph-based summarization system for Vietnamese documents
CN103729343A (en) Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
JP5250009B2 (en) Suggestion query extraction apparatus and method, and program
CN105404677A (en) Tree structure based retrieval method
Ye et al. Part-of-speech tagging based on dictionary and statistical machine learning
Rychlý et al. Annotated amharic corpora
Venkataraman et al. Instant search: A hands-on tutorial
Chang et al. Enhancing POI search on maps via online address extraction and associated information segmentation
CN105426490A (en) Tree structure based indexing method
Moretti et al. ALCIDE: An online platform for the Analysis of Language and Content In a Digital Environment
Stanković et al. A bilingual digital library for academic and entrepreneurial knowledge management
Vashisht et al. Enhanced lexicon E-SLIDE framework for efficient sentiment analysis
Sene-Mongaba The Making of Lingala Corpus: An Under-resourced Language and the Internet
CN104090966A (en) Semi-structured data retrieval method based on graph model
Horák et al. Slovak national corpus
Sujatha et al. Evaluation of English-Telugu and English-Tamil Cross Language Information Retrieval System using Dictionary Based Query Translation Method
bin Mohd Rosman et al. Bringing together over-and under-represented languages: Linking Wordnet to the SIL Semantic Domains
Kolthoff et al. Automated retrieval of graphical user interface prototypes from natural language requirements
Lertnattee et al. Using Multicultural Herbal Information to Create Multi-pattern Herb Name Retrieval System
Bhatia et al. Tools and infrastructure for supporting enterprise knowledge graphs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 518057 5 C block 403-409 of Nanshan District software industrial base, Shenzhen, Guangdong.

Patentee after: Shenzhen easou world Polytron Technologies Inc

Address before: 518026 A5501-A, A tower, joint Plaza, Binhe Road and colored field road, Futian District, Shenzhen, Guangdong

Patentee before: Shenzhen Yisou Science & Technology Development Co., Ltd.

CP03 Change of name, title or address