CN101226532A - Method and system for extracting homoionym in network - Google Patents

Method and system for extracting homoionym in network Download PDF

Info

Publication number
CN101226532A
CN101226532A CNA2007103045644A CN200710304564A CN101226532A CN 101226532 A CN101226532 A CN 101226532A CN A2007103045644 A CNA2007103045644 A CN A2007103045644A CN 200710304564 A CN200710304564 A CN 200710304564A CN 101226532 A CN101226532 A CN 101226532A
Authority
CN
China
Prior art keywords
word
webpage
father
near synonym
anchor text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007103045644A
Other languages
Chinese (zh)
Other versions
CN101226532B (en
Inventor
禹荣凌
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN200710304564A priority Critical patent/CN101226532B/en
Publication of CN101226532A publication Critical patent/CN101226532A/en
Application granted granted Critical
Publication of CN101226532B publication Critical patent/CN101226532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a method to extract near synonyms on the network, which comprises obtaining reversely linked anchor texts of web pages and comparing the anchor texts with each other, respectively removing repeated words and forming a near synonym set by remaining words, extracting near synonyms based on the near synonyms set, obtaining occurrence frequency of the remaining words in the anchor texts, Rank values of reversely linked pater web pages, and Rank values of main domains of the pater web pages, respectively computing synonymic possibilities of each pair of the remaining words according to the above values, and selecting word pairs whose synonymic possibilities are larger than a preset threshold value as near synonyms. Simultaneously the invention also provides a system for extracting near synonyms on the network. The invention solves problems of low accuracy and span of near synonyms extracting of the prior art and has higher span and accuracy of extracted near synonyms.

Description

A kind of method and system of on network, extracting near synonym
Technical field
The present invention relates near synonym and extract the field, particularly relate to a kind of method and system of on network, extracting near synonym.
Background technology
Utilize Internet resources to extract near synonym on the internet, can provide support for Webpage search correlativity, natural language processing, text mining etc.But on network, be difficult to find appropriate mode to remove to define near synonym, therefore, prior art still adopts contrast to search mode at present.
Consult Fig. 1, for having the method for extracting near synonym on network now, concrete steps are as described below.
Step S101, preset the near synonym dictionary.Artificial by relevant dictionary and experience arrangement near synonym, the near synonym storehouse formed in the near synonym of putting in order.The near synonym storehouse comprises the close word of the meaning of a word on the ordinary meaning, and as " having a meal ", " having dinner ", " hesitation ", " hesitating " etc. also comprise sensu lato near synonym, promptly represent the word of same things, as " Beijing ", " Peking University " etc.
Step S102, extraction web page text carry out word segmentation processing to web page text.On network, grasp required webpage, on webpage, extract body text again, body text is carried out word segmentation processing, use the space to separate each word, filter out the word that function word, modal particle, preposition etc. do not have essential meaning.
Step S103, with the contrast of web page text and near synonym dictionary, extract the word identical with the near synonym dictionary.
Step S104, webpage is analyzed according to the near synonym that extract.
Said method just extracts identical near synonym according to existing near synonym storehouse on network, analyze.But the near synonym storehouse is according to manual sorting, and the general data amount is less, and coverage is also less, extracts near synonym based on this database, and the range of extraction is subjected to great limitation.
Near synonym on the network are much formed by the netspeak custom, set in advance the near synonym storehouse and are difficult to include the distinctive near synonym of these networks, can't find these near synonym based on the near synonym storehouse of presetting, and the applicability of extracting near synonym is lower.
Summary of the invention
Technical matters to be solved by this invention provides a kind of method and system of extracting near synonym on network, to solve range and the lower problem of applicability that near synonym extract in the prior art.The near synonym that the present invention extracts have higher range and precision.
The present invention discloses a kind of method of extracting near synonym on network, comprising: the anchor text that obtains each backward chaining on the webpage; Described anchor text is contrasted in twos, remove overlapping word respectively; The near synonym set formed in remaining word, extract near synonym based on described near synonym set.
Preferably, obtain before the anchor text of each backward chaining of webpage, also comprise: obtain the Rank value of each backward chaining father webpage on the webpage, and the Rank value in this father's webpage institute owner territory; As the Rank value of father's webpage, and/or the Rank value in this master territory, webpage place is lower than default value, removes the backward chaining of this father's webpage correspondence.
Preferably, obtain after the anchor text of each backward chaining of webpage, also comprise; Calculate anchor text weight, remove the anchor text that weighted value is lower than default value.
Preferably, extracting near synonym based on described near synonym set is specially: obtain the frequency of occurrence of above-mentioned residue word in described anchor text, the Rank value of described each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory; Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value; Choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Preferably, extracting near synonym based on described near synonym set is specially: obtain the frequency of occurrence of above-mentioned residue word in described anchor text, the Rank value of described each backward chaining father webpage, and the Rank value in this master territory, father's webpage place; Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, repeat above-mentioned steps, obtain each word near adopted probability at each webpage; Right at each word, its near adopted probability in different web pages be multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively, the product addition of acquisition is as the right total near adopted probability of this word; Extract word that total nearly adopted probability surpasses setting threshold to as near synonym.
Preferably, calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word: right at each word, the frequency of occurrence of word centering one word be multiply by the Rank value of its corresponding backward chaining father webpage, multiply by the setting coefficient after product is taken the logarithm again; The frequency of occurrence of another word of word centering be multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory, and product is taken the logarithm; The data addition that obtains is the near adopted probability of this word.
Preferably, calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word: right at each word, the frequency of occurrence of word centering one word be multiply by the Rank value of its corresponding backward chaining father webpage, and the frequency of occurrence of another word multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory; The data addition that obtains is the near adopted probability of this word.
The present invention also discloses a kind of system that extracts near synonym on network, and comprise anchor text acquisition module, contrast module, remove module, reach and form module: described anchor text acquisition module is used to obtain the anchor text of each backward chaining on the webpage; Described contrast module is used for described anchor text is contrasted in twos; Described removal module is used for removing respectively overlapping word; Described composition module is used for the near synonym set formed in remaining word.
Preferably, also comprise data acquisition module, nearly adopted probability calculation module, near synonym module: described data acquisition module, be used for obtaining the frequency of occurrence of above-mentioned residue word, the Rank value of described each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory at described anchor text; Described nearly adopted probability calculation module is used for calculating the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value; Described near synonym module is used to choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Preferably, also comprise total nearly adopted probability module, receive the right near adopted probability of each word that described nearly adopted probability calculation module sends, it is right to be used at each word, its near adopted probability in different web pages be multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively, the product addition that obtains as the right total near adopted probability of this word, and is sent to the near synonym module; Described near synonym module is extracted word that total nearly adopted probability surpasses setting threshold to as near synonym.
Compared with prior art, the present invention has the following advantages:
The present invention defines the near synonym on the network, utilizes the anchor text to extract potential near synonym on network, forms the near synonym set, and near synonym are extracted in set based near synonym.The near synonym data volume of extracting is big, wide coverage, and can embody the characteristic of network application, the range and the precision of the near synonym of extraction are higher.
The present invention is by the frequency of occurrence of word, the Rank value of backward chaining father webpage, the Rand value of domain name under this father's webpage, calculate the right near adopted probability of word, this nearly adopted probability embodies a concentrated reflection of the right applying frequency of this word, the confidence level of place link can judge preferably that this word is mutually unison to inherent implication, and the near synonym of choosing according to nearly adopted probability have very high precision and practicality.
Description of drawings
Fig. 1 has the method flow diagram that extracts near synonym on network now;
Fig. 2 extracts the method first embodiment process flow diagram of near synonym on network for the present invention;
Fig. 3 extracts the method second embodiment process flow diagram of near synonym on network for the present invention;
Fig. 4 extracts method the 3rd embodiment process flow diagram of near synonym on network for the present invention;
Fig. 5 extracts method the 4th embodiment process flow diagram of near synonym on network for the present invention;
Fig. 6 extracts the first embodiment synoptic diagram of near synonym system on network for the present invention;
Fig. 7 extracts the second embodiment synoptic diagram of near synonym system on network for the present invention;
Fig. 8 extracts the 3rd embodiment synoptic diagram of near synonym system on network for the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
A webpage comprises the hyperlink (URL) of pointing to another webpage, thinks to have linking relationship between these two webpages.Literal on the hyperlink is the anchor text.If webpage A uses anchor text S linked web pages B, webpage A can be described as father's webpage, and webpage B can be described as sub-pages, and this link is forward chaining for webpage A, is backward chaining for webpage B.Each webpage all may exist a plurality of forward chainings and backward chaining.The frequency that the present invention occurs in webpage backward chaining anchor text according to word, the confidence level of backward chaining father webpage, and the confidence level in this father's webpage institute owner territory are calculated the right near adopted probability of word.According to nearly this word of adopted probabilistic determination to whether being near synonym.Mass data on the internet can guarantee the feasibility of this method and result's accuracy from probability.
The present invention utilizes the anchor text to extract near synonym, the anchor text that points to same webpage has its identical inherent meaning, overlapping word is generally the usual term of this webpage or fixing term in the anchor text, and the anchor text is removed the word behind this overlapping word, exists the possibility of nearly justice very big.
Consult Fig. 2, the present invention extracts near synonym on network method first embodiment is shown, concrete steps are as described below.
Step S201, obtain the anchor text of each backward chaining on the webpage.The webserver extracts the anchor text of whole forward chainings in each webpage of internet, counter-rotating obtains the anchor text of each webpage backward chaining again.
For example, webpage A uses anchor text S to point to webpage B, and S is the forward chaining anchor text of webpage A, is webpage A (S)-->webpage B.After the counter-rotating, obtain webpage B (S)<--webpage A for webpage B, then is that anchor text S is the anchor text of backward chaining.
Again for example, webpage ID1, webpage ID2, webpage ID3, webpage ID4 use anchor text " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ", the homepage ID0 of " China Merchants Bank " forward chaining China Merchants Bank respectively, for the homepage ID0 of China Merchants Bank, have 4 each backward chaining, corresponding anchor text is followed successively by " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ", " China Merchants Bank ".
Step S202, each anchor text is contrasted in twos, determine maximum public substring.At first the anchor text is carried out word segmentation processing, the anchor text behind the participle is compared in twos, with its overlapping word as the public substring of maximum.For example:
Webpage ID1 and webpage 1D2: the maximum public substring of " China Merchants Bank ", " China Merchants Bank's homepage " is " China Merchants Bank ";
Webpage ID1 and webpage ID3: " China Merchants Bank ", " China Merchants Bank's homepage " do not have public substring;
Webpage ID1 and webpage ID4: " China Merchants Bank ", " China Merchants Bank " do not have public substring;
Webpage ID2 and webpage ID3: the maximum public substring of " China Merchants Bank's homepage ", " China Merchants Bank's homepage " is " homepage ";
Webpage ID2 and webpage ID4: " China Merchants Bank's homepage ", " China Merchants Bank " do not have public substring;
Webpage ID3 and webpage ID4: the maximum public substring of " China Merchants Bank's homepage ", " China Merchants Bank " is " China Merchants Bank ".
Step S203, remove overlapping word respectively.On above-mentioned anchor text, remove maximum public substring respectively.For example:
Webpage ID1 and webpage ID2: empty string, " homepage ";
Webpage ID1 and webpage ID3: " China Merchants Bank ", " China Merchants Bank's homepage ";
Webpage ID1 and webpage ID4: " China Merchants Bank ", " China Merchants Bank ";
Webpage ID2 and webpage ID3: " China Merchants Bank ", " China Merchants Bank ";
Webpage ID2 and webpage ID4: " China Merchants Bank's homepage ", " China Merchants Bank ";
Webpage ID3 and webpage ID4: " China Merchants Bank ", empty string.
Step S204, the near synonym set formed in remaining word, near synonym are extracted in set based near synonym.Ignore empty string, obtain removing the word of maximum public substring, the near synonym set formed in remaining word.For example, " homepage ", " China Merchants Bank ", " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ".
The present invention can adopt manual sorting, calculates the right modes such as nearly adopted probability of each word in the near synonym set, and near synonym are extracted in set based near synonym.
The present invention defines the near synonym on the network, utilizes the anchor text to extract potential near synonym on network, forms the near synonym set, and near synonym are extracted in set based near synonym.The near synonym data volume of extracting is big, wide coverage, and can embody the characteristic of network application, the range and the precision of the near synonym of extraction are higher.
The present invention can test to backward chaining and anchor text earlier before extracting near synonym according to the anchor text, removed the backward chaining and the anchor text that do not have reference value, further improved the precision of extracting near synonym.
Consult Fig. 3, the present invention extracts near synonym on network method second embodiment is shown, concrete steps are as described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S301, the check webpage.According to the Rank value of backward chaining father webpage, and each backward chaining on the Rank value in this father's webpage institute owner territory check webpage, remove the backward chaining that does not have reference value.The Rank value of webpage is embodied a concentrated reflection of the confidence level of this webpage, but also represents the reference value of this webpage.
Obtain the Rank value of each backward chaining of webpage father webpage, and the Rank value in this father's webpage institute owner territory, as father's webpage, the Rank value that reaches this father's webpage institute owner territory is lower than default value, thinks that this backward chaining does not have reference value, removes this backward chaining; As this father's webpage, the Rank value that reaches this father's webpage institute owner territory is higher than default value, thinks that this backward chaining has reference value, keeps this backward chaining.
According to practical situations, the present invention also can be in the Rank of father's webpage value, or the Rank value in this father's webpage institute owner territory is removed this backward chaining when being lower than default value.
Default value carries out value according to field, father's webpage place and different in kind, and span is 100-10000.
Step S302, obtain the anchor text of each backward chaining of webpage.
Step S303, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.The weight of anchor text is meant in the webpage backward chaining total degree that this anchor text occurs, if certain anchor text weight is very high, this webpage backward chaining may be from a plurality of different websites so, but this anchor text reference value is higher relatively.
For the backward chaining anchor text of subpage frame, father's webpage of establishing this sub-pages is N1, and above-mentioned father's webpage belongs to N2 main territory, and (some parent page may belong to same main territory, i.e. N1>=N2).If the father's webpage that belongs to main territory with this sub-pages has M1, other N1-M1 father's webpage belongs to N2-1 main territory respectively.If with sub-pages be u1 with the weight coefficient in main territory, be u2 with the weight coefficient in the different main territories of sub-pages, the weight of this anchor text is:
This anchor text weight=M1*u1+ (N1-M1) * u2.
The weights scope of u1 is 0.05-0.15, is preferably 0.1; The weights scope of u2 is 0.15-0.25, is preferably 0.2.Default value is carried out value according to webpage field and different in kind, and span is 1-10.
For example, total webpage A, A1, A2, A3, B, B1, B2, B3, C, C1, C2, C3, use anchor text S forward chaining webpage K, for the backward chaining anchor text S of webpage K, father's webpage is totally 12 of A, A1, A2, A3, B, B1, B2, B3, C, C1, C2, C3, wherein, A, A1, A2, A3 are same main territory, and with K is identical main territory, and B, B1, B2, B3 are same main territory, and C, C1, C2, C3 are same main territory.Calculate as can be known, M1 is 4, and N1-M1 is 8.As u1=0.1, u2=0.2, then; Weight=4 * 0.1+8 * 0.2=2 of anchor text S
Step S304, each anchor text is contrasted in twos, determine maximum public substring.
Step S305, the near synonym set formed in remaining word, extract near synonym based on described near synonym set.
The present invention is according to the Rank value of backward chaining father webpage, and the Rank value in this father's webpage institute owner territory, judges whether this backward chaining has reference value.The Rank value of father's webpage, and the confidence level of the Rank value in father's webpage institute owner territory concentrated reflection father webpage, the possibility that cheating link and rubbish link appear in webpage with a high credibility is very low, on the contrary then possibility is bigger.Therefore, this method can be removed cheating link in the webpage backward chaining and rubbish link effectively, and what guarantee backward chaining can be with reference to property.The present invention also removes illegal anchor text according to the weight of anchor text, and the anchor text of reservation is had better can be with reference to property, and the near synonym precision of extracting based on this anchor text is higher.
The present invention also can be according to the frequency of occurrence of residue word in the anchor text, the Rank value of each backward chaining father webpage, and the Rank value in master territory, father's webpage place, calculate the right near adopted probability of each word in the residue word respectively, in the near synonym set, extract near synonym according to nearly adopted probability.
Consult Fig. 4, the present invention extracts near synonym on network method the 3rd embodiment is shown, concrete steps are as described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S401, the check webpage.
Step S402, obtain the anchor text of each backward chaining of webpage.
Step S403, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.
Step S404, each anchor text is contrasted in twos, determine maximum public substring.
Step S405, extract remaining word, the near synonym set formed in remaining word.
Step S406, obtain the frequency of occurrence of above-mentioned residue word in the anchor text, the Rank value of each backward chaining father webpage, and the Rank value in this master territory, father's webpage place.
Step S407, calculate the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value.
Word is to being exactly a pair of word, and is right for the word of forming in twos in the residue word, and the right near adopted probability formula of definition word is: f (v1, v2)=Fun (Freq (v1), Freq (v2)), wherein v1, v2 represents 2 different words, as " China Merchants Bank ", " China Merchants Bank ".Freq (v) is the frequency of occurrence of word v.For example, in embodiment illustrated in fig. 2, the frequency of occurrence of " homepage " is 1, the frequency of occurrence of " China Merchants Bank " is 3, the frequency of occurrence of " China Merchants Bank " is 4, the frequency of occurrence of " China Merchants Bank's homepage " is 1, the frequency of occurrence of " China Merchants Bank's homepage " is 1.
Right for each word, the formula of nearly adopted probability can be:
F (v1, v2)=u*Log (a*d1)+Log (b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=u* (a*d1+b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=x* (Log (a*d1)+Log (a*d2))+y* (a*d1+b*d2)+z, wherein x, y, z are constant factors, the frequency of occurrence of a word v1, b is the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
Certainly, the formula of nearly adopted probability also can carry out other combination according to above-mentioned formula, does not give unnecessary details one by one.
Step S408, choose the word that nearly adopted probability surpasses predetermined threshold value, as near synonym.
The setting of predetermined threshold value can be distinguished according to the difference in word field, specifically can pass through the calculating to the nearly adopted probability of a large amount of known near synonym, chooses a rational predetermined threshold value.
The present invention is by the frequency of occurrence of word, the Rank value of backward chaining father webpage, the Rand value of domain name under this father's webpage, calculate the right near adopted probability of word, this nearly adopted probability embodies a concentrated reflection of the right applying frequency of this word, the confidence level of place link can judge preferably that this word is mutually unison to inherent implication, and the near synonym of choosing according to nearly adopted probability have very high precision and practicality.
The present invention also can comprehensively judge this word to whether being near synonym by calculating word the total near adopted probability at each webpage of internet, further improves the precision of extraction near synonym.
Consult Fig. 5, the present invention extracts near synonym on network method the 4th embodiment is shown, concrete steps are as described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S501, the check webpage.
Step S502, obtain the anchor text of each backward chaining of webpage.
Step S503, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.
Step S504, each anchor text is contrasted in twos, determine maximum public substring.
Step S505, extract remaining word, the near synonym set formed in remaining word.
Step S506, obtain the frequency of occurrence of above-mentioned residue word in the anchor text, the Rank value of each backward chaining father webpage, and the Rank value in this master territory, father's webpage place.
Step S507, calculate the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value.
Word is to being exactly a pair of word, and is right for the word of forming in twos in the residue word, and the right near adopted probability formula of definition word is: f (v1, v2)=Fun (Freq (v1), Freq (v2)), wherein v1, v2 represents 2 different words, as " China Merchants Bank ", " China Merchants Bank ".Freq (v) is the frequency of occurrence of word v.For example, in embodiment illustrated in fig. 2, the frequency of occurrence of " homepage " is 1, the frequency of occurrence of " China Merchants Bank " is 3, the frequency of occurrence of " China Merchants Bank " is 4, the frequency of occurrence of " China Merchants Bank's homepage " is 1, the frequency of occurrence of " China Merchants Bank's homepage " is 1.
Right for each word, the formula of nearly adopted probability can be:
F (v1, v2)=u*Log (a*d1)+Log (b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=u* (a*d1+b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=x* (Log (a*d1)+Log (a*d2))+y* (a*d1+b*d2)+z, wherein x, y, z are constant factors, the frequency of occurrence of a word v1, b is the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
Certainly, the formula of nearly adopted probability also can carry out other combination according to above-mentioned formula, does not give unnecessary details one by one.
Step S508, repetition above-mentioned steps S501 obtain each word to the near adopted probability at each webpage to step S507.
Step S509, right at each word multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively with its near adopted probability in different web pages, and the product addition of acquisition is as the right total near adopted probability of this word.
The right total near adopted probability calculation formula of word is:
Similar (v1, v2)=F1 (v1, v2) * x1+F2 (v1, v2) * x2+F3 (v1, v2) * x3... wherein, x1, x2, x3 are the Rank values of backward chaining subpage frame, represent the confidence level of backward chaining subpage frame.
Total nearly adopted probability is carried out normalized, make total nearly adopted probable value within the scope of 0-1.
Step S510, the total nearly adopted probability of extraction surpass the word of setting threshold, as near synonym.
The span of setting threshold is 0.3-0.8.The field difference that setting threshold is right according to word, value are also different, need to choose a rational setting threshold by to a large amount of near synonym and nearly adopted probability calculation.
The comprehensive word of the present invention is judged this word to whether being near synonym to the near adopted probability of each webpage on the internet, and this judgment mode is taken all factors into consideration the expressed meaning of a word of this word at each webpage, and the precision of choosing near synonym is further improved.
The present invention also can calculate the right total near adopted probable value of this word according to word to the Rank value of the frequency that occurs, the parent page Rank value of place anchor text, main territory Rank value, sub-pages itself, affiliated main territory Rank value in sub-pages anchor text.
For example, word is t1, t2 to V1, the V2 frequency of occurrence in the reverse anchor text of webpage A, and the frequency of occurrence in the reverse anchor text of B is t3, t4, and the Rank value of webpage A, webpage B and affiliated main territory Rank value are respectively RA, RB, DA, DB.Word is to V1, V2 near adopted probability Fa=u1*t1* (the A1+A2+A3....)+u1*t2 (A1+A2+....) in webpage A, and wherein u1 is an anchor text weighting coefficient, and A1, A2, A3.... are the reverse anchor text weights of webpage A; In like manner, word is to V1, V2 near adopted probability Fb=u1*t3* (the B1+B2+B3....)+u1*t4 (B1+B2+....) in webpage B, and wherein u1 is an anchor text weighting coefficient, and B1, B2, B3.... are the reverse anchor text weights of webpage B.
With Fa and Fb addition, obtain the word near adopted probability Similar total to V1, V2 (v1, v2)=u2* (AR*Fa+BR*Fb)+u3* (DA*Fa+DB*Fb).Wherein u2 is A, B webpage Rank coefficient, and u3 is the coefficient of the main territory Rank under webpage A, the B, and AR, BR are the webpage Rank of A, B, and DA, DB are the Rank in the main territory at A, B place.
Based on above-mentioned a kind of method of extracting near synonym on network, the present invention also provides a kind of system that extracts near synonym on network, and the near synonym that this system extracts have higher range and precision.
Consult Fig. 6, the present invention extracts the near synonym system on network first embodiment is shown, comprise anchor text acquisition module 61, contrast module 62, remove module 63, reach and form module 64.
Anchor text acquisition module 61 obtains the anchor text of each backward chaining on the webpage.Anchor text acquisition module 61 extracts the anchor text of whole forward chainings in each webpage of internet, counter-rotating obtains the anchor text of each webpage backward chaining again, and the anchor text that obtains is sent to contrast module 62.
Contrast module 62 contrasts the anchor text in twos, determines maximum public substring.Contrast module 62 is at first carried out word segmentation processing to the anchor text, and the anchor text behind the participle is compared in twos, and its overlapping word as the public substring of maximum, and is sent to comparing result and removes module 63.
Remove module 63 and remove overlapping word respectively, remaining word is sent to form module 64.
Form module 64 the near synonym set formed in remaining word, near synonym are extracted in set based near synonym.
Near synonym extraction system of the present invention also can be by calculating the right near adopted probability of word, and near synonym are extracted in set based near synonym.
Consult Fig. 7, the present invention extracts the near synonym system on network second embodiment is shown, comprises anchor text acquisition module 61, contrast module 62, removal module 63, form module 64, data acquisition module 65, nearly adopted probability calculation module 66, reach near synonym extraction module 67.
Data acquisition module 65 obtains forms the frequency of occurrence of residue word in the anchor text in the module 64, and the Rank value of described each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory send to nearly adopted probability calculation module 66.
Nearly adopted probability calculation module 66 is calculated the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, and the right near adopted probability of each word that calculates is sent near synonym module 67.
Near synonym extraction module 67 is chosen word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Consult Fig. 8, the present invention extracts the near synonym system on network the 3rd embodiment is shown, comprises anchor text acquisition module 61, contrast module 62, removal module 63, composition module 64, data acquisition module 65, nearly adopted probability calculation module 66, near synonym extraction module 67, reach total nearly adopted probability module 68.
Total nearly adopted probability module 68 receives the right near adopted probability of each word that nearly adopted probability calculation module 66 sends, it is right to be used at each word, its near adopted probability in different web pages be multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively, the product addition that obtains, as the right total near adopted probability of this word, and be sent near synonym extraction module 67.
Near synonym extraction module 67 extracts word that total nearly adopted probability surpasses setting threshold to as near synonym.
More than to a kind of method and system of on network, extracting near synonym provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. a method of extracting near synonym on network is characterized in that, comprising:
Obtain the anchor text of each backward chaining on the webpage;
Described anchor text is contrasted in twos, remove overlapping word respectively;
The near synonym set formed in remaining word, extract near synonym based on described near synonym set.
2. the method for claim 1 is characterized in that, obtains before the anchor text of each backward chaining of webpage, also comprises:
Obtain the Rank value of each backward chaining father webpage on the webpage, and the Rank value in this father's webpage institute owner territory;
As the Rank value of father's webpage, and/or the Rank value in this master territory, webpage place is lower than default value, removes the backward chaining of this father's webpage correspondence.
3. the method for claim 1 is characterized in that, obtains after the anchor text of each backward chaining of webpage, also comprises;
Calculate anchor text weight, remove the anchor text that weighted value is lower than default value.
4. the method for claim 1 is characterized in that, extracts near synonym based on described near synonym set and is specially:
Obtain the frequency of occurrence of above-mentioned residue word in described anchor text, the Rank value of described each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory;
Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value;
Choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
5. the method for claim 1 is characterized in that, extracts near synonym based on described near synonym set and is specially:
Obtain the frequency of occurrence of above-mentioned residue word in described anchor text, the Rank value of described each backward chaining father webpage, and the Rank value in this master territory, father's webpage place;
Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, repeat above-mentioned steps, obtain each word near adopted probability at each webpage;
Right at each word, its near adopted probability in different web pages be multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively, the product addition of acquisition is as the right total near adopted probability of this word;
Extract word that total nearly adopted probability surpasses setting threshold to as near synonym.
6. as claim 4 or 5 described methods, it is characterized in that, calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word:
Right at each word, the frequency of occurrence of word centering one word be multiply by the Rank value of its corresponding backward chaining father webpage, after taking the logarithm, product multiply by the setting coefficient again;
The frequency of occurrence of another word of word centering be multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory, and product is taken the logarithm;
The data addition that obtains is the near adopted probability of this word.
7. as claim 4 or 5 described methods, it is characterized in that, calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word:
Right at each word, the frequency of occurrence of word centering one word be multiply by the Rank value of its corresponding backward chaining father webpage, the frequency of occurrence of another word multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory; The data addition that obtains is the right near adopted probability of this word.
8. a system that extracts near synonym on network is characterized in that, comprises anchor text acquisition module, contrast module, removes module, reaches and form module:
Described anchor text acquisition module is used to obtain the anchor text of each backward chaining on the webpage;
Described contrast module is used for described anchor text is contrasted in twos;
Described removal module is used for removing respectively overlapping word;
Described composition module is used for the near synonym set formed in remaining word.
9. system as claimed in claim 8 is characterized in that, also comprises data acquisition module, nearly adopted probability calculation module, near synonym module:
Described data acquisition module is used for obtaining the frequency of occurrence of above-mentioned residue word at described anchor text, the Rank value of described each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory;
Described nearly adopted probability calculation module is used for calculating the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value;
Described near synonym module is used to choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
10. system as claimed in claim 8 or 9, it is characterized in that, also comprise total nearly adopted probability module, receive the right near adopted probability of each word that described nearly adopted probability calculation module sends, it is right to be used at each word, and its near adopted probability in different web pages be multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability, the product addition of acquisition respectively, as the right total near adopted probability of this word, and be sent to the near synonym module;
Described near synonym module is extracted word that total nearly adopted probability surpasses setting threshold to as near synonym.
CN200710304564A 2007-12-28 2007-12-28 Method and system for extracting homoionym in network Active CN101226532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200710304564A CN101226532B (en) 2007-12-28 2007-12-28 Method and system for extracting homoionym in network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200710304564A CN101226532B (en) 2007-12-28 2007-12-28 Method and system for extracting homoionym in network

Publications (2)

Publication Number Publication Date
CN101226532A true CN101226532A (en) 2008-07-23
CN101226532B CN101226532B (en) 2012-10-03

Family

ID=39858533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200710304564A Active CN101226532B (en) 2007-12-28 2007-12-28 Method and system for extracting homoionym in network

Country Status (1)

Country Link
CN (1) CN101226532B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN110413737A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN110413737A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym
CN110413737B (en) * 2019-07-29 2022-10-14 腾讯科技(深圳)有限公司 Synonym determination method, synonym determination device, server and readable storage medium

Also Published As

Publication number Publication date
CN101226532B (en) 2012-10-03

Similar Documents

Publication Publication Date Title
Meusel et al. Graph structure in the web---revisited: a trick of the heavy tail
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN103365924B (en) A kind of method of internet information search, device and terminal
CN103268348B (en) A kind of user's query intention recognition methods
CN103870461B (en) Subject recommending method, device and server
CN102955857B (en) Class center compression transformation-based text clustering method in search engine
CN105843795A (en) Topic model based document keyword extraction method and system
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN102722709A (en) Method and device for identifying garbage pictures
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
CN103873601A (en) Addressing class query word mining method and system
CN106156041A (en) Hot information finds method and system
CN104462268B (en) A kind of method and system of html document information extraction expression formula
CN104834736A (en) Method and device for establishing index database and retrieval method, device and system
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN103646029A (en) Similarity calculation method for blog articles
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN105302807A (en) Method and apparatus for obtaining information category
CN104951478A (en) Information processing method and information processing device
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN102063497B (en) Open type knowledge sharing platform and entry processing method thereof
CN103218452A (en) Method and device for recognizing valid interlinkage in Hub webpage
CN103020083B (en) The automatic mining method of demand recognition template, demand recognition methods and corresponding device
CN102654861A (en) Method and system for calculating webpage extraction accuracy
CN105677906A (en) Automatic collecting and analyzing system and method for network events

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENCENT TECHNOLOGY (BEIJING) CO., LTD.

Effective date: 20131016

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100089 HAIDIAN, BEIJING TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20131016

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Beijing 100089 Haidian District 38 Haidian Avenue branch bank building 16 layer

Patentee before: Tencent Technology (Beijing) Co., Ltd