Embodiment
For make above-mentioned purpose of the present invention, feature and advantage can be more obviously understandable, below in conjunction with accompanying drawing and embodiment the present invention done further detailed explanation.
A webpage comprises the hyperlink (URL) of pointing to another webpage, thinks to have linking relationship between these two webpages.Literal on the hyperlink is the anchor text.If webpage A uses anchor text S linked web pages B, webpage A can be described as father's webpage, and webpage B can be described as sub-pages, and this link is forward chaining for webpage A, is backward chaining for webpage B.Each webpage all possibly exist a plurality of forward chainings and backward chaining.The frequency that the present invention occurs in webpage backward chaining anchor text according to word, the confidence level of backward chaining father webpage, and the confidence level in this father's webpage institute owner territory are calculated the right near adopted probability of word.According to nearly this word of adopted probabilistic determination to whether being near synonym.Mass data on the internet can guarantee the feasibility of this method and result's accuracy from probability.
The present invention utilizes the anchor text to extract near synonym; The anchor text that points to same webpage has its identical inherent meaning; Overlapping word is generally the usual term of this webpage or fixing term in the anchor text, and the anchor text is removed the word behind this overlapping word, exists the possibility of nearly justice very big.
Consult Fig. 2, the present invention extracts near synonym on network method first embodiment is shown, concrete steps are described below.
Step S201, obtain the anchor text of each backward chaining on the webpage.The webserver extracts the anchor text of whole forward chainings in each webpage of internet, counter-rotating obtains the anchor text of each webpage backward chaining again.
For example, webpage A uses anchor text S to point to webpage B, and S is the forward chaining anchor text of webpage A, is webpage A (S)-->webpage B.After the counter-rotating, obtain webpage B (S)<--webpage A for webpage B, then is that anchor text S is the anchor text of backward chaining.
Again for example; Webpage ID1, webpage ID2, webpage ID3, webpage ID4 use anchor text " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ", the homepage ID0 of " China Merchants Bank " forward chaining China Merchants Bank respectively; For the homepage ID0 of China Merchants Bank; Have 4 each backward chaining, corresponding anchor text is followed successively by " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ", " China Merchants Bank ".
Step S202, each anchor text is contrasted in twos, confirm maximum public substring.At first the anchor text is carried out word segmentation processing, the anchor text behind the participle is compared in twos, with its overlapping word as the public substring of maximum.For example:
Webpage ID1 and webpage 1D2: the maximum public substring of " China Merchants Bank ", " China Merchants Bank's homepage " is " China Merchants Bank ";
Webpage ID1 and webpage ID3: " China Merchants Bank ", " China Merchants Bank's homepage " do not have public substring;
Webpage ID1 and webpage ID4: " China Merchants Bank ", " China Merchants Bank " do not have public substring;
Webpage ID2 and webpage ID3: the maximum public substring of " China Merchants Bank's homepage ", " China Merchants Bank's homepage " is " homepage ";
Webpage ID2 and webpage ID4: " China Merchants Bank's homepage ", " China Merchants Bank " do not have public substring;
Webpage ID3 and webpage ID4: the maximum public substring of " China Merchants Bank's homepage ", " China Merchants Bank " is " China Merchants Bank ".
Step S203, remove overlapping word respectively.On above-mentioned anchor text, remove maximum public substring respectively.For example:
Webpage ID1 and webpage ID2: empty string, " homepage ";
Webpage ID1 and webpage ID3: " China Merchants Bank ", " China Merchants Bank's homepage ";
Webpage ID1 and webpage ID4: " China Merchants Bank ", " China Merchants Bank ";
Webpage ID2 and webpage ID3: " China Merchants Bank ", " China Merchants Bank ";
Webpage ID2 and webpage ID4: " China Merchants Bank's homepage ", " China Merchants Bank ";
Webpage ID3 and webpage ID4: " China Merchants Bank ", empty string.
Step S204, the near synonym set formed in remaining word, near synonym are extracted in set based near synonym.Ignore empty string, obtain removing the word of maximum public substring, the near synonym set formed in remaining word.For example, " homepage ", " China Merchants Bank ", " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ".
The present invention can adopt manual sorting, calculates the right modes such as nearly adopted probability of each word in the near synonym set, and near synonym are extracted in set based near synonym.
The present invention defines the near synonym on the network, utilizes the anchor text on network, to extract potential near synonym, forms the near synonym set, and near synonym are extracted in set based near synonym.The near synonym data volume of extracting is big, wide coverage, and can embody the characteristic of network application, the range and the precision of the near synonym of extraction are higher.
The present invention can test to backward chaining and anchor text earlier before extracting near synonym according to the anchor text, removed the backward chaining and the anchor text that do not have reference value, further improved the precision of extracting near synonym.
Consult Fig. 3, the present invention extracts near synonym on network method second embodiment is shown, concrete steps are described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S301, the check webpage.According to the Rank value of backward chaining father webpage, and each backward chaining on the Rank value in this father's webpage institute owner territory check webpage, remove the backward chaining that does not have reference value.The Rank value of webpage is embodied a concentrated reflection of the confidence level of this webpage, but also represents the reference value of this webpage.
Obtain the Rank value of each backward chaining of webpage father webpage, and the Rank value in this father's webpage institute owner territory, like father's webpage, the Rank value that reaches this father's webpage institute owner territory is lower than default value, thinks that this backward chaining does not have reference value, removes this backward chaining; Like this father's webpage, the Rank value that reaches this father's webpage institute owner territory is higher than default value, thinks that this backward chaining has reference value, keeps this backward chaining.
According to practical situations, the present invention also can be in the Rank of father's webpage value, or the Rank value in this father's webpage institute owner territory is removed this backward chaining when being lower than default value.
Default value carries out value according to father's webpage field, place and different in kind, and span is 100-10000.
Step S302, obtain the anchor text of each backward chaining of webpage.
Step S303, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.The weight of anchor text is meant in the webpage backward chaining total degree that this anchor text occurs, if certain anchor text weight is very high, this webpage backward chaining maybe be from a plurality of different websites so, but this anchor text reference value is higher relatively.
For the backward chaining anchor text of subpage frame, father's webpage of establishing this sub-pages is N1, and above-mentioned father's webpage belongs to N2 main territory, and (some parent page possibly belong to same main territory, i.e. N1>=N2).If the father's webpage that belongs to main territory with this sub-pages has M1, other N1-M1 father's webpage belongs to N2-1 main territory respectively.If with sub-pages be u1 with the weight coefficient in main territory, be u2 with the weight coefficient in the different main territories of sub-pages, the weight of this anchor text is:
This anchor text weight=M1*u1+ (N1-M1) * u2.
The weights scope of u1 is 0.05-0.15, is preferably 0.1; The weights scope of u2 is 0.15-0.25, is preferably 0.2.Default value is carried out value according to webpage field and different in kind, and span is 1-10.
For example, total webpage A, A1, A2, A3, B, B1, B2, B3, C, C1, C2, C3 use anchor text S forward chaining webpage K; For the backward chaining anchor text S of webpage K, father's webpage is totally 12 of A, A1, A2, A3, B, B1, B2, B3, C, C1, C2, C3, wherein; A, A1, A2, A3 are same main territory; And with K is identical main territory, and B, B1, B2, B3 are same main territory, and C, C1, C2, C3 are same main territory.Calculating can know that M1 is 4, and N1-M1 is 8.Like u1=0.1, u2=0.2, then; Weight=4 * 0.1+8 * 0.2=2 of anchor text S
Step S304, each anchor text is contrasted in twos, confirm maximum public substring.
Step S305, the near synonym set formed in remaining word, extract near synonym based on said near synonym set.
The present invention is according to the Rank value of backward chaining father webpage, and the Rank value in this father's webpage institute owner territory, judges whether this backward chaining has reference value.The Rank value of father's webpage, and the confidence level of the Rank value in father's webpage institute owner territory concentrated reflection father webpage, the possibility that cheating link and rubbish link appear in webpage with a high credibility is very low, on the contrary then possibility is bigger.Therefore, this method can be removed cheating link and the rubbish link in the webpage backward chaining effectively, and what guarantee backward chaining can be with reference to property.The present invention also removes illegal anchor text according to the weight of anchor text, and the anchor text of reservation is had better can be with reference to property, and the near synonym precision of extracting based on this anchor text is higher.
The present invention also can be according to the frequency of occurrence of residue word in the anchor text; The Rank value of each backward chaining father webpage; Reach the Rank value that father's webpage belongs to main territory, calculate the right near adopted probability of each word in the residue word respectively, in the near synonym set, extract near synonym according to nearly adopted probability.
Consult Fig. 4, the present invention extracts near synonym on network method the 3rd embodiment is shown, concrete steps are described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S401, the check webpage.
Step S402, obtain the anchor text of each backward chaining of webpage.
Step S403, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.
Step S404, each anchor text is contrasted in twos, confirm maximum public substring.
Step S405, extract remaining word, the near synonym set formed in remaining word.
Step S406, obtain the frequency of occurrence of above-mentioned residue word in the anchor text, the Rank value of each backward chaining father webpage, and this father's webpage belongs to the Rank value in main territory.
Step S407, calculate the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value.
Word is to being exactly a pair of word, and is right for the word of forming in twos in the residue word, and the right near adopted probability formula of definition word is: f (v1; V2)=Fun (Freq (v1), Freq (v2)), wherein v1; V2 represents 2 different words, like " China Merchants Bank ", " China Merchants Bank ".Freq (v) is the frequency of occurrence of word v.For example, in embodiment illustrated in fig. 2, the frequency of occurrence of " homepage " is 1, the frequency of occurrence of " China Merchants Bank " is 3, the frequency of occurrence of " China Merchants Bank " is 4, the frequency of occurrence of " China Merchants Bank's homepage " is 1, the frequency of occurrence of " China Merchants Bank's homepage " is 1.
Right for each word, the formula of nearly adopted probability can be:
F (v1, v2)=u*Log (a*d1)+Log (b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=u* (a*d1+b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=x* (Log (a*d1)+Log (a*d2))+y* (a*d1+b*d2)+z, wherein x, y, z are constant factors; The frequency of occurrence of a word v1; B is the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
Certainly, the formula of nearly adopted probability also can carry out other combination according to above-mentioned formula, does not give unnecessary details one by one.
Step S408, choose the word that nearly adopted probability surpasses predetermined threshold value, as near synonym.
The setting of predetermined threshold value can be distinguished according to the difference in word field, specifically can pass through the calculating to the nearly adopted probability of a large amount of known near synonym, chooses a rational predetermined threshold value.
The present invention is through the frequency of occurrence of word; The Rand value of domain name under the Rank value of backward chaining father webpage, this father's webpage is calculated the right near adopted probability of word; The right applying frequency of this nearly adopted probability this word of concentrated reflection; The confidence level of place link can judge preferably that this word is mutually unison to inherent implication, and the near synonym of choosing according to nearly adopted probability have very high precision and practicality.
The present invention also can comprehensively judge this word to whether being near synonym through calculating the total near adopted probability of word to each webpage in the internet, further improves the precision of extracting near synonym.
Consult Fig. 5, the present invention extracts near synonym on network method the 4th embodiment is shown, concrete steps are described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S501, the check webpage.
Step S502, obtain the anchor text of each backward chaining of webpage.
Step S503, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.
Step S504, each anchor text is contrasted in twos, confirm maximum public substring.
Step S505, extract remaining word, the near synonym set formed in remaining word.
Step S506, obtain the frequency of occurrence of above-mentioned residue word in the anchor text, the Rank value of each backward chaining father webpage, and this father's webpage belongs to the Rank value in main territory.
Step S507, calculate the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value.
Word is to being exactly a pair of word, and is right for the word of forming in twos in the residue word, and the right near adopted probability formula of definition word is: f (v1; V2)=Fun (Freq (v1), Freq (v2)), wherein v1; V2 represents 2 different words, like " China Merchants Bank ", " China Merchants Bank ".Freq (v) is the frequency of occurrence of word v.For example, in embodiment illustrated in fig. 2, the frequency of occurrence of " homepage " is 1, the frequency of occurrence of " China Merchants Bank " is 3, the frequency of occurrence of " China Merchants Bank " is 4, the frequency of occurrence of " China Merchants Bank's homepage " is 1, the frequency of occurrence of " China Merchants Bank's homepage " is 1.
Right for each word, the formula of nearly adopted probability can be:
F (v1, v2)=u*Log (a*d1)+Log (b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=u* (a*d1+b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=x* (Log (a*d1)+Log (a*d2))+y* (a*d1+b*d2)+z, wherein x, y, z are constant factors; The frequency of occurrence of a word v1; B is the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
Certainly, the formula of nearly adopted probability also can carry out other combination according to above-mentioned formula, does not give unnecessary details one by one.
Step S508, repetition above-mentioned steps S501 obtain each word to the near adopted probability at each webpage to step S507.
Step S509, right to each word multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively with its near adopted probability in different web pages, and the product addition of acquisition is as the right total near adopted probability of this word.
The right total near adopted probability calculation formula of word is:
Similar (v1, v2)=F1 (v1, v2) * x1+F2 (v1, v2) * x2+F3 (v1, v2) * x3...; Wherein, x1, x2, x3 are the Rank values of backward chaining subpage frame, represent the confidence level of backward chaining subpage frame.
Total nearly adopted probability is carried out normalization handle, make total nearly adopted probable value within the scope of 0-1.
Step S510, the total nearly adopted probability of extraction surpass the word of setting threshold, as near synonym.
The span of setting threshold is 0.3-0.8.Setting threshold is different according to the right field of word, and value is also different, needs to choose a rational setting threshold through to a large amount of near synonym and nearly adopted probability calculation.
The comprehensive word of the present invention is judged this word to whether being near synonym to the near adopted probability of each webpage on the internet, and this judgment mode is taken all factors into consideration the expressed meaning of a word of this word at each webpage, and the precision of choosing near synonym is further improved.
The present invention also can calculate the right total near adopted probable value of this word according to word to the frequency that in sub-pages anchor text, occurs, the parent page Rank value that belongs to the anchor text, main territory Rank value, the Rank value of sub-pages itself, affiliated main territory Rank value.
For example, word is t1, t2 to V1, the frequency of occurrence of V2 in the reverse anchor text of webpage A, and the frequency of occurrence in the reverse anchor text of B is t3, t4, and the Rank value of webpage A, webpage B and affiliated main territory Rank value are respectively RA, RB, DA, DB.Word is to V1, V2 near adopted probability Fa=u1*t1* (the A1+A2+A3....)+u1*t2 (A1+A2+....) in webpage A, and wherein u1 is an anchor text weighting coefficient, and A1, A2, A3.... are the reverse anchor text weights of webpage A; In like manner, word is to V1, V2 near adopted probability Fb=u1*t3* (the B1+B2+B3....)+u1*t4 (B1+B2+....) in webpage B, and wherein u1 is an anchor text weighting coefficient, and B1, B2, B3.... are the reverse anchor text weights of webpage B.
With Fa and Fb addition, obtain the word near adopted probability Similar total to V1, V2 (v1, v2)=u2* (AR*Fa+BR*Fb)+u3* (DA*Fa+DB*Fb).Wherein u2 is A, B webpage Rank coefficient, and u3 is the coefficient of the main territory Rank under webpage A, the B, and AR, BR are the webpage Rank of A, B, and DA, DB are the Rank in the main territory at A, B place.
Based on above-mentioned a kind of method of on network, extracting near synonym, the present invention also provides a kind of system that on network, extracts near synonym, and the near synonym that this system extracts have higher range and precision.
Consult Fig. 6, the present invention extracts the near synonym system on network first embodiment is shown, comprise anchor text acquisition module 61, contrast module 62, remove module 63, reach and form module 64.
Anchor text acquisition module 61 obtains the anchor text of each backward chaining on the webpage.Anchor text acquisition module 61 extracts the anchor text of whole forward chainings in each webpage of internet, counter-rotating obtains the anchor text of each webpage backward chaining again, and the anchor text that obtains is sent to contrast module 62.
Contrast module 62 contrasts the anchor text in twos, confirms maximum public substring.Contrast module 62 is at first carried out word segmentation processing to the anchor text, and the anchor text behind the participle is compared in twos, and its overlapping word as the public substring of maximum, and is sent to comparing result and removes module 63.
Remove module 63 and remove overlapping word respectively, remaining word is sent to form module 64.
Form module 64 the near synonym set formed in remaining word, near synonym are extracted in set based near synonym.
Near synonym extraction system of the present invention also can be through calculating the right near adopted probability of word, and near synonym are extracted in set based near synonym.
Consult Fig. 7; The present invention extracts the near synonym system on network second embodiment is shown, comprises anchor text acquisition module 61, contrast module 62, remove module 63, form module 64, data acquisition module 65, nearly adopted probability calculation module 66, and near synonym extraction module 67.
Data acquisition module 65 obtains forms the frequency of occurrence of residue word in the anchor text in the module 64, and the Rank value of said each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory send to nearly adopted probability calculation module 66.
Nearly adopted probability calculation module 66 is calculated the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, and the right near adopted probability of each word that calculates is sent near synonym module 67.
Near synonym extraction module 67 is chosen word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Consult Fig. 8; The present invention extracts the near synonym system on network the 3rd embodiment is shown, comprises anchor text acquisition module 61, contrast module 62, remove module 63, form module 64, data acquisition module 65, nearly adopted probability calculation module 66, near synonym extraction module 67, and total nearly adopted probability module 68.
Total nearly adopted probability module 68 receives the right near adopted probability of each word that nearly adopted probability calculation module 66 is sent; It is right to be used for to each word; Its near adopted probability in different web pages multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively; The product addition that obtains as the right total near adopted probability of this word, and is sent near synonym extraction module 67.
Near synonym extraction module 67 extracts word that total nearly adopted probability surpasses setting threshold to as near synonym.
More than to a kind of method and system of on network, extracting near synonym provided by the present invention; Carried out detailed introduction; Used concrete example among this paper principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as limitation of the present invention.