CN101226532B - Method and system for extracting homoionym in network - Google Patents

Method and system for extracting homoionym in network Download PDF

Info

Publication number
CN101226532B
CN101226532B CN200710304564A CN200710304564A CN101226532B CN 101226532 B CN101226532 B CN 101226532B CN 200710304564 A CN200710304564 A CN 200710304564A CN 200710304564 A CN200710304564 A CN 200710304564A CN 101226532 B CN101226532 B CN 101226532B
Authority
CN
China
Prior art keywords
webpage
word
father
anchor text
near synonym
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200710304564A
Other languages
Chinese (zh)
Other versions
CN101226532A (en
Inventor
禹荣凌
刘云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN200710304564A priority Critical patent/CN101226532B/en
Publication of CN101226532A publication Critical patent/CN101226532A/en
Application granted granted Critical
Publication of CN101226532B publication Critical patent/CN101226532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a method to extract near synonyms on the network, which comprises obtaining reversely linked anchor texts of web pages and comparing the anchor texts with each other, respectively removing repeated words and forming a near synonym set by remaining words, extracting near synonyms based on the near synonyms set, obtaining occurrence frequency of the remaining words in the anchor texts, Rank values of reversely linked pater web pages, and Rank values of main domains of the pater web pages, respectively computing synonymic possibilities of each pair of the remaining words according to the above values, and selecting word pairs whose synonymic possibilities are larger than a preset threshold value as near synonyms. Simultaneously the invention also provides a system for extracting near synonyms on the network. The invention solves problems of low accuracy and span of near synonyms extracting of the prior art and has higher span and accuracy of extracted near synonyms.

Description

A kind of method and system of on network, extracting near synonym
Technical field
The present invention relates near synonym and extract the field, particularly relate to a kind of method and system of on network, extracting near synonym.
Background technology
Utilize Internet resources to extract near synonym on the internet, can provide support for Webpage search correlativity, natural language processing, text mining etc.But on network, be difficult to find appropriate mode to remove to define near synonym, therefore, prior art still adopts contrast to search mode at present.
Consult Fig. 1, for having the method for on network, extracting near synonym now, concrete steps are described below.
Step S101, preset the near synonym dictionary.Artificial through relevant dictionary and experience arrangement near synonym, the near synonym storehouse formed in the near synonym of putting in order.The near synonym storehouse comprises the close word of the meaning of a word on the ordinary meaning, and like " having a meal ", " having dinner ", " hesitation ", " hesitating " etc. also comprise sensu lato near synonym, promptly represent the word of same things, like " Beijing ", " Peking University " etc.
Step S102, extraction web page text carry out word segmentation processing to web page text.On network, grasp required webpage, on webpage, extract body text again, body text is carried out word segmentation processing, use the space to separate each word, filter out the word that function word, modal particle, preposition etc. do not have essential meaning.
Step S103, with the contrast of web page text and near synonym dictionary, extract the word identical with the near synonym dictionary.
Step S104, webpage is analyzed according to the near synonym that extract.
Said method just extracts identical near synonym according to existing near synonym storehouse on network, analyze.But the near synonym storehouse is according to manual sorting, and the general data amount is less, and coverage is also less, extracts near synonym based on this database, and the range of extraction receives great limitation.
Near synonym on the network are much formed by the netspeak custom, the near synonym storehouse is set in advance is difficult to include the distinctive near synonym of these networks, can't find these near synonym based on the near synonym storehouse of presetting, and the applicability of extracting near synonym is lower.
Summary of the invention
Technical matters to be solved by this invention provides a kind of method and system of on network, extracting near synonym, to solve range and the lower problem of applicability that near synonym extract in the prior art.The near synonym that the present invention extracts have higher range and precision.
The present invention discloses a kind of method of on network, extracting near synonym, comprising: the anchor text that obtains each backward chaining on the webpage; Calculate the weight of said anchor text, remove the anchor text that weight is lower than default value; Wherein, for the backward chaining anchor text of subpage frame, said anchor text weight does not belong to number with father's webpage in main territory and multiply by separately sum behind the weight coefficient respectively for belonging to number, this sub-pages with father's webpage in main territory with this sub-pages; The anchor text is contrasted in twos, remove overlapping word respectively; The near synonym set formed in remaining word, extract near synonym based on said near synonym set.
Preferably, obtain before the anchor text of each backward chaining of webpage, also comprise: obtain the Rank value of each backward chaining father webpage on the webpage, and the Rank value in this father's webpage institute owner territory; Like the Rank value of father's webpage, and/or the Rank value that this webpage belongs to main territory is lower than default value, removes the corresponding backward chaining of this father's webpage.
Preferably, extracting near synonym based on said near synonym set is specially: obtain the frequency of occurrence of above-mentioned residue word in said anchor text, the Rank value of said each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory; Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value; Choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Preferably, extract near synonym based on said near synonym set and be specially: obtain the frequency of occurrence of above-mentioned residue word in said anchor text, the Rank value of said each backward chaining father webpage, and this father's webpage belongs to the Rank value in main territory; Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, repeat above-mentioned steps, obtain each word near adopted probability at each webpage; Right to each word, its near adopted probability in different web pages multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively, the product addition of acquisition is as the right total near adopted probability of this word; Extract word that total nearly adopted probability surpasses setting threshold to as near synonym.
Preferably; Calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word: right to each word; The frequency of occurrence of word centering one word multiply by the Rank value of its corresponding backward chaining father webpage, multiply by the setting coefficient after product is taken the logarithm again; The frequency of occurrence of another word of word centering multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory, and product is taken the logarithm; The data addition that obtains is the near adopted probability of this word.
Preferably; Calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word: right to each word; The frequency of occurrence of word centering one word multiply by the Rank value of its corresponding backward chaining father webpage, and the frequency of occurrence of another word multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory; The data addition that obtains is the near adopted probability of this word.
The present invention also discloses a kind of system that on network, extracts near synonym; Comprise anchor text acquisition module, contrast module, remove module, reach and form module, also comprise being used to calculate said anchor text weight and removing the module that weight is lower than the anchor text of default value; Said anchor text acquisition module is used to obtain the anchor text of each backward chaining on the webpage; Saidly be used to calculate said anchor text weight and remove the module that weight is lower than the anchor text of default value; For the backward chaining anchor text of subpage frame, said anchor text weight does not belong to number with father's webpage in main territory and multiply by separately sum behind the weight coefficient respectively for belonging to number, this sub-pages with father's webpage in main territory with this sub-pages; Said contrast module is used for the anchor text is contrasted in twos, and overlapping word as the public substring of maximum, and is sent to the removal module with comparing result; Said removal module is used for removing respectively overlapping word, and remaining word is sent to the composition module; Said composition module is used for the near synonym set formed in remaining word, and near synonym are extracted in set based near synonym.
Preferably; Also comprise data acquisition module, nearly adopted probability calculation module, near synonym module: said data acquisition module; Be used for obtaining the frequency of occurrence of above-mentioned residue word, the Rank value of said each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory at said anchor text; Said nearly adopted probability calculation module is used for calculating the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value; Said near synonym module is used to choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Preferably; Also comprise total nearly adopted probability module, receive the right near adopted probability of each word that said nearly adopted probability calculation module is sent, it is right to be used for to each word; Its near adopted probability in different web pages multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively; The product addition that obtains as the right total near adopted probability of this word, and is sent to the near synonym module; Said near synonym module is extracted word that total nearly adopted probability surpasses setting threshold to as near synonym.
Compared with prior art, the present invention has the following advantages:
The present invention defines the near synonym on the network, utilizes the anchor text on network, to extract potential near synonym, forms the near synonym set, and near synonym are extracted in set based near synonym.The near synonym data volume of extracting is big, wide coverage, and can embody the characteristic of network application, the range and the precision of the near synonym of extraction are higher.
The present invention is through the frequency of occurrence of word; The Rand value of domain name under the Rank value of backward chaining father webpage, this father's webpage is calculated the right near adopted probability of word; The right applying frequency of this nearly adopted probability this word of concentrated reflection; The confidence level of place link can judge preferably that this word is mutually unison to inherent implication, and the near synonym of choosing according to nearly adopted probability have very high precision and practicality.
Description of drawings
Fig. 1 has the method flow diagram that on network, extracts near synonym now;
Fig. 2 extracts the method first embodiment process flow diagram of near synonym on network for the present invention;
Fig. 3 extracts the method second embodiment process flow diagram of near synonym on network for the present invention;
Fig. 4 extracts method the 3rd embodiment process flow diagram of near synonym on network for the present invention;
Fig. 5 extracts method the 4th embodiment process flow diagram of near synonym on network for the present invention;
Fig. 6 extracts the first embodiment synoptic diagram of near synonym system on network for the present invention;
Fig. 7 extracts the second embodiment synoptic diagram of near synonym system on network for the present invention;
Fig. 8 extracts the 3rd embodiment synoptic diagram of near synonym system on network for the present invention.
Embodiment
For make above-mentioned purpose of the present invention, feature and advantage can be more obviously understandable, below in conjunction with accompanying drawing and embodiment the present invention done further detailed explanation.
A webpage comprises the hyperlink (URL) of pointing to another webpage, thinks to have linking relationship between these two webpages.Literal on the hyperlink is the anchor text.If webpage A uses anchor text S linked web pages B, webpage A can be described as father's webpage, and webpage B can be described as sub-pages, and this link is forward chaining for webpage A, is backward chaining for webpage B.Each webpage all possibly exist a plurality of forward chainings and backward chaining.The frequency that the present invention occurs in webpage backward chaining anchor text according to word, the confidence level of backward chaining father webpage, and the confidence level in this father's webpage institute owner territory are calculated the right near adopted probability of word.According to nearly this word of adopted probabilistic determination to whether being near synonym.Mass data on the internet can guarantee the feasibility of this method and result's accuracy from probability.
The present invention utilizes the anchor text to extract near synonym; The anchor text that points to same webpage has its identical inherent meaning; Overlapping word is generally the usual term of this webpage or fixing term in the anchor text, and the anchor text is removed the word behind this overlapping word, exists the possibility of nearly justice very big.
Consult Fig. 2, the present invention extracts near synonym on network method first embodiment is shown, concrete steps are described below.
Step S201, obtain the anchor text of each backward chaining on the webpage.The webserver extracts the anchor text of whole forward chainings in each webpage of internet, counter-rotating obtains the anchor text of each webpage backward chaining again.
For example, webpage A uses anchor text S to point to webpage B, and S is the forward chaining anchor text of webpage A, is webpage A (S)-->webpage B.After the counter-rotating, obtain webpage B (S)<--webpage A for webpage B, then is that anchor text S is the anchor text of backward chaining.
Again for example; Webpage ID1, webpage ID2, webpage ID3, webpage ID4 use anchor text " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ", the homepage ID0 of " China Merchants Bank " forward chaining China Merchants Bank respectively; For the homepage ID0 of China Merchants Bank; Have 4 each backward chaining, corresponding anchor text is followed successively by " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ", " China Merchants Bank ".
Step S202, each anchor text is contrasted in twos, confirm maximum public substring.At first the anchor text is carried out word segmentation processing, the anchor text behind the participle is compared in twos, with its overlapping word as the public substring of maximum.For example:
Webpage ID1 and webpage 1D2: the maximum public substring of " China Merchants Bank ", " China Merchants Bank's homepage " is " China Merchants Bank ";
Webpage ID1 and webpage ID3: " China Merchants Bank ", " China Merchants Bank's homepage " do not have public substring;
Webpage ID1 and webpage ID4: " China Merchants Bank ", " China Merchants Bank " do not have public substring;
Webpage ID2 and webpage ID3: the maximum public substring of " China Merchants Bank's homepage ", " China Merchants Bank's homepage " is " homepage ";
Webpage ID2 and webpage ID4: " China Merchants Bank's homepage ", " China Merchants Bank " do not have public substring;
Webpage ID3 and webpage ID4: the maximum public substring of " China Merchants Bank's homepage ", " China Merchants Bank " is " China Merchants Bank ".
Step S203, remove overlapping word respectively.On above-mentioned anchor text, remove maximum public substring respectively.For example:
Webpage ID1 and webpage ID2: empty string, " homepage ";
Webpage ID1 and webpage ID3: " China Merchants Bank ", " China Merchants Bank's homepage ";
Webpage ID1 and webpage ID4: " China Merchants Bank ", " China Merchants Bank ";
Webpage ID2 and webpage ID3: " China Merchants Bank ", " China Merchants Bank ";
Webpage ID2 and webpage ID4: " China Merchants Bank's homepage ", " China Merchants Bank ";
Webpage ID3 and webpage ID4: " China Merchants Bank ", empty string.
Step S204, the near synonym set formed in remaining word, near synonym are extracted in set based near synonym.Ignore empty string, obtain removing the word of maximum public substring, the near synonym set formed in remaining word.For example, " homepage ", " China Merchants Bank ", " China Merchants Bank ", " China Merchants Bank's homepage ", " China Merchants Bank's homepage ".
The present invention can adopt manual sorting, calculates the right modes such as nearly adopted probability of each word in the near synonym set, and near synonym are extracted in set based near synonym.
The present invention defines the near synonym on the network, utilizes the anchor text on network, to extract potential near synonym, forms the near synonym set, and near synonym are extracted in set based near synonym.The near synonym data volume of extracting is big, wide coverage, and can embody the characteristic of network application, the range and the precision of the near synonym of extraction are higher.
The present invention can test to backward chaining and anchor text earlier before extracting near synonym according to the anchor text, removed the backward chaining and the anchor text that do not have reference value, further improved the precision of extracting near synonym.
Consult Fig. 3, the present invention extracts near synonym on network method second embodiment is shown, concrete steps are described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S301, the check webpage.According to the Rank value of backward chaining father webpage, and each backward chaining on the Rank value in this father's webpage institute owner territory check webpage, remove the backward chaining that does not have reference value.The Rank value of webpage is embodied a concentrated reflection of the confidence level of this webpage, but also represents the reference value of this webpage.
Obtain the Rank value of each backward chaining of webpage father webpage, and the Rank value in this father's webpage institute owner territory, like father's webpage, the Rank value that reaches this father's webpage institute owner territory is lower than default value, thinks that this backward chaining does not have reference value, removes this backward chaining; Like this father's webpage, the Rank value that reaches this father's webpage institute owner territory is higher than default value, thinks that this backward chaining has reference value, keeps this backward chaining.
According to practical situations, the present invention also can be in the Rank of father's webpage value, or the Rank value in this father's webpage institute owner territory is removed this backward chaining when being lower than default value.
Default value carries out value according to father's webpage field, place and different in kind, and span is 100-10000.
Step S302, obtain the anchor text of each backward chaining of webpage.
Step S303, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.The weight of anchor text is meant in the webpage backward chaining total degree that this anchor text occurs, if certain anchor text weight is very high, this webpage backward chaining maybe be from a plurality of different websites so, but this anchor text reference value is higher relatively.
For the backward chaining anchor text of subpage frame, father's webpage of establishing this sub-pages is N1, and above-mentioned father's webpage belongs to N2 main territory, and (some parent page possibly belong to same main territory, i.e. N1>=N2).If the father's webpage that belongs to main territory with this sub-pages has M1, other N1-M1 father's webpage belongs to N2-1 main territory respectively.If with sub-pages be u1 with the weight coefficient in main territory, be u2 with the weight coefficient in the different main territories of sub-pages, the weight of this anchor text is:
This anchor text weight=M1*u1+ (N1-M1) * u2.
The weights scope of u1 is 0.05-0.15, is preferably 0.1; The weights scope of u2 is 0.15-0.25, is preferably 0.2.Default value is carried out value according to webpage field and different in kind, and span is 1-10.
For example, total webpage A, A1, A2, A3, B, B1, B2, B3, C, C1, C2, C3 use anchor text S forward chaining webpage K; For the backward chaining anchor text S of webpage K, father's webpage is totally 12 of A, A1, A2, A3, B, B1, B2, B3, C, C1, C2, C3, wherein; A, A1, A2, A3 are same main territory; And with K is identical main territory, and B, B1, B2, B3 are same main territory, and C, C1, C2, C3 are same main territory.Calculating can know that M1 is 4, and N1-M1 is 8.Like u1=0.1, u2=0.2, then; Weight=4 * 0.1+8 * 0.2=2 of anchor text S
Step S304, each anchor text is contrasted in twos, confirm maximum public substring.
Step S305, the near synonym set formed in remaining word, extract near synonym based on said near synonym set.
The present invention is according to the Rank value of backward chaining father webpage, and the Rank value in this father's webpage institute owner territory, judges whether this backward chaining has reference value.The Rank value of father's webpage, and the confidence level of the Rank value in father's webpage institute owner territory concentrated reflection father webpage, the possibility that cheating link and rubbish link appear in webpage with a high credibility is very low, on the contrary then possibility is bigger.Therefore, this method can be removed cheating link and the rubbish link in the webpage backward chaining effectively, and what guarantee backward chaining can be with reference to property.The present invention also removes illegal anchor text according to the weight of anchor text, and the anchor text of reservation is had better can be with reference to property, and the near synonym precision of extracting based on this anchor text is higher.
The present invention also can be according to the frequency of occurrence of residue word in the anchor text; The Rank value of each backward chaining father webpage; Reach the Rank value that father's webpage belongs to main territory, calculate the right near adopted probability of each word in the residue word respectively, in the near synonym set, extract near synonym according to nearly adopted probability.
Consult Fig. 4, the present invention extracts near synonym on network method the 3rd embodiment is shown, concrete steps are described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S401, the check webpage.
Step S402, obtain the anchor text of each backward chaining of webpage.
Step S403, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.
Step S404, each anchor text is contrasted in twos, confirm maximum public substring.
Step S405, extract remaining word, the near synonym set formed in remaining word.
Step S406, obtain the frequency of occurrence of above-mentioned residue word in the anchor text, the Rank value of each backward chaining father webpage, and this father's webpage belongs to the Rank value in main territory.
Step S407, calculate the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value.
Word is to being exactly a pair of word, and is right for the word of forming in twos in the residue word, and the right near adopted probability formula of definition word is: f (v1; V2)=Fun (Freq (v1), Freq (v2)), wherein v1; V2 represents 2 different words, like " China Merchants Bank ", " China Merchants Bank ".Freq (v) is the frequency of occurrence of word v.For example, in embodiment illustrated in fig. 2, the frequency of occurrence of " homepage " is 1, the frequency of occurrence of " China Merchants Bank " is 3, the frequency of occurrence of " China Merchants Bank " is 4, the frequency of occurrence of " China Merchants Bank's homepage " is 1, the frequency of occurrence of " China Merchants Bank's homepage " is 1.
Right for each word, the formula of nearly adopted probability can be:
F (v1, v2)=u*Log (a*d1)+Log (b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=u* (a*d1+b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=x* (Log (a*d1)+Log (a*d2))+y* (a*d1+b*d2)+z, wherein x, y, z are constant factors; The frequency of occurrence of a word v1; B is the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
Certainly, the formula of nearly adopted probability also can carry out other combination according to above-mentioned formula, does not give unnecessary details one by one.
Step S408, choose the word that nearly adopted probability surpasses predetermined threshold value, as near synonym.
The setting of predetermined threshold value can be distinguished according to the difference in word field, specifically can pass through the calculating to the nearly adopted probability of a large amount of known near synonym, chooses a rational predetermined threshold value.
The present invention is through the frequency of occurrence of word; The Rand value of domain name under the Rank value of backward chaining father webpage, this father's webpage is calculated the right near adopted probability of word; The right applying frequency of this nearly adopted probability this word of concentrated reflection; The confidence level of place link can judge preferably that this word is mutually unison to inherent implication, and the near synonym of choosing according to nearly adopted probability have very high precision and practicality.
The present invention also can comprehensively judge this word to whether being near synonym through calculating the total near adopted probability of word to each webpage in the internet, further improves the precision of extracting near synonym.
Consult Fig. 5, the present invention extracts near synonym on network method the 4th embodiment is shown, concrete steps are described below.
The backward chaining that does not have reference value is removed in each backward chaining on step S501, the check webpage.
Step S502, obtain the anchor text of each backward chaining of webpage.
Step S503, calculating anchor text weight are removed the anchor text that weighted value is lower than default value.
Step S504, each anchor text is contrasted in twos, confirm maximum public substring.
Step S505, extract remaining word, the near synonym set formed in remaining word.
Step S506, obtain the frequency of occurrence of above-mentioned residue word in the anchor text, the Rank value of each backward chaining father webpage, and this father's webpage belongs to the Rank value in main territory.
Step S507, calculate the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value.
Word is to being exactly a pair of word, and is right for the word of forming in twos in the residue word, and the right near adopted probability formula of definition word is: f (v1; V2)=Fun (Freq (v1), Freq (v2)), wherein v1; V2 represents 2 different words, like " China Merchants Bank ", " China Merchants Bank ".Freq (v) is the frequency of occurrence of word v.For example, in embodiment illustrated in fig. 2, the frequency of occurrence of " homepage " is 1, the frequency of occurrence of " China Merchants Bank " is 3, the frequency of occurrence of " China Merchants Bank " is 4, the frequency of occurrence of " China Merchants Bank's homepage " is 1, the frequency of occurrence of " China Merchants Bank's homepage " is 1.
Right for each word, the formula of nearly adopted probability can be:
F (v1, v2)=u*Log (a*d1)+Log (b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=u* (a*d1+b*d2)+t, wherein u, t are constant factors, and the frequency of occurrence of a word v1, b are the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
The formula of nearly adopted probability also can be:
F (v1, v2)=x* (Log (a*d1)+Log (a*d2))+y* (a*d1+b*d2)+z, wherein x, y, z are constant factors; The frequency of occurrence of a word v1; B is the frequency of occurrence of word v2, and d1 is the Rank value of father's webpage of backward chaining, and d2 is the Rand value of domain name under this father's webpage.
Certainly, the formula of nearly adopted probability also can carry out other combination according to above-mentioned formula, does not give unnecessary details one by one.
Step S508, repetition above-mentioned steps S501 obtain each word to the near adopted probability at each webpage to step S507.
Step S509, right to each word multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively with its near adopted probability in different web pages, and the product addition of acquisition is as the right total near adopted probability of this word.
The right total near adopted probability calculation formula of word is:
Similar (v1, v2)=F1 (v1, v2) * x1+F2 (v1, v2) * x2+F3 (v1, v2) * x3...; Wherein, x1, x2, x3 are the Rank values of backward chaining subpage frame, represent the confidence level of backward chaining subpage frame.
Total nearly adopted probability is carried out normalization handle, make total nearly adopted probable value within the scope of 0-1.
Step S510, the total nearly adopted probability of extraction surpass the word of setting threshold, as near synonym.
The span of setting threshold is 0.3-0.8.Setting threshold is different according to the right field of word, and value is also different, needs to choose a rational setting threshold through to a large amount of near synonym and nearly adopted probability calculation.
The comprehensive word of the present invention is judged this word to whether being near synonym to the near adopted probability of each webpage on the internet, and this judgment mode is taken all factors into consideration the expressed meaning of a word of this word at each webpage, and the precision of choosing near synonym is further improved.
The present invention also can calculate the right total near adopted probable value of this word according to word to the frequency that in sub-pages anchor text, occurs, the parent page Rank value that belongs to the anchor text, main territory Rank value, the Rank value of sub-pages itself, affiliated main territory Rank value.
For example, word is t1, t2 to V1, the frequency of occurrence of V2 in the reverse anchor text of webpage A, and the frequency of occurrence in the reverse anchor text of B is t3, t4, and the Rank value of webpage A, webpage B and affiliated main territory Rank value are respectively RA, RB, DA, DB.Word is to V1, V2 near adopted probability Fa=u1*t1* (the A1+A2+A3....)+u1*t2 (A1+A2+....) in webpage A, and wherein u1 is an anchor text weighting coefficient, and A1, A2, A3.... are the reverse anchor text weights of webpage A; In like manner, word is to V1, V2 near adopted probability Fb=u1*t3* (the B1+B2+B3....)+u1*t4 (B1+B2+....) in webpage B, and wherein u1 is an anchor text weighting coefficient, and B1, B2, B3.... are the reverse anchor text weights of webpage B.
With Fa and Fb addition, obtain the word near adopted probability Similar total to V1, V2 (v1, v2)=u2* (AR*Fa+BR*Fb)+u3* (DA*Fa+DB*Fb).Wherein u2 is A, B webpage Rank coefficient, and u3 is the coefficient of the main territory Rank under webpage A, the B, and AR, BR are the webpage Rank of A, B, and DA, DB are the Rank in the main territory at A, B place.
Based on above-mentioned a kind of method of on network, extracting near synonym, the present invention also provides a kind of system that on network, extracts near synonym, and the near synonym that this system extracts have higher range and precision.
Consult Fig. 6, the present invention extracts the near synonym system on network first embodiment is shown, comprise anchor text acquisition module 61, contrast module 62, remove module 63, reach and form module 64.
Anchor text acquisition module 61 obtains the anchor text of each backward chaining on the webpage.Anchor text acquisition module 61 extracts the anchor text of whole forward chainings in each webpage of internet, counter-rotating obtains the anchor text of each webpage backward chaining again, and the anchor text that obtains is sent to contrast module 62.
Contrast module 62 contrasts the anchor text in twos, confirms maximum public substring.Contrast module 62 is at first carried out word segmentation processing to the anchor text, and the anchor text behind the participle is compared in twos, and its overlapping word as the public substring of maximum, and is sent to comparing result and removes module 63.
Remove module 63 and remove overlapping word respectively, remaining word is sent to form module 64.
Form module 64 the near synonym set formed in remaining word, near synonym are extracted in set based near synonym.
Near synonym extraction system of the present invention also can be through calculating the right near adopted probability of word, and near synonym are extracted in set based near synonym.
Consult Fig. 7; The present invention extracts the near synonym system on network second embodiment is shown, comprises anchor text acquisition module 61, contrast module 62, remove module 63, form module 64, data acquisition module 65, nearly adopted probability calculation module 66, and near synonym extraction module 67.
Data acquisition module 65 obtains forms the frequency of occurrence of residue word in the anchor text in the module 64, and the Rank value of said each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory send to nearly adopted probability calculation module 66.
Nearly adopted probability calculation module 66 is calculated the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, and the right near adopted probability of each word that calculates is sent near synonym module 67.
Near synonym extraction module 67 is chosen word that nearly adopted probability surpasses predetermined threshold value to as near synonym.
Consult Fig. 8; The present invention extracts the near synonym system on network the 3rd embodiment is shown, comprises anchor text acquisition module 61, contrast module 62, remove module 63, form module 64, data acquisition module 65, nearly adopted probability calculation module 66, near synonym extraction module 67, and total nearly adopted probability module 68.
Total nearly adopted probability module 68 receives the right near adopted probability of each word that nearly adopted probability calculation module 66 is sent; It is right to be used for to each word; Its near adopted probability in different web pages multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively; The product addition that obtains as the right total near adopted probability of this word, and is sent near synonym extraction module 67.
Near synonym extraction module 67 extracts word that total nearly adopted probability surpasses setting threshold to as near synonym.
More than to a kind of method and system of on network, extracting near synonym provided by the present invention; Carried out detailed introduction; Used concrete example among this paper principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as limitation of the present invention.

Claims (9)

1. a method of on network, extracting near synonym is characterized in that, comprising:
Obtain the anchor text of each backward chaining on the webpage;
Calculate the weight of said anchor text, remove the anchor text that weight is lower than default value; Wherein, for the backward chaining anchor text of subpage frame, said anchor text weight does not belong to number with father's webpage in main territory and multiply by separately sum behind the weight coefficient respectively for belonging to number, this sub-pages with father's webpage in main territory with this sub-pages;
The anchor text is contrasted in twos, remove overlapping word respectively;
The near synonym set formed in remaining word, extract near synonym based on said near synonym set;
Wherein, if webpage A uses anchor text S linked web pages B, then webpage A is father's webpage, and webpage B is a sub-pages, and link is forward chaining for webpage A, is backward chaining for webpage B.
2. the method for claim 1 is characterized in that, obtains before the anchor text of each backward chaining of webpage, also comprises:
Obtain the Rank value of each backward chaining father webpage on the webpage, and the Rank value in this father's webpage institute owner territory;
Like the Rank value of father's webpage, and/or the Rank value that this webpage belongs to main territory is lower than default value, removes the corresponding backward chaining of this father's webpage;
Wherein, the Rank value is the numerical value of reflection webpage confidence level.
3. the method for claim 1 is characterized in that, extracts near synonym based on said near synonym set and is specially:
Obtain the frequency of occurrence of above-mentioned residue word in said anchor text, the Rank value of said each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory;
Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value;
Choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym;
Wherein, the Rank value is the numerical value of reflection webpage confidence level.
4. the method for claim 1 is characterized in that, extracts near synonym based on said near synonym set and is specially:
Obtain the frequency of occurrence of above-mentioned residue word in said anchor text, the Rank value of said each backward chaining father webpage, and this father's webpage belongs to the Rank value in main territory;
Calculate the right near adopted probability of each word in the above-mentioned residue word respectively according to above-mentioned numerical value, repeat above-mentioned steps, obtain each word near adopted probability at each webpage;
Right to each word, its near adopted probability in different web pages multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability respectively, the product addition of acquisition is as the right total near adopted probability of this word;
Extract word that total nearly adopted probability surpasses setting threshold to as near synonym;
Wherein, the Rank value is the numerical value of reflection webpage confidence level.
5. like claim 3 or 4 described methods, it is characterized in that, calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word:
Right to each word, the frequency of occurrence of word centering one word multiply by the Rank value of its corresponding backward chaining father webpage, after taking the logarithm, product multiply by the setting coefficient again;
The frequency of occurrence of another word of word centering multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory, and product is taken the logarithm;
The data addition that obtains is the near adopted probability of this word.
6. like claim 3 or 4 described methods, it is characterized in that, calculate respectively according to above-mentioned numerical value that the right near adopted probability of each word is specially in the above-mentioned residue word:
Right to each word, the frequency of occurrence of word centering one word multiply by the Rank value of its corresponding backward chaining father webpage, the frequency of occurrence of another word multiply by the Rank value in its corresponding backward chaining father webpage institute owner territory; The data addition that obtains is the right near adopted probability of this word.
7. a system that on network, extracts near synonym is characterized in that, comprises anchor text acquisition module, contrast module, removes module, reaches and form module, also comprises processing module;
Said anchor text acquisition module is used to obtain the anchor text of each backward chaining on the webpage;
Said processing module is used to calculate said anchor text weight and removes the anchor text that weight is lower than default value; Wherein, for the backward chaining anchor text of subpage frame, said anchor text weight does not belong to number with father's webpage in main territory and multiply by separately sum behind the weight coefficient respectively for belonging to number, this sub-pages with father's webpage in main territory with this sub-pages;
Said contrast module is used for the anchor text is contrasted in twos, and comparing result is sent to the removal module;
Said removal module is used for removing respectively overlapping word, and remaining word is sent to the composition module;
Said composition module is used for the near synonym set formed in remaining word, and near synonym are extracted in set based near synonym;
Wherein, if webpage A uses anchor text S linked web pages B, then webpage A is father's webpage, and webpage B is a sub-pages, and link is forward chaining for webpage A, is backward chaining for webpage B.
8. system as claimed in claim 7 is characterized in that, also comprises data acquisition module, nearly adopted probability calculation module, near synonym module:
Said data acquisition module is used for obtaining the frequency of occurrence of above-mentioned residue word at said anchor text, the Rank value of said each backward chaining father webpage, and the Rank value in this father's webpage institute owner territory;
Said nearly adopted probability calculation module is used for calculating the right near adopted probability of above-mentioned each word of residue word respectively according to above-mentioned numerical value;
Said near synonym module is used to choose word that nearly adopted probability surpasses predetermined threshold value to as near synonym;
Wherein, the Rank value is the numerical value of reflection webpage confidence level.
9. like claim 7 or 8 described systems, it is characterized in that, also comprise total nearly adopted probability module; Receive the right near adopted probability of each word that said nearly adopted probability calculation module is sent; It is right to be used for to each word, and its near adopted probability in different web pages multiply by the Rank value of the corresponding backward chaining subpage frame of this nearly adopted probability, the product addition of acquisition respectively; As the right total near adopted probability of this word, and be sent to the near synonym module; Said near synonym module is extracted word that total nearly adopted probability surpasses setting threshold to as near synonym;
Wherein, the Rank value is the numerical value of reflection webpage confidence level.
CN200710304564A 2007-12-28 2007-12-28 Method and system for extracting homoionym in network Active CN101226532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200710304564A CN101226532B (en) 2007-12-28 2007-12-28 Method and system for extracting homoionym in network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200710304564A CN101226532B (en) 2007-12-28 2007-12-28 Method and system for extracting homoionym in network

Publications (2)

Publication Number Publication Date
CN101226532A CN101226532A (en) 2008-07-23
CN101226532B true CN101226532B (en) 2012-10-03

Family

ID=39858533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200710304564A Active CN101226532B (en) 2007-12-28 2007-12-28 Method and system for extracting homoionym in network

Country Status (1)

Country Link
CN (1) CN101226532B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN110413737B (en) * 2019-07-29 2022-10-14 腾讯科技(深圳)有限公司 Synonym determination method, synonym determination device, server and readable storage medium

Also Published As

Publication number Publication date
CN101226532A (en) 2008-07-23

Similar Documents

Publication Publication Date Title
Meusel et al. Graph structure in the web---revisited: a trick of the heavy tail
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN103365924B (en) A kind of method of internet information search, device and terminal
CN103268348B (en) A kind of user's query intention recognition methods
CN102955857B (en) Class center compression transformation-based text clustering method in search engine
CN105843795A (en) Topic model based document keyword extraction method and system
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
CN102722709A (en) Method and device for identifying garbage pictures
CN102663139A (en) Method and system for constructing emotional dictionary
CN104484343A (en) Topic detection and tracking method for microblog
CN103873601A (en) Addressing class query word mining method and system
CN101226533A (en) Method and system for arranging web page again
CN106156041A (en) Hot information finds method and system
CN104462268B (en) A kind of method and system of html document information extraction expression formula
CN105786951A (en) Method and device for extracting content blocks in webpage and server
CN101751425A (en) Method for acquiring document set abstracts and device
CN103823792A (en) Method and equipment for detecting hotspot events from text document
CN103646029A (en) Similarity calculation method for blog articles
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN104679768A (en) Method and device for extracting keywords from documents
CN103020083B (en) The automatic mining method of demand recognition template, demand recognition methods and corresponding device
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN101226532B (en) Method and system for extracting homoionym in network
CN103218452A (en) Method and device for recognizing valid interlinkage in Hub webpage
CN102063497B (en) Open type knowledge sharing platform and entry processing method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENCENT TECHNOLOGY (BEIJING) CO., LTD.

Effective date: 20131016

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100089 HAIDIAN, BEIJING TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20131016

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Beijing 100089 Haidian District 38 Haidian Avenue branch bank building 16 layer

Patentee before: Tencent Technology (Beijing) Co., Ltd