CN102169496A - Anchor text analysis-based automatic domain term generating method - Google Patents

Anchor text analysis-based automatic domain term generating method Download PDF

Info

Publication number
CN102169496A
CN102169496A CN 201110091312 CN201110091312A CN102169496A CN 102169496 A CN102169496 A CN 102169496A CN 201110091312 CN201110091312 CN 201110091312 CN 201110091312 A CN201110091312 A CN 201110091312A CN 102169496 A CN102169496 A CN 102169496A
Authority
CN
China
Prior art keywords
multiword
anchor text
candidate
algorithm
information entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110091312
Other languages
Chinese (zh)
Inventor
闫兴龙
刘奕群
马少平
张敏
金奕江
张阔
茹立云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Sogou Technology Development Co Ltd filed Critical Tsinghua University
Priority to CN 201110091312 priority Critical patent/CN102169496A/en
Publication of CN102169496A publication Critical patent/CN102169496A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an anchor text analysis-based automatic domain term generating method, which comprises the following steps of: acquiring a browsed log of a user; processing the browsed log to acquire an anchor text clicked by the user and a corresponding click result address; processing the anchor text according to the click result address to acquire a candidate multi-character set; screening multiple characters in the candidate multi-character set on the basis of a new word discovery algorithm to remove the multiple characters incapable of independently forming words; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generating result. By the method, domain terms can be automatically discovered and extracted from the anchor text, the model structure and the parameters are simple, the algorithms have low complexity, and better performance and domain term discovery effect are achieved on experimental test data.

Description

Field term based on anchor text analysis generates method automatically
Technical field
The present invention relates to networking technology area, particularly a kind of field term based on anchor text analysis generates method automatically.
Background technology
Field term is meant in an ambit and uses, and represents the word of interior notion of this ambit or relation.Term can be a speech, can be phrase also, is the appellation that is used for representing notion at specific ambit, in other words, is the agreement language symbol of expressing or limit scientific concept by voice or literal.In China, people's custom is called " noun ".The instantiation of term is seen everywhere when reading scientific and technical literature, study specialized courses, is exactly term in the computer network field such as router, and DNA is exactly the term of life science etc.In the terminology extraction field, the linguistic unit that exact meaning is arranged with certain grammatical relation be made up of two or more word represented in term, as " NMD ".
The every field that is extracted in of field term all has very important use.In the domain body building process, need upgrade in time field term, so the extracting method of field term is being brought into play crucial effects in the structure of domain body and maintenance process.In information retrieval field, need introduce the field term collection during index building, field term extractive technique ground improves can improve the accuracy rate of retrieval and the coverage rate of retrieval greatly, especially aspect vertical search, if obtain the term in certain field, can obtain more accurate information for the search in this field.Browsing aspect the recommendation, aspect the recommendation of user's the behavior of browsing, utilizing the field term in certain field that the web resource obtains, can help us to hold user's the intention of browsing more accurately, recommending relevant information to give the user by user's the behavior of specifically browsing.Being extracted in of field term also has very big effect in the advertisement putting in addition, by obtaining the field dictionary, is very helpful for the classification of webpage, can better help commercial company to do more meticulous and advertisement putting accurately for different customer groups.
Three kinds of modes that the abstracting method of current field term is main:
1. rule-based method.Rule and method mainly by pre-establishing rule template, extracts term by matching template then.But the establishment of rule mainly depends on linguistic knowledge.And philological rule is difficult to find.It is very difficult to formulate complete rule set, but also will consider the compatibility of a plurality of rules.
2. based on the method for adding up.Statistical method is used in the terminology extraction very early, and has obtained good effect.Some people uses the relative frequency of document to carry out terminology extraction, and applies it in the automatic structure of body.Frantzi has proposed the C-value/NC-value evaluation function and has been used for the field term extraction, and obtains good result.Pantel adopts mutual information and log-likelihood ratio to obtain field term.Thereby Liu adopts left and right sides information entropy and log-likelihood ratio to determine word border extraction candidate term.And this method is also utilized in this article to some extent.Algorithm based on statistics all can use in each corpus, but can't obtain result preferably for the corpus of particular type.
3. the method regular and statistics combines.The method that a lot of again in actual applications statistics and rule combine.ThuyVU uses the method for C-value/NC-value and T check to calculate earlier according to the rule extraction candidate collection then, obtains real term at last.This method is in conjunction with the relative merits of above-mentioned two kinds of methods, and the result who obtains is relatively preferably.
The shortcoming that prior art exists is that the abstracting method of present field term is very complicated, and accuracy rate is lower, therefore demands urgently improving.
Summary of the invention
Purpose of the present invention is intended to solve above-mentioned technological deficiency.
For achieving the above object, one aspect of the present invention proposes a kind of field term based on anchor text analysis and generates method automatically, may further comprise the steps: the travel log of gathering the user; Described travel log is handled to obtain anchor text and the corresponding click result address that the user clicks; According to described click result address described anchor text is handled to obtain the set of candidate's multiword; Based on the new word discovery algorithm multiword in described candidate's multiword set is screened the multiword that can not independently become speech to remove; Further screen with output field term generation result with the candidate's multiword set after according to the relative frequency algorithm described new word discovery algorithm being screened.
In one embodiment of the invention, described anchor text and the corresponding click result address that travel log is handled to obtain user's click further comprises: carry out the user journal code conversion, and the arrangement of described travel log is character string forms, remove numeral, letter and punctuation mark simultaneously.
In one embodiment of the invention, described described anchor text the processing to obtain the set of candidate's multiword according to the click result address further comprises: judge whether described click result address belongs to default url list; The pairing described anchor text of described click result address that will belong to default url list adds the set of candidate's multiword.
In one embodiment of the invention, described multiword in described candidate's multiword set the screening to remove based on the new word discovery algorithm can not independently become the multiword of speech further to comprise: based on left and right sides entropy algorithm described candidate's multiword set is filtered; With based on the degree of coupling algorithm candidate's multiword set after to described screening filter.
In one embodiment of the invention, described described candidate's multiword set the filtration further based on left and right sides entropy algorithm comprises: left information entropy and the right information entropy of calculating each multiword in described candidate's multiword set; Judge that whether the left information entropy of described each multiword or right information entropy are greater than threshold value; If the left information entropy of multiword or right information entropy all less than described threshold value, are then removed described multiword.
In one embodiment of the invention, wherein, left information entropy is:
LE ( w ) = - 1 n Σ a i ∈ A C ( w , a i ) log C ( w , a i ) n ;
Right information entropy is:
RE ( w ) = - 1 n Σ b i ∈ B C ( w , b i ) log C ( w , b i ) n ;
Wherein,
Figure BDA0000054975650000033
C (w, a i) and C (w, b i) be respectively the left individual character a for speech w iWith right individual character b iThe number of times that occurs.
In one embodiment of the invention, described based on the degree of coupling algorithm candidate's multiword set after to described screening filter further and comprise: the word length of calculating each multiword in the candidate's multiword set after the described screening; The word length and the degree of coupling according to described each multiword judge whether described multiword can independently become word; If judge independently to become word, then with its removal.
In one embodiment of the invention, also comprise: search for based on each multiword inputted search engine in the candidate's multiword set after left and right sides entropy algorithm and the screening of degree of coupling algorithm described; Remove the multiword that Search Results does not meet the demands according to Search Results.
The present invention can find and extract field term automatically from the anchor text.Model structure and parameter are simple, and algorithm complex is low, obtained preferable performance and field term and find effect on test data of experiment.This explanation the present invention has generalization and adaptability preferably, generates synon effect and has objective, reliable, comprehensive characteristics, has a good application prospect.
Aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Description of drawings
Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:
Fig. 1 generates method flow diagram automatically for the field term based on anchor text analysis of the embodiment of the invention;
Fig. 2 and 3 is that the embodiment of the invention is based on the new word discovery algorithm flow chart.
Embodiment
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein identical from start to finish or similar label is represented identical or similar elements or the element with identical or similar functions.Below by the embodiment that is described with reference to the drawings is exemplary, only is used to explain the present invention, and can not be interpreted as limitation of the present invention.
This method is by analyzing user's travel log, and the anchor text message of clicking when extracting user's browsing page if the webpage of this anchor text correspondence is the webpage in certain field, is then thought the field term that comprises this field in this anchor text.Based on above-mentioned network resource usage information entropy, the degree of coupling and relative frequency automatic screening and obtain this field term.Anchor text, English name are anchor text, and the anchor text is exactly a link text.The anchor text can be used as the assessment of content of the page at anchor text place.Normally, the link that increases in the page all can with the page itself in have certain relation.For example: can increase the link of some colleague websites or the link of the esbablished corporation that some make clothes on the industry website of clothes; On the other hand, the anchor text can be as the assessment to the page pointed.The anchor text can be described the content of the page pointed accurately, the link that increases on the personal website, and the anchor text is " search engine ".In general the link that the page adds all should have directly related contact with the page, and search engine can be judged the contents attribute of this webpage according to the anchor textual description of the link of pointing to some webpages.The anchor text also shows as the search engine role can collect the file that some search engines can not index.
The embodiment of the invention has proposed to generate method automatically based on the field term of multiple network resources analysis.This method is by the analyzing and processing to multiple network resources, obtain the corpus relevant with field of finance and economics, algorithm by new word discovery extracts the word in the corpus then, at last, filtration by relative frequency, obtain the term set relevant, thereby reach the automatic generation of field term with this field.Compare with traditional field term abstracting method, the present invention based on data resource be anchor text resource and Internet resources, compare with traditional text data and have structural stronger, ageing stronger characteristics.The method applied in the present invention can realize efficiently, field term generates accurately, thereby provides support for the various natural language application systems based on the internet.
As shown in Figure 1, for the field term based on anchor text analysis of the embodiment of the invention generates method flow diagram automatically, may further comprise the steps:
Step S101, collection user's travel log.When access to netwoks, have when user's browsing page, by clicking anchor text accessed web page, if this webpage is relevant with certain field, then anchor text and this field correlativity are also stronger, wherein have very big probability to comprise the field term in this field.The embodiment of the invention is example with the travel log, but also can adopt other Internet resources.
As an example of the present invention, can adopt one week of user (on October 17th, 1 2010 on October 11st, 2010) to browse the behavior daily record.The clauses and subclauses and the scale of corpus are as follows:
Table 1: the clauses and subclauses of each corpus and scale
Figure BDA0000054975650000051
The information that user's travel log comprises:
Table 2: the item of information that user's travel log comprises
Figure BDA0000054975650000052
Comprised the information that enough users browse in the above log information, therefore can utilize this daily record to carry out field term and extract.
Step S102 handles to obtain anchor text and the corresponding click result address that the user clicks travel log.The data pre-service of user browsing behavior daily record comprises: carry out field text corpus code conversion, the coded format (being generally the generic resource identifier is the URI form) of server record is converted to the GBK form of Chinese characters of the national standard coding; Utilize the content item of listing in the table 5 that user journal is put in order, find the information that needs, and daily record is organized into the form of above content item character string.Various noises in the filtering anchor text, as numeral, letter and punctuation mark.
The data acquisition that field term generates institute's foundation automatically is to come from user's travel log, and for user's travel log, it should comprise that at least following content just can be used for field term and generate automatically:
Table 3: the content that the user's travel log that generates automatically for field term comprises
Figure BDA0000054975650000053
Because Internet resources form complexity need therefrom be found out Useful Information, is mainly undertaken by following steps.
Step 1.1 is carried out the user journal code conversion, the coded format (being generally the generic resource identifier is the URI form) of server record is converted to the GBK form of Chinese characters of the national standard coding.
Step 1.2 utilizes the content item of listing in the table 3 that user journal is put in order, removes the information outside table 3 content item, and daily record is organized into the form of above content item character string.
Various noises in the step 1.3 filtering corpus, as numeral, letter and punctuation mark are to obtain the set of candidate's multiword.
Step S103 handles to obtain the set of candidate's multiword described anchor text according to clicking result address, promptly carries out the webpage screening.In an embodiment of the present invention, according to the screening of webpage url, find out text corpus based on certain field of above-mentioned Internet resources.And it is carried out cutting handle, obtain the set of candidate's multiword.
The screening of webpage is based on the method for summing up of concluding.As " east wealth net " is the finance and economic website of specialty, and this paper thinks that the webpage URL that comprises " eastmoney.com " belongs to the webpage of finance and economic; For some big portal websites, reach " qq " etc. as " sina ", " sohu ", this paper adopts and concludes the field of finance and economics webpage that the method for summing up obtains this portal website, as the subdomain name in " sohu " door:
Some finance and economic webpage under the table 4:sohu website
Figure BDA0000054975650000061
With all webpages corpus as a setting.By repeatedly repeating random sampling, in 660,000 webpage, 100 webpages of at every turn sampling, through repeated experiments repeatedly, its rate of accuracy reached to 96%.
By above-mentioned processing, the corpus quantity and the scale that obtain are as follows:
Table 5: field of finance and economics corpus clauses and subclauses and scale
Figure BDA0000054975650000062
Step S104 screens the multiword that can not independently become speech to remove based on the new word discovery algorithm to the multiword in described candidate's multiword set.For the set of the multiword of previous step, the frequency of occurrences of adding up multiword is respectively calculated the information entropy of each multiword, and according to the frequency of occurrences of multiword, the left and right sides information entropy and the degree of coupling to multiword set screening, can not independently become the multiword of speech to screen away.Result after the screening is put into search engine, obtain the webpage number of this multiword Query Result, if the webpage number that obtains is very few, then this multiword of filtering is gathered thereby finally obtain candidate's term.Particularly, this new word discovery algorithm can comprise following one or more steps, shown in Fig. 2 and 3, for the embodiment of the invention based on the new word discovery algorithm flow chart:
Step 201 is filtered based on frequency.Add up the frequency that multiword occurs in this field multiword set, frequency is gathered as candidate's multiword of next step calculating greater than the word of certain threshold value.
Step 202, the computing information entropy, and based on left and right sides entropy algorithm to described candidate's multiword set filter.Specifically comprise: left information entropy and the right information entropy of calculating each multiword in described candidate's multiword set; Judge that whether the left information entropy of described each multiword or right information entropy are greater than threshold value; If the left information entropy of multiword or right information entropy all less than described threshold value, are then removed described multiword.
The computing method of information entropy are as follows:
Set up the left and right sides individual character statistics of word correspondence.Main method is exactly all documents of traversal, adds up the frequency of each individual character of each word left side and the right appearance then.
Calculate corresponding entropy.
Definition 1: suppose that word w belongs to Candidate Set, in addition, A={a 1, a 2, a 3..., a mAnd b={b 1, b 2, b 3..., b nBe respectively the left and right sides individual character set of this word correspondence, then left and right sides entropy is defined as follows:
Left side information entropy is:
LE ( w ) = - 1 n Σ a i ∈ A C ( w , a i ) log C ( w , a i ) n - - - ( 3 - 1 )
Right information entropy is:
RE ( w ) = - 1 n Σ b i ∈ B C ( w , b i ) log C ( w , b i ) n - - - ( 3 - 2 )
Wherein,
Figure BDA0000054975650000073
C (w, a i) and C (w, b i) be respectively the left individual character a for speech w iWith right individual character b iThe number of times that occurs.
Because this paper adopts corpus self that characteristics are arranged, query word often is not a sentence, so for certain speech, it independently becomes the individual character on the very possible left side of speech (right side), as in the inquiry corpus of handling, " BOE " occurs 532 times altogether, and its left and right sides individual character has only 22 altogether, thus can not reflect the probability that it becomes speech with information entropy, so adopted following strategy (L wherein here, R is a zone bit, and α is a threshold value):
If
Figure BDA0000054975650000074
Then establish L=1, otherwise establish L=0, wherein, the frequency that N occurs altogether for this speech, n is the frequency that this speech left side individual character occurs.In like manner, if Then establish R=1, otherwise establish R=0, wherein, the frequency that N occurs altogether for this speech, n is the frequency that the right individual character of this speech occurs.
If L=R=1 thinks that then this speech puts into Candidate Set, carry out next step filtration.L=0 or R=0 then filter by the method for judging its left information entropy or right information entropy else if.
The strategy that filters according to information entropy is: extract hereinbefore after the Candidate Set, judge for L=0 or R=0, if the left information entropy of this word greater than the right information entropy of certain value (being made as β) or this word greater than certain value (being made as β), then Candidate Set put in this speech, carry out next step filtration, otherwise, this speech is removed.It is pointed out that in addition if the entropy of this side does not exist, then it is defined as infinitesimal.Have only w to satisfy the threshold value that both sides become speech, just can put it in the Candidate Set.
Step 203 is filtered based on the degree of coupling filter algorithm of recursion.Filter the back set of words though the method for previous step can find well, still have a lot of noises, need to prove, do not exist right information entropy to be because in its satisfied last trifle for the filtering rule of frequency.And from the angle of semanteme, in fact can open in cutting on the right side of this candidate word.And subject matter is that left information entropy is excessive, like this can't filtering according to the rule of previous step.Calculate the word length of each multiword in the candidate's multiword set after the described screening; The word length and the degree of coupling according to described each multiword judge whether described multiword can independently become word; If judge independently to become word, then with its removal.
Degree of coupling filter algorithm based on recursion is as follows:
For example, be 3 multiword w for word length, if there is w 1∈ T 2(T 2For length is the set of 2 candidate word), w=w 1P, p are individual character, w 1For removing p multiword afterwards.Calculate p and w 1The degree of coupling, if satisfy following condition: 1.w simultaneously 1The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the right information entropy of 2.w is less than w 1Right information entropy, the right information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.Equally, if there is w 1∈ T 2(T 2For length is the set of 2 candidate word), w=pw 1, p is an individual character, w 1For removing p multiword afterwards.Calculate p and w 1The degree of coupling, if satisfy following condition: 1.w simultaneously 1The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the left information entropy of 2.w is less than w 1Left information entropy, the left information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.
For example, be 4 multiword w for word length, if there is w 1∈ T 3(T 3For length is the set of 3 candidate word), w=w 1P, p are individual character, w 1For removing p multiword afterwards.Calculate p and w 1The degree of coupling, if satisfy following condition: 1.w simultaneously 1The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the right information entropy of 2.w is less than w 1Right information entropy, the right information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.Equally, if there is w 1∈ T 3(T 3For length is the set of 2 candidate word), w=pw 1, p is an individual character, w 1For removing p multiword afterwards.Calculate p and w 1The degree of coupling, if satisfy following condition: 1.w simultaneously 1The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the left information entropy of 2.w is less than w 1Left information entropy, the left information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.By that analogy, obtain the speech of length.
Step 204 is filtered according to search engine.According to the build mechanism of search engine, multiword is put into search engine, if the result who obtains is seldom, illustrate that this multiword can't independently become speech.Around this principle, can further filter the result.The webpage number that the present invention utilizes certain commercial search engine to obtain filters end product, and removal can not independently become the multiword of speech, and experiment shows that this method is can the non-word of filtering.
Behind the new word discovery algorithm, the result after the anchor text corpus sorts based on frequency is as follows:
Table 6: based on the information entropy and the frequency of anchor text corpus word
Figure BDA0000054975650000091
On the effect that generates from field term, the field term that this field term generation method generates has the higher degree of reliability, and table 7 has been listed candidate's term number that three kinds of corpus not filtering through relative frequency generate and become Word probability:
Table 7: the word number that obtains by the new word discovery algorithm with become Word probability
Figure BDA0000054975650000092
Step S105, the candidate's multiword set after to the screening of described new word discovery algorithm is further screened with the output field term and is generated the result according to the relative frequency algorithm.The method of relative frequency is present existing a kind of very effective method, is widely used in the systems such as information retrieval and text classification.An obvious characteristic of term is exactly repeatedly to occur in the text of this area, and the number of times that occurs in other field is less, and relative frequency can reflect this feature of term to a certain extent.This method is calculated simple, has also obtained and has extracted the result preferably.
In an embodiment of the present invention, the computing formula of relative frequency is: the frequency of specific area corpus is divided by the frequency of background corpus.After the relative frequency screening, the number of times that occurs according to word sorts, and obtains orderly result at corpus, and the result is marked, and checks whether it is the finance and economic word.Get preceding 10 (P10), preceding 100 (P100), preceding 1000 (P1000) (existence) and whole marks respectively, and calculate its degree of accuracy following (wherein the relative frequency threshold value is represented the ratio of filtering):
Table 8: different corpus are in field of finance and economics word accuracy rate
Figure BDA0000054975650000101
According to above step, obtained the set of field of finance and economics term.This has just finished objective, the reliable overall process that generates field term automatically of the behavior that utilizes the network user.
After above each step, generated the field term of field of finance and economics.The field term that comprises a lot of parts of speech is as noun, verb, adjective etc.In order to verify validity of the present invention and reliability, we have carried out the related experiment that field term generates.
This paper adopts the inquiry log in certain one week of commercial search engine, and should week user browsing behavior daily record.
On the effect of field term generation, the field term that this field term generation method generates has the higher degree of reliability, simultaneously since this method based on data resource be Internet resources, therefore the field term that generates can comprise emerging word in the language environment.Table 9 has been listed the certain fields term and has been generated the result:
Table 9: the certain fields term generates the result
Figure BDA0000054975650000102
The present invention can find and extract field term automatically from the anchor text.Model structure and parameter are simple, and algorithm complex is low, obtained preferable performance and field term and find effect on test data of experiment.This explanation the present invention has generalization and adaptability preferably, generates synon effect and has objective, reliable, comprehensive characteristics, has a good application prospect.
The present invention is by the analysis to user's travel log, the anchor text of clicking when extracting this field of user capture webpage, include more this field term in these Internet resources, based on above-mentioned network resource usage information entropy, the degree of coupling and relative frequency automatic screening and obtain this field term.It has does not need artificial participation, accurately objective and can in time find the advantage of the popular term in certain field on the internet.
Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification that scope of the present invention is by claims and be equal to and limit to these embodiment.

Claims (8)

1. the field term based on anchor text analysis generates method automatically, it is characterized in that, may further comprise the steps:
Gather user's travel log;
Described travel log is handled to obtain anchor text and the corresponding click result address that the user clicks;
According to described click result address described anchor text is handled to obtain the set of candidate's multiword;
Based on the new word discovery algorithm multiword in described candidate's multiword set is screened the multiword that can not independently become speech to remove; With
Candidate's multiword set after to the screening of described new word discovery algorithm is further screened with the output field term and is generated the result according to the relative frequency algorithm.
2. the field term based on anchor text analysis as claimed in claim 1 generates method automatically, it is characterized in that, described anchor text and the corresponding click result address that travel log is handled to obtain user's click further comprises:
Carry out the user journal code conversion, and described travel log arrangement is character string forms, remove numeral, letter and punctuation mark simultaneously.
3. the field term based on anchor text analysis as claimed in claim 1 generates method automatically, it is characterized in that, described described anchor text the processing to obtain the set of candidate's multiword according to the click result address further comprises:
Judge whether described click result address belongs to default url list;
The pairing described anchor text of described click result address that will belong to default url list adds the set of candidate's multiword.
4. the field term based on anchor text analysis as claimed in claim 1 generates method automatically, it is characterized in that, described multiword in described candidate's multiword set the screening to remove based on the new word discovery algorithm can not independently become the multiword of speech further to comprise:
Based on left and right sides entropy algorithm described candidate's multiword set is filtered; With
Candidate's multiword set after to described screening is filtered based on degree of coupling algorithm.
5. the field term based on anchor text analysis as claimed in claim 4 generates method automatically, it is characterized in that, described described candidate's multiword set the filtration further based on left and right sides entropy algorithm comprises:
Calculate the left information entropy and the right information entropy of each multiword in described candidate's multiword set;
Judge that whether the left information entropy of described each multiword or right information entropy are greater than threshold value;
If the left information entropy of multiword or right information entropy all less than described threshold value, are then removed described multiword.
6. the field term based on anchor text analysis as claimed in claim 5 generates method automatically, it is characterized in that,
Wherein,
Left side information entropy is:
LE ( w ) = - 1 n Σ a i ∈ A C ( w , a i ) log C ( w , a i ) n ;
Right information entropy is:
RE ( w ) = - 1 n Σ b i ∈ B C ( w , b i ) log C ( w , b i ) n ;
Wherein,
Figure FDA0000054975640000023
C (w, a i) and C (w, b i) be respectively the left individual character a for speech w iWith right individual character b iThe number of times that occurs.
7. the field term based on anchor text analysis as claimed in claim 4 generates method automatically, it is characterized in that, described based on the degree of coupling algorithm candidate's multiword set after to described screening filter further and comprise:
Calculate the word length of each multiword in the candidate's multiword set after the described screening;
The word length and the degree of coupling according to described each multiword judge whether described multiword can independently become word;
If judge independently to become word, then with its removal.
8. the field term based on anchor text analysis as claimed in claim 4 generates method automatically, it is characterized in that, also comprises:
Search for based on each multiword inputted search engine in the candidate's multiword set after left and right sides entropy algorithm and the screening of degree of coupling algorithm described;
Remove the multiword that Search Results does not meet the demands according to Search Results.
CN 201110091312 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method Pending CN102169496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110091312 CN102169496A (en) 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110091312 CN102169496A (en) 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method

Publications (1)

Publication Number Publication Date
CN102169496A true CN102169496A (en) 2011-08-31

Family

ID=44490658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110091312 Pending CN102169496A (en) 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method

Country Status (1)

Country Link
CN (1) CN102169496A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103631963A (en) * 2013-12-18 2014-03-12 北京博雅立方科技有限公司 Keyword optimization processing method and device based on big data
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN104102658A (en) * 2013-04-09 2014-10-15 腾讯科技(深圳)有限公司 Method and device for mining text contents
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN107967299A (en) * 2017-11-03 2018-04-27 中国农业大学 The hot word extraction method and system of a kind of facing agricultural public sentiment
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN108768764A (en) * 2018-05-08 2018-11-06 四川斐讯信息技术有限公司 A kind of router test method and device
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050033978A1 (en) * 2003-08-08 2005-02-10 Hyser Chris D. Method and system for securing a computer system
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101178728A (en) * 2007-11-21 2008-05-14 北京搜狗科技发展有限公司 Web side navigation method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050033978A1 (en) * 2003-08-08 2005-02-10 Hyser Chris D. Method and system for securing a computer system
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101178728A (en) * 2007-11-21 2008-05-14 北京搜狗科技发展有限公司 Web side navigation method and system

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102658B (en) * 2013-04-09 2018-09-07 腾讯科技(深圳)有限公司 Content of text method for digging and device
CN104102658A (en) * 2013-04-09 2014-10-15 腾讯科技(深圳)有限公司 Method and device for mining text contents
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103631963A (en) * 2013-12-18 2014-03-12 北京博雅立方科技有限公司 Keyword optimization processing method and device based on big data
CN103631963B (en) * 2013-12-18 2017-10-17 北京博雅立方科技有限公司 A kind of keyword optimized treatment method and device based on big data
CN103778243B (en) * 2014-02-11 2017-02-08 北京信息科技大学 Domain term extraction method
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN108875040A (en) * 2015-10-27 2018-11-23 上海智臻智能网络科技股份有限公司 Dictionary update method and computer readable storage medium
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN106815190B (en) * 2015-11-27 2020-06-23 阿里巴巴集团控股有限公司 Word recognition method and device and server
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN107967299A (en) * 2017-11-03 2018-04-27 中国农业大学 The hot word extraction method and system of a kind of facing agricultural public sentiment
CN107967299B (en) * 2017-11-03 2020-05-12 中国农业大学 Agricultural public opinion-oriented automatic hot word extraction method and system
CN108768764A (en) * 2018-05-08 2018-11-06 四川斐讯信息技术有限公司 A kind of router test method and device
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium
CN111666417B (en) * 2020-04-13 2023-06-23 百度在线网络技术(北京)有限公司 Method, device, electronic equipment and readable storage medium for generating synonyms
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112395395B (en) * 2021-01-19 2021-05-28 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103336766B (en) Short text garbage identification and modeling method and device
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
CN105260359B (en) Semantic key words extracting method and device
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN103049542A (en) Domain-oriented network information search method
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN102254014A (en) Adaptive information extraction method for webpage characteristics
CN101393555A (en) Rubbish blog detecting method
CN101609450A (en) Web page classification method based on training set
CN1963816A (en) Automatization processing method of rating of merit of search engine
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN103927297A (en) Evidence theory based Chinese microblog credibility evaluation method
CN104679825A (en) Web text-based acquiring and screening method of seismic macroscopic anomaly information
Kim et al. Event diffusion patterns in social media
CN103177036A (en) Method and system for label automatic extraction
Albishre et al. Effective 20 newsgroups dataset cleaning
CN101968801A (en) Method for extracting key words of single text
CN110297961A (en) A kind of Quick Acquisition of policy information and optimization extracting method
CN105718585A (en) Document and label word semantic association method and device thereof
CN106354844A (en) Service combination package recommendation system and method based on text mining
CN104346382B (en) Use the text analysis system and method for language inquiry
CN103714120B (en) A kind of system that user interest topic is extracted in the access record from user url

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20110831

RJ01 Rejection of invention patent application after publication