CN102169496A

CN102169496A - Anchor text analysis-based automatic domain term generating method

Info

Publication number: CN102169496A
Application number: CN 201110091312
Authority: CN
Inventors: 闫兴龙; 刘奕群; 马少平; 张敏; 金奕江; 张阔; 茹立云
Original assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Current assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Priority date: 2011-04-12
Filing date: 2011-04-12
Publication date: 2011-08-31

Abstract

The invention provides an anchor text analysis-based automatic domain term generating method, which comprises the following steps of: acquiring a browsed log of a user; processing the browsed log to acquire an anchor text clicked by the user and a corresponding click result address; processing the anchor text according to the click result address to acquire a candidate multi-character set; screening multiple characters in the candidate multi-character set on the basis of a new word discovery algorithm to remove the multiple characters incapable of independently forming words; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generating result. By the method, domain terms can be automatically discovered and extracted from the anchor text, the model structure and the parameters are simple, the algorithms have low complexity, and better performance and domain term discovery effect are achieved on experimental test data.

Description

Field term based on anchor text analysis generates method automatically

Technical field

The present invention relates to networking technology area, particularly a kind of field term based on anchor text analysis generates method automatically.

Background technology

Field term is meant in an ambit and uses, and represents the word of interior notion of this ambit or relation.Term can be a speech, can be phrase also, is the appellation that is used for representing notion at specific ambit, in other words, is the agreement language symbol of expressing or limit scientific concept by voice or literal.In China, people's custom is called " noun ".The instantiation of term is seen everywhere when reading scientific and technical literature, study specialized courses, is exactly term in the computer network field such as router, and DNA is exactly the term of life science etc.In the terminology extraction field, the linguistic unit that exact meaning is arranged with certain grammatical relation be made up of two or more word represented in term, as " NMD ".

The every field that is extracted in of field term all has very important use.In the domain body building process, need upgrade in time field term, so the extracting method of field term is being brought into play crucial effects in the structure of domain body and maintenance process.In information retrieval field, need introduce the field term collection during index building, field term extractive technique ground improves can improve the accuracy rate of retrieval and the coverage rate of retrieval greatly, especially aspect vertical search, if obtain the term in certain field, can obtain more accurate information for the search in this field.Browsing aspect the recommendation, aspect the recommendation of user's the behavior of browsing, utilizing the field term in certain field that the web resource obtains, can help us to hold user's the intention of browsing more accurately, recommending relevant information to give the user by user's the behavior of specifically browsing.Being extracted in of field term also has very big effect in the advertisement putting in addition, by obtaining the field dictionary, is very helpful for the classification of webpage, can better help commercial company to do more meticulous and advertisement putting accurately for different customer groups.

Three kinds of modes that the abstracting method of current field term is main:

1. rule-based method.Rule and method mainly by pre-establishing rule template, extracts term by matching template then.But the establishment of rule mainly depends on linguistic knowledge.And philological rule is difficult to find.It is very difficult to formulate complete rule set, but also will consider the compatibility of a plurality of rules.

2. based on the method for adding up.Statistical method is used in the terminology extraction very early, and has obtained good effect.Some people uses the relative frequency of document to carry out terminology extraction, and applies it in the automatic structure of body.Frantzi has proposed the C-value/NC-value evaluation function and has been used for the field term extraction, and obtains good result.Pantel adopts mutual information and log-likelihood ratio to obtain field term.Thereby Liu adopts left and right sides information entropy and log-likelihood ratio to determine word border extraction candidate term.And this method is also utilized in this article to some extent.Algorithm based on statistics all can use in each corpus, but can't obtain result preferably for the corpus of particular type.

3. the method regular and statistics combines.The method that a lot of again in actual applications statistics and rule combine.ThuyVU uses the method for C-value/NC-value and T check to calculate earlier according to the rule extraction candidate collection then, obtains real term at last.This method is in conjunction with the relative merits of above-mentioned two kinds of methods, and the result who obtains is relatively preferably.

The shortcoming that prior art exists is that the abstracting method of present field term is very complicated, and accuracy rate is lower, therefore demands urgently improving.

Summary of the invention

Purpose of the present invention is intended to solve above-mentioned technological deficiency.

For achieving the above object, one aspect of the present invention proposes a kind of field term based on anchor text analysis and generates method automatically, may further comprise the steps: the travel log of gathering the user; Described travel log is handled to obtain anchor text and the corresponding click result address that the user clicks; According to described click result address described anchor text is handled to obtain the set of candidate's multiword; Based on the new word discovery algorithm multiword in described candidate's multiword set is screened the multiword that can not independently become speech to remove; Further screen with output field term generation result with the candidate's multiword set after according to the relative frequency algorithm described new word discovery algorithm being screened.

In one embodiment of the invention, described anchor text and the corresponding click result address that travel log is handled to obtain user's click further comprises: carry out the user journal code conversion, and the arrangement of described travel log is character string forms, remove numeral, letter and punctuation mark simultaneously.

In one embodiment of the invention, described described anchor text the processing to obtain the set of candidate's multiword according to the click result address further comprises: judge whether described click result address belongs to default url list; The pairing described anchor text of described click result address that will belong to default url list adds the set of candidate's multiword.

In one embodiment of the invention, described multiword in described candidate's multiword set the screening to remove based on the new word discovery algorithm can not independently become the multiword of speech further to comprise: based on left and right sides entropy algorithm described candidate's multiword set is filtered; With based on the degree of coupling algorithm candidate's multiword set after to described screening filter.

In one embodiment of the invention, described described candidate's multiword set the filtration further based on left and right sides entropy algorithm comprises: left information entropy and the right information entropy of calculating each multiword in described candidate's multiword set; Judge that whether the left information entropy of described each multiword or right information entropy are greater than threshold value; If the left information entropy of multiword or right information entropy all less than described threshold value, are then removed described multiword.

In one embodiment of the invention, wherein, left information entropy is:

LE (w) = - \frac{1}{n} \underset{a_{i} &Element; A}{Σ} C (w, a_{i}) \log \frac{C (w, a_{i})}{n};

Right information entropy is:

RE (w) = - \frac{1}{n} \underset{b_{i} &Element; B}{Σ} C (w, b_{i}) \log \frac{C (w, b_{i})}{n};

Wherein,

C (w, a _i) and C (w, b _i) be respectively the left individual character a for speech w _iWith right individual character b _iThe number of times that occurs.

In one embodiment of the invention, described based on the degree of coupling algorithm candidate's multiword set after to described screening filter further and comprise: the word length of calculating each multiword in the candidate's multiword set after the described screening; The word length and the degree of coupling according to described each multiword judge whether described multiword can independently become word; If judge independently to become word, then with its removal.

In one embodiment of the invention, also comprise: search for based on each multiword inputted search engine in the candidate's multiword set after left and right sides entropy algorithm and the screening of degree of coupling algorithm described; Remove the multiword that Search Results does not meet the demands according to Search Results.

The present invention can find and extract field term automatically from the anchor text.Model structure and parameter are simple, and algorithm complex is low, obtained preferable performance and field term and find effect on test data of experiment.This explanation the present invention has generalization and adaptability preferably, generates synon effect and has objective, reliable, comprehensive characteristics, has a good application prospect.

Aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:

Fig. 1 generates method flow diagram automatically for the field term based on anchor text analysis of the embodiment of the invention;

Fig. 2 and 3 is that the embodiment of the invention is based on the new word discovery algorithm flow chart.

Embodiment

Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein identical from start to finish or similar label is represented identical or similar elements or the element with identical or similar functions.Below by the embodiment that is described with reference to the drawings is exemplary, only is used to explain the present invention, and can not be interpreted as limitation of the present invention.

This method is by analyzing user's travel log, and the anchor text message of clicking when extracting user's browsing page if the webpage of this anchor text correspondence is the webpage in certain field, is then thought the field term that comprises this field in this anchor text.Based on above-mentioned network resource usage information entropy, the degree of coupling and relative frequency automatic screening and obtain this field term.Anchor text, English name are anchor text, and the anchor text is exactly a link text.The anchor text can be used as the assessment of content of the page at anchor text place.Normally, the link that increases in the page all can with the page itself in have certain relation.For example: can increase the link of some colleague websites or the link of the esbablished corporation that some make clothes on the industry website of clothes; On the other hand, the anchor text can be as the assessment to the page pointed.The anchor text can be described the content of the page pointed accurately, the link that increases on the personal website, and the anchor text is " search engine ".In general the link that the page adds all should have directly related contact with the page, and search engine can be judged the contents attribute of this webpage according to the anchor textual description of the link of pointing to some webpages.The anchor text also shows as the search engine role can collect the file that some search engines can not index.

The embodiment of the invention has proposed to generate method automatically based on the field term of multiple network resources analysis.This method is by the analyzing and processing to multiple network resources, obtain the corpus relevant with field of finance and economics, algorithm by new word discovery extracts the word in the corpus then, at last, filtration by relative frequency, obtain the term set relevant, thereby reach the automatic generation of field term with this field.Compare with traditional field term abstracting method, the present invention based on data resource be anchor text resource and Internet resources, compare with traditional text data and have structural stronger, ageing stronger characteristics.The method applied in the present invention can realize efficiently, field term generates accurately, thereby provides support for the various natural language application systems based on the internet.

As shown in Figure 1, for the field term based on anchor text analysis of the embodiment of the invention generates method flow diagram automatically, may further comprise the steps:

Step S101, collection user's travel log.When access to netwoks, have when user's browsing page, by clicking anchor text accessed web page, if this webpage is relevant with certain field, then anchor text and this field correlativity are also stronger, wherein have very big probability to comprise the field term in this field.The embodiment of the invention is example with the travel log, but also can adopt other Internet resources.

As an example of the present invention, can adopt one week of user (on October 17th, 1 2010 on October 11st, 2010) to browse the behavior daily record.The clauses and subclauses and the scale of corpus are as follows:

Table 1: the clauses and subclauses of each corpus and scale

The information that user's travel log comprises:

Table 2: the item of information that user's travel log comprises

Comprised the information that enough users browse in the above log information, therefore can utilize this daily record to carry out field term and extract.

Step S102 handles to obtain anchor text and the corresponding click result address that the user clicks travel log.The data pre-service of user browsing behavior daily record comprises: carry out field text corpus code conversion, the coded format (being generally the generic resource identifier is the URI form) of server record is converted to the GBK form of Chinese characters of the national standard coding; Utilize the content item of listing in the table 5 that user journal is put in order, find the information that needs, and daily record is organized into the form of above content item character string.Various noises in the filtering anchor text, as numeral, letter and punctuation mark.

The data acquisition that field term generates institute's foundation automatically is to come from user's travel log, and for user's travel log, it should comprise that at least following content just can be used for field term and generate automatically:

Table 3: the content that the user's travel log that generates automatically for field term comprises

Because Internet resources form complexity need therefrom be found out Useful Information, is mainly undertaken by following steps.

Step 1.1 is carried out the user journal code conversion, the coded format (being generally the generic resource identifier is the URI form) of server record is converted to the GBK form of Chinese characters of the national standard coding.

Step 1.2 utilizes the content item of listing in the table 3 that user journal is put in order, removes the information outside table 3 content item, and daily record is organized into the form of above content item character string.

Various noises in the step 1.3 filtering corpus, as numeral, letter and punctuation mark are to obtain the set of candidate's multiword.

Step S103 handles to obtain the set of candidate's multiword described anchor text according to clicking result address, promptly carries out the webpage screening.In an embodiment of the present invention, according to the screening of webpage url, find out text corpus based on certain field of above-mentioned Internet resources.And it is carried out cutting handle, obtain the set of candidate's multiword.

The screening of webpage is based on the method for summing up of concluding.As " east wealth net " is the finance and economic website of specialty, and this paper thinks that the webpage URL that comprises " eastmoney.com " belongs to the webpage of finance and economic; For some big portal websites, reach " qq " etc. as " sina ", " sohu ", this paper adopts and concludes the field of finance and economics webpage that the method for summing up obtains this portal website, as the subdomain name in " sohu " door:

Some finance and economic webpage under the table 4:sohu website

With all webpages corpus as a setting.By repeatedly repeating random sampling, in 660,000 webpage, 100 webpages of at every turn sampling, through repeated experiments repeatedly, its rate of accuracy reached to 96%.

By above-mentioned processing, the corpus quantity and the scale that obtain are as follows:

Table 5: field of finance and economics corpus clauses and subclauses and scale

Step S104 screens the multiword that can not independently become speech to remove based on the new word discovery algorithm to the multiword in described candidate's multiword set.For the set of the multiword of previous step, the frequency of occurrences of adding up multiword is respectively calculated the information entropy of each multiword, and according to the frequency of occurrences of multiword, the left and right sides information entropy and the degree of coupling to multiword set screening, can not independently become the multiword of speech to screen away.Result after the screening is put into search engine, obtain the webpage number of this multiword Query Result, if the webpage number that obtains is very few, then this multiword of filtering is gathered thereby finally obtain candidate's term.Particularly, this new word discovery algorithm can comprise following one or more steps, shown in Fig. 2 and 3, for the embodiment of the invention based on the new word discovery algorithm flow chart:

Step 201 is filtered based on frequency.Add up the frequency that multiword occurs in this field multiword set, frequency is gathered as candidate's multiword of next step calculating greater than the word of certain threshold value.

Step 202, the computing information entropy, and based on left and right sides entropy algorithm to described candidate's multiword set filter.Specifically comprise: left information entropy and the right information entropy of calculating each multiword in described candidate's multiword set; Judge that whether the left information entropy of described each multiword or right information entropy are greater than threshold value; If the left information entropy of multiword or right information entropy all less than described threshold value, are then removed described multiword.

The computing method of information entropy are as follows:

Set up the left and right sides individual character statistics of word correspondence.Main method is exactly all documents of traversal, adds up the frequency of each individual character of each word left side and the right appearance then.

Calculate corresponding entropy.

Definition 1: suppose that word w belongs to Candidate Set, in addition, A={a ₁, a ₂, a ₃..., a _mAnd b={b ₁, b ₂, b ₃..., b _nBe respectively the left and right sides individual character set of this word correspondence, then left and right sides entropy is defined as follows:

Left side information entropy is:

LE (w) = - \frac{1}{n} \underset{a_{i} &Element; A}{Σ} C (w, a_{i}) \log \frac{C (w, a_{i})}{n} - - - (3 - 1)

Right information entropy is:

RE (w) = - \frac{1}{n} \underset{b_{i} &Element; B}{Σ} C (w, b_{i}) \log \frac{C (w, b_{i})}{n} - - - (3 - 2)

Wherein,

Because this paper adopts corpus self that characteristics are arranged, query word often is not a sentence, so for certain speech, it independently becomes the individual character on the very possible left side of speech (right side), as in the inquiry corpus of handling, " BOE " occurs 532 times altogether, and its left and right sides individual character has only 22 altogether, thus can not reflect the probability that it becomes speech with information entropy, so adopted following strategy (L wherein here, R is a zone bit, and α is a threshold value):

If

Then establish L=1, otherwise establish L=0, wherein, the frequency that N occurs altogether for this speech, n is the frequency that this speech left side individual character occurs.In like manner, if Then establish R=1, otherwise establish R=0, wherein, the frequency that N occurs altogether for this speech, n is the frequency that the right individual character of this speech occurs.

If L=R=1 thinks that then this speech puts into Candidate Set, carry out next step filtration.L=0 or R=0 then filter by the method for judging its left information entropy or right information entropy else if.

The strategy that filters according to information entropy is: extract hereinbefore after the Candidate Set, judge for L=0 or R=0, if the left information entropy of this word greater than the right information entropy of certain value (being made as β) or this word greater than certain value (being made as β), then Candidate Set put in this speech, carry out next step filtration, otherwise, this speech is removed.It is pointed out that in addition if the entropy of this side does not exist, then it is defined as infinitesimal.Have only w to satisfy the threshold value that both sides become speech, just can put it in the Candidate Set.

Step 203 is filtered based on the degree of coupling filter algorithm of recursion.Filter the back set of words though the method for previous step can find well, still have a lot of noises, need to prove, do not exist right information entropy to be because in its satisfied last trifle for the filtering rule of frequency.And from the angle of semanteme, in fact can open in cutting on the right side of this candidate word.And subject matter is that left information entropy is excessive, like this can't filtering according to the rule of previous step.Calculate the word length of each multiword in the candidate's multiword set after the described screening; The word length and the degree of coupling according to described each multiword judge whether described multiword can independently become word; If judge independently to become word, then with its removal.

Degree of coupling filter algorithm based on recursion is as follows:

For example, be 3 multiword w for word length, if there is w ₁∈ T ₂(T ₂For length is the set of 2 candidate word), w=w ₁P, p are individual character, w ₁For removing p multiword afterwards.Calculate p and w ₁The degree of coupling, if satisfy following condition: 1.w simultaneously ₁The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the right information entropy of 2.w is less than w ₁Right information entropy, the right information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.Equally, if there is w ₁∈ T ₂(T ₂For length is the set of 2 candidate word), w=pw ₁, p is an individual character, w ₁For removing p multiword afterwards.Calculate p and w ₁The degree of coupling, if satisfy following condition: 1.w simultaneously ₁The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the left information entropy of 2.w is less than w ₁Left information entropy, the left information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.

For example, be 4 multiword w for word length, if there is w ₁∈ T ₃(T ₃For length is the set of 3 candidate word), w=w ₁P, p are individual character, w ₁For removing p multiword afterwards.Calculate p and w ₁The degree of coupling, if satisfy following condition: 1.w simultaneously ₁The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the right information entropy of 2.w is less than w ₁Right information entropy, the right information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.Equally, if there is w ₁∈ T ₃(T ₃For length is the set of 2 candidate word), w=pw ₁, p is an individual character, w ₁For removing p multiword afterwards.Calculate p and w ₁The degree of coupling, if satisfy following condition: 1.w simultaneously ₁The number of times that the number of times that occurs occurs divided by w is greater than certain threshold value, and the left information entropy of 2.w is less than w ₁Left information entropy, the left information entropy of 3.w is less than certain threshold value.Then this w of filtering thinks that it can not independently become speech.By that analogy, obtain the speech of length.

Step 204 is filtered according to search engine.According to the build mechanism of search engine, multiword is put into search engine, if the result who obtains is seldom, illustrate that this multiword can't independently become speech.Around this principle, can further filter the result.The webpage number that the present invention utilizes certain commercial search engine to obtain filters end product, and removal can not independently become the multiword of speech, and experiment shows that this method is can the non-word of filtering.

Behind the new word discovery algorithm, the result after the anchor text corpus sorts based on frequency is as follows:

Table 6: based on the information entropy and the frequency of anchor text corpus word

On the effect that generates from field term, the field term that this field term generation method generates has the higher degree of reliability, and table 7 has been listed candidate's term number that three kinds of corpus not filtering through relative frequency generate and become Word probability:

Table 7: the word number that obtains by the new word discovery algorithm with become Word probability

Step S105, the candidate's multiword set after to the screening of described new word discovery algorithm is further screened with the output field term and is generated the result according to the relative frequency algorithm.The method of relative frequency is present existing a kind of very effective method, is widely used in the systems such as information retrieval and text classification.An obvious characteristic of term is exactly repeatedly to occur in the text of this area, and the number of times that occurs in other field is less, and relative frequency can reflect this feature of term to a certain extent.This method is calculated simple, has also obtained and has extracted the result preferably.

In an embodiment of the present invention, the computing formula of relative frequency is: the frequency of specific area corpus is divided by the frequency of background corpus.After the relative frequency screening, the number of times that occurs according to word sorts, and obtains orderly result at corpus, and the result is marked, and checks whether it is the finance and economic word.Get preceding 10 (P10), preceding 100 (P100), preceding 1000 (P1000) (existence) and whole marks respectively, and calculate its degree of accuracy following (wherein the relative frequency threshold value is represented the ratio of filtering):

Table 8: different corpus are in field of finance and economics word accuracy rate

According to above step, obtained the set of field of finance and economics term.This has just finished objective, the reliable overall process that generates field term automatically of the behavior that utilizes the network user.

After above each step, generated the field term of field of finance and economics.The field term that comprises a lot of parts of speech is as noun, verb, adjective etc.In order to verify validity of the present invention and reliability, we have carried out the related experiment that field term generates.

This paper adopts the inquiry log in certain one week of commercial search engine, and should week user browsing behavior daily record.

On the effect of field term generation, the field term that this field term generation method generates has the higher degree of reliability, simultaneously since this method based on data resource be Internet resources, therefore the field term that generates can comprise emerging word in the language environment.Table 9 has been listed the certain fields term and has been generated the result:

Table 9: the certain fields term generates the result

The present invention is by the analysis to user's travel log, the anchor text of clicking when extracting this field of user capture webpage, include more this field term in these Internet resources, based on above-mentioned network resource usage information entropy, the degree of coupling and relative frequency automatic screening and obtain this field term.It has does not need artificial participation, accurately objective and can in time find the advantage of the popular term in certain field on the internet.

Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification that scope of the present invention is by claims and be equal to and limit to these embodiment.

Claims

1. the field term based on anchor text analysis generates method automatically, it is characterized in that, may further comprise the steps:

Gather user's travel log;

Described travel log is handled to obtain anchor text and the corresponding click result address that the user clicks;

According to described click result address described anchor text is handled to obtain the set of candidate's multiword;

Based on the new word discovery algorithm multiword in described candidate's multiword set is screened the multiword that can not independently become speech to remove; With

Candidate's multiword set after to the screening of described new word discovery algorithm is further screened with the output field term and is generated the result according to the relative frequency algorithm.

2. the field term based on anchor text analysis as claimed in claim 1 generates method automatically, it is characterized in that, described anchor text and the corresponding click result address that travel log is handled to obtain user's click further comprises:

Carry out the user journal code conversion, and described travel log arrangement is character string forms, remove numeral, letter and punctuation mark simultaneously.

3. the field term based on anchor text analysis as claimed in claim 1 generates method automatically, it is characterized in that, described described anchor text the processing to obtain the set of candidate's multiword according to the click result address further comprises:

Judge whether described click result address belongs to default url list;

The pairing described anchor text of described click result address that will belong to default url list adds the set of candidate's multiword.

4. the field term based on anchor text analysis as claimed in claim 1 generates method automatically, it is characterized in that, described multiword in described candidate's multiword set the screening to remove based on the new word discovery algorithm can not independently become the multiword of speech further to comprise:

Based on left and right sides entropy algorithm described candidate's multiword set is filtered; With

Candidate's multiword set after to described screening is filtered based on degree of coupling algorithm.

5. the field term based on anchor text analysis as claimed in claim 4 generates method automatically, it is characterized in that, described described candidate's multiword set the filtration further based on left and right sides entropy algorithm comprises:

Calculate the left information entropy and the right information entropy of each multiword in described candidate's multiword set;

Judge that whether the left information entropy of described each multiword or right information entropy are greater than threshold value;

If the left information entropy of multiword or right information entropy all less than described threshold value, are then removed described multiword.

6. the field term based on anchor text analysis as claimed in claim 5 generates method automatically, it is characterized in that,

Wherein,

Left side information entropy is:

LE (w) = - \frac{1}{n} \underset{a_{i} &Element; A}{Σ} C (w, a_{i}) \log \frac{C (w, a_{i})}{n};

Right information entropy is:

RE (w) = - \frac{1}{n} \underset{b_{i} &Element; B}{Σ} C (w, b_{i}) \log \frac{C (w, b_{i})}{n};

Wherein,

7. the field term based on anchor text analysis as claimed in claim 4 generates method automatically, it is characterized in that, described based on the degree of coupling algorithm candidate's multiword set after to described screening filter further and comprise:

Calculate the word length of each multiword in the candidate's multiword set after the described screening;

The word length and the degree of coupling according to described each multiword judge whether described multiword can independently become word;

If judge independently to become word, then with its removal.

8. the field term based on anchor text analysis as claimed in claim 4 generates method automatically, it is characterized in that, also comprises:

Search for based on each multiword inputted search engine in the candidate's multiword set after left and right sides entropy algorithm and the screening of degree of coupling algorithm described;

Remove the multiword that Search Results does not meet the demands according to Search Results.