CN102169496A

CN102169496A - Anchor text analysis-based automatic domain term generating method

Info

Publication number: CN102169496A
Application number: CN 201110091312
Authority: CN
Inventors: 闫兴龙; 刘奕群; 马少平; 张敏; 金奕江; 张阔; 茹立云
Original assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Current assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Priority date: 2011-04-12
Filing date: 2011-04-12
Publication date: 2011-08-31

Abstract

The invention provides an anchor text analysis-based automatic domain term generating method, which comprises the following steps of: acquiring a browsed log of a user; processing the browsed log to acquire an anchor text clicked by the user and a corresponding click result address; processing the anchor text according to the click result address to acquire a candidate multi-character set; screening multiple characters in the candidate multi-character set on the basis of a new word discovery algorithm to remove the multiple characters incapable of independently forming words; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generating result. By the method, domain terms can be automatically discovered and extracted from the anchor text, the model structure and the parameters are simple, the algorithms have low complexity, and better performance and domain term discovery effect are achieved on experimental test data.

Description

Automatic domain term generation method based on anchor text analysis

Technical Field

The invention relates to the technical field of networks, in particular to a field term automatic generation method based on anchor text analysis.

Background

Domain terms refer to words used in a subject area that represent concepts or relationships within the subject area. Terms may be words or phrases, which are terms used to represent concepts in a particular subject area, or, stated otherwise, are promissory language symbols used to express or define scientific concepts through speech or text. In China, people are used to refer to the nouns. Specific examples of the terms are found everywhere when reading scientific and technical literature and studying professional courses, for example, a router is a term in the field of computer networks, and DNA is a term in the field of life sciences. In the field of term extraction, terms refer to a definite unit of language consisting of two or more words with a certain grammatical relationship, such as the national missile defense system.

The extraction of domain terms has important applications in various fields. In the process of constructing the domain ontology, the domain terms need to be updated timely, so that the method for extracting the domain terms plays a crucial role in the process of constructing and maintaining the domain ontology. In the field of information retrieval, a field term set is required to be introduced when an index is constructed, the field term extraction technology can greatly improve the retrieval accuracy and the retrieval coverage, and particularly in the aspect of vertical search, if a term in a certain field is obtained, more accurate information can be obtained for the search in the field. In the aspect of browsing recommendation, in the aspect of recommendation of browsing behaviors of a user, domain terms in a certain field obtained by web resources can be used for helping people to grasp browsing intentions of the user more accurately, and relevant information is recommended to the user through specific browsing behaviors of the user. In addition, the extraction of the domain terms also plays a great role in advertisement putting, and the domain dictionary is obtained, so that the classification of the webpage is greatly assisted, and the business company can be better assisted to carry out more precise and accurate advertisement putting on different user groups.

The extraction method of the current field terminology mainly comprises three modes:

1. a rule-based approach. The rule method mainly extracts terms by pre-establishing a rule template and then matching the template. But the formulation of rules relies primarily on linguistic knowledge. And linguistic rules are difficult to find. It is difficult to formulate a complete rule set and consider the compatibility of multiple rules.

2. A statistical-based approach. Statistical methods have long been used in term extraction and have achieved good results. Some people use the relative frequency of documents for term extraction and apply it to the automatic construction of ontologies. Frantzi proposed the C-value/NC-value evaluation function for domain term extraction and achieved good results. Pattel uses mutual information and log-likelihood ratio to obtain domain terminology. Liu uses left and right information entropy and log likelihood ratio to determine word boundaries so as to extract candidate terms. And this approach is also utilized herein. Statistical-based algorithms can be used in various corpora, but do not yield good results for certain types of corpora.

3. A method combining rules and statistics. In practical application, many methods of combining statistics and rules are used. The ThuyVU firstly extracts a candidate set according to rules, then calculates by using a C-value/NC-value and T-test method, and finally obtains real terms. This method combines the advantages and disadvantages of the two methods described above, and the results obtained are relatively good.

The prior art has the defects that the extraction method of the current domain terminology is very complicated and has low accuracy, so improvement is urgently needed.

Disclosure of Invention

The object of the present invention is to solve the above technical drawbacks.

In order to achieve the above object, in one aspect, the present invention provides an automatic domain term generation method based on anchor text analysis, including the following steps: collecting a browsing log of a user; processing the browsing log to obtain an anchor text clicked by a user and a corresponding click result address; processing the anchor text according to the click result address to obtain a candidate multi-character set; screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which cannot form words independently; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result.

In an embodiment of the present invention, the processing the browsing log to obtain an anchor text clicked by the user and a corresponding click result address further includes: and carrying out user log coding conversion, arranging the browsing log into a character string form, and removing numbers, letters and punctuation marks.

In an embodiment of the present invention, the processing the anchor text according to the click result address to obtain the candidate multiword set further includes: judging whether the click result address belongs to a preset URL list or not; and adding the anchor text corresponding to the click result address belonging to a preset URL list into a candidate multi-character set.

In an embodiment of the present invention, the filtering the multiple words in the candidate multiple word set based on the new word discovery algorithm to remove the multiple words that cannot be independently formed into words further includes: filtering the candidate multi-word set based on a left-right entropy algorithm; and filtering the screened candidate multi-word set based on a coupling degree algorithm.

In one embodiment of the present invention, the filtering the candidate multiword set based on a left-right entropy algorithm further comprises: calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set; judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value; and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.

In one embodiment of the present invention, wherein the left information entropy is:

<math><mrow><mi>LE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>Σ</mi><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>&Element;</mo><mi>A</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>;</mo></mrow></math>

the right information entropy is:

<math><mrow><mi>RE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>Σ</mi><mrow><msub><mi>b</mi><mi>i</mi></msub><mo>&Element;</mo><mi>B</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>;</mo></mrow></math>

wherein,

C(w，a_i) And C (w, b)_i) Respectively, the left single character a for the word w_iAnd the right single character b_iThe number of occurrences.

In an embodiment of the present invention, the filtering the filtered candidate multi-word set based on the coupling degree algorithm further includes: calculating the word length of each multi-word in the screened candidate multi-word set; judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word; if it is judged that the word cannot be formed independently, it is removed.

In one embodiment of the present invention, further comprising: inputting each multi-word in the candidate multi-word set screened based on the left-right entropy algorithm and the coupling degree algorithm into a search engine for searching; and removing the multiwords which do not meet the requirement of the search result according to the search result.

The invention can automatically find and extract the domain terms from the anchor text. The model structure and parameters are simple, the algorithm complexity is low, and good performance and field term discovery effect are obtained on experimental test data. The method has good popularization and adaptability, and the effect of generating synonyms has the characteristics of objectivity, reliability and comprehensiveness, thereby having good application prospect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a method for automatically generating domain terms based on anchor text analysis according to an embodiment of the present invention;

fig. 2 and 3 are flowcharts of a new word discovery algorithm according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

According to the method, anchor text information clicked when a user browses a webpage is extracted through analyzing a user browsing log, and if the webpage corresponding to the anchor text is a webpage in a certain field, the anchor text is considered to contain a field term in the field. The term in the field is obtained based on the network resource utilization information entropy, the coupling degree and the relative frequency automatic screening. Anchor text, the english name anchor text, anchor text is the link text. The anchor text may serve as an evaluation of the content of the page on which the anchor text is located. Normally, the links added in the page have a certain relationship with the content of the page itself. For example: the industry website of the clothing can be added with links of some peer websites or links of some famous enterprises for making clothing; anchor text, on the other hand, can serve as an evaluation of the page pointed to. The anchor text can accurately describe the content of the pointed page and links added on the personal website, and the anchor text is a search engine. Links added to a page should generally have a direct relationship with the page, and a search engine may determine the content attribute of a certain web page according to the anchor text description of the link pointing to the web page. The anchor text also acts as a search engine by gathering documents that some search engines cannot index.

The embodiment of the invention provides a domain term automatic generation method based on multiple network resource analysis. The method comprises the steps of obtaining a corpus relevant to the financial field through analysis and processing of various network resources, extracting words in the corpus through an algorithm of new word discovery, and finally obtaining a term set relevant to the field through filtering of relative frequency, so that automatic generation of field terms is achieved. Compared with the traditional domain term extraction method, the data resources based on the method are anchor text resources and network resources, and compared with the traditional text data, the method has the characteristics of stronger structure and stronger timeliness. The method can realize efficient and accurate field term generation, thereby providing support for various natural language application systems based on the Internet.

As shown in fig. 1, a flowchart of a method for automatically generating domain terms based on anchor text analysis according to an embodiment of the present invention includes the following steps:

step S101, collecting a browsing log of a user. In the network access, when a user browses a webpage, the webpage is accessed by clicking the anchor text, if the webpage is related to a certain field, the relevance of the anchor text to the field is strong, and the field term of the field is contained with a high probability. The embodiment of the invention takes the browsing log as an example, but other network resources can also be adopted.

As an example of the present invention, the user may browse the behavior log for one week (11/10/2010-17/10/2010). The entries and scale of the corpus are as follows:

table 1: entry and size of each corpus

The user browsing log comprises the following information:

table 2: information items contained in user browsing log

The log information contains enough information for the user to browse, so that the domain term extraction can be performed by using the log.

Step S102, the browsing log is processed to obtain an anchor text clicked by the user and a corresponding click result address. The data preprocessing of the user browsing behavior log comprises the following steps: performing field text corpus coding conversion, and converting a coding format (usually, a Universal Resource Identifier (URI) format) recorded by a server into a GBK format of national standard Chinese character coding; the user logs are sorted by the content items listed in table 5 to find the required information, and the logs are sorted into the form of the above content item character strings. Various noises such as numbers, letters and punctuation marks in the anchor text are filtered.

The data set on which the domain term is automatically generated is from a user browsing log, and for the user browsing log, at least the following should be included for the domain term to be automatically generated:

table 3: user browsing logs for automatic generation of domain terms

Due to the complex format of network resources, useful information needs to be found out from the network resources, and the method mainly comprises the following steps.

Step 1.1, user log coding conversion is carried out, and the coding format (generally, Universal Resource Identifier (URI) format) recorded by the server is converted into the GBK format of national standard Chinese character coding.

Step 1.2 uses the content items listed in table 3 to sort the user log, removes the information except the content items in table 3, and sorts the log into the form of the character string of the content items.

Step 1.3 filters out various noises in the corpus, such as numbers, letters and punctuation marks, to obtain a candidate multi-word set.

And step S103, processing the anchor text according to the click result address to obtain a candidate multi-character set, namely, screening the webpage. In the embodiment of the invention, a text corpus in a certain field based on the network resources is found out according to the screening of the webpage url. And carrying out segmentation processing on the candidate multi-character set to obtain a candidate multi-character set.

The web page screening is based on the method of inductive summarization. If "east wealth web" is a professional finance-type website, it is considered herein that a web page URL containing "eastmoney.com" belongs to a finance-type web page; for some large portals such as "sina", "sohu", and "qq", the method of summary is used herein to obtain the financial domain pages of the portal, such as the sub-domain names in the "sohu" portal:

table 4: some financial web pages under the sohu website

All web pages are used as background corpus. Through repeated random sampling for many times, 100 webpages are sampled every time in 66 thousands of webpages, and the accuracy rate reaches 96% through repeated experiments.

Through the above processing, the number and scale of the obtained corpora are as follows:

table 5: financial domain corpus entry and scale

And step S104, screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which can not form words independently. And for the multi-character set in the last step, counting the occurrence frequency of the multi-characters respectively, calculating the information entropy of each multi-character, screening the multi-character set according to the occurrence frequency, the left and right information entropy and the coupling degree of the multi-characters, and screening out the multi-characters which can not be independently formed into words. And putting the screened results into a search engine to obtain the number of the webpages of the multi-word query result, and filtering the multi-word if the number of the obtained webpages is too small, thereby finally obtaining a candidate term set. Specifically, the new word discovery algorithm may include one or more of the following steps, as shown in fig. 2 and 3, which are flowcharts based on the new word discovery algorithm according to the embodiment of the present invention:

in step 201, filtering is performed based on frequency. And counting the occurrence frequency of the multiple characters in the multiple character set in the field, and taking the words with the frequency greater than a certain threshold value as a candidate multiple character set for next calculation.

Step 202, calculating information entropy, and filtering the candidate multi-word set based on a left-right entropy algorithm. The method specifically comprises the following steps: calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set; judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value; and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.

The information entropy calculation method is as follows:

and establishing left and right word statistical data corresponding to the words. The main method is to traverse all the documents and then count the frequency of each word appearing on the left and right of each word.

The corresponding entropy is calculated.

Definition 1: let it be assumed that the word w belongs to the candidate set, and in addition, a ═ a₁，a₂，a₃，...，a_mB ═ b₁，b₂，b₃，...，b_nThe left and right entropy is defined as follows:

the left information entropy is:

<math><mrow><mi>LE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>Σ</mi><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>&Element;</mo><mi>A</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>3</mn><mo>-</mo><mn>1</mn><mo>)</mo></mrow></mrow></math>

the right information entropy is:

<math><mrow><mi>RE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>Σ</mi><mrow><msub><mi>b</mi><mi>i</mi></msub><mo>&Element;</mo><mi>B</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>3</mn><mo>-</mo><mn>2</mn><mo>)</mo></mrow></mrow></math>

wherein,

Since the corpus itself is characterized in that the query word is not a sentence, for a word, its independent word formation is likely to be left (right) single word, for example, in the processed query corpus, 532 times of "jingfang" occurs totally, and the left and right single words are 22 in total, so the probability of word formation cannot be reflected by the information entropy, so the following strategy is adopted (where L, R are flag bits, and α is a threshold):

if it is not

And setting L to 1, otherwise, setting L to 0, wherein N is the frequency of the common occurrence of the word, and N is the frequency of the occurrence of the left single word of the word. In the same way, ifThen, let R be 1, otherwise, let R be 0, where N is the frequency of the co-occurrence of the word and N is the word's right listThe frequency of occurrence of the word.

And if L-R-1, the word is considered to be put into a candidate set, and the next step of filtering is carried out. Otherwise, if L or R is 0, filtering is performed by determining the left information entropy or the right information entropy thereof.

The strategy for filtering according to the information entropy is as follows: after the candidate set is extracted, it is determined whether L is 0 or R is 0, and if the left entropy of the word is greater than a certain value (set to β) or the right entropy of the word is greater than a certain value (set to β), the word is placed in the candidate set, and the next filtering is performed, otherwise, the word is removed. It is further noted that if the entropy value of this side does not exist, it is defined as infinitesimal. Only if w satisfies the threshold of two-sided word formation can it be put into the candidate set.

And step 203, filtering based on a recursive coupling degree filtering algorithm. Although the method in the previous step can find the filtered word set well, there still exists much noise, and it should be noted that the right entropy does not exist because it satisfies the filtering rule for frequency in the previous section. Whereas the right side of the candidate word can in fact be cut apart semantically. The main problem is that the left information entropy is too large, so that the left information cannot be filtered according to the rule of the previous step. Calculating the word length of each multi-word in the screened candidate multi-word set; judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word; if it is judged that the word cannot be formed independently, it is removed.

The coupling degree filtering algorithm based on recursion is as follows:

for example, for a multiword w with a word length of 3, if w is present₁∈T₂(T₂A set of candidate words of length 2), w ═ w₁p, p are single words, w₁To remove multiple words after p. Calculating p and w₁If the following conditions are satisfied: 1.w₁The number of occurrences divided by the number of occurrences of w is greater than a threshold, and the right entropy of 2.w is less than w₁Right entropy of 3.wThe entropy is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. Likewise, if w is present₁∈T₂(T₂A set of candidate words of length 2), w ═ pw₁P is a single word, w₁To remove multiple words after p. Calculating p and w₁If the following conditions are satisfied: 1.w₁The frequency of occurrence divided by the frequency of occurrence of w is greater than a certain threshold, and the left entropy of 2.w is less than w₁The left information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded.

For example, for a multiword w with a word length of 4, if w is present₁∈T₃(T₃A set of candidate words of length 3), w ═ w₁p, p are single words, w₁To remove multiple words after p. Calculating p and w₁If the following conditions are satisfied: 1.w₁The number of occurrences divided by the number of occurrences of w is greater than a threshold, and the right entropy of 2.w is less than w₁The right information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. Likewise, if w is present₁∈T₃(T₃A set of candidate words of length 2), w ═ pw₁P is a single word, w₁To remove multiple words after p. Calculating p and w₁If the following conditions are satisfied: 1.w₁The frequency of occurrence divided by the frequency of occurrence of w is greater than a certain threshold, and the left entropy of 2.w is less than w₁The left information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. By analogy, words with longer lengths are obtained.

Step 204, filtering according to the search engine. According to the construction mechanism of the search engine, multiple characters are put into the search engine, and if few results are obtained, the multiple characters cannot be independently formed into words. According to this principle, the result can be further filtered. The invention utilizes the number of the web pages obtained by a certain commercial search engine to filter the final result and remove the multi-characters which can not form words independently.

After the new word discovery algorithm is adopted, the anchor text corpus has the following results after being sorted based on frequency:

table 6: information entropy and frequency of words based on anchor text corpus

From the effect of domain term generation, the domain terms generated by the domain term generation method have high reliability, and table 7 lists the number of candidate terms and the word formation probability generated by three corpora without relative frequency filtering:

table 7: word number and word forming probability obtained by new word discovery algorithm

And S105, further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result. The relative frequency method is a very effective method in the prior art, and is widely used in systems such as information retrieval and text classification. An obvious feature of a term is that it appears many times in the text of the field, and less frequently in other fields, with relative frequency reflecting to some extent this feature of the term. The method is simple in calculation and obtains a better extraction result.

In the embodiment of the present invention, the calculation formula of the relative frequency is: the frequency of the domain-specific corpus is divided by the frequency of the background corpus. And after screening based on the relative frequency, sorting according to the occurrence times of the words, obtaining an ordered result in a corpus, labeling the result, and checking whether the word is a financial word or not. The first 10 bits (P10), the first 100 bits (P100), the first 1000 bits (P1000) (present) and all are labeled separately and their accuracies are calculated as follows (where the relative frequency thresholds represent the proportions to be filtered):

table 8: word accuracy rate of different corpora in financial field

According to the steps, a financial domain term set is obtained. This completes the whole process of automatically generating domain terms objectively and reliably by using the behavior of the network user.

Through the steps, the domain terms in the financial field are generated. Including domain terms of many parts of speech such as nouns, verbs, adjectives, etc. To verify the validity and reliability of the present invention, we performed experiments related to the generation of domain terms.

The query log of a week of a certain commercial search engine and the user browsing behavior log of the week are adopted.

The domain term generated by the domain term generation method has a high degree of reliability in terms of the effect of domain term generation, and meanwhile, because the data resource on which the method is based is a network resource, the generated domain term can contain new words in a language environment. Table 9 lists the partial domain term generation results:

table 9: partial Domain terms generate results

According to the method, the anchor text clicked when the user accesses the webpage in the field is extracted through analyzing the user browsing log, the network resource comprises more terms in the field, and the terms in the field are obtained through automatic screening based on the information entropy, the coupling degree and the relative frequency of the network resource utilization. The method has the advantages of no need of manual participation, accuracy, objectivity and capability of timely finding out popular terms in a certain field on the Internet.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A domain term automatic generation method based on anchor text analysis is characterized by comprising the following steps:

collecting a browsing log of a user;

processing the browsing log to obtain an anchor text clicked by a user and a corresponding click result address;

processing the anchor text according to the click result address to obtain a candidate multi-character set;

screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which cannot form words independently; and

and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result.

2. The method for automatic generation of domain terms based on anchor text analysis according to claim 1, wherein the processing of the travel log to obtain the anchor text clicked by the user and the corresponding click result address further comprises:

and carrying out user log coding conversion, arranging the browsing log into a character string form, and removing numbers, letters and punctuation marks.

3. The method of claim 1, wherein the processing the anchor text according to the click result address to obtain the candidate multiword set further comprises:

judging whether the click result address belongs to a preset URL list or not;

and adding the anchor text corresponding to the click result address belonging to a preset URL list into a candidate multi-character set.

4. The method of claim 1, wherein the filtering the multi-words in the candidate multi-word set to remove multi-words that cannot be independently participated based on the new word discovery algorithm further comprises:

filtering the candidate multi-word set based on a left-right entropy algorithm; and

and filtering the screened candidate multi-word set based on a coupling degree algorithm.

5. The method of claim 4, wherein the filtering the set of candidate multiwords based on left-right entropy algorithm further comprises:

calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set;

judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value;

and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.

6. The method of claim 5, wherein the domain term is automatically generated based on anchor text analysis,

wherein,

the left information entropy is:

the right information entropy is:

wherein,

7. The method of claim 4, wherein the filtering the filtered candidate multi-word set based on a degree of coupling algorithm further comprises:

calculating the word length of each multi-word in the screened candidate multi-word set;

judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word;

if it is judged that the word cannot be formed independently, it is removed.

8. The method of claim 4 for automatic generation of domain terms based on anchor text analysis, further comprising:

inputting each multi-word in the candidate multi-word set screened based on the left-right entropy algorithm and the coupling degree algorithm into a search engine for searching;

and removing the multiwords which do not meet the requirement of the search result according to the search result.