CN102169496A - Anchor text analysis-based automatic domain term generating method - Google Patents

Anchor text analysis-based automatic domain term generating method Download PDF

Info

Publication number
CN102169496A
CN102169496A CN 201110091312 CN201110091312A CN102169496A CN 102169496 A CN102169496 A CN 102169496A CN 201110091312 CN201110091312 CN 201110091312 CN 201110091312 A CN201110091312 A CN 201110091312A CN 102169496 A CN102169496 A CN 102169496A
Authority
CN
China
Prior art keywords
mrow
word
anchor text
msub
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110091312
Other languages
Chinese (zh)
Inventor
闫兴龙
刘奕群
马少平
张敏
金奕江
张阔
茹立云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Sogou Technology Development Co Ltd filed Critical Tsinghua University
Priority to CN 201110091312 priority Critical patent/CN102169496A/en
Publication of CN102169496A publication Critical patent/CN102169496A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an anchor text analysis-based automatic domain term generating method, which comprises the following steps of: acquiring a browsed log of a user; processing the browsed log to acquire an anchor text clicked by the user and a corresponding click result address; processing the anchor text according to the click result address to acquire a candidate multi-character set; screening multiple characters in the candidate multi-character set on the basis of a new word discovery algorithm to remove the multiple characters incapable of independently forming words; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generating result. By the method, domain terms can be automatically discovered and extracted from the anchor text, the model structure and the parameters are simple, the algorithms have low complexity, and better performance and domain term discovery effect are achieved on experimental test data.

Description

Automatic domain term generation method based on anchor text analysis
Technical Field
The invention relates to the technical field of networks, in particular to a field term automatic generation method based on anchor text analysis.
Background
Domain terms refer to words used in a subject area that represent concepts or relationships within the subject area. Terms may be words or phrases, which are terms used to represent concepts in a particular subject area, or, stated otherwise, are promissory language symbols used to express or define scientific concepts through speech or text. In China, people are used to refer to the nouns. Specific examples of the terms are found everywhere when reading scientific and technical literature and studying professional courses, for example, a router is a term in the field of computer networks, and DNA is a term in the field of life sciences. In the field of term extraction, terms refer to a definite unit of language consisting of two or more words with a certain grammatical relationship, such as the national missile defense system.
The extraction of domain terms has important applications in various fields. In the process of constructing the domain ontology, the domain terms need to be updated timely, so that the method for extracting the domain terms plays a crucial role in the process of constructing and maintaining the domain ontology. In the field of information retrieval, a field term set is required to be introduced when an index is constructed, the field term extraction technology can greatly improve the retrieval accuracy and the retrieval coverage, and particularly in the aspect of vertical search, if a term in a certain field is obtained, more accurate information can be obtained for the search in the field. In the aspect of browsing recommendation, in the aspect of recommendation of browsing behaviors of a user, domain terms in a certain field obtained by web resources can be used for helping people to grasp browsing intentions of the user more accurately, and relevant information is recommended to the user through specific browsing behaviors of the user. In addition, the extraction of the domain terms also plays a great role in advertisement putting, and the domain dictionary is obtained, so that the classification of the webpage is greatly assisted, and the business company can be better assisted to carry out more precise and accurate advertisement putting on different user groups.
The extraction method of the current field terminology mainly comprises three modes:
1. a rule-based approach. The rule method mainly extracts terms by pre-establishing a rule template and then matching the template. But the formulation of rules relies primarily on linguistic knowledge. And linguistic rules are difficult to find. It is difficult to formulate a complete rule set and consider the compatibility of multiple rules.
2. A statistical-based approach. Statistical methods have long been used in term extraction and have achieved good results. Some people use the relative frequency of documents for term extraction and apply it to the automatic construction of ontologies. Frantzi proposed the C-value/NC-value evaluation function for domain term extraction and achieved good results. Pattel uses mutual information and log-likelihood ratio to obtain domain terminology. Liu uses left and right information entropy and log likelihood ratio to determine word boundaries so as to extract candidate terms. And this approach is also utilized herein. Statistical-based algorithms can be used in various corpora, but do not yield good results for certain types of corpora.
3. A method combining rules and statistics. In practical application, many methods of combining statistics and rules are used. The ThuyVU firstly extracts a candidate set according to rules, then calculates by using a C-value/NC-value and T-test method, and finally obtains real terms. This method combines the advantages and disadvantages of the two methods described above, and the results obtained are relatively good.
The prior art has the defects that the extraction method of the current domain terminology is very complicated and has low accuracy, so improvement is urgently needed.
Disclosure of Invention
The object of the present invention is to solve the above technical drawbacks.
In order to achieve the above object, in one aspect, the present invention provides an automatic domain term generation method based on anchor text analysis, including the following steps: collecting a browsing log of a user; processing the browsing log to obtain an anchor text clicked by a user and a corresponding click result address; processing the anchor text according to the click result address to obtain a candidate multi-character set; screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which cannot form words independently; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result.
In an embodiment of the present invention, the processing the browsing log to obtain an anchor text clicked by the user and a corresponding click result address further includes: and carrying out user log coding conversion, arranging the browsing log into a character string form, and removing numbers, letters and punctuation marks.
In an embodiment of the present invention, the processing the anchor text according to the click result address to obtain the candidate multiword set further includes: judging whether the click result address belongs to a preset URL list or not; and adding the anchor text corresponding to the click result address belonging to a preset URL list into a candidate multi-character set.
In an embodiment of the present invention, the filtering the multiple words in the candidate multiple word set based on the new word discovery algorithm to remove the multiple words that cannot be independently formed into words further includes: filtering the candidate multi-word set based on a left-right entropy algorithm; and filtering the screened candidate multi-word set based on a coupling degree algorithm.
In one embodiment of the present invention, the filtering the candidate multiword set based on a left-right entropy algorithm further comprises: calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set; judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value; and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.
In one embodiment of the present invention, wherein the left information entropy is:
<math><mrow><mi>LE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>&Sigma;</mi><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>&Element;</mo><mi>A</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>;</mo></mrow></math>
the right information entropy is:
<math><mrow><mi>RE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>&Sigma;</mi><mrow><msub><mi>b</mi><mi>i</mi></msub><mo>&Element;</mo><mi>B</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>;</mo></mrow></math>
wherein,
Figure BDA0000054975650000033
C(w,ai) And C (w, b)i) Respectively, the left single character a for the word wiAnd the right single character biThe number of occurrences.
In an embodiment of the present invention, the filtering the filtered candidate multi-word set based on the coupling degree algorithm further includes: calculating the word length of each multi-word in the screened candidate multi-word set; judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word; if it is judged that the word cannot be formed independently, it is removed.
In one embodiment of the present invention, further comprising: inputting each multi-word in the candidate multi-word set screened based on the left-right entropy algorithm and the coupling degree algorithm into a search engine for searching; and removing the multiwords which do not meet the requirement of the search result according to the search result.
The invention can automatically find and extract the domain terms from the anchor text. The model structure and parameters are simple, the algorithm complexity is low, and good performance and field term discovery effect are obtained on experimental test data. The method has good popularization and adaptability, and the effect of generating synonyms has the characteristics of objectivity, reliability and comprehensiveness, thereby having good application prospect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a method for automatically generating domain terms based on anchor text analysis according to an embodiment of the present invention;
fig. 2 and 3 are flowcharts of a new word discovery algorithm according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.
According to the method, anchor text information clicked when a user browses a webpage is extracted through analyzing a user browsing log, and if the webpage corresponding to the anchor text is a webpage in a certain field, the anchor text is considered to contain a field term in the field. The term in the field is obtained based on the network resource utilization information entropy, the coupling degree and the relative frequency automatic screening. Anchor text, the english name anchor text, anchor text is the link text. The anchor text may serve as an evaluation of the content of the page on which the anchor text is located. Normally, the links added in the page have a certain relationship with the content of the page itself. For example: the industry website of the clothing can be added with links of some peer websites or links of some famous enterprises for making clothing; anchor text, on the other hand, can serve as an evaluation of the page pointed to. The anchor text can accurately describe the content of the pointed page and links added on the personal website, and the anchor text is a search engine. Links added to a page should generally have a direct relationship with the page, and a search engine may determine the content attribute of a certain web page according to the anchor text description of the link pointing to the web page. The anchor text also acts as a search engine by gathering documents that some search engines cannot index.
The embodiment of the invention provides a domain term automatic generation method based on multiple network resource analysis. The method comprises the steps of obtaining a corpus relevant to the financial field through analysis and processing of various network resources, extracting words in the corpus through an algorithm of new word discovery, and finally obtaining a term set relevant to the field through filtering of relative frequency, so that automatic generation of field terms is achieved. Compared with the traditional domain term extraction method, the data resources based on the method are anchor text resources and network resources, and compared with the traditional text data, the method has the characteristics of stronger structure and stronger timeliness. The method can realize efficient and accurate field term generation, thereby providing support for various natural language application systems based on the Internet.
As shown in fig. 1, a flowchart of a method for automatically generating domain terms based on anchor text analysis according to an embodiment of the present invention includes the following steps:
step S101, collecting a browsing log of a user. In the network access, when a user browses a webpage, the webpage is accessed by clicking the anchor text, if the webpage is related to a certain field, the relevance of the anchor text to the field is strong, and the field term of the field is contained with a high probability. The embodiment of the invention takes the browsing log as an example, but other network resources can also be adopted.
As an example of the present invention, the user may browse the behavior log for one week (11/10/2010-17/10/2010). The entries and scale of the corpus are as follows:
table 1: entry and size of each corpus
Figure BDA0000054975650000051
The user browsing log comprises the following information:
table 2: information items contained in user browsing log
Figure BDA0000054975650000052
The log information contains enough information for the user to browse, so that the domain term extraction can be performed by using the log.
Step S102, the browsing log is processed to obtain an anchor text clicked by the user and a corresponding click result address. The data preprocessing of the user browsing behavior log comprises the following steps: performing field text corpus coding conversion, and converting a coding format (usually, a Universal Resource Identifier (URI) format) recorded by a server into a GBK format of national standard Chinese character coding; the user logs are sorted by the content items listed in table 5 to find the required information, and the logs are sorted into the form of the above content item character strings. Various noises such as numbers, letters and punctuation marks in the anchor text are filtered.
The data set on which the domain term is automatically generated is from a user browsing log, and for the user browsing log, at least the following should be included for the domain term to be automatically generated:
table 3: user browsing logs for automatic generation of domain terms
Figure BDA0000054975650000053
Due to the complex format of network resources, useful information needs to be found out from the network resources, and the method mainly comprises the following steps.
Step 1.1, user log coding conversion is carried out, and the coding format (generally, Universal Resource Identifier (URI) format) recorded by the server is converted into the GBK format of national standard Chinese character coding.
Step 1.2 uses the content items listed in table 3 to sort the user log, removes the information except the content items in table 3, and sorts the log into the form of the character string of the content items.
Step 1.3 filters out various noises in the corpus, such as numbers, letters and punctuation marks, to obtain a candidate multi-word set.
And step S103, processing the anchor text according to the click result address to obtain a candidate multi-character set, namely, screening the webpage. In the embodiment of the invention, a text corpus in a certain field based on the network resources is found out according to the screening of the webpage url. And carrying out segmentation processing on the candidate multi-character set to obtain a candidate multi-character set.
The web page screening is based on the method of inductive summarization. If "east wealth web" is a professional finance-type website, it is considered herein that a web page URL containing "eastmoney.com" belongs to a finance-type web page; for some large portals such as "sina", "sohu", and "qq", the method of summary is used herein to obtain the financial domain pages of the portal, such as the sub-domain names in the "sohu" portal:
table 4: some financial web pages under the sohu website
Figure BDA0000054975650000061
All web pages are used as background corpus. Through repeated random sampling for many times, 100 webpages are sampled every time in 66 thousands of webpages, and the accuracy rate reaches 96% through repeated experiments.
Through the above processing, the number and scale of the obtained corpora are as follows:
table 5: financial domain corpus entry and scale
Figure BDA0000054975650000062
And step S104, screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which can not form words independently. And for the multi-character set in the last step, counting the occurrence frequency of the multi-characters respectively, calculating the information entropy of each multi-character, screening the multi-character set according to the occurrence frequency, the left and right information entropy and the coupling degree of the multi-characters, and screening out the multi-characters which can not be independently formed into words. And putting the screened results into a search engine to obtain the number of the webpages of the multi-word query result, and filtering the multi-word if the number of the obtained webpages is too small, thereby finally obtaining a candidate term set. Specifically, the new word discovery algorithm may include one or more of the following steps, as shown in fig. 2 and 3, which are flowcharts based on the new word discovery algorithm according to the embodiment of the present invention:
in step 201, filtering is performed based on frequency. And counting the occurrence frequency of the multiple characters in the multiple character set in the field, and taking the words with the frequency greater than a certain threshold value as a candidate multiple character set for next calculation.
Step 202, calculating information entropy, and filtering the candidate multi-word set based on a left-right entropy algorithm. The method specifically comprises the following steps: calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set; judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value; and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.
The information entropy calculation method is as follows:
and establishing left and right word statistical data corresponding to the words. The main method is to traverse all the documents and then count the frequency of each word appearing on the left and right of each word.
The corresponding entropy is calculated.
Definition 1: let it be assumed that the word w belongs to the candidate set, and in addition, a ═ a1,a2,a3,...,amB ═ b1,b2,b3,...,bnThe left and right entropy is defined as follows:
the left information entropy is:
<math><mrow><mi>LE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>&Sigma;</mi><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>&Element;</mo><mi>A</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>3</mn><mo>-</mo><mn>1</mn><mo>)</mo></mrow></mrow></math>
the right information entropy is:
<math><mrow><mi>RE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>&Sigma;</mi><mrow><msub><mi>b</mi><mi>i</mi></msub><mo>&Element;</mo><mi>B</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>3</mn><mo>-</mo><mn>2</mn><mo>)</mo></mrow></mrow></math>
wherein,
Figure BDA0000054975650000073
C(w,ai) And C (w, b)i) Respectively, the left single character a for the word wiAnd the right single character biThe number of occurrences.
Since the corpus itself is characterized in that the query word is not a sentence, for a word, its independent word formation is likely to be left (right) single word, for example, in the processed query corpus, 532 times of "jingfang" occurs totally, and the left and right single words are 22 in total, so the probability of word formation cannot be reflected by the information entropy, so the following strategy is adopted (where L, R are flag bits, and α is a threshold):
if it is not
Figure BDA0000054975650000074
And setting L to 1, otherwise, setting L to 0, wherein N is the frequency of the common occurrence of the word, and N is the frequency of the occurrence of the left single word of the word. In the same way, ifThen, let R be 1, otherwise, let R be 0, where N is the frequency of the co-occurrence of the word and N is the word's right listThe frequency of occurrence of the word.
And if L-R-1, the word is considered to be put into a candidate set, and the next step of filtering is carried out. Otherwise, if L or R is 0, filtering is performed by determining the left information entropy or the right information entropy thereof.
The strategy for filtering according to the information entropy is as follows: after the candidate set is extracted, it is determined whether L is 0 or R is 0, and if the left entropy of the word is greater than a certain value (set to β) or the right entropy of the word is greater than a certain value (set to β), the word is placed in the candidate set, and the next filtering is performed, otherwise, the word is removed. It is further noted that if the entropy value of this side does not exist, it is defined as infinitesimal. Only if w satisfies the threshold of two-sided word formation can it be put into the candidate set.
And step 203, filtering based on a recursive coupling degree filtering algorithm. Although the method in the previous step can find the filtered word set well, there still exists much noise, and it should be noted that the right entropy does not exist because it satisfies the filtering rule for frequency in the previous section. Whereas the right side of the candidate word can in fact be cut apart semantically. The main problem is that the left information entropy is too large, so that the left information cannot be filtered according to the rule of the previous step. Calculating the word length of each multi-word in the screened candidate multi-word set; judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word; if it is judged that the word cannot be formed independently, it is removed.
The coupling degree filtering algorithm based on recursion is as follows:
for example, for a multiword w with a word length of 3, if w is present1∈T2(T2A set of candidate words of length 2), w ═ w1p, p are single words, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The number of occurrences divided by the number of occurrences of w is greater than a threshold, and the right entropy of 2.w is less than w1Right entropy of 3.wThe entropy is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. Likewise, if w is present1∈T2(T2A set of candidate words of length 2), w ═ pw1P is a single word, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The frequency of occurrence divided by the frequency of occurrence of w is greater than a certain threshold, and the left entropy of 2.w is less than w1The left information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded.
For example, for a multiword w with a word length of 4, if w is present1∈T3(T3A set of candidate words of length 3), w ═ w1p, p are single words, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The number of occurrences divided by the number of occurrences of w is greater than a threshold, and the right entropy of 2.w is less than w1The right information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. Likewise, if w is present1∈T3(T3A set of candidate words of length 2), w ═ pw1P is a single word, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The frequency of occurrence divided by the frequency of occurrence of w is greater than a certain threshold, and the left entropy of 2.w is less than w1The left information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. By analogy, words with longer lengths are obtained.
Step 204, filtering according to the search engine. According to the construction mechanism of the search engine, multiple characters are put into the search engine, and if few results are obtained, the multiple characters cannot be independently formed into words. According to this principle, the result can be further filtered. The invention utilizes the number of the web pages obtained by a certain commercial search engine to filter the final result and remove the multi-characters which can not form words independently.
After the new word discovery algorithm is adopted, the anchor text corpus has the following results after being sorted based on frequency:
table 6: information entropy and frequency of words based on anchor text corpus
Figure BDA0000054975650000091
From the effect of domain term generation, the domain terms generated by the domain term generation method have high reliability, and table 7 lists the number of candidate terms and the word formation probability generated by three corpora without relative frequency filtering:
table 7: word number and word forming probability obtained by new word discovery algorithm
Figure BDA0000054975650000092
And S105, further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result. The relative frequency method is a very effective method in the prior art, and is widely used in systems such as information retrieval and text classification. An obvious feature of a term is that it appears many times in the text of the field, and less frequently in other fields, with relative frequency reflecting to some extent this feature of the term. The method is simple in calculation and obtains a better extraction result.
In the embodiment of the present invention, the calculation formula of the relative frequency is: the frequency of the domain-specific corpus is divided by the frequency of the background corpus. And after screening based on the relative frequency, sorting according to the occurrence times of the words, obtaining an ordered result in a corpus, labeling the result, and checking whether the word is a financial word or not. The first 10 bits (P10), the first 100 bits (P100), the first 1000 bits (P1000) (present) and all are labeled separately and their accuracies are calculated as follows (where the relative frequency thresholds represent the proportions to be filtered):
table 8: word accuracy rate of different corpora in financial field
Figure BDA0000054975650000101
According to the steps, a financial domain term set is obtained. This completes the whole process of automatically generating domain terms objectively and reliably by using the behavior of the network user.
Through the steps, the domain terms in the financial field are generated. Including domain terms of many parts of speech such as nouns, verbs, adjectives, etc. To verify the validity and reliability of the present invention, we performed experiments related to the generation of domain terms.
The query log of a week of a certain commercial search engine and the user browsing behavior log of the week are adopted.
The domain term generated by the domain term generation method has a high degree of reliability in terms of the effect of domain term generation, and meanwhile, because the data resource on which the method is based is a network resource, the generated domain term can contain new words in a language environment. Table 9 lists the partial domain term generation results:
table 9: partial Domain terms generate results
Figure BDA0000054975650000102
The invention can automatically find and extract the domain terms from the anchor text. The model structure and parameters are simple, the algorithm complexity is low, and good performance and field term discovery effect are obtained on experimental test data. The method has good popularization and adaptability, and the effect of generating synonyms has the characteristics of objectivity, reliability and comprehensiveness, thereby having good application prospect.
According to the method, the anchor text clicked when the user accesses the webpage in the field is extracted through analyzing the user browsing log, the network resource comprises more terms in the field, and the terms in the field are obtained through automatic screening based on the information entropy, the coupling degree and the relative frequency of the network resource utilization. The method has the advantages of no need of manual participation, accuracy, objectivity and capability of timely finding out popular terms in a certain field on the Internet.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A domain term automatic generation method based on anchor text analysis is characterized by comprising the following steps:
collecting a browsing log of a user;
processing the browsing log to obtain an anchor text clicked by a user and a corresponding click result address;
processing the anchor text according to the click result address to obtain a candidate multi-character set;
screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which cannot form words independently; and
and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result.
2. The method for automatic generation of domain terms based on anchor text analysis according to claim 1, wherein the processing of the travel log to obtain the anchor text clicked by the user and the corresponding click result address further comprises:
and carrying out user log coding conversion, arranging the browsing log into a character string form, and removing numbers, letters and punctuation marks.
3. The method of claim 1, wherein the processing the anchor text according to the click result address to obtain the candidate multiword set further comprises:
judging whether the click result address belongs to a preset URL list or not;
and adding the anchor text corresponding to the click result address belonging to a preset URL list into a candidate multi-character set.
4. The method of claim 1, wherein the filtering the multi-words in the candidate multi-word set to remove multi-words that cannot be independently participated based on the new word discovery algorithm further comprises:
filtering the candidate multi-word set based on a left-right entropy algorithm; and
and filtering the screened candidate multi-word set based on a coupling degree algorithm.
5. The method of claim 4, wherein the filtering the set of candidate multiwords based on left-right entropy algorithm further comprises:
calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set;
judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value;
and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.
6. The method of claim 5, wherein the domain term is automatically generated based on anchor text analysis,
wherein,
the left information entropy is:
<math><mrow><mi>LE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>&Sigma;</mi><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>&Element;</mo><mi>A</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>a</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>;</mo></mrow></math>
the right information entropy is:
<math><mrow><mi>RE</mi><mrow><mo>(</mo><mi>w</mi><mo>)</mo></mrow><mo>=</mo><mo>-</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munder><mi>&Sigma;</mi><mrow><msub><mi>b</mi><mi>i</mi></msub><mo>&Element;</mo><mi>B</mi></mrow></munder><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow><mi>log</mi><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mi>w</mi><mo>,</mo><msub><mi>b</mi><mi>i</mi></msub><mo>)</mo></mrow></mrow><mi>n</mi></mfrac><mo>;</mo></mrow></math>
wherein,
Figure FDA0000054975640000023
C(w,ai) And C (w, b)i) Respectively, the left single character a for the word wiAnd the right single character biThe number of occurrences.
7. The method of claim 4, wherein the filtering the filtered candidate multi-word set based on a degree of coupling algorithm further comprises:
calculating the word length of each multi-word in the screened candidate multi-word set;
judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word;
if it is judged that the word cannot be formed independently, it is removed.
8. The method of claim 4 for automatic generation of domain terms based on anchor text analysis, further comprising:
inputting each multi-word in the candidate multi-word set screened based on the left-right entropy algorithm and the coupling degree algorithm into a search engine for searching;
and removing the multiwords which do not meet the requirement of the search result according to the search result.
CN 201110091312 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method Pending CN102169496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110091312 CN102169496A (en) 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110091312 CN102169496A (en) 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method

Publications (1)

Publication Number Publication Date
CN102169496A true CN102169496A (en) 2011-08-31

Family

ID=44490658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110091312 Pending CN102169496A (en) 2011-04-12 2011-04-12 Anchor text analysis-based automatic domain term generating method

Country Status (1)

Country Link
CN (1) CN102169496A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103631963A (en) * 2013-12-18 2014-03-12 北京博雅立方科技有限公司 Keyword optimization processing method and device based on big data
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN104102658A (en) * 2013-04-09 2014-10-15 腾讯科技(深圳)有限公司 Method and device for mining text contents
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN107967299A (en) * 2017-11-03 2018-04-27 中国农业大学 The hot word extraction method and system of a kind of facing agricultural public sentiment
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN108768764A (en) * 2018-05-08 2018-11-06 四川斐讯信息技术有限公司 A kind of router test method and device
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050033978A1 (en) * 2003-08-08 2005-02-10 Hyser Chris D. Method and system for securing a computer system
CN101178728A (en) * 2007-11-21 2008-05-14 北京搜狗科技发展有限公司 Web side navigation method and system
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050033978A1 (en) * 2003-08-08 2005-02-10 Hyser Chris D. Method and system for securing a computer system
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101178728A (en) * 2007-11-21 2008-05-14 北京搜狗科技发展有限公司 Web side navigation method and system

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102658B (en) * 2013-04-09 2018-09-07 腾讯科技(深圳)有限公司 Content of text method for digging and device
CN104102658A (en) * 2013-04-09 2014-10-15 腾讯科技(深圳)有限公司 Method and device for mining text contents
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103631963A (en) * 2013-12-18 2014-03-12 北京博雅立方科技有限公司 Keyword optimization processing method and device based on big data
CN103631963B (en) * 2013-12-18 2017-10-17 北京博雅立方科技有限公司 A kind of keyword optimized treatment method and device based on big data
CN103778243B (en) * 2014-02-11 2017-02-08 北京信息科技大学 Domain term extraction method
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN108875040A (en) * 2015-10-27 2018-11-23 上海智臻智能网络科技股份有限公司 Dictionary update method and computer readable storage medium
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN106815190B (en) * 2015-11-27 2020-06-23 阿里巴巴集团控股有限公司 Word recognition method and device and server
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN107967299A (en) * 2017-11-03 2018-04-27 中国农业大学 The hot word extraction method and system of a kind of facing agricultural public sentiment
CN107967299B (en) * 2017-11-03 2020-05-12 中国农业大学 Agricultural public opinion-oriented automatic hot word extraction method and system
CN108768764A (en) * 2018-05-08 2018-11-06 四川斐讯信息技术有限公司 A kind of router test method and device
CN108959259A (en) * 2018-07-05 2018-12-07 第四范式(北京)技术有限公司 New word discovery method and system
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium
CN111666417B (en) * 2020-04-13 2023-06-23 百度在线网络技术(北京)有限公司 Method, device, electronic equipment and readable storage medium for generating synonyms
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium
CN112395395B (en) * 2021-01-19 2021-05-28 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
Hulsebos et al. Gittables: A large-scale corpus of relational tables
US7424421B2 (en) Word collection method and system for use in word-breaking
CN102609433A (en) Method and system for recommending query based on user log
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
WO2008014702A1 (en) Method and system of extracting new words
CN101169780A (en) Semantic ontology retrieval system and method
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
Pramana et al. Systematic literature review of stemming and lemmatization performance for sentence similarity
Huang et al. Improving biterm topic model with word embeddings
Kallimani et al. Information extraction by an abstractive text summarization for an Indian regional language
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN101650729A (en) Dynamic construction method for Web service component library and service search method thereof
CN104346382B (en) Use the text analysis system and method for language inquiry
Nakashole et al. Real-time population of knowledge bases: opportunities and challenges
Devika et al. A semantic graph-based keyword extraction model using ranking method on big social data
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
Chen et al. Research on clustering analysis of Internet public opinion
CN105677684A (en) Method for making semantic annotations on content generated by users based on external data sources
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
Zheng et al. Architecture Descriptions Analysis Based on Text Mining and Crawling Technology
Hajjem et al. Building comparable corpora from social networks
JP6173958B2 (en) Program, apparatus and method for searching using a plurality of hash tables

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20110831

RJ01 Rejection of invention patent application after publication