CN102169496A - Anchor text analysis-based automatic domain term generating method - Google Patents
Anchor text analysis-based automatic domain term generating method Download PDFInfo
- Publication number
- CN102169496A CN102169496A CN 201110091312 CN201110091312A CN102169496A CN 102169496 A CN102169496 A CN 102169496A CN 201110091312 CN201110091312 CN 201110091312 CN 201110091312 A CN201110091312 A CN 201110091312A CN 102169496 A CN102169496 A CN 102169496A
- Authority
- CN
- China
- Prior art keywords
- mrow
- word
- anchor text
- msub
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000004458 analytical method Methods 0.000 title claims abstract description 13
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 39
- 238000012216 screening Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000001914 filtration Methods 0.000 claims description 24
- 230000008878 coupling Effects 0.000 claims description 14
- 238000010168 coupling process Methods 0.000 claims description 14
- 238000005859 coupling reaction Methods 0.000 claims description 14
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 7
- 238000012360 testing method Methods 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 description 12
- 230000006399 behavior Effects 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 244000089409 Erythrina poeppigiana Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an anchor text analysis-based automatic domain term generating method, which comprises the following steps of: acquiring a browsed log of a user; processing the browsed log to acquire an anchor text clicked by the user and a corresponding click result address; processing the anchor text according to the click result address to acquire a candidate multi-character set; screening multiple characters in the candidate multi-character set on the basis of a new word discovery algorithm to remove the multiple characters incapable of independently forming words; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generating result. By the method, domain terms can be automatically discovered and extracted from the anchor text, the model structure and the parameters are simple, the algorithms have low complexity, and better performance and domain term discovery effect are achieved on experimental test data.
Description
Technical Field
The invention relates to the technical field of networks, in particular to a field term automatic generation method based on anchor text analysis.
Background
Domain terms refer to words used in a subject area that represent concepts or relationships within the subject area. Terms may be words or phrases, which are terms used to represent concepts in a particular subject area, or, stated otherwise, are promissory language symbols used to express or define scientific concepts through speech or text. In China, people are used to refer to the nouns. Specific examples of the terms are found everywhere when reading scientific and technical literature and studying professional courses, for example, a router is a term in the field of computer networks, and DNA is a term in the field of life sciences. In the field of term extraction, terms refer to a definite unit of language consisting of two or more words with a certain grammatical relationship, such as the national missile defense system.
The extraction of domain terms has important applications in various fields. In the process of constructing the domain ontology, the domain terms need to be updated timely, so that the method for extracting the domain terms plays a crucial role in the process of constructing and maintaining the domain ontology. In the field of information retrieval, a field term set is required to be introduced when an index is constructed, the field term extraction technology can greatly improve the retrieval accuracy and the retrieval coverage, and particularly in the aspect of vertical search, if a term in a certain field is obtained, more accurate information can be obtained for the search in the field. In the aspect of browsing recommendation, in the aspect of recommendation of browsing behaviors of a user, domain terms in a certain field obtained by web resources can be used for helping people to grasp browsing intentions of the user more accurately, and relevant information is recommended to the user through specific browsing behaviors of the user. In addition, the extraction of the domain terms also plays a great role in advertisement putting, and the domain dictionary is obtained, so that the classification of the webpage is greatly assisted, and the business company can be better assisted to carry out more precise and accurate advertisement putting on different user groups.
The extraction method of the current field terminology mainly comprises three modes:
1. a rule-based approach. The rule method mainly extracts terms by pre-establishing a rule template and then matching the template. But the formulation of rules relies primarily on linguistic knowledge. And linguistic rules are difficult to find. It is difficult to formulate a complete rule set and consider the compatibility of multiple rules.
2. A statistical-based approach. Statistical methods have long been used in term extraction and have achieved good results. Some people use the relative frequency of documents for term extraction and apply it to the automatic construction of ontologies. Frantzi proposed the C-value/NC-value evaluation function for domain term extraction and achieved good results. Pattel uses mutual information and log-likelihood ratio to obtain domain terminology. Liu uses left and right information entropy and log likelihood ratio to determine word boundaries so as to extract candidate terms. And this approach is also utilized herein. Statistical-based algorithms can be used in various corpora, but do not yield good results for certain types of corpora.
3. A method combining rules and statistics. In practical application, many methods of combining statistics and rules are used. The ThuyVU firstly extracts a candidate set according to rules, then calculates by using a C-value/NC-value and T-test method, and finally obtains real terms. This method combines the advantages and disadvantages of the two methods described above, and the results obtained are relatively good.
The prior art has the defects that the extraction method of the current domain terminology is very complicated and has low accuracy, so improvement is urgently needed.
Disclosure of Invention
The object of the present invention is to solve the above technical drawbacks.
In order to achieve the above object, in one aspect, the present invention provides an automatic domain term generation method based on anchor text analysis, including the following steps: collecting a browsing log of a user; processing the browsing log to obtain an anchor text clicked by a user and a corresponding click result address; processing the anchor text according to the click result address to obtain a candidate multi-character set; screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which cannot form words independently; and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result.
In an embodiment of the present invention, the processing the browsing log to obtain an anchor text clicked by the user and a corresponding click result address further includes: and carrying out user log coding conversion, arranging the browsing log into a character string form, and removing numbers, letters and punctuation marks.
In an embodiment of the present invention, the processing the anchor text according to the click result address to obtain the candidate multiword set further includes: judging whether the click result address belongs to a preset URL list or not; and adding the anchor text corresponding to the click result address belonging to a preset URL list into a candidate multi-character set.
In an embodiment of the present invention, the filtering the multiple words in the candidate multiple word set based on the new word discovery algorithm to remove the multiple words that cannot be independently formed into words further includes: filtering the candidate multi-word set based on a left-right entropy algorithm; and filtering the screened candidate multi-word set based on a coupling degree algorithm.
In one embodiment of the present invention, the filtering the candidate multiword set based on a left-right entropy algorithm further comprises: calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set; judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value; and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.
In one embodiment of the present invention, wherein the left information entropy is:
the right information entropy is:
wherein,C(w,ai) And C (w, b)i) Respectively, the left single character a for the word wiAnd the right single character biThe number of occurrences.
In an embodiment of the present invention, the filtering the filtered candidate multi-word set based on the coupling degree algorithm further includes: calculating the word length of each multi-word in the screened candidate multi-word set; judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word; if it is judged that the word cannot be formed independently, it is removed.
In one embodiment of the present invention, further comprising: inputting each multi-word in the candidate multi-word set screened based on the left-right entropy algorithm and the coupling degree algorithm into a search engine for searching; and removing the multiwords which do not meet the requirement of the search result according to the search result.
The invention can automatically find and extract the domain terms from the anchor text. The model structure and parameters are simple, the algorithm complexity is low, and good performance and field term discovery effect are obtained on experimental test data. The method has good popularization and adaptability, and the effect of generating synonyms has the characteristics of objectivity, reliability and comprehensiveness, thereby having good application prospect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a method for automatically generating domain terms based on anchor text analysis according to an embodiment of the present invention;
fig. 2 and 3 are flowcharts of a new word discovery algorithm according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.
According to the method, anchor text information clicked when a user browses a webpage is extracted through analyzing a user browsing log, and if the webpage corresponding to the anchor text is a webpage in a certain field, the anchor text is considered to contain a field term in the field. The term in the field is obtained based on the network resource utilization information entropy, the coupling degree and the relative frequency automatic screening. Anchor text, the english name anchor text, anchor text is the link text. The anchor text may serve as an evaluation of the content of the page on which the anchor text is located. Normally, the links added in the page have a certain relationship with the content of the page itself. For example: the industry website of the clothing can be added with links of some peer websites or links of some famous enterprises for making clothing; anchor text, on the other hand, can serve as an evaluation of the page pointed to. The anchor text can accurately describe the content of the pointed page and links added on the personal website, and the anchor text is a search engine. Links added to a page should generally have a direct relationship with the page, and a search engine may determine the content attribute of a certain web page according to the anchor text description of the link pointing to the web page. The anchor text also acts as a search engine by gathering documents that some search engines cannot index.
The embodiment of the invention provides a domain term automatic generation method based on multiple network resource analysis. The method comprises the steps of obtaining a corpus relevant to the financial field through analysis and processing of various network resources, extracting words in the corpus through an algorithm of new word discovery, and finally obtaining a term set relevant to the field through filtering of relative frequency, so that automatic generation of field terms is achieved. Compared with the traditional domain term extraction method, the data resources based on the method are anchor text resources and network resources, and compared with the traditional text data, the method has the characteristics of stronger structure and stronger timeliness. The method can realize efficient and accurate field term generation, thereby providing support for various natural language application systems based on the Internet.
As shown in fig. 1, a flowchart of a method for automatically generating domain terms based on anchor text analysis according to an embodiment of the present invention includes the following steps:
step S101, collecting a browsing log of a user. In the network access, when a user browses a webpage, the webpage is accessed by clicking the anchor text, if the webpage is related to a certain field, the relevance of the anchor text to the field is strong, and the field term of the field is contained with a high probability. The embodiment of the invention takes the browsing log as an example, but other network resources can also be adopted.
As an example of the present invention, the user may browse the behavior log for one week (11/10/2010-17/10/2010). The entries and scale of the corpus are as follows:
table 1: entry and size of each corpus
The user browsing log comprises the following information:
table 2: information items contained in user browsing log
The log information contains enough information for the user to browse, so that the domain term extraction can be performed by using the log.
Step S102, the browsing log is processed to obtain an anchor text clicked by the user and a corresponding click result address. The data preprocessing of the user browsing behavior log comprises the following steps: performing field text corpus coding conversion, and converting a coding format (usually, a Universal Resource Identifier (URI) format) recorded by a server into a GBK format of national standard Chinese character coding; the user logs are sorted by the content items listed in table 5 to find the required information, and the logs are sorted into the form of the above content item character strings. Various noises such as numbers, letters and punctuation marks in the anchor text are filtered.
The data set on which the domain term is automatically generated is from a user browsing log, and for the user browsing log, at least the following should be included for the domain term to be automatically generated:
table 3: user browsing logs for automatic generation of domain terms
Due to the complex format of network resources, useful information needs to be found out from the network resources, and the method mainly comprises the following steps.
Step 1.1, user log coding conversion is carried out, and the coding format (generally, Universal Resource Identifier (URI) format) recorded by the server is converted into the GBK format of national standard Chinese character coding.
Step 1.2 uses the content items listed in table 3 to sort the user log, removes the information except the content items in table 3, and sorts the log into the form of the character string of the content items.
Step 1.3 filters out various noises in the corpus, such as numbers, letters and punctuation marks, to obtain a candidate multi-word set.
And step S103, processing the anchor text according to the click result address to obtain a candidate multi-character set, namely, screening the webpage. In the embodiment of the invention, a text corpus in a certain field based on the network resources is found out according to the screening of the webpage url. And carrying out segmentation processing on the candidate multi-character set to obtain a candidate multi-character set.
The web page screening is based on the method of inductive summarization. If "east wealth web" is a professional finance-type website, it is considered herein that a web page URL containing "eastmoney.com" belongs to a finance-type web page; for some large portals such as "sina", "sohu", and "qq", the method of summary is used herein to obtain the financial domain pages of the portal, such as the sub-domain names in the "sohu" portal:
table 4: some financial web pages under the sohu website
All web pages are used as background corpus. Through repeated random sampling for many times, 100 webpages are sampled every time in 66 thousands of webpages, and the accuracy rate reaches 96% through repeated experiments.
Through the above processing, the number and scale of the obtained corpora are as follows:
table 5: financial domain corpus entry and scale
And step S104, screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which can not form words independently. And for the multi-character set in the last step, counting the occurrence frequency of the multi-characters respectively, calculating the information entropy of each multi-character, screening the multi-character set according to the occurrence frequency, the left and right information entropy and the coupling degree of the multi-characters, and screening out the multi-characters which can not be independently formed into words. And putting the screened results into a search engine to obtain the number of the webpages of the multi-word query result, and filtering the multi-word if the number of the obtained webpages is too small, thereby finally obtaining a candidate term set. Specifically, the new word discovery algorithm may include one or more of the following steps, as shown in fig. 2 and 3, which are flowcharts based on the new word discovery algorithm according to the embodiment of the present invention:
in step 201, filtering is performed based on frequency. And counting the occurrence frequency of the multiple characters in the multiple character set in the field, and taking the words with the frequency greater than a certain threshold value as a candidate multiple character set for next calculation.
Step 202, calculating information entropy, and filtering the candidate multi-word set based on a left-right entropy algorithm. The method specifically comprises the following steps: calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set; judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value; and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.
The information entropy calculation method is as follows:
and establishing left and right word statistical data corresponding to the words. The main method is to traverse all the documents and then count the frequency of each word appearing on the left and right of each word.
The corresponding entropy is calculated.
Definition 1: let it be assumed that the word w belongs to the candidate set, and in addition, a ═ a1,a2,a3,...,amB ═ b1,b2,b3,...,bnThe left and right entropy is defined as follows:
the left information entropy is:
the right information entropy is:
wherein,C(w,ai) And C (w, b)i) Respectively, the left single character a for the word wiAnd the right single character biThe number of occurrences.
Since the corpus itself is characterized in that the query word is not a sentence, for a word, its independent word formation is likely to be left (right) single word, for example, in the processed query corpus, 532 times of "jingfang" occurs totally, and the left and right single words are 22 in total, so the probability of word formation cannot be reflected by the information entropy, so the following strategy is adopted (where L, R are flag bits, and α is a threshold):
if it is notAnd setting L to 1, otherwise, setting L to 0, wherein N is the frequency of the common occurrence of the word, and N is the frequency of the occurrence of the left single word of the word. In the same way, ifThen, let R be 1, otherwise, let R be 0, where N is the frequency of the co-occurrence of the word and N is the word's right listThe frequency of occurrence of the word.
And if L-R-1, the word is considered to be put into a candidate set, and the next step of filtering is carried out. Otherwise, if L or R is 0, filtering is performed by determining the left information entropy or the right information entropy thereof.
The strategy for filtering according to the information entropy is as follows: after the candidate set is extracted, it is determined whether L is 0 or R is 0, and if the left entropy of the word is greater than a certain value (set to β) or the right entropy of the word is greater than a certain value (set to β), the word is placed in the candidate set, and the next filtering is performed, otherwise, the word is removed. It is further noted that if the entropy value of this side does not exist, it is defined as infinitesimal. Only if w satisfies the threshold of two-sided word formation can it be put into the candidate set.
And step 203, filtering based on a recursive coupling degree filtering algorithm. Although the method in the previous step can find the filtered word set well, there still exists much noise, and it should be noted that the right entropy does not exist because it satisfies the filtering rule for frequency in the previous section. Whereas the right side of the candidate word can in fact be cut apart semantically. The main problem is that the left information entropy is too large, so that the left information cannot be filtered according to the rule of the previous step. Calculating the word length of each multi-word in the screened candidate multi-word set; judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word; if it is judged that the word cannot be formed independently, it is removed.
The coupling degree filtering algorithm based on recursion is as follows:
for example, for a multiword w with a word length of 3, if w is present1∈T2(T2A set of candidate words of length 2), w ═ w1p, p are single words, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The number of occurrences divided by the number of occurrences of w is greater than a threshold, and the right entropy of 2.w is less than w1Right entropy of 3.wThe entropy is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. Likewise, if w is present1∈T2(T2A set of candidate words of length 2), w ═ pw1P is a single word, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The frequency of occurrence divided by the frequency of occurrence of w is greater than a certain threshold, and the left entropy of 2.w is less than w1The left information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded.
For example, for a multiword w with a word length of 4, if w is present1∈T3(T3A set of candidate words of length 3), w ═ w1p, p are single words, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The number of occurrences divided by the number of occurrences of w is greater than a threshold, and the right entropy of 2.w is less than w1The right information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. Likewise, if w is present1∈T3(T3A set of candidate words of length 2), w ═ pw1P is a single word, w1To remove multiple words after p. Calculating p and w1If the following conditions are satisfied: 1.w1The frequency of occurrence divided by the frequency of occurrence of w is greater than a certain threshold, and the left entropy of 2.w is less than w1The left information entropy of 3.w is less than a certain threshold. Then the w is filtered out and considered not to be independently worded. By analogy, words with longer lengths are obtained.
Step 204, filtering according to the search engine. According to the construction mechanism of the search engine, multiple characters are put into the search engine, and if few results are obtained, the multiple characters cannot be independently formed into words. According to this principle, the result can be further filtered. The invention utilizes the number of the web pages obtained by a certain commercial search engine to filter the final result and remove the multi-characters which can not form words independently.
After the new word discovery algorithm is adopted, the anchor text corpus has the following results after being sorted based on frequency:
table 6: information entropy and frequency of words based on anchor text corpus
From the effect of domain term generation, the domain terms generated by the domain term generation method have high reliability, and table 7 lists the number of candidate terms and the word formation probability generated by three corpora without relative frequency filtering:
table 7: word number and word forming probability obtained by new word discovery algorithm
And S105, further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result. The relative frequency method is a very effective method in the prior art, and is widely used in systems such as information retrieval and text classification. An obvious feature of a term is that it appears many times in the text of the field, and less frequently in other fields, with relative frequency reflecting to some extent this feature of the term. The method is simple in calculation and obtains a better extraction result.
In the embodiment of the present invention, the calculation formula of the relative frequency is: the frequency of the domain-specific corpus is divided by the frequency of the background corpus. And after screening based on the relative frequency, sorting according to the occurrence times of the words, obtaining an ordered result in a corpus, labeling the result, and checking whether the word is a financial word or not. The first 10 bits (P10), the first 100 bits (P100), the first 1000 bits (P1000) (present) and all are labeled separately and their accuracies are calculated as follows (where the relative frequency thresholds represent the proportions to be filtered):
table 8: word accuracy rate of different corpora in financial field
According to the steps, a financial domain term set is obtained. This completes the whole process of automatically generating domain terms objectively and reliably by using the behavior of the network user.
Through the steps, the domain terms in the financial field are generated. Including domain terms of many parts of speech such as nouns, verbs, adjectives, etc. To verify the validity and reliability of the present invention, we performed experiments related to the generation of domain terms.
The query log of a week of a certain commercial search engine and the user browsing behavior log of the week are adopted.
The domain term generated by the domain term generation method has a high degree of reliability in terms of the effect of domain term generation, and meanwhile, because the data resource on which the method is based is a network resource, the generated domain term can contain new words in a language environment. Table 9 lists the partial domain term generation results:
table 9: partial Domain terms generate results
The invention can automatically find and extract the domain terms from the anchor text. The model structure and parameters are simple, the algorithm complexity is low, and good performance and field term discovery effect are obtained on experimental test data. The method has good popularization and adaptability, and the effect of generating synonyms has the characteristics of objectivity, reliability and comprehensiveness, thereby having good application prospect.
According to the method, the anchor text clicked when the user accesses the webpage in the field is extracted through analyzing the user browsing log, the network resource comprises more terms in the field, and the terms in the field are obtained through automatic screening based on the information entropy, the coupling degree and the relative frequency of the network resource utilization. The method has the advantages of no need of manual participation, accuracy, objectivity and capability of timely finding out popular terms in a certain field on the Internet.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (8)
1. A domain term automatic generation method based on anchor text analysis is characterized by comprising the following steps:
collecting a browsing log of a user;
processing the browsing log to obtain an anchor text clicked by a user and a corresponding click result address;
processing the anchor text according to the click result address to obtain a candidate multi-character set;
screening the multiple characters in the candidate multiple character set based on a new word discovery algorithm to remove the multiple characters which cannot form words independently; and
and further screening the candidate multi-character set screened by the new word discovery algorithm according to a relative frequency algorithm to output a domain term generation result.
2. The method for automatic generation of domain terms based on anchor text analysis according to claim 1, wherein the processing of the travel log to obtain the anchor text clicked by the user and the corresponding click result address further comprises:
and carrying out user log coding conversion, arranging the browsing log into a character string form, and removing numbers, letters and punctuation marks.
3. The method of claim 1, wherein the processing the anchor text according to the click result address to obtain the candidate multiword set further comprises:
judging whether the click result address belongs to a preset URL list or not;
and adding the anchor text corresponding to the click result address belonging to a preset URL list into a candidate multi-character set.
4. The method of claim 1, wherein the filtering the multi-words in the candidate multi-word set to remove multi-words that cannot be independently participated based on the new word discovery algorithm further comprises:
filtering the candidate multi-word set based on a left-right entropy algorithm; and
and filtering the screened candidate multi-word set based on a coupling degree algorithm.
5. The method of claim 4, wherein the filtering the set of candidate multiwords based on left-right entropy algorithm further comprises:
calculating the left information entropy and the right information entropy of each multi-word in the candidate multi-word set;
judging whether the left information entropy or the right information entropy of each multi-word is larger than a threshold value;
and if the left information entropy or the right information entropy of the multi-word is smaller than the threshold value, removing the multi-word.
6. The method of claim 5, wherein the domain term is automatically generated based on anchor text analysis,
wherein,
the left information entropy is:
the right information entropy is:
7. The method of claim 4, wherein the filtering the filtered candidate multi-word set based on a degree of coupling algorithm further comprises:
calculating the word length of each multi-word in the screened candidate multi-word set;
judging whether the multiple words can be independently formed according to the word length and the coupling degree of each multiple word;
if it is judged that the word cannot be formed independently, it is removed.
8. The method of claim 4 for automatic generation of domain terms based on anchor text analysis, further comprising:
inputting each multi-word in the candidate multi-word set screened based on the left-right entropy algorithm and the coupling degree algorithm into a search engine for searching;
and removing the multiwords which do not meet the requirement of the search result according to the search result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110091312 CN102169496A (en) | 2011-04-12 | 2011-04-12 | Anchor text analysis-based automatic domain term generating method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110091312 CN102169496A (en) | 2011-04-12 | 2011-04-12 | Anchor text analysis-based automatic domain term generating method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102169496A true CN102169496A (en) | 2011-08-31 |
Family
ID=44490658
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110091312 Pending CN102169496A (en) | 2011-04-12 | 2011-04-12 | Anchor text analysis-based automatic domain term generating method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102169496A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593427A (en) * | 2013-11-07 | 2014-02-19 | 清华大学 | New word searching method and system |
CN103631963A (en) * | 2013-12-18 | 2014-03-12 | 北京博雅立方科技有限公司 | Keyword optimization processing method and device based on big data |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN104102658A (en) * | 2013-04-09 | 2014-10-15 | 腾讯科技(深圳)有限公司 | Method and device for mining text contents |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105224682A (en) * | 2015-10-27 | 2016-01-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN106815190A (en) * | 2015-11-27 | 2017-06-09 | 阿里巴巴集团控股有限公司 | A kind of words recognition method, device and server |
CN107967299A (en) * | 2017-11-03 | 2018-04-27 | 中国农业大学 | The hot word extraction method and system of a kind of facing agricultural public sentiment |
CN108268440A (en) * | 2017-01-04 | 2018-07-10 | 普天信息技术有限公司 | A kind of unknown word identification method |
CN108768764A (en) * | 2018-05-08 | 2018-11-06 | 四川斐讯信息技术有限公司 | A kind of router test method and device |
CN108959259A (en) * | 2018-07-05 | 2018-12-07 | 第四范式(北京)技术有限公司 | New word discovery method and system |
CN111666417A (en) * | 2020-04-13 | 2020-09-15 | 百度在线网络技术(北京)有限公司 | Method and device for generating synonyms, electronic equipment and readable storage medium |
CN112395395A (en) * | 2021-01-19 | 2021-02-23 | 平安国际智慧城市科技股份有限公司 | Text keyword extraction method, device, equipment and storage medium |
CN112597760A (en) * | 2020-12-04 | 2021-04-02 | 光大科技有限公司 | Method and device for extracting domain words in document |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050033978A1 (en) * | 2003-08-08 | 2005-02-10 | Hyser Chris D. | Method and system for securing a computer system |
CN101178728A (en) * | 2007-11-21 | 2008-05-14 | 北京搜狗科技发展有限公司 | Web side navigation method and system |
CN101178714A (en) * | 2006-12-20 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page classification method and device |
-
2011
- 2011-04-12 CN CN 201110091312 patent/CN102169496A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050033978A1 (en) * | 2003-08-08 | 2005-02-10 | Hyser Chris D. | Method and system for securing a computer system |
CN101178714A (en) * | 2006-12-20 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page classification method and device |
CN101178728A (en) * | 2007-11-21 | 2008-05-14 | 北京搜狗科技发展有限公司 | Web side navigation method and system |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102658B (en) * | 2013-04-09 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Content of text method for digging and device |
CN104102658A (en) * | 2013-04-09 | 2014-10-15 | 腾讯科技(深圳)有限公司 | Method and device for mining text contents |
CN103593427A (en) * | 2013-11-07 | 2014-02-19 | 清华大学 | New word searching method and system |
CN103631963A (en) * | 2013-12-18 | 2014-03-12 | 北京博雅立方科技有限公司 | Keyword optimization processing method and device based on big data |
CN103631963B (en) * | 2013-12-18 | 2017-10-17 | 北京博雅立方科技有限公司 | A kind of keyword optimized treatment method and device based on big data |
CN103778243B (en) * | 2014-02-11 | 2017-02-08 | 北京信息科技大学 | Domain term extraction method |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN108875040A (en) * | 2015-10-27 | 2018-11-23 | 上海智臻智能网络科技股份有限公司 | Dictionary update method and computer readable storage medium |
CN108875040B (en) * | 2015-10-27 | 2020-08-18 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and computer-readable storage medium |
CN105224682B (en) * | 2015-10-27 | 2018-06-05 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105224682A (en) * | 2015-10-27 | 2016-01-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN106815190B (en) * | 2015-11-27 | 2020-06-23 | 阿里巴巴集团控股有限公司 | Word recognition method and device and server |
CN106815190A (en) * | 2015-11-27 | 2017-06-09 | 阿里巴巴集团控股有限公司 | A kind of words recognition method, device and server |
CN108268440A (en) * | 2017-01-04 | 2018-07-10 | 普天信息技术有限公司 | A kind of unknown word identification method |
CN107967299A (en) * | 2017-11-03 | 2018-04-27 | 中国农业大学 | The hot word extraction method and system of a kind of facing agricultural public sentiment |
CN107967299B (en) * | 2017-11-03 | 2020-05-12 | 中国农业大学 | Agricultural public opinion-oriented automatic hot word extraction method and system |
CN108768764A (en) * | 2018-05-08 | 2018-11-06 | 四川斐讯信息技术有限公司 | A kind of router test method and device |
CN108959259A (en) * | 2018-07-05 | 2018-12-07 | 第四范式(北京)技术有限公司 | New word discovery method and system |
CN111666417A (en) * | 2020-04-13 | 2020-09-15 | 百度在线网络技术(北京)有限公司 | Method and device for generating synonyms, electronic equipment and readable storage medium |
CN111666417B (en) * | 2020-04-13 | 2023-06-23 | 百度在线网络技术(北京)有限公司 | Method, device, electronic equipment and readable storage medium for generating synonyms |
CN112597760A (en) * | 2020-12-04 | 2021-04-02 | 光大科技有限公司 | Method and device for extracting domain words in document |
CN112395395A (en) * | 2021-01-19 | 2021-02-23 | 平安国际智慧城市科技股份有限公司 | Text keyword extraction method, device, equipment and storage medium |
CN112395395B (en) * | 2021-01-19 | 2021-05-28 | 平安国际智慧城市科技股份有限公司 | Text keyword extraction method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
Hulsebos et al. | Gittables: A large-scale corpus of relational tables | |
US7424421B2 (en) | Word collection method and system for use in word-breaking | |
CN102609433A (en) | Method and system for recommending query based on user log | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
WO2008014702A1 (en) | Method and system of extracting new words | |
CN101169780A (en) | Semantic ontology retrieval system and method | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
Pramana et al. | Systematic literature review of stemming and lemmatization performance for sentence similarity | |
Huang et al. | Improving biterm topic model with word embeddings | |
Kallimani et al. | Information extraction by an abstractive text summarization for an Indian regional language | |
CN112507109A (en) | Retrieval method and device based on semantic analysis and keyword recognition | |
CN109815401A (en) | A kind of name disambiguation method applied to Web people search | |
CN101650729A (en) | Dynamic construction method for Web service component library and service search method thereof | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
Nakashole et al. | Real-time population of knowledge bases: opportunities and challenges | |
Devika et al. | A semantic graph-based keyword extraction model using ranking method on big social data | |
CN104794209B (en) | Chinese microblogging mood sorting technique based on Markov logical network and system | |
Jia et al. | A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth | |
Chen et al. | Research on clustering analysis of Internet public opinion | |
CN105677684A (en) | Method for making semantic annotations on content generated by users based on external data sources | |
CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
Zheng et al. | Architecture Descriptions Analysis Based on Text Mining and Crawling Technology | |
Hajjem et al. | Building comparable corpora from social networks | |
JP6173958B2 (en) | Program, apparatus and method for searching using a plurality of hash tables |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20110831 |
|
RJ01 | Rejection of invention patent application after publication |