WO2013143362A1

WO2013143362A1 - Method, device, and computer storage media for adding hyperlink to text

Info

Publication number: WO2013143362A1
Application number: PCT/CN2013/071573
Authority: WO
Inventors: 贺翔; 卞琪; 焦峰
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2012-03-29
Filing date: 2013-02-08
Publication date: 2013-10-03
Also published as: US20140250356A1; SG11201400690PA; CN103365831B; US9483447B2; CN103365831A

Abstract

Methods and devices for adding hyperlink to text are disclosed: generating hyperlink word list and characteristic word list in advance, and to each characteristic word, determining respectively co-occurrence frequency with each hyperlink word; to each text X which to be added the hyperlink, words segmentation processing them respectively, extracting the hyperlink word occurred in the hyperlink word list and the characteristic word occurred in the characteristic word list from results of word segmentation, determining weights of each extracted hyperlink word and extracted characteristic word, and getting respectively final weights of each extracted hypertext link word according to the co-occurrence frequency of each extracted characteristic word and each extracted hyperlink word and the weights; descendingly sorting each extracted hyperlink word according to the final weights, adding hyperlink to first k hyperlink words which after sorting, and K is positive integer. Applying the solution, it can improve the relativity of the added hyperlink and the text, and it is easy to implement.

Description

A method for adding a hyperlink to a text, a device, and a computer storage medium. The application is submitted to the Chinese Patent Office on March 29, 2012, and the application number is 201210087642.0. The invention name is "a method for adding a hyperlink to a text and The priority of the Chinese Patent Application, the entire disclosure of which is incorporated herein by reference. Technical field

The present invention relates to text processing techniques, and more particularly to a method and apparatus for adding hyperlinks to text. Background of the invention

A hyperlink is a connection to a specific target on the Internet. When you click on a hyperlink, it will automatically jump to the specified destination.

Different texts can be linked by adding a hyperlink to the text. As shown in Figure 1, Figure 1 is a textual diagram of a text with a hyperlink added. The words "Taihe Hall", "Zhonghe Temple" and "Baohe Temple" have been added with hyperlinks. In practical applications, a word with a hyperlink is usually called a hyperlink, and it is represented by an underlined blue font.

In the prior art, the following methods are generally used to add a hyperlink to the text:

1) Manually determine which of these words to add a hyperlink to the text to which the hyperlink is to be added;

2) Pre-generate a list of super-chain words, and add a hyperlink to the words appearing in the list of super-chain words by matching the list of hyperlinked words for the text to be added with the hyperlink.

However, the above two methods have certain problems in practical applications:

For mode 1), it is inconvenient to implement because of the need for manual operations, especially when it is necessary to add hyperlinks to large-scale text; For mode 2), although it is more convenient to implement, just adding a hyperlink to the word on the match may result in the added hyperlink being less relevant to the text. Summary of the invention

In view of this, the present invention provides a method for adding a hyperlink to text and a device for adding a hyperlink to text, which can improve the relevance of the added hyperlink to the text, and is convenient to implement.

In order to achieve the above object, the technical solution of the present invention is achieved as follows:

A method of adding hyperlinks to text, including:

Pre-generating a list of super-chain words, and collecting various texts, by performing word-cutting processing on each text, generating a list of feature words, and determining the co-occurrence frequency of each feature word with each super-chain word for each feature word;

For each text X to be added with a hyperlink, the following processes are respectively performed: word-cutting on the text X; feature words appearing in the list, and determining each extracted super-chain word and each extracted feature word Weights; each determined weight, respectively, the final weight of each extracted superchain word;

The extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.

A device for adding hyperlinks to text, including:

a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and generate a list of feature words by performing word-cutting processing on each text, and respectively determine each super-chain word for each feature word Co-occurrence frequency Add a module to process the text X for each hyperlink to be added, as follows:

Performing word-cutting on the text X; character words appearing in the list, and determining the weights of each extracted super-chain word and each extracted feature word; each determined weight is obtained separately for each extraction The final weight of the superchain word;

It can be seen that, by using the solution of the present invention, the correlation between the words and the words is obtained by statistically collecting the co-occurrence relationship between the words and the words in the collected text, and then according to the super-chain words taken from the hyperlink to be added. The final weight, and add a hyperlink to the super-chain word with a larger final weight, thereby improving the relevance of the added hyperlink to the text; Moreover, after using the solution of the present invention, which words are automatically determined Add hyperlinks, no manual operation, and it's easy to implement. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a schematic diagram of the text with an existing hyperlink added.

2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention.

FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention The method for implementing the present invention is directed to the problem existing in the prior art. In the present invention, a scheme for adding a hyperlink to a text is proposed, which can improve the correlation between the added hyperlink and the text, and is convenient to implement. In order to make the technical solutions of the present invention clearer and clearer, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. 2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention. As shown in FIG. 2, the method includes the following steps: Step 21: Pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, and determine each of the feature words separately. The co-occurrence frequency of each super-chain word. In this step, a list of super-chain words is first generated, which specifically includes which super-chain words can be determined according to actual needs, the list can be generated by manual editing, or can be automatically generated by the machine in some way, and how to generate is not limited. Various texts can then be collected from the Internet to generate a list of feature words and a co-occurrence relationship between the statistic words and the words. How to collect text as an existing technology, in theory, the more the number of collected text, the better, the specific number can be determined according to actual needs.

Includes:

1) For each text collected, it is processed separately;

2) All the non-repeating words obtained by the cut words are used as feature words to form a list of feature words. Or, in order to reduce the subsequent processing workload, the high-frequency words, stop words, and low-frequency words may be removed from all the non-repeating words obtained by the word-cutting, and the remaining words are used as feature words to form a feature word list; It is a prior art to cut words and how to distinguish which words are high frequency words, stop words and low frequency words. In addition, after obtaining the list of feature words, it is also necessary to determine the inverse text frequency (IDF) for each feature word. , Inverse Document Frequency ), the IDF value is obtained by dividing the number of all the texts collected by 4 by the number of texts in which the feature word appears, and then obtaining the logarithm of the obtained quotient;

3) For each feature word, determine its co-occurrence frequency with each super-chain word: For each feature word y and each super-chain word X, calculate the co-occurrence frequency P(xly) of each: P (xly)=xy co-occurrence number I y occurrences: (1) where xy co-occurrence number indicates the number of texts of the feature word y and the super-chain word X appearing in all the collected texts, y appears The number of times indicates the number of texts of the feature word y in all the collected texts; or, for each feature word y and each super-chain word X, the co-occurrence frequency P(xly) of the two is calculated:

P(x/y)=H(x,y)/ I(x,y )= H(x,y)/(H(x)+H(y)-H(x,y)); ( 2 ) Wherein, H represents information entropy, I represents mutual information, and specific calculation methods of H and I are well known in the art; in practical applications, one of the above two methods may be selected according to actual needs. Step 22: For each text X to be added with a hyperlink, process according to the procedures shown in steps 23 to 26, respectively. For ease of presentation, use the text X to represent any text that requires a hyperlink. Step 23: Perform word segmentation on the text X. Step 24: Extract the super-chain words appearing in the list of super-chain words and the feature words appearing in the list of feature words from the result of the word-cutting, and determine each extracted super-chain word and each extracted feature word Weight. The word-cutting result is compared with the list of super-chain words and the list of feature words generated in step 21, and the feature words appearing in the table. And, for each extracted super-chain word H, calculate its weight WH:

WH = TFH * IDFH; (3) where TFH represents the word frequency (TF, Term Frequency) value of the hyperlink word H, that is, the number of occurrences of the hyperlink word H in the text X, and IDFH represents the IDF value of the hyperlink word H; For each extracted feature word F, calculate its weight WF:

WF = TFF * IDFF; (4) where TFF represents the TF value of the feature word F, and IDFF represents the IDF value of the feature word F ₍ each IDF value has been calculated in step 21 to obtain step 25: The rate and each of the determined weights respectively yield the final weight of each extracted superchain word. In this step, for each extracted super-chain word H, the final weight WH' is calculated: WH, = W _H * ∑ P(H/Fi)*W _FI ; (5) where n indicates extraction The number of characteristic words. The P(H/Fi) value has been calculated in step 21. Step 26: Sort the extracted super-chain words according to the final weights in descending order, and add a hyperlink to the super-chain words in the top K position after sorting, and K is a positive integer.

The specific value of K can be determined according to actual needs. In addition, how to add hyperlinks to hyperlinks is prior art. So far, an introduction to the embodiment of the method of the present invention has been completed. Based on the above description, FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention. As shown in FIG. 3, the method includes: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, for each feature word, respectively Determining the co-occurrence frequency of each super-chain word; adding a module for respectively processing the text X to be added with each hyperlink, and performing the following processing on the text X; a feature word appearing in the table, and determining the weight of each extracted superchain word and each extracted feature word;

The determined weights are respectively obtained as the final weights of each extracted super-chain word; the extracted super-chain words are sorted according to the final weights in descending order, and are ranked in the top K position after sorting. Hyperchain words add hyperlinks, K is a positive integer. The pre-processing module may specifically include: a first processing unit, configured to generate a super-chain list; a second processing unit, configured to collect various texts, and generate a feature word list by performing word-cutting processing on each text For each feature word, determine its co-occurrence frequency with each super-chain word. The second processing unit may specifically include (for the drawing, not shown): a first processing sub-unit for collecting various texts; and a second processing sub-unit for performing word-cutting processing on each text, All the non-repeating words obtained by the cut word are used as feature words to form a list of feature words, or high frequency words, stop words and low frequency words are removed from all the non-repeated words obtained by the cut words, and the remaining words are used as feature words. , composing a list of feature words; and, for each feature word y and each hyperlink word X, respectively calculate the co-occurrence frequency P(xly) of the two:

P(xly)=xy co-occurrence number I y occurrences; (1) The number of xy co-occurrences indicates the number of texts of the feature word y and the super-chain word X in all the collected texts, and the number of occurrences of y indicates that the feature word y appears in all the collected texts. The number of texts; or, for each feature word y and each hyperlink word X, calculate the co-occurrence frequency P(xly) of the two:

P(x/y)= H(x,y)/ I(x,y); ( 2 ) where H represents information entropy and I represents mutual information. The adding module may specifically include: a third processing unit, configured to perform word-cutting processing on the text X; and a fourth processing unit, configured to extract, from the word-cutting result, the super-chain word appearing in the super-chain word list and the feature a feature word appearing in the word list, and determining the weight of each extracted super-chain word and each extracted feature word; co-occurring according to each extracted feature word and each extracted super-chain word a frequency and each determined weight, respectively obtaining a final weight of each extracted super-chain word; a fifth processing unit, configured to extract each super-chain word according to a final weight value in descending order Sorting, adding a hyperlink to the super-chain word in the top K position after sorting, K is a positive integer. In addition, the second processing subunit may be further configured to determine, respectively, an IDF value for each feature word, and the IDF value is divided by the number of all the collected texts by the number of texts in which the feature word appears, and then The obtained quotient is obtained by logarithm; the fourth processing unit may further include (for the drawing, not shown): a third processing sub-unit, configured to extract, from the word-cutting result, a super-chain word appearing in the super-chain word list and a feature word appearing in the feature word list; and for each extracted super-chain word H, respectively Calculate its weight WH:

WH = TFH * IDFH; (3) where TFH represents the TF value of the hyperlink H, that is, the number of occurrences of the hyperlink H in the text X, and IDFH represents the IDF value of the hyperlink H; The feature word F, respectively calculate its weight WF:

WF = TFF * IDFF; (4) where TFF represents the TF value of the feature word F, IDFF represents the IDF value of the feature word F; and the fourth processing sub-unit is used to calculate for each extracted super-chain word H, respectively Its final weight, WH,:

WH, = W _H * ∑ P(H/Fi)*W _FI ; (5) where n represents the number of extracted feature words. For a specific working process of the device embodiment shown in FIG. 3, refer to the corresponding description in the method embodiment shown in FIG. 2, and details are not described herein again.

Embodiments of the present invention also provide a machine readable storage medium storing instructions for causing a machine to perform a method of adding a hyperlink to text as described herein. In particular, a system or apparatus equipped with a storage medium on which software program code implementing the functions of any of the above-described embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be stored Reading and executing the program code stored in the storage medium. In this case, the program code itself read from the storage medium can implement the functions of any of the above embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Storage medium embodiments for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Tape, non-volatile memory card and ROM. Alternatively, the program code can be downloaded from the server computer by the communication network.

In addition, it should be clear that not only can the program code read by the computer be executed, but also some or all of the actual operations can be performed by an operating system or the like operating on the computer based on the instruction of the program code, thereby implementing the above embodiment. The function of any of the embodiments.

In addition, it can be understood that the program code read out from the storage medium is written into a memory set in an expansion board inserted into the computer or written in a memory set in an expansion unit connected to the computer, and then based on the program code. The instructions cause a CPU or the like mounted on the expansion board or the expansion unit to perform part and all of the actual operations, thereby realizing the functions of any of the above embodiments. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are made within the spirit and principles of the present invention, should be included in the present invention. Within the scope of protection.

Claims

Claim

A method for adding a hyperlink to a text, comprising: pre-generating a list of super-chain words, collecting various texts, and performing a word-cutting process on each text to generate a list of feature words, for each Characteristic words, respectively, determine the co-occurrence frequency of each super-chain word;

For each text X to be added with a hyperlink, the following processing is performed separately:

Performing word-cutting on the text X; characteristic words appearing in the table, and determining the weight of each extracted super-chain word and each extracted feature word; the determined weights are respectively extracted The final weight of the superchain word;

The method according to claim 1, wherein the generating a list of feature words by performing word segmentation on each text comprises:

All the non-repeating words obtained by the cut word are used as feature words;

Alternatively, the high frequency word, the stop word, and the low frequency word are removed from all the non-duplicate words obtained by the word cut, and the remaining words are used as feature words.

The method according to claim 1, wherein the determining, for each feature word, the co-occurrence frequency of each super-chain word respectively comprises:

For each feature word y and each super-chain word X, calculate the co-occurrence frequency P(xly) of each: P(xly)=xy co-occurrence times/y occurrence times;

Among them, the number of xy co-occurrences indicates that in all the collected texts, the characteristic words appear at the same time. The number of texts of y and super chain word x, the number of occurrences of y indicates the number of texts in which the feature word y appears in all the collected texts;

Or,

For each feature word y and each hyperlink word X, calculate the co-occurrence frequency P(xly) of the two:

P(x/y)= H(x,y)/ I(x,y);

Where H represents information entropy and I represents mutual information.

4. The method of claim 3, wherein

After the generating a list of feature words, the method further includes: determining, for each feature word, an inverse text frequency (IDF, Inverse Document Frequency) value, and dividing the IDF value by using the number of all collected texts The number of texts of the feature word is obtained, and the obtained quotient is obtained by logarithm;

L

For each extracted superchain word H, calculate its weight W _H :

W _H = TF _H * IDF _H ;

Wherein, TF _H represents the word frequency (TF, Term Frequency) value of the hyperlink word H, that is, the number of occurrences of the hyperlink word H in the text X, and the IDF _H represents the IDF value of the hyperlink word H;

For each extracted feature word F, its weight W _{F is} calculated separately:

W _F = TF _F * IDF _F ;

Where TF _F represents the TF value of the feature word F, and IDF _F represents the IDF value of the feature word F.

The method according to claim 4, wherein the each of the extracted feature words and the co-occurrence frequency of each of the extracted super-chain words and each of the determined weights are respectively obtained. The final weights of the extracted superchain words include:

For each extracted superchain word H, calculate its final weight W _H ':

WH, = W _H * ∑ P(H/Fi)*W _FI ; Where n represents the number of extracted feature words.

6. A device for adding a hyperlink to a text, comprising: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and generate a feature by performing word-cutting processing on each text; a list of words, for each feature word, determine the co-occurrence frequency of each super-chain word;

Add a module to process the text X for each hyperlink to be added, as follows:

The device according to claim 6, wherein the preprocessing module comprises:

a first processing unit, configured to generate a list of super-chain words;

The second processing unit is configured to collect various texts, and perform a word segmentation process on each text to generate a feature word list, and determine a frequency of co-occurrence with each super-chain word for each feature word.

The device according to claim 7, wherein the second processing unit comprises:

a first processing subunit for collecting various texts;

a second processing sub-unit, configured to perform word-cutting processing on each text, and obtain all the words obtained by the word-cutting The non-repeating words are used as feature words to form a list of feature words. Or, high-frequency words, stop words and low-frequency words are removed from all the non-repeating words obtained from the cut words, and the remaining words are used as feature words to form a feature word list. ;

And, for each feature word y and each super-chain word X, calculate the co-occurrence frequency P(xly) of the two: P(xly)=xy co-occurrence times/y occurrence times, wherein xy co-occurrence times In all the collected texts, the number of texts of the feature word y and the super chain word X appears at the same time, and the number of occurrences of y indicates the number of texts in which the feature word y appears in all the collected texts; or For each feature word y and each super-chain word X, the co-occurrence frequency P(xly) of the two is calculated: P(x/y)= H(x,y)/ I(x,y), where H represents information entropy, and I represents mutual information.

The device according to claim 8, wherein the adding module comprises: a third processing unit, configured to perform word-cutting processing on the text X;

a fourth processing unit, configured to extract, from the word-cutting result, a super-chain word appearing in the list of super-chain words and a feature word appearing in the list of feature words, and determine each extracted super-chain word and each extraction The weight of the feature word; according to the co-occurrence frequency of each extracted feature word and each extracted super-chain word and each determined weight, respectively, the final result of each extracted super-chain word Weight

The fifth processing unit is configured to sort the extracted super-chain words according to the final weights in descending order, and add a hyperlink to the super-chain words in the top K position after sorting, where K is a positive integer.

10. Apparatus according to claim 9 wherein:

The second processing sub-unit is further configured to determine an inverse text frequency (IDF) value for each feature word, and the IDF value is obtained by dividing the number of all collected texts by the feature. The number of words in the word, and the obtained quotient is obtained as a logarithm;

The fourth processing unit includes: a third processing sub-unit, configured to extract, from the word-cutting result, a super-chain word appearing in the super-chain word list and a feature word appearing in the feature word list; and extracting the super-chain word for each

H, respectively calculate its weight W _H : W _H = TF _H * IDF _H , where TF _H represents the word frequency (TF, Term Frequency) value of the super-chain word H, that is, the number of occurrences of the super-chain word 文本 in the text X IDF _H represents the IDF value of the hyperchain word H; for each extracted feature word F, its weight W _F : W _F = TF _F * IDF _{F is} calculated respectively, where TF _F represents the TF value of the feature word F , IDF _F represents the IDF value of the feature word F;

The fourth processing sub-unit is configured to calculate a final weight W _H for each extracted super-chain word H, respectively: W _H , = W _H * JP(H/Fi)*W _K , where n represents The number of feature words extracted.

A computer storage medium, characterized in that a computer program is stored therein for performing the method of any one of claims 1 to 5.