WO2013143362A1 - Method, device, and computer storage media for adding hyperlink to text - Google Patents

Method, device, and computer storage media for adding hyperlink to text Download PDF

Info

Publication number
WO2013143362A1
WO2013143362A1 PCT/CN2013/071573 CN2013071573W WO2013143362A1 WO 2013143362 A1 WO2013143362 A1 WO 2013143362A1 CN 2013071573 W CN2013071573 W CN 2013071573W WO 2013143362 A1 WO2013143362 A1 WO 2013143362A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
words
super
feature
chain
Prior art date
Application number
PCT/CN2013/071573
Other languages
French (fr)
Chinese (zh)
Inventor
贺翔
卞琪
焦峰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to US14/349,016 priority Critical patent/US9483447B2/en
Priority to SG11201400690PA priority patent/SG11201400690PA/en
Publication of WO2013143362A1 publication Critical patent/WO2013143362A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/134Hyperlinking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present invention relates to text processing techniques, and more particularly to a method and apparatus for adding hyperlinks to text. Background of the invention
  • a hyperlink is a connection to a specific target on the Internet. When you click on a hyperlink, it will automatically jump to the specified destination.
  • Figure 1 is a textual diagram of a text with a hyperlink added.
  • the words “Taihe Hall”, “Zhonghe Temple” and “Baohe Temple” have been added with hyperlinks.
  • a word with a hyperlink is usually called a hyperlink, and it is represented by an underlined blue font.
  • the present invention provides a method for adding a hyperlink to text and a device for adding a hyperlink to text, which can improve the relevance of the added hyperlink to the text, and is convenient to implement.
  • a method of adding hyperlinks to text including:
  • each text X to be added with a hyperlink the following processes are respectively performed: word-cutting on the text X; feature words appearing in the list, and determining each extracted super-chain word and each extracted feature word Weights; each determined weight, respectively, the final weight of each extracted superchain word;
  • the extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
  • a device for adding hyperlinks to text including:
  • a pre-processing module configured to pre-generate a list of super-chain words, collect various texts, and generate a list of feature words by performing word-cutting processing on each text, and respectively determine each super-chain word for each feature word Co-occurrence frequency
  • Add a module to process the text X for each hyperlink to be added as follows:
  • the extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
  • the correlation between the words and the words is obtained by statistically collecting the co-occurrence relationship between the words and the words in the collected text, and then according to the super-chain words taken from the hyperlink to be added.
  • which words are automatically determined Add hyperlinks, no manual operation, and it's easy to implement.
  • Figure 1 is a schematic diagram of the text with an existing hyperlink added.
  • FIG. 2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention.
  • FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention
  • the method for implementing the present invention is directed to the problem existing in the prior art.
  • a scheme for adding a hyperlink to a text is proposed, which can improve the correlation between the added hyperlink and the text, and is convenient to implement.
  • the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
  • 2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention. As shown in FIG.
  • Step 21 Pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, and determine each of the feature words separately.
  • the co-occurrence frequency of each super-chain word is first generated, which specifically includes which super-chain words can be determined according to actual needs, the list can be generated by manual editing, or can be automatically generated by the machine in some way, and how to generate is not limited.
  • Various texts can then be collected from the Internet to generate a list of feature words and a co-occurrence relationship between the statistic words and the words. How to collect text as an existing technology, in theory, the more the number of collected text, the better, the specific number can be determined according to actual needs.
  • All the non-repeating words obtained by the cut words are used as feature words to form a list of feature words.
  • the high-frequency words, stop words, and low-frequency words may be removed from all the non-repeating words obtained by the word-cutting, and the remaining words are used as feature words to form a feature word list; It is a prior art to cut words and how to distinguish which words are high frequency words, stop words and low frequency words.
  • the IDF value is obtained by dividing the number of all the texts collected by 4 by the number of texts in which the feature word appears, and then obtaining the logarithm of the obtained quotient;
  • Step 22 For each text X to be added with a hyperlink, process according to the procedures shown in steps 23 to 26, respectively. For ease of presentation, use the text X to represent any text that requires a hyperlink. Step 23: Perform word segmentation on the text X.
  • Step 24 Extract the super-chain words appearing in the list of super-chain words and the feature words appearing in the list of feature words from the result of the word-cutting, and determine each extracted super-chain word and each extracted feature word Weight.
  • the word-cutting result is compared with the list of super-chain words and the list of feature words generated in step 21, and the feature words appearing in the table. And, for each extracted super-chain word H, calculate its weight WH:
  • WH TFH * IDFH; (3) where TFH represents the word frequency (TF, Term Frequency) value of the hyperlink word H, that is, the number of occurrences of the hyperlink word H in the text X, and IDFH represents the IDF value of the hyperlink word H; For each extracted feature word F, calculate its weight WF:
  • Step 25 The rate and each of the determined weights respectively yield the final weight of each extracted superchain word.
  • the P(H/Fi) value has been calculated in step 21.
  • Step 26 Sort the extracted super-chain words according to the final weights in descending order, and add a hyperlink to the super-chain words in the top K position after sorting, and K is a positive integer.
  • FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention. As shown in FIG.
  • the method includes: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, for each feature word, respectively Determining the co-occurrence frequency of each super-chain word; adding a module for respectively processing the text X to be added with each hyperlink, and performing the following processing on the text X; a feature word appearing in the table, and determining the weight of each extracted superchain word and each extracted feature word;
  • the determined weights are respectively obtained as the final weights of each extracted super-chain word; the extracted super-chain words are sorted according to the final weights in descending order, and are ranked in the top K position after sorting.
  • Hyperchain words add hyperlinks, K is a positive integer.
  • the pre-processing module may specifically include: a first processing unit, configured to generate a super-chain list; a second processing unit, configured to collect various texts, and generate a feature word list by performing word-cutting processing on each text For each feature word, determine its co-occurrence frequency with each super-chain word.
  • the second processing unit may specifically include (for the drawing, not shown): a first processing sub-unit for collecting various texts; and a second processing sub-unit for performing word-cutting processing on each text, All the non-repeating words obtained by the cut word are used as feature words to form a list of feature words, or high frequency words, stop words and low frequency words are removed from all the non-repeated words obtained by the cut words, and the remaining words are used as feature words. , composing a list of feature words; and, for each feature word y and each hyperlink word X, respectively calculate the co-occurrence frequency P(xly) of the two:
  • P(xly) xy co-occurrence number I y occurrences; (1)
  • the number of xy co-occurrences indicates the number of texts of the feature word y and the super-chain word X in all the collected texts, and the number of occurrences of y indicates that the feature word y appears in all the collected texts.
  • the adding module may specifically include: a third processing unit, configured to perform word-cutting processing on the text X; and a fourth processing unit, configured to extract, from the word-cutting result, the super-chain word appearing in the super-chain word list and the feature a feature word appearing in the word list, and determining the weight of each extracted super-chain word and each extracted feature word; co-occurring according to each extracted feature word and each extracted super-chain word a frequency and each determined weight, respectively obtaining a final weight of each extracted super-chain word; a fifth processing unit, configured to extract each super-chain word according to a final weight value in descending order Sorting, adding a hyperlink to the super-chain word in the top K position after sorting, K is a positive integer.
  • the second processing subunit may be further configured to determine, respectively, an IDF value for each feature word, and the IDF value is divided by the number of all the collected texts by the number of texts in which the feature word appears, and then The obtained quotient is obtained by logarithm;
  • the fourth processing unit may further include (for the drawing, not shown): a third processing sub-unit, configured to extract, from the word-cutting result, a super-chain word appearing in the super-chain word list and a feature word appearing in the feature word list; and for each extracted super-chain word H, respectively Calculate its weight WH:
  • WH TFH * IDFH; (3) where TFH represents the TF value of the hyperlink H, that is, the number of occurrences of the hyperlink H in the text X, and IDFH represents the IDF value of the hyperlink H;
  • the feature word F respectively calculate its weight WF:
  • WF TFF * IDFF; (4) where TFF represents the TF value of the feature word F, IDFF represents the IDF value of the feature word F; and the fourth processing sub-unit is used to calculate for each extracted super-chain word H, respectively Its final weight, WH,:
  • WH, W H * ⁇ P(H/Fi)*W FI ; (5) where n represents the number of extracted feature words.
  • Embodiments of the present invention also provide a machine readable storage medium storing instructions for causing a machine to perform a method of adding a hyperlink to text as described herein.
  • a system or apparatus equipped with a storage medium on which software program code implementing the functions of any of the above-described embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be stored Reading and executing the program code stored in the storage medium.
  • the program code itself read from the storage medium can implement the functions of any of the above embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
  • Storage medium embodiments for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Tape, non-volatile memory card and ROM.
  • the program code can be downloaded from the server computer by the communication network.
  • the program code read out from the storage medium is written into a memory set in an expansion board inserted into the computer or written in a memory set in an expansion unit connected to the computer, and then based on the program code.
  • the instructions cause a CPU or the like mounted on the expansion board or the expansion unit to perform part and all of the actual operations, thereby realizing the functions of any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Methods and devices for adding hyperlink to text are disclosed: generating hyperlink word list and characteristic word list in advance, and to each characteristic word, determining respectively co-occurrence frequency with each hyperlink word; to each text X which to be added the hyperlink, words segmentation processing them respectively, extracting the hyperlink word occurred in the hyperlink word list and the characteristic word occurred in the characteristic word list from results of word segmentation, determining weights of each extracted hyperlink word and extracted characteristic word, and getting respectively final weights of each extracted hypertext link word according to the co-occurrence frequency of each extracted characteristic word and each extracted hyperlink word and the weights; descendingly sorting each extracted hyperlink word according to the final weights, adding hyperlink to first k hyperlink words which after sorting, and K is positive integer. Applying the solution, it can improve the relativity of the added hyperlink and the text, and it is easy to implement.

Description

一种为文本添加超级链接的方法,装置以及计算机存储介盾 本申请要求于 2012 年 3 月 29 日提交中国专利局、 申请号为 201210087642.0、发明名称为 "一种为文本添加超级链接的方法和装置" 的中国专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域  A method for adding a hyperlink to a text, a device, and a computer storage medium. The application is submitted to the Chinese Patent Office on March 29, 2012, and the application number is 201210087642.0. The invention name is "a method for adding a hyperlink to a text and The priority of the Chinese Patent Application, the entire disclosure of which is incorporated herein by reference. Technical field
本发明涉及文本处理技术, 特别涉及一种为文本添加超级链接的方 法和装置。 发明背景  The present invention relates to text processing techniques, and more particularly to a method and apparatus for adding hyperlinks to text. Background of the invention
超级链接, 是指互联网中指向某个特定目标的连接, 点击超级链接 后, 将自动跳转到指定目标。  A hyperlink is a connection to a specific target on the Internet. When you click on a hyperlink, it will automatically jump to the specified destination.
通过在文本中添加超级链接, 可将不同的文本联系在一起。 如图 1 所示,图 1为现有添加了超级链接的文本示意图。其中的 "太和殿"、 "中 和殿"、 "保和殿" 等词即添加了超级链接。 在实际应用中, 通常将添加 了超级链接的词称为超链词, 用带下划线的蓝色字体表示。  Different texts can be linked by adding a hyperlink to the text. As shown in Figure 1, Figure 1 is a textual diagram of a text with a hyperlink added. The words "Taihe Hall", "Zhonghe Temple" and "Baohe Temple" have been added with hyperlinks. In practical applications, a word with a hyperlink is usually called a hyperlink, and it is represented by an underlined blue font.
现有技术中, 通常采用以下方式来为文本添加超级链接:  In the prior art, the following methods are generally used to add a hyperlink to the text:
1 )针对待添加超级链接的文本,人工确定对其中的哪些词添加超级 链接;  1) Manually determine which of these words to add a hyperlink to the text to which the hyperlink is to be added;
2 )预先生成一个超链词列表, 针对待添加超级链接的文本, 通过与 超链词列表进行匹配, 为在超链词列表中出现的词添加超级链接。  2) Pre-generate a list of super-chain words, and add a hyperlink to the words appearing in the list of super-chain words by matching the list of hyperlinked words for the text to be added with the hyperlink.
但是, 上述两种方式在实际应用中均会存在一定的问题:  However, the above two methods have certain problems in practical applications:
对于方式 1 ), 由于需要人工进行操作, 因此实现起来很不方便, 尤 其是当需要对大规模文本添加超级链接时; 对于方式 2 ), 虽然实现起来比较方便, 但只是筒单地为匹配上的词 添加超级链接, 可能导致所添加的超级链接与文本的相关性较低。 发明内容 For mode 1), it is inconvenient to implement because of the need for manual operations, especially when it is necessary to add hyperlinks to large-scale text; For mode 2), although it is more convenient to implement, just adding a hyperlink to the word on the match may result in the added hyperlink being less relevant to the text. Summary of the invention
有鉴于此, 本发明提供了一种为文本添加超级链接的方法以及一种 为文本添加超级链接的装置, 能够提高所添加的超级链接与文本的相关 性, 且实现起来筒单方便。  In view of this, the present invention provides a method for adding a hyperlink to text and a device for adding a hyperlink to text, which can improve the relevance of the added hyperlink to the text, and is convenient to implement.
为达到上述目的, 本发明的技术方案是这样实现的:  In order to achieve the above object, the technical solution of the present invention is achieved as follows:
一种为文本添加超级链接的方法, 包括:  A method of adding hyperlinks to text, including:
预先生成一个超链词列表, 并收集各种文本, 通过对各文本进行切 词处理, 生成一个特征词列表, 针对每个特征词, 分别确定其与每个超 链词的共现频率;  Pre-generating a list of super-chain words, and collecting various texts, by performing word-cutting processing on each text, generating a list of feature words, and determining the co-occurrence frequency of each feature word with each super-chain word for each feature word;
针对每个待添加超级链接的文本 X, 分别进行如下处理: 对文本 X进行切词处理; 列表中出现的特征词, 并确定每个提取出的超链词和每个提取出的特征 词的权值; 每个确定出的权值, 分别得到每个提取出的超链词的最终权值;  For each text X to be added with a hyperlink, the following processes are respectively performed: word-cutting on the text X; feature words appearing in the list, and determining each extracted super-chain word and each extracted feature word Weights; each determined weight, respectively, the final weight of each extracted superchain word;
按照最终权值由大到小的顺序对提取出的各超链词进行排序, 为排 序后处于前 K位的超链词添加超级链接, K为正整数。  The extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
一种为文本添加超级链接的装置, 包括:  A device for adding hyperlinks to text, including:
预处理模块, 用于预先生成一个超链词列表, 并收集各种文本, 通 过对各文本进行切词处理, 生成一个特征词列表, 针对每个特征词, 分 别确定其与每个超链词的共现频率; 添加模块, 用于针对每个待添加超级链接的文本 X, 分别进行如下 处理: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and generate a list of feature words by performing word-cutting processing on each text, and respectively determine each super-chain word for each feature word Co-occurrence frequency Add a module to process the text X for each hyperlink to be added, as follows:
对文本 X进行切词处理; 列表中出现的特征词, 并确定每个提取出的超链词和每个提取出的特征 词的权值; 每个确定出的权值, 分别得到每个提取出的超链词的最终权值;  Performing word-cutting on the text X; character words appearing in the list, and determining the weights of each extracted super-chain word and each extracted feature word; each determined weight is obtained separately for each extraction The final weight of the superchain word;
按照最终权值由大到小的顺序对提取出的各超链词进行排序, 为排 序后处于前 K位的超链词添加超级链接, K为正整数。  The extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
可见, 采用本发明所述方案, 通过统计收集到的文本中的词与词之 间的共现关系, 得到词与词之间的相关性, 进而根据从待添加超级链接 取出的各超链词的最终权重, 并为最终权值较大的超链词添加超级链 接, 从而提高了所添加的超级链接与文本的相关性; 而且, 采用本发明 所述方案后, 可自动确定出为哪些词添加超级链接, 无需人工操作, 实 现起来筒单方便。 附图简要说明  It can be seen that, by using the solution of the present invention, the correlation between the words and the words is obtained by statistically collecting the co-occurrence relationship between the words and the words in the collected text, and then according to the super-chain words taken from the hyperlink to be added. The final weight, and add a hyperlink to the super-chain word with a larger final weight, thereby improving the relevance of the added hyperlink to the text; Moreover, after using the solution of the present invention, which words are automatically determined Add hyperlinks, no manual operation, and it's easy to implement. BRIEF DESCRIPTION OF THE DRAWINGS
图 1为现有添加了超级链接的文本示意图。  Figure 1 is a schematic diagram of the text with an existing hyperlink added.
图 2为本发明为文本添加超级链接的方法实施例的流程图。  2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention.
图 3 为本发明为文本添加超级链接的装置实施例的组成结构示意 实施本发明的方式 针对现有技术中存在的问题, 本发明中提出一种为文本添加超级链 接的方案, 能够提高所添加的超级链接与文本的相关性, 且实现起来筒 单方便。 为使本发明的技术方案更加清楚、明白, 以下参照附图并举实施例, 对本发明所述方案作进一步地详细说明。 图 2为本发明为文本添加超级链接的方法实施例的流程图。 如图 2 所示, 包括: 步骤 21 : 预先生成一个超链词列表, 并收集各种文本, 通过对各文 本进行切词处理, 生成一个特征词列表, 针对每个特征词, 分别确定其 与每个超链词的共现频率。 本步骤中, 首先生成一个超链词列表, 其中具体包括哪些超链词可 根据实际需要而定, 该列表可由人工编辑生成, 也可由机器通过某种方 式自动生成, 如何生成不作限制。 之后, 可从互联网中收集各种文本, 用来生成特征词列表以及统计 词与词之间的共现关系。 如何收集文本为现有技术, 理论上来说, 收集 的文本数越多越好, 具体数目可根据实际需要而定。 FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention The method for implementing the present invention is directed to the problem existing in the prior art. In the present invention, a scheme for adding a hyperlink to a text is proposed, which can improve the correlation between the added hyperlink and the text, and is convenient to implement. In order to make the technical solutions of the present invention clearer and clearer, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. 2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention. As shown in FIG. 2, the method includes the following steps: Step 21: Pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, and determine each of the feature words separately. The co-occurrence frequency of each super-chain word. In this step, a list of super-chain words is first generated, which specifically includes which super-chain words can be determined according to actual needs, the list can be generated by manual editing, or can be automatically generated by the machine in some way, and how to generate is not limited. Various texts can then be collected from the Internet to generate a list of feature words and a co-occurrence relationship between the statistic words and the words. How to collect text as an existing technology, in theory, the more the number of collected text, the better, the specific number can be determined according to actual needs.
包括: Includes:
1 )针对收集到的每个文本, 分别对其进行切词处理; 1) For each text collected, it is processed separately;
2 )将切词得到的所有不重复词均作为特征词,组成一个特征词列表, 或者, 为减少后续的处理工作量, 也可先从切词得到的所有不重复词中 去除高频词、 停用词和低频词等, 将剩余词作为特征词, 组成一个特征 词列表; 如何进行切词以及如何区分哪些词为高频词、 停用词和低频词均为 现有技术; 另外, 在得到特征词列表之后, 还需要针对每个特征词, 分别确定 其逆文本频率(IDF, Inverse Document Frequency )值, IDF值通过用 4丈 集到的所有文本的个数除以出现了该特征词的文本个数, 再将得到的商 取对数得到; 2) All the non-repeating words obtained by the cut words are used as feature words to form a list of feature words. Or, in order to reduce the subsequent processing workload, the high-frequency words, stop words, and low-frequency words may be removed from all the non-repeating words obtained by the word-cutting, and the remaining words are used as feature words to form a feature word list; It is a prior art to cut words and how to distinguish which words are high frequency words, stop words and low frequency words. In addition, after obtaining the list of feature words, it is also necessary to determine the inverse text frequency (IDF) for each feature word. , Inverse Document Frequency ), the IDF value is obtained by dividing the number of all the texts collected by 4 by the number of texts in which the feature word appears, and then obtaining the logarithm of the obtained quotient;
3 )针对每个特征词, 分别确定其与每个超链词的共现频率: 针对每个特征词 y和每个超链词 X ,分别计算两者的共现频率 P(xly): P(xly)=xy共现次数 I y出现次数: ( 1 ) 其中, xy共现次数表示在收集到的所有文本中, 同时出现了特征词 y和超链词 X的文本的个数, y出现次数表示在收集到的所有文本中, 出现了特征词 y的文本的个数; 或者, 针对每个特征词 y和每个超链词 X ,分别计算两者的共现频率 P(xly): 3) For each feature word, determine its co-occurrence frequency with each super-chain word: For each feature word y and each super-chain word X, calculate the co-occurrence frequency P(xly) of each: P (xly)=xy co-occurrence number I y occurrences: (1) where xy co-occurrence number indicates the number of texts of the feature word y and the super-chain word X appearing in all the collected texts, y appears The number of times indicates the number of texts of the feature word y in all the collected texts; or, for each feature word y and each super-chain word X, the co-occurrence frequency P(xly) of the two is calculated:
P(x/y)=H(x,y)/ I(x,y )= H(x,y)/(H(x)+H(y)-H(x,y)); ( 2 ) 其中, H表示信息熵, I表示互信息, H和 I的具体计算方式均为本 领域公知; 在实际应用中, 可根据实际需要选择使用以上两种方式中的一种。 步骤 22: 针对每个待添加超级链接的文本 X,分别按照步骤 23 ~ 26 所示过程进行处理。 为便于表述, 用文本 X来代表任意一个需要添加超级链接的文本。 步骤 23: 对文本 X进行切词处理。 步骤 24: 从切词结果中提取出在超链词列表中出现的超链词以及在 特征词列表中出现的特征词, 并确定每个提取出的超链词和每个提取出 的特征词的权值。 将切词结果与步骤 21 中生成的超链词列表以及特征词列表进行匹 表中出现的特征词。 并且, 针对每个提取出的超链词 H, 分别计算其权值 WH: P(x/y)=H(x,y)/ I(x,y )= H(x,y)/(H(x)+H(y)-H(x,y)); ( 2 ) Wherein, H represents information entropy, I represents mutual information, and specific calculation methods of H and I are well known in the art; in practical applications, one of the above two methods may be selected according to actual needs. Step 22: For each text X to be added with a hyperlink, process according to the procedures shown in steps 23 to 26, respectively. For ease of presentation, use the text X to represent any text that requires a hyperlink. Step 23: Perform word segmentation on the text X. Step 24: Extract the super-chain words appearing in the list of super-chain words and the feature words appearing in the list of feature words from the result of the word-cutting, and determine each extracted super-chain word and each extracted feature word Weight. The word-cutting result is compared with the list of super-chain words and the list of feature words generated in step 21, and the feature words appearing in the table. And, for each extracted super-chain word H, calculate its weight WH:
WH = TFH * IDFH; ( 3 ) 其中, TFH表示超链词 H的词频(TF, Term Frequency )值, 即超 链词 H在文本 X中的出现次数, IDFH表示超链词 H的 IDF值; 针对每个提取出的特征词 F, 分别计算其权值 WF: WH = TFH * IDFH; (3) where TFH represents the word frequency (TF, Term Frequency) value of the hyperlink word H, that is, the number of occurrences of the hyperlink word H in the text X, and IDFH represents the IDF value of the hyperlink word H; For each extracted feature word F, calculate its weight WF:
WF = TFF * IDFF; ( 4 ) 其中, TFF表示特征词 F的 TF值, IDFF表示特征词 F的 IDF值( 各 IDF值已在步骤 21中计算得到 步骤 25: 率以及每个确定出的权值, 分别得到每个提取出的超链词的最终权值。 本步骤中,针对每个提取出的超链词 H,分别计算其最终权值 WH' : WH, = WH * ∑ P(H/Fi)*WFI; ( 5 ) 其中, n表示提取出的特征词的个数。 P(H/Fi)值已在步骤 21中计算得到。 步骤 26: 按照最终权值由大到小的顺序对提取出的各超链词进行排 序, 为排序后处于前 K位的超链词添加超级链接, K为正整数。 WF = TFF * IDFF; (4) where TFF represents the TF value of the feature word F, and IDFF represents the IDF value of the feature word F ( each IDF value has been calculated in step 21 to obtain step 25: The rate and each of the determined weights respectively yield the final weight of each extracted superchain word. In this step, for each extracted super-chain word H, the final weight WH' is calculated: WH, = W H * ∑ P(H/Fi)*W FI ; (5) where n indicates extraction The number of characteristic words. The P(H/Fi) value has been calculated in step 21. Step 26: Sort the extracted super-chain words according to the final weights in descending order, and add a hyperlink to the super-chain words in the top K position after sorting, and K is a positive integer.
K的具体取值可根据实际需要而定。 另外, 如何为超链词添加超级 链接为现有技术。 至此, 即完成了关于本发明方法实施例的介绍。 基于上述介绍, 图 3为本发明为文本添加超级链接的装置实施例的 组成结构示意图。 如图 3所示, 包括: 预处理模块, 用于预先生成一个超链词列表, 并收集各种文本, 通 过对各文本进行切词处理, 生成一个特征词列表, 针对每个特征词, 分 别确定其与每个超链词的共现频率; 添加模块, 用于针对每个待添加超级链接的文本 X, 分别进行如下 处理: 对文本 X进行切词处理; 表中出现的特征词, 并确定每个提取出的超链词和每个提取出的特征词 的权值; The specific value of K can be determined according to actual needs. In addition, how to add hyperlinks to hyperlinks is prior art. So far, an introduction to the embodiment of the method of the present invention has been completed. Based on the above description, FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention. As shown in FIG. 3, the method includes: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, for each feature word, respectively Determining the co-occurrence frequency of each super-chain word; adding a module for respectively processing the text X to be added with each hyperlink, and performing the following processing on the text X; a feature word appearing in the table, and determining the weight of each extracted superchain word and each extracted feature word;
个确定出的权值, 分别得到每个提取出的超链词的最终权值; 按照最终权值由大到小的顺序对提取出的各超链词进行排序, 为排 序后处于前 K位的超链词添加超级链接, K为正整数。 其中, 预处理模块中可具体包括: 第一处理单元, 用于生成一个超链词列表; 第二处理单元, 用于收集各种文本, 通过对各文本进行切词处理, 生成一个特征词列表, 针对每个特征词, 分别确定其与每个超链词的共 现频率。 第二处理单元中又可具体包括(为筒化附图, 未图示): 第一处理子单元, 用于收集各种文本; 第二处理子单元, 用于对各文本进行切词处理, 将切词得到的所有 不重复词均作为特征词, 组成一个特征词列表, 或者, 从切词得到的所 有不重复词中去除高频词、 停用词和低频词, 将剩余词作为特征词, 组 成一个特征词列表; 并且, 针对每个特征词 y和每个超链词 X , 分别计算两者的共现频 率 P(xly): The determined weights are respectively obtained as the final weights of each extracted super-chain word; the extracted super-chain words are sorted according to the final weights in descending order, and are ranked in the top K position after sorting. Hyperchain words add hyperlinks, K is a positive integer. The pre-processing module may specifically include: a first processing unit, configured to generate a super-chain list; a second processing unit, configured to collect various texts, and generate a feature word list by performing word-cutting processing on each text For each feature word, determine its co-occurrence frequency with each super-chain word. The second processing unit may specifically include (for the drawing, not shown): a first processing sub-unit for collecting various texts; and a second processing sub-unit for performing word-cutting processing on each text, All the non-repeating words obtained by the cut word are used as feature words to form a list of feature words, or high frequency words, stop words and low frequency words are removed from all the non-repeated words obtained by the cut words, and the remaining words are used as feature words. , composing a list of feature words; and, for each feature word y and each hyperlink word X, respectively calculate the co-occurrence frequency P(xly) of the two:
P(xly)=xy共现次数 I y出现次数; ( 1 ) 其中, xy共现次数表示在收集到的所有文本中, 同时出现了特征词 y和超链词 X的文本的个数, y出现次数表示在收集到的所有文本中, 出现了特征词 y的文本的个数; 或者, 针对每个特征词 y和每个超链词 X , 分别计算两者的共现频 率 P(xly): P(xly)=xy co-occurrence number I y occurrences; (1) The number of xy co-occurrences indicates the number of texts of the feature word y and the super-chain word X in all the collected texts, and the number of occurrences of y indicates that the feature word y appears in all the collected texts. The number of texts; or, for each feature word y and each hyperlink word X, calculate the co-occurrence frequency P(xly) of the two:
P(x/y)= H(x,y)/ I(x,y); ( 2 ) 其中, H表示信息熵, I表示互信息。 添加模块中可具体包括: 第三处理单元, 用于对文本 X进行切词处理; 第四处理单元, 用于从切词结果中提取出在超链词列表中出现的超 链词以及在特征词列表中出现的特征词, 并确定每个提取出的超链词和 每个提取出的特征词的权值; 根据每个提取出的特征词与每个提取出的 超链词的共现频率以及每个确定出的权值, 分别得到每个提取出的超链 词的最终权值; 第五处理单元, 用于按照最终权值由大到小的顺序对提取出的各超 链词进行排序, 为排序后处于前 K位的超链词添加超级链接, K为正整 数。 另外, 第二处理子单元还可进一步用于, 针对每个特征词, 分别确 定其 IDF值, IDF值通过用收集到的所有文本的个数除以出现了该特征 词的文本个数, 再将得到的商取对数得到; 第四处理单元中又可进一步包括(为筒化附图, 未图示): 第三处理子单元, 用于从切词结果中提取出在超链词列表中出现的 超链词以及在特征词列表中出现的特征词; 并针对每个提取出的超链词 H , 分别计算其权值 WH: P(x/y)= H(x,y)/ I(x,y); ( 2 ) where H represents information entropy and I represents mutual information. The adding module may specifically include: a third processing unit, configured to perform word-cutting processing on the text X; and a fourth processing unit, configured to extract, from the word-cutting result, the super-chain word appearing in the super-chain word list and the feature a feature word appearing in the word list, and determining the weight of each extracted super-chain word and each extracted feature word; co-occurring according to each extracted feature word and each extracted super-chain word a frequency and each determined weight, respectively obtaining a final weight of each extracted super-chain word; a fifth processing unit, configured to extract each super-chain word according to a final weight value in descending order Sorting, adding a hyperlink to the super-chain word in the top K position after sorting, K is a positive integer. In addition, the second processing subunit may be further configured to determine, respectively, an IDF value for each feature word, and the IDF value is divided by the number of all the collected texts by the number of texts in which the feature word appears, and then The obtained quotient is obtained by logarithm; the fourth processing unit may further include (for the drawing, not shown): a third processing sub-unit, configured to extract, from the word-cutting result, a super-chain word appearing in the super-chain word list and a feature word appearing in the feature word list; and for each extracted super-chain word H, respectively Calculate its weight WH:
WH = TFH * IDFH; ( 3 ) 其中, TFH表示超链词 H的 TF值, 即超链词 H在文本 X中的出现 次数, IDFH表示超链词 H的 IDF值; 针对每个提取出的特征词 F, 分别计算其权值 WF:  WH = TFH * IDFH; (3) where TFH represents the TF value of the hyperlink H, that is, the number of occurrences of the hyperlink H in the text X, and IDFH represents the IDF value of the hyperlink H; The feature word F, respectively calculate its weight WF:
WF = TFF * IDFF; ( 4 ) 其中, TFF表示特征词 F的 TF值, IDFF表示特征词 F的 IDF值; 第四处理子单元, 用于针对每个提取出的超链词 H , 分别计算其最 终权值 WH,:  WF = TFF * IDFF; (4) where TFF represents the TF value of the feature word F, IDFF represents the IDF value of the feature word F; and the fourth processing sub-unit is used to calculate for each extracted super-chain word H, respectively Its final weight, WH,:
WH, = WH * ∑ P(H/Fi)*WFI; ( 5 ) 其中, n表示提取出的特征词的个数。 图 3所示装置实施例的具体工作流程请参照图 2所示方法实施例中 的相应说明, 此处不再赘述。 WH, = W H * ∑ P(H/Fi)*W FI ; (5) where n represents the number of extracted feature words. For a specific working process of the device embodiment shown in FIG. 3, refer to the corresponding description in the method embodiment shown in FIG. 2, and details are not described herein again.
本发明实施例还提供了一种机器可读的存储介质, 存储用于使一机 器执行如本文所述的为文本添加超级链接的方法的指令。 具体地, 可以 提供配有存储介质的系统或者装置, 在该存储介质上存储着实现上述实 施例中任一实施例的功能的软件程序代码, 且使该系统或者装置的计算 机(或 CPU或 MPU )读出并执行存储在存储介质中的程序代码。 在这种情况下, 从存储介质读取的程序代码本身可实现上述实施例 中任何一项实施例的功能, 因此程序代码和存储程序代码的存储介质构 成了本发明的一部分。 Embodiments of the present invention also provide a machine readable storage medium storing instructions for causing a machine to perform a method of adding a hyperlink to text as described herein. In particular, a system or apparatus equipped with a storage medium on which software program code implementing the functions of any of the above-described embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be stored Reading and executing the program code stored in the storage medium. In this case, the program code itself read from the storage medium can implement the functions of any of the above embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
用于提供程序代码的存储介质实施例包括软盘、 硬盘、 磁光盘、 光 盘(如 CD-ROM、 CD-R, CD-RW、 DVD-ROM、 DVD-RAM、 DVD-RW, DVD+RW ), 磁带、 非易失性存储卡和 ROM。 可选择地, 可以由通信网 络从服务器计算机上下载程序代码。  Storage medium embodiments for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Tape, non-volatile memory card and ROM. Alternatively, the program code can be downloaded from the server computer by the communication network.
此外, 应该清楚的是, 不仅可以通过执行计算机所读出的程序代码, 而且可以通过基于程序代码的指令使计算机上操作的操作系统等来完 成部分或者全部的实际操作, 从而实现上述实施例中任意一项实施例的 功能。  In addition, it should be clear that not only can the program code read by the computer be executed, but also some or all of the actual operations can be performed by an operating system or the like operating on the computer based on the instruction of the program code, thereby implementing the above embodiment. The function of any of the embodiments.
此外, 可以理解的是, 将由存储介质读出的程序代码写到插入计算 机内的扩展板中所设置的存储器中或者写到与计算机相连接的扩展单 元中设置的存储器中, 随后基于程序代码的指令使安装在扩展板或者扩 展单元上的 CPU等来执行部分和全部实际操作,从而实现上述实施例中 任一实施例的功能。 以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡 在本发明的精神和原则之内, 所做的任何修改、 等同替换、 改进等, 均 应包含在本发明保护的范围之内。  In addition, it can be understood that the program code read out from the storage medium is written into a memory set in an expansion board inserted into the computer or written in a memory set in an expansion unit connected to the computer, and then based on the program code. The instructions cause a CPU or the like mounted on the expansion board or the expansion unit to perform part and all of the actual operations, thereby realizing the functions of any of the above embodiments. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are made within the spirit and principles of the present invention, should be included in the present invention. Within the scope of protection.

Claims

权利要求书 Claim
1、 一种为文本添加超级链接的方法, 其特征在于, 包括: 预先生成一个超链词列表, 并收集各种文本, 通过对各文本进行切 词处理, 生成一个特征词列表, 针对每个特征词, 分别确定其与每个超 链词的共现频率; A method for adding a hyperlink to a text, comprising: pre-generating a list of super-chain words, collecting various texts, and performing a word-cutting process on each text to generate a list of feature words, for each Characteristic words, respectively, determine the co-occurrence frequency of each super-chain word;
针对每个待添加超级链接的文本 X , 分别进行如下处理:  For each text X to be added with a hyperlink, the following processing is performed separately:
对文本 X进行切词处理; 表中出现的特征词, 并确定每个提取出的超链词和每个提取出的特征词 的权值; 个确定出的权值, 分别得到每个提取出的超链词的最终权值;  Performing word-cutting on the text X; characteristic words appearing in the table, and determining the weight of each extracted super-chain word and each extracted feature word; the determined weights are respectively extracted The final weight of the superchain word;
按照最终权值由大到小的顺序对提取出的各超链词进行排序, 为排 序后处于前 K位的超链词添加超级链接, K为正整数。  The extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
2、根据权利要求 1所述的方法, 其特征在于, 所述通过对各文本进 行切词处理, 生成一个特征词列表包括:  The method according to claim 1, wherein the generating a list of feature words by performing word segmentation on each text comprises:
将切词得到的所有不重复词均作为特征词;  All the non-repeating words obtained by the cut word are used as feature words;
或者,从切词得到的所有不重复词中去除高频词、停用词和低频词, 将剩余词作为特征词。  Alternatively, the high frequency word, the stop word, and the low frequency word are removed from all the non-duplicate words obtained by the word cut, and the remaining words are used as feature words.
3、根据权利要求 1所述的方法,其特征在于,所述针对每个特征词, 分别确定其与每个超链词的共现频率包括:  The method according to claim 1, wherein the determining, for each feature word, the co-occurrence frequency of each super-chain word respectively comprises:
针对每个特征词 y和每个超链词 X ,分别计算两者的共现频率 P(xly): P(xly)=xy共现次数 / y出现次数;  For each feature word y and each super-chain word X, calculate the co-occurrence frequency P(xly) of each: P(xly)=xy co-occurrence times/y occurrence times;
其中, xy共现次数表示在收集到的所有文本中, 同时出现了特征词 y和超链词 x的文本的个数, y出现次数表示在收集到的所有文本中, 出现了特征词 y的文本的个数; Among them, the number of xy co-occurrences indicates that in all the collected texts, the characteristic words appear at the same time. The number of texts of y and super chain word x, the number of occurrences of y indicates the number of texts in which the feature word y appears in all the collected texts;
或者,  Or,
针对每个特征词 y和每个超链词 X ,分别计算两者的共现频率 P(xly): For each feature word y and each hyperlink word X, calculate the co-occurrence frequency P(xly) of the two:
P(x/y)= H(x,y)/ I(x,y); P(x/y)= H(x,y)/ I(x,y);
其中, H表示信息熵, I表示互信息。  Where H represents information entropy and I represents mutual information.
4、 根据权利要求 3所述的方法, 其特征在于,  4. The method of claim 3, wherein
所述生成一个特征词列表之后, 进一步包括: 针对每个特征词, 分 另 ll确定其逆文本频率 ( IDF, Inverse Document Frequency )值, IDF值通 过用收集到的所有文本的个数除以出现了该特征词的文本个数, 再将得 到的商取对数得到;  After the generating a list of feature words, the method further includes: determining, for each feature word, an inverse text frequency (IDF, Inverse Document Frequency) value, and dividing the IDF value by using the number of all collected texts The number of texts of the feature word is obtained, and the obtained quotient is obtained by logarithm;
L  L
针对每个提取出的超链词 H, 分别计算其权值 WH: For each extracted superchain word H, calculate its weight W H :
WH = TFH * IDFH; W H = TF H * IDF H ;
其中, TFH表示超链词 H的词频 (TF, Term Frequency)值, 即超链词 H在文本 X中的出现次数, IDFH表示超链词 H的 IDF值; Wherein, TF H represents the word frequency (TF, Term Frequency) value of the hyperlink word H, that is, the number of occurrences of the hyperlink word H in the text X, and the IDF H represents the IDF value of the hyperlink word H;
针对每个提取出的特征词 F, 分别计算其权值 WF: For each extracted feature word F, its weight W F is calculated separately:
WF = TFF * IDFF; W F = TF F * IDF F ;
其中, TFF表示特征词 F的 TF值, IDFF表示特征词 F的 IDF值。Where TF F represents the TF value of the feature word F, and IDF F represents the IDF value of the feature word F.
5、根据权利要求 4所述的方法, 其特征在于, 所述根据每个提取出 的特征词与每个提取出的超链词的共现频率以及每个确定出的权值, 分 别得到每个提取出的超链词的最终权值包括: The method according to claim 4, wherein the each of the extracted feature words and the co-occurrence frequency of each of the extracted super-chain words and each of the determined weights are respectively obtained. The final weights of the extracted superchain words include:
针对每个提取出的超链词 H, 分别计算其最终权值 WH': For each extracted superchain word H, calculate its final weight W H ':
WH, = WH * ∑ P(H/Fi)*WFI; 其中, n表示提取出的特征词的个数。 WH, = W H * ∑ P(H/Fi)*W FI ; Where n represents the number of extracted feature words.
6、 一种为文本添加超级链接的装置, 其特征在于, 包括: 预处理模块, 用于预先生成一个超链词列表, 并收集各种文本, 通 过对各文本进行切词处理, 生成一个特征词列表, 针对每个特征词, 分 别确定其与每个超链词的共现频率; 6. A device for adding a hyperlink to a text, comprising: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and generate a feature by performing word-cutting processing on each text; a list of words, for each feature word, determine the co-occurrence frequency of each super-chain word;
添加模块, 用于针对每个待添加超级链接的文本 X, 分别进行如下 处理:  Add a module to process the text X for each hyperlink to be added, as follows:
对文本 X进行切词处理; 表中出现的特征词, 并确定每个提取出的超链词和每个提取出的特征词 的权值; 个确定出的权值, 分别得到每个提取出的超链词的最终权值;  Performing word-cutting on the text X; characteristic words appearing in the table, and determining the weight of each extracted super-chain word and each extracted feature word; the determined weights are respectively extracted The final weight of the superchain word;
按照最终权值由大到小的顺序对提取出的各超链词进行排序, 为排 序后处于前 K位的超链词添加超级链接, K为正整数。  The extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
7、根据权利要求 6所述的装置, 其特征在于, 所述预处理模块中包 括:  The device according to claim 6, wherein the preprocessing module comprises:
第一处理单元, 用于生成一个超链词列表;  a first processing unit, configured to generate a list of super-chain words;
第二处理单元, 用于收集各种文本, 通过对各文本进行切词处理, 生成一个特征词列表, 针对每个特征词, 分别确定其与每个超链词的共 现频率。  The second processing unit is configured to collect various texts, and perform a word segmentation process on each text to generate a feature word list, and determine a frequency of co-occurrence with each super-chain word for each feature word.
8、根据权利要求 7所述的装置, 其特征在于, 所述第二处理单元中 包括:  The device according to claim 7, wherein the second processing unit comprises:
第一处理子单元, 用于收集各种文本;  a first processing subunit for collecting various texts;
第二处理子单元, 用于对各文本进行切词处理, 将切词得到的所有 不重复词均作为特征词, 组成一个特征词列表, 或者, 从切词得到的所 有不重复词中去除高频词、 停用词和低频词, 将剩余词作为特征词, 组 成一个特征词列表; a second processing sub-unit, configured to perform word-cutting processing on each text, and obtain all the words obtained by the word-cutting The non-repeating words are used as feature words to form a list of feature words. Or, high-frequency words, stop words and low-frequency words are removed from all the non-repeating words obtained from the cut words, and the remaining words are used as feature words to form a feature word list. ;
并且, 针对每个特征词 y和每个超链词 X , 分别计算两者的共现频 率 P(xly): P(xly)=xy共现次数 / y出现次数, 其中, xy共现次数表示在 收集到的所有文本中, 同时出现了特征词 y和超链词 X的文本的个数, y出现次数表示在收集到的所有文本中, 出现了特征词 y的文本的个数; 或者, 针对每个特征词 y 和每个超链词 X , 分别计算两者的共现频率 P(xly): P(x/y)= H(x,y)/ I(x,y), 其中, H表示信息熵, I表示互信息。  And, for each feature word y and each super-chain word X, calculate the co-occurrence frequency P(xly) of the two: P(xly)=xy co-occurrence times/y occurrence times, wherein xy co-occurrence times In all the collected texts, the number of texts of the feature word y and the super chain word X appears at the same time, and the number of occurrences of y indicates the number of texts in which the feature word y appears in all the collected texts; or For each feature word y and each super-chain word X, the co-occurrence frequency P(xly) of the two is calculated: P(x/y)= H(x,y)/ I(x,y), where H represents information entropy, and I represents mutual information.
9、根据权利要求 8所述的装置,其特征在于,所述添加模块中包括: 第三处理单元, 用于对文本 X进行切词处理;  The device according to claim 8, wherein the adding module comprises: a third processing unit, configured to perform word-cutting processing on the text X;
第四处理单元, 用于从切词结果中提取出在超链词列表中出现的超 链词以及在特征词列表中出现的特征词, 并确定每个提取出的超链词和 每个提取出的特征词的权值; 根据每个提取出的特征词与每个提取出的 超链词的共现频率以及每个确定出的权值, 分别得到每个提取出的超链 词的最终权值;  a fourth processing unit, configured to extract, from the word-cutting result, a super-chain word appearing in the list of super-chain words and a feature word appearing in the list of feature words, and determine each extracted super-chain word and each extraction The weight of the feature word; according to the co-occurrence frequency of each extracted feature word and each extracted super-chain word and each determined weight, respectively, the final result of each extracted super-chain word Weight
第五处理单元, 用于按照最终权值由大到小的顺序对提取出的各超 链词进行排序, 为排序后处于前 K位的超链词添加超级链接, K为正整 数。  The fifth processing unit is configured to sort the extracted super-chain words according to the final weights in descending order, and add a hyperlink to the super-chain words in the top K position after sorting, where K is a positive integer.
10、 根据权利要求 9所述的装置, 其特征在于,  10. Apparatus according to claim 9 wherein:
所述第二处理子单元进一步用于, 针对每个特征词, 分别确定其逆 文本频率 (IDF, Inverse Document Frequency)值, IDF值通过用收集到的所 有文本的个数除以出现了该特征词的文本个数, 再将得到的商取对数得 到;  The second processing sub-unit is further configured to determine an inverse text frequency (IDF) value for each feature word, and the IDF value is obtained by dividing the number of all collected texts by the feature. The number of words in the word, and the obtained quotient is obtained as a logarithm;
所述第四处理单元中包括: 第三处理子单元, 用于从切词结果中提取出在超链词列表中出现的 超链词以及在特征词列表中出现的特征词; 并针对每个提取出的超链词The fourth processing unit includes: a third processing sub-unit, configured to extract, from the word-cutting result, a super-chain word appearing in the super-chain word list and a feature word appearing in the feature word list; and extracting the super-chain word for each
H, 分别计算其权值 WH: WH = TFH * IDFH, 其中, TFH表示超链词 H 的词频 (TF, Term Frequency)值,即超链词 Η在文本 X中的出现次数, IDFH 表示超链词 H的 IDF值; 针对每个提取出的特征词 F, 分别计算其权值 WF: WF = TFF * IDFF, 其中, TFF表示特征词 F的 TF值, IDFF表示特 征词 F的 IDF值; H, respectively calculate its weight W H : W H = TF H * IDF H , where TF H represents the word frequency (TF, Term Frequency) value of the super-chain word H, that is, the number of occurrences of the super-chain word 文本 in the text X IDF H represents the IDF value of the hyperchain word H; for each extracted feature word F, its weight W F : W F = TF F * IDF F is calculated respectively, where TF F represents the TF value of the feature word F , IDF F represents the IDF value of the feature word F;
第四处理子单元, 用于针对每个提取出的超链词 H, 分别计算其最 终权值 WH,: WH, = WH * J P(H/Fi)*WK , 其中, n表示提取出的特征 词的个数。 The fourth processing sub-unit is configured to calculate a final weight W H for each extracted super-chain word H, respectively: W H , = W H * JP(H/Fi)*W K , where n represents The number of feature words extracted.
11、 一种计算机存储介质, 其特征在于, 其中存储有计算机程序, 该计算机程序用于执行所述权利要求 1至 5任一项所述的方法。 A computer storage medium, characterized in that a computer program is stored therein for performing the method of any one of claims 1 to 5.
PCT/CN2013/071573 2012-03-29 2013-02-08 Method, device, and computer storage media for adding hyperlink to text WO2013143362A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/349,016 US9483447B2 (en) 2012-03-29 2013-02-08 Method, device, and computer storage media for adding hyperlink to text
SG11201400690PA SG11201400690PA (en) 2012-03-29 2013-02-08 Method, device, and computer storage media for adding hyperlink to text

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210087642.0 2012-03-29
CN201210087642.0A CN103365831B (en) 2012-03-29 2012-03-29 A kind of method and apparatus that hyperlink is added for text

Publications (1)

Publication Number Publication Date
WO2013143362A1 true WO2013143362A1 (en) 2013-10-03

Family

ID=49258175

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/071573 WO2013143362A1 (en) 2012-03-29 2013-02-08 Method, device, and computer storage media for adding hyperlink to text

Country Status (4)

Country Link
US (1) US9483447B2 (en)
CN (1) CN103365831B (en)
SG (1) SG11201400690PA (en)
WO (1) WO2013143362A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096695A (en) * 2018-01-30 2019-08-06 腾讯科技(深圳)有限公司 Hyperlink label method and apparatus, file classification method and device

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302807B (en) * 2014-06-06 2020-01-10 腾讯科技(深圳)有限公司 Method and device for acquiring information category
CN110019834A (en) * 2017-10-16 2019-07-16 百度在线网络技术(北京)有限公司 A kind of method, apparatus, server and medium marking text chain
CN112468761A (en) * 2020-10-31 2021-03-09 浙江云优家智能科技有限公司 Intelligent conference recording system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1801142A (en) * 2004-12-31 2006-07-12 北京联动在线通讯科技有限公司 Method for implementing quick link of specific matching words displayed on screen
CN101000628A (en) * 2006-01-13 2007-07-18 国际商业机器公司 Wrong hyperlink detection equipment and method

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5694594A (en) * 1994-11-14 1997-12-02 Chang; Daniel System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms
US5920859A (en) * 1997-02-05 1999-07-06 Idd Enterprises, L.P. Hypertext document retrieval system and method
US7333966B2 (en) * 2001-12-21 2008-02-19 Thomson Global Resources Systems, methods, and software for hyperlinking names
EP1817693A1 (en) * 2004-09-29 2007-08-15 Panscient Pty Ltd. Machine learning system
CN100483408C (en) * 2005-07-29 2009-04-29 国际商业机器公司 Method and apparatus for establishing link structure between multiple documents
US7590626B2 (en) * 2006-10-30 2009-09-15 Microsoft Corporation Distributional similarity-based models for query correction
US20080306918A1 (en) * 2007-03-30 2008-12-11 Albert Mons System and method for wikifying content for knowledge navigation and discovery
US9690786B2 (en) * 2008-03-17 2017-06-27 Tivo Solutions Inc. Systems and methods for dynamically creating hyperlinks associated with relevant multimedia content
US8150829B2 (en) * 2008-04-11 2012-04-03 Fujitsu Limited Facilitating display of an interactive and dynamic cloud of terms related to one or more input terms
US20110047166A1 (en) * 2009-08-20 2011-02-24 Innography, Inc. System and methods of relating trademarks and patent documents
US8255786B1 (en) * 2010-04-09 2012-08-28 Wal-Mart Stores, Inc. Including hyperlinks in a document
US8850301B1 (en) * 2012-03-05 2014-09-30 Google Inc. Linking to relevant content from an ereader
US9378455B2 (en) * 2012-05-10 2016-06-28 Yan M. Yufik Systems and methods for a computer understanding multi modal data streams

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1801142A (en) * 2004-12-31 2006-07-12 北京联动在线通讯科技有限公司 Method for implementing quick link of specific matching words displayed on screen
CN101000628A (en) * 2006-01-13 2007-07-18 国际商业机器公司 Wrong hyperlink detection equipment and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096695A (en) * 2018-01-30 2019-08-06 腾讯科技(深圳)有限公司 Hyperlink label method and apparatus, file classification method and device
CN110096695B (en) * 2018-01-30 2023-01-03 腾讯科技(深圳)有限公司 Hyperlink marking method and device and text classification method and device

Also Published As

Publication number Publication date
US20140250356A1 (en) 2014-09-04
SG11201400690PA (en) 2014-09-26
CN103365831B (en) 2017-07-21
US9483447B2 (en) 2016-11-01
CN103365831A (en) 2013-10-23

Similar Documents

Publication Publication Date Title
CN103336766B (en) Short text garbage identification and modeling method and device
CN101950284B (en) Chinese word segmentation method and system
CN102693279B (en) Method, device and system for fast calculating comment similarity
US9043356B2 (en) Document processing method and system
US20140149102A1 (en) Personalized machine translation via online adaptation
CN103123624B (en) Determine method and device, searching method and the device of centre word
CN107357777B (en) Method and device for extracting label information
CN104951469B (en) Optimize the method and apparatus of corpus
CN108845982A (en) A kind of Chinese word cutting method of word-based linked character
CN102270244B (en) Method for quickly extracting webpage content key words based on core sentence
CN103631938A (en) Method and device for automatically expanding segmentation dictionary
CN101308512B (en) Mutual translation pair extraction method and device based on web page
CN105320734A (en) Web page core content extraction method
WO2013143362A1 (en) Method, device, and computer storage media for adding hyperlink to text
CN102999533A (en) Textspeak identification method and system
CN105956119A (en) Patent write auxiliary system and method
WO2015024429A1 (en) Method and device for acquiring movie and television subject from webpage
US8108391B1 (en) Identifying non-compositional compounds
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
CN105718441B (en) The method and apparatus of the similar UI components of function between a kind of lookup different platform
JP5757551B2 (en) Semantic classification assignment device, semantic classification provision method, semantic classification provision program
CN104298752B (en) The automatic extending method of program code initialism based on web page resource
CN109325159A (en) A kind of microblog hot event method for digging
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
JP2009140411A (en) Text summarization device and text summarization method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13767444

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14349016

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27/02/2015)

122 Ep: pct application non-entry in european phase

Ref document number: 13767444

Country of ref document: EP

Kind code of ref document: A1