WO2013143362A1 - 一种为文本添加超级链接的方法,装置以及计算机存储介质 - Google Patents
一种为文本添加超级链接的方法,装置以及计算机存储介质 Download PDFInfo
- Publication number
- WO2013143362A1 WO2013143362A1 PCT/CN2013/071573 CN2013071573W WO2013143362A1 WO 2013143362 A1 WO2013143362 A1 WO 2013143362A1 CN 2013071573 W CN2013071573 W CN 2013071573W WO 2013143362 A1 WO2013143362 A1 WO 2013143362A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- words
- super
- feature
- chain
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/134—Hyperlinking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Definitions
- the present invention relates to text processing techniques, and more particularly to a method and apparatus for adding hyperlinks to text. Background of the invention
- a hyperlink is a connection to a specific target on the Internet. When you click on a hyperlink, it will automatically jump to the specified destination.
- Figure 1 is a textual diagram of a text with a hyperlink added.
- the words “Taihe Hall”, “Zhonghe Temple” and “Baohe Temple” have been added with hyperlinks.
- a word with a hyperlink is usually called a hyperlink, and it is represented by an underlined blue font.
- the present invention provides a method for adding a hyperlink to text and a device for adding a hyperlink to text, which can improve the relevance of the added hyperlink to the text, and is convenient to implement.
- a method of adding hyperlinks to text including:
- each text X to be added with a hyperlink the following processes are respectively performed: word-cutting on the text X; feature words appearing in the list, and determining each extracted super-chain word and each extracted feature word Weights; each determined weight, respectively, the final weight of each extracted superchain word;
- the extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
- a device for adding hyperlinks to text including:
- a pre-processing module configured to pre-generate a list of super-chain words, collect various texts, and generate a list of feature words by performing word-cutting processing on each text, and respectively determine each super-chain word for each feature word Co-occurrence frequency
- Add a module to process the text X for each hyperlink to be added as follows:
- the extracted super-chain words are sorted in descending order of the final weights, and hyperlinks are added to the super-chain words in the top K position after sorting, and K is a positive integer.
- the correlation between the words and the words is obtained by statistically collecting the co-occurrence relationship between the words and the words in the collected text, and then according to the super-chain words taken from the hyperlink to be added.
- which words are automatically determined Add hyperlinks, no manual operation, and it's easy to implement.
- Figure 1 is a schematic diagram of the text with an existing hyperlink added.
- FIG. 2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention.
- FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention
- the method for implementing the present invention is directed to the problem existing in the prior art.
- a scheme for adding a hyperlink to a text is proposed, which can improve the correlation between the added hyperlink and the text, and is convenient to implement.
- the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
- 2 is a flow chart of an embodiment of a method for adding a hyperlink to text according to the present invention. As shown in FIG.
- Step 21 Pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, and determine each of the feature words separately.
- the co-occurrence frequency of each super-chain word is first generated, which specifically includes which super-chain words can be determined according to actual needs, the list can be generated by manual editing, or can be automatically generated by the machine in some way, and how to generate is not limited.
- Various texts can then be collected from the Internet to generate a list of feature words and a co-occurrence relationship between the statistic words and the words. How to collect text as an existing technology, in theory, the more the number of collected text, the better, the specific number can be determined according to actual needs.
- All the non-repeating words obtained by the cut words are used as feature words to form a list of feature words.
- the high-frequency words, stop words, and low-frequency words may be removed from all the non-repeating words obtained by the word-cutting, and the remaining words are used as feature words to form a feature word list; It is a prior art to cut words and how to distinguish which words are high frequency words, stop words and low frequency words.
- the IDF value is obtained by dividing the number of all the texts collected by 4 by the number of texts in which the feature word appears, and then obtaining the logarithm of the obtained quotient;
- Step 22 For each text X to be added with a hyperlink, process according to the procedures shown in steps 23 to 26, respectively. For ease of presentation, use the text X to represent any text that requires a hyperlink. Step 23: Perform word segmentation on the text X.
- Step 24 Extract the super-chain words appearing in the list of super-chain words and the feature words appearing in the list of feature words from the result of the word-cutting, and determine each extracted super-chain word and each extracted feature word Weight.
- the word-cutting result is compared with the list of super-chain words and the list of feature words generated in step 21, and the feature words appearing in the table. And, for each extracted super-chain word H, calculate its weight WH:
- WH TFH * IDFH; (3) where TFH represents the word frequency (TF, Term Frequency) value of the hyperlink word H, that is, the number of occurrences of the hyperlink word H in the text X, and IDFH represents the IDF value of the hyperlink word H; For each extracted feature word F, calculate its weight WF:
- Step 25 The rate and each of the determined weights respectively yield the final weight of each extracted superchain word.
- the P(H/Fi) value has been calculated in step 21.
- Step 26 Sort the extracted super-chain words according to the final weights in descending order, and add a hyperlink to the super-chain words in the top K position after sorting, and K is a positive integer.
- FIG. 3 is a schematic structural diagram of an embodiment of an apparatus for adding a hyperlink to text according to the present invention. As shown in FIG.
- the method includes: a pre-processing module, configured to pre-generate a list of super-chain words, collect various texts, and perform a word-cutting process on each text to generate a list of feature words, for each feature word, respectively Determining the co-occurrence frequency of each super-chain word; adding a module for respectively processing the text X to be added with each hyperlink, and performing the following processing on the text X; a feature word appearing in the table, and determining the weight of each extracted superchain word and each extracted feature word;
- the determined weights are respectively obtained as the final weights of each extracted super-chain word; the extracted super-chain words are sorted according to the final weights in descending order, and are ranked in the top K position after sorting.
- Hyperchain words add hyperlinks, K is a positive integer.
- the pre-processing module may specifically include: a first processing unit, configured to generate a super-chain list; a second processing unit, configured to collect various texts, and generate a feature word list by performing word-cutting processing on each text For each feature word, determine its co-occurrence frequency with each super-chain word.
- the second processing unit may specifically include (for the drawing, not shown): a first processing sub-unit for collecting various texts; and a second processing sub-unit for performing word-cutting processing on each text, All the non-repeating words obtained by the cut word are used as feature words to form a list of feature words, or high frequency words, stop words and low frequency words are removed from all the non-repeated words obtained by the cut words, and the remaining words are used as feature words. , composing a list of feature words; and, for each feature word y and each hyperlink word X, respectively calculate the co-occurrence frequency P(xly) of the two:
- P(xly) xy co-occurrence number I y occurrences; (1)
- the number of xy co-occurrences indicates the number of texts of the feature word y and the super-chain word X in all the collected texts, and the number of occurrences of y indicates that the feature word y appears in all the collected texts.
- the adding module may specifically include: a third processing unit, configured to perform word-cutting processing on the text X; and a fourth processing unit, configured to extract, from the word-cutting result, the super-chain word appearing in the super-chain word list and the feature a feature word appearing in the word list, and determining the weight of each extracted super-chain word and each extracted feature word; co-occurring according to each extracted feature word and each extracted super-chain word a frequency and each determined weight, respectively obtaining a final weight of each extracted super-chain word; a fifth processing unit, configured to extract each super-chain word according to a final weight value in descending order Sorting, adding a hyperlink to the super-chain word in the top K position after sorting, K is a positive integer.
- the second processing subunit may be further configured to determine, respectively, an IDF value for each feature word, and the IDF value is divided by the number of all the collected texts by the number of texts in which the feature word appears, and then The obtained quotient is obtained by logarithm;
- the fourth processing unit may further include (for the drawing, not shown): a third processing sub-unit, configured to extract, from the word-cutting result, a super-chain word appearing in the super-chain word list and a feature word appearing in the feature word list; and for each extracted super-chain word H, respectively Calculate its weight WH:
- WH TFH * IDFH; (3) where TFH represents the TF value of the hyperlink H, that is, the number of occurrences of the hyperlink H in the text X, and IDFH represents the IDF value of the hyperlink H;
- the feature word F respectively calculate its weight WF:
- WF TFF * IDFF; (4) where TFF represents the TF value of the feature word F, IDFF represents the IDF value of the feature word F; and the fourth processing sub-unit is used to calculate for each extracted super-chain word H, respectively Its final weight, WH,:
- WH, W H * ⁇ P(H/Fi)*W FI ; (5) where n represents the number of extracted feature words.
- Embodiments of the present invention also provide a machine readable storage medium storing instructions for causing a machine to perform a method of adding a hyperlink to text as described herein.
- a system or apparatus equipped with a storage medium on which software program code implementing the functions of any of the above-described embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be stored Reading and executing the program code stored in the storage medium.
- the program code itself read from the storage medium can implement the functions of any of the above embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
- Storage medium embodiments for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), Tape, non-volatile memory card and ROM.
- the program code can be downloaded from the server computer by the communication network.
- the program code read out from the storage medium is written into a memory set in an expansion board inserted into the computer or written in a memory set in an expansion unit connected to the computer, and then based on the program code.
- the instructions cause a CPU or the like mounted on the expansion board or the expansion unit to perform part and all of the actual operations, thereby realizing the functions of any of the above embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG11201400690PA SG11201400690PA (en) | 2012-03-29 | 2013-02-08 | Method, device, and computer storage media for adding hyperlink to text |
US14/349,016 US9483447B2 (en) | 2012-03-29 | 2013-02-08 | Method, device, and computer storage media for adding hyperlink to text |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210087642.0 | 2012-03-29 | ||
CN201210087642.0A CN103365831B (zh) | 2012-03-29 | 2012-03-29 | 一种为文本添加超级链接的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013143362A1 true WO2013143362A1 (zh) | 2013-10-03 |
Family
ID=49258175
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2013/071573 WO2013143362A1 (zh) | 2012-03-29 | 2013-02-08 | 一种为文本添加超级链接的方法,装置以及计算机存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US9483447B2 (zh) |
CN (1) | CN103365831B (zh) |
SG (1) | SG11201400690PA (zh) |
WO (1) | WO2013143362A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096695A (zh) * | 2018-01-30 | 2019-08-06 | 腾讯科技(深圳)有限公司 | 超链接标记方法和装置、文本分类方法和装置 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105302807B (zh) * | 2014-06-06 | 2020-01-10 | 腾讯科技(深圳)有限公司 | 一种获取信息类别的方法和装置 |
CN110019834A (zh) * | 2017-10-16 | 2019-07-16 | 百度在线网络技术(北京)有限公司 | 一种标注文字链的方法、装置、服务器及介质 |
CN112468761A (zh) * | 2020-10-31 | 2021-03-09 | 浙江云优家智能科技有限公司 | 一种智能会议记录系统 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1801142A (zh) * | 2004-12-31 | 2006-07-12 | 北京联动在线通讯科技有限公司 | 对屏幕显示的特定匹配词语实现快速链接的方法 |
CN101000628A (zh) * | 2006-01-13 | 2007-07-18 | 国际商业机器公司 | 错误超链接检测设备及其方法 |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5675819A (en) * | 1994-06-16 | 1997-10-07 | Xerox Corporation | Document information retrieval using global word co-occurrence patterns |
US5694594A (en) * | 1994-11-14 | 1997-12-02 | Chang; Daniel | System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms |
US5920859A (en) * | 1997-02-05 | 1999-07-06 | Idd Enterprises, L.P. | Hypertext document retrieval system and method |
US7333966B2 (en) * | 2001-12-21 | 2008-02-19 | Thomson Global Resources | Systems, methods, and software for hyperlinking names |
EP1817693A1 (en) * | 2004-09-29 | 2007-08-15 | Panscient Pty Ltd. | Machine learning system |
CN100483408C (zh) * | 2005-07-29 | 2009-04-29 | 国际商业机器公司 | 在多个文档之间建立链接结构的方法和装置 |
US7590626B2 (en) * | 2006-10-30 | 2009-09-15 | Microsoft Corporation | Distributional similarity-based models for query correction |
BRPI0811424A2 (pt) * | 2007-03-30 | 2019-09-24 | Knewco Inc | estrutura de dados, sistema e método de conhecimento da navegação e descoberta |
US9690786B2 (en) * | 2008-03-17 | 2017-06-27 | Tivo Solutions Inc. | Systems and methods for dynamically creating hyperlinks associated with relevant multimedia content |
US8150829B2 (en) * | 2008-04-11 | 2012-04-03 | Fujitsu Limited | Facilitating display of an interactive and dynamic cloud of terms related to one or more input terms |
US20110047166A1 (en) * | 2009-08-20 | 2011-02-24 | Innography, Inc. | System and methods of relating trademarks and patent documents |
US8255786B1 (en) * | 2010-04-09 | 2012-08-28 | Wal-Mart Stores, Inc. | Including hyperlinks in a document |
US8850301B1 (en) * | 2012-03-05 | 2014-09-30 | Google Inc. | Linking to relevant content from an ereader |
US9378455B2 (en) * | 2012-05-10 | 2016-06-28 | Yan M. Yufik | Systems and methods for a computer understanding multi modal data streams |
-
2012
- 2012-03-29 CN CN201210087642.0A patent/CN103365831B/zh active Active
-
2013
- 2013-02-08 WO PCT/CN2013/071573 patent/WO2013143362A1/zh active Application Filing
- 2013-02-08 SG SG11201400690PA patent/SG11201400690PA/en unknown
- 2013-02-08 US US14/349,016 patent/US9483447B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1801142A (zh) * | 2004-12-31 | 2006-07-12 | 北京联动在线通讯科技有限公司 | 对屏幕显示的特定匹配词语实现快速链接的方法 |
CN101000628A (zh) * | 2006-01-13 | 2007-07-18 | 国际商业机器公司 | 错误超链接检测设备及其方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110096695A (zh) * | 2018-01-30 | 2019-08-06 | 腾讯科技(深圳)有限公司 | 超链接标记方法和装置、文本分类方法和装置 |
CN110096695B (zh) * | 2018-01-30 | 2023-01-03 | 腾讯科技(深圳)有限公司 | 超链接标记方法和装置、文本分类方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN103365831B (zh) | 2017-07-21 |
US9483447B2 (en) | 2016-11-01 |
US20140250356A1 (en) | 2014-09-04 |
CN103365831A (zh) | 2013-10-23 |
SG11201400690PA (en) | 2014-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101950284B (zh) | 中文分词方法及系统 | |
CN102693279B (zh) | 一种快速计算评论相似度的方法、装置及系统 | |
US9043356B2 (en) | Document processing method and system | |
US20140149102A1 (en) | Personalized machine translation via online adaptation | |
CN107357777B (zh) | 提取标签信息的方法和装置 | |
CN103336766A (zh) | 短文本垃圾识别以及建模方法和装置 | |
CN104951469B (zh) | 优化语料库的方法和装置 | |
CN103123624A (zh) | 确定中心词的方法及装置、搜索方法及装置 | |
CN102270244B (zh) | 基于核心语句的网页内容关键词快速提取方法 | |
CN103631938A (zh) | 一种分词词典自动扩展的方法和装置 | |
CN101308512B (zh) | 一种基于网页的互译翻译对抽取方法及装置 | |
WO2013143362A1 (zh) | 一种为文本添加超级链接的方法,装置以及计算机存储介质 | |
CN105956119A (zh) | 专利撰写辅助系统及其方法 | |
US8572081B1 (en) | Identifying non-compositional compounds | |
CN103984731B (zh) | 微博环境下自适应话题追踪方法和装置 | |
CN112528640A (zh) | 一种基于异常子图检测的领域术语自动抽取方法 | |
CN105718441B (zh) | 一种查找不同平台间功能相似ui组件的方法和装置 | |
CN104298752B (zh) | 基于web网页资源的程序代码缩略词的自动扩充方法 | |
CN107451215B (zh) | 特征文本抽取方法及装置 | |
CN106202033B (zh) | 一种基于依存约束和知识的副词词义消歧方法和装置 | |
CN109325159A (zh) | 一种微博热点事件挖掘方法 | |
JP2011129006A (ja) | 意味分類付与装置、意味分類付与方法、意味分類付与プログラム | |
CN113934910A (zh) | 一种自动优化、更新的主题库构建方法,及热点事件实时更新方法 | |
CN112765979B (zh) | 论文关键词提取系统及其方法 | |
JP2018206189A (ja) | 情報収集装置、および、情報収集方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13767444 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14349016 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27/02/2015) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13767444 Country of ref document: EP Kind code of ref document: A1 |