WO2017107696A1 - Procédé et dispositif d'identification d'article pondéré - Google Patents

Procédé et dispositif d'identification d'article pondéré Download PDF

Info

Publication number
WO2017107696A1
WO2017107696A1 PCT/CN2016/105354 CN2016105354W WO2017107696A1 WO 2017107696 A1 WO2017107696 A1 WO 2017107696A1 CN 2016105354 W CN2016105354 W CN 2016105354W WO 2017107696 A1 WO2017107696 A1 WO 2017107696A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
article
weight
value
title
Prior art date
Application number
PCT/CN2016/105354
Other languages
English (en)
Chinese (zh)
Inventor
张伸正
魏少俊
陈培军
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2017107696A1 publication Critical patent/WO2017107696A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Definitions

  • the present invention relates to the field of computers, and in particular, to a weighted article identification method and apparatus.
  • the present invention has been made in order to provide a weighted article identification method and apparatus that overcomes the above problems or at least partially solves the above problems.
  • a weighted article identification method comprising: segmenting a title corresponding to an article to obtain a plurality of words; calculating a weight value of the plurality of words; and weight values of the plurality of words Responding to the degree of importance of the plurality of words in the article; expanding the number of at least one of the plurality of words in the title corresponding to the article according to the weight value of the plurality of words, so that the plurality of words The number of words corresponds to the weight value of the plurality of words; the article is identified with the expanded title.
  • the foregoing method calculating a weight value of the plurality of words, specifically: counting a word frequency of the plurality of words in the article, and calculating a word frequency according to the plurality of words in the article The weight value of the plurality of words.
  • the foregoing method before expanding the number of at least one of the plurality of words in the title corresponding to the article according to the weight value of the plurality of words, further comprising: the plurality of words The weight value is adjusted such that the weight values of the plurality of words are integer multiples of the preset value.
  • the foregoing method after adjusting the weight value of the plurality of words, such that the weight values of the plurality of words are all integer multiples of the preset value, further comprising: according to the plurality of words The minimum value in the weight value sets the preset value.
  • the foregoing method by using the extended title to identify the article, specifically includes: identifying the article by taking a minimum hash value of the extended title.
  • a weighted article identification device comprising: a word segmentation module adapted to perform word segmentation on a title corresponding to an article to obtain a plurality of words; a weight value calculation module adapted to calculate the plurality of words a weight value of the word; a weight value of the plurality of words reflects an importance degree of the plurality of words in the article; an extension module adapted to expand a title corresponding to the article according to a weight value of the plurality of words The number of at least one of the plurality of words, the number of the plurality of words is corresponding to the weight value of the plurality of words; and the identification module is adapted to identify the article with the expanded title.
  • the weight value calculation module calculates a word frequency of the plurality of words in the article, and calculates a weight of the plurality of words according to a word frequency of the plurality of words in the article. value.
  • the foregoing apparatus further includes: a weight adjustment module, configured to adjust weight values of the plurality of words such that weight values of the plurality of words are integer multiples of preset values.
  • the foregoing apparatus further includes: a setting module, configured to set the preset value according to a minimum value of the weight values of the plurality of words.
  • the identifier module identifies the article by taking a minimum hash value of the extended title.
  • a computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform said weighted article identification method.
  • a computer readable medium wherein the computer program is stored.
  • the weight value is calculated for each word according to the importance degree of each word in the article title, and the corresponding word in the article title is expanded according to the weight value, and the weight value in the expanded title is larger.
  • the proportion of words increases, which is equivalent to the expanded title can also reflect the importance of multiple words in the article, so when you need to analyze the problem according to the importance of multiple words in the article, you can use the expanded title instead of the article to use. .
  • FIG. 1 is a flow chart schematically showing a weighted article identification method according to an embodiment of the present invention
  • FIG. 2 is a block diagram schematically showing a weighted article identification device in accordance with one embodiment of the present invention
  • FIG. 3 is a block diagram schematically showing a weighted article identification device in accordance with one embodiment of the present invention.
  • FIG. 4 is a block diagram schematically showing a computing device for performing a weighted article identification method in accordance with the present invention
  • Fig. 5 schematically shows a storage unit for holding or carrying program code implementing the weighted article identification method according to the present invention.
  • a weighted article identification method in an embodiment of the present invention includes:
  • step 110 word segmentation is performed on the title corresponding to the article to obtain a plurality of words. For example, corresponding to the headline of a certain news, "The Star New Film Scale", the word segmentation is divided into several words: star, new film, scale, and big.
  • Step 120 Calculate weight values of multiple words; weight values of multiple words reflect the importance of multiple words in the article.
  • the weight value is calculated. For example, if a word matches the current hotspot event, the word is given a higher weight value.
  • Step 130 Expand the number of at least one of the plurality of words in the title corresponding to the article according to the weight value of the plurality of words, so that the number of the plurality of words corresponds to the weight value of the plurality of words.
  • the star has a weight value of 0.2
  • the new film has a weight value of 0.1
  • the expanded title may be "Star Star New Film Scale”. It can be seen that in the expanded title, the important words are relatively large, so the expanded title can reflect which words in the news are more important.
  • step 140 the article is identified by the expanded title.
  • the words with high weights in the expanded title are repeated more frequently, and the number of repeated words with low weight is less, which can reflect the importance of multiple words of the article, so it is important to need multiple words according to the article.
  • a weighted article identification method is provided.
  • the step 120 includes:
  • the word frequency of multiple words in the article is counted, and the weight values of multiple words are calculated according to the word frequency of multiple words in the article.
  • the more important the word appears in the article the higher the frequency, so the weight of the plurality of words can be judged according to the word frequency.
  • a weighted article identification method is provided.
  • the weighted article identification method of the present embodiment, before the step 130, further includes:
  • the weight values of the plurality of words are adjusted such that the weight values of the plurality of words are integer multiples of the preset values.
  • the number of words in the title can only be increased by an integer, it is necessary to adjust the weight values of the multiple words, so that the ratio of the weight values of the multiple words is not too complicated, resulting in expansion in the title. A large number of words, which affect the briefness of the title.
  • a weighted article identification method is provided.
  • the weighted article identification method of the present embodiment, before the step 130, further includes:
  • the preset value is set according to the minimum value among the weight values of the plurality of words.
  • the minimum value among the weight values of the plurality of words is set to a preset value such that at least one word in the title appears only once, so that the length of the title can be guaranteed not to be too long.
  • a weighted article identification method is provided.
  • the step 140 includes:
  • the article is identified by taking the minimum hash value of the expanded title.
  • the value may be related to "European style clothing”
  • the weight of the "star” can be calculated according to the weight such as (tfidf, word frequency). For example, the weight of "star” in this article is 0.4, the weight of "new film” is 0.2, and the weight of other words is 0.1.
  • the title will be expanded to "star star star star new film new film scale big workplace sorcerer fan has to wear this", and then calculate the minimum hash value, then the value can reflect the different importance of multiple words.
  • a weighted article identification device in an embodiment of the present invention includes:
  • the word segmentation module 210 is adapted to segment the title corresponding to the article to obtain a plurality of words. For example, corresponding to the headline of a certain news, "The Star New Film Scale", the word segmentation is divided into several words: star, new film, scale, and big.
  • the weight value calculation module 220 is adapted to calculate weight values of the plurality of words; the weight values of the plurality of words reflect the importance of the plurality of words in the article.
  • the weight value is calculated. For example, if a word matches the current hotspot event, the word is given a higher weight value.
  • the expansion module 230 is adapted to expand the number of at least one of the plurality of words in the title corresponding to the article according to the weight value of the plurality of words, so that the number of the plurality of words corresponds to the weight value of the plurality of words.
  • the star has a weight value of 0.2
  • the new film has a weight value of 0.1
  • the expanded title may be "Star Star New Film Scale”. It can be seen that in the expanded title, the important words are relatively large, so the expanded title can reflect which words in the news are more important.
  • the identification module 240 is adapted to identify the article with the expanded title.
  • the words with high weights in the expanded title are repeated more frequently, and the number of repeated words with low weight is less, which can reflect the importance of multiple words in the article, so the importance of multiple words according to the article is needed.
  • a weighted article identification device is provided.
  • the weight value calculation module 220 is adapted to count multiple The word frequency of the word in the article, based on the word frequency of multiple words in the article, calculate the weight value of multiple words.
  • the more important the word appears in the article the higher the frequency, so the weight of the plurality of words can be judged according to the word frequency.
  • another embodiment of the present invention provides a weighted article identification device.
  • the weighted article identification device of the present embodiment further includes:
  • the weight adjustment module 310 is adapted to adjust the weight values of the plurality of words such that the weight values of the plurality of words are integer multiples of the preset value.
  • the weight values of the plurality of words are integer multiples of the preset value.
  • a weighted article identification device in another embodiment of the present invention, is provided.
  • the weighted article identification device of the embodiment further includes:
  • the setting module 320 is adapted to set a preset value according to a minimum value among the weight values of the plurality of words.
  • the minimum value among the weight values of the plurality of words is set to a preset value such that at least one word in the title appears only once, so that the length of the title can be guaranteed not to be too long.
  • a weighted article identification device is provided.
  • the identification module 140 is adapted to take the expanded title.
  • the minimum hash value identifies the article.
  • the weight of the "star” can be calculated according to the weight such as (tfidf, word frequency).
  • the weight of "star” in this article is 0.4, the weight of "new film” is 0.2, and the weight of other words is 0.1. Then, the title will be expanded to "star star star star new film new film scale big workplace sorcerer fan has to wear this", and then calculate the minimum hash value, then the value can reflect the different importance of multiple words.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the present specification including accompanying claims, abstracts and drawings) may be employed. All features of the invention, as well as all processes or units of any method or device so disclosed, are combined.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the functionality of some or all of the components of the weighted article identification device in accordance with embodiments of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • Figure 4 illustrates a computing device that can implement a weighted article identification method in accordance with the present invention.
  • the computing device conventionally includes a processor 410 and a computer program product or computer readable medium in the form of a memory 420.
  • the memory 420 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 420 has a memory space 430 for program code 431 for performing any of the method steps described above.
  • storage space 430 for program code may include various program code 431 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 420 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 431', ie, code readable by a processor, such as 410, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et un dispositif d'identification d'article pondéré. Le procédé consiste : à analyser des titres correspondant à des articles pour produire plusieurs termes (110); à calculer les pondérations des termes, lesdites pondérations des termes reflétant le degré d'importance des termes dans les articles (120); à étendre, sur la base des pondérations des termes, le nombre d'au moins un desdits termes dans les titres correspondant aux articles, ce qui fait correspondre les nombres des termes aux pondérations des termes (130); et à identifier les articles ayant les titres étendus (140). Selon le procédé, les pondérations des termes sont calculées sur la base du degré d'importance des termes dans les titres d'articles, les termes correspondants dans les titres d'articles sont étendus sur la base de l'ampleur des pondérations, et les proportions représentées par les termes de pondérations supérieures sont étendues dans les titres étendus, ce qui équivaut à ce que les titres étendus puissent refléter les degrés d'importance des termes des articles. Ainsi, lorsqu'un problème doit être analysé sur la base des degrés d'importance des termes des articles, les titres étendus peuvent être utilisés à la place des articles.
PCT/CN2016/105354 2015-12-22 2016-11-10 Procédé et dispositif d'identification d'article pondéré WO2017107696A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510976010.3 2015-12-22
CN201510976010.3A CN105589847B (zh) 2015-12-22 2015-12-22 带权重的文章标识方法和装置

Publications (1)

Publication Number Publication Date
WO2017107696A1 true WO2017107696A1 (fr) 2017-06-29

Family

ID=55929437

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/105354 WO2017107696A1 (fr) 2015-12-22 2016-11-10 Procédé et dispositif d'identification d'article pondéré

Country Status (2)

Country Link
CN (1) CN105589847B (fr)
WO (1) WO2017107696A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589847B (zh) * 2015-12-22 2019-02-15 北京奇虎科技有限公司 带权重的文章标识方法和装置
KR101797234B1 (ko) 2016-12-07 2017-11-13 서강대학교 산학협력단 온라인 커뮤니티에서 동일 사용자의 닉네임 목록을 추출하는 장치 및 방법
CN108509545B (zh) * 2018-03-20 2021-11-23 北京云站科技有限公司 一种文章的评论处理方法及系统
CN108959263B (zh) * 2018-07-11 2022-06-03 北京奇艺世纪科技有限公司 一种词条权重计算模型训练方法及装置
CN110287280B (zh) * 2019-06-24 2023-09-29 腾讯科技(深圳)有限公司 一种分析文章中词的方法和装置、存储介质以及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020099730A1 (en) * 2000-05-12 2002-07-25 Applied Psychology Research Limited Automatic text classification system
CN101079031A (zh) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 一种网页主题提取系统和方法
CN104978320A (zh) * 2014-04-02 2015-10-14 东华软件股份公司 一种基于相似度的知识推荐方法和设备
CN105103153A (zh) * 2013-03-06 2015-11-25 汤姆逊许可公司 视频的图像概要
CN105589847A (zh) * 2015-12-22 2016-05-18 北京奇虎科技有限公司 带权重的文章标识方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004348222A (ja) * 2003-05-20 2004-12-09 Matsushita Electric Ind Co Ltd 自動販売機の商品収納装置
CN100520782C (zh) * 2007-11-09 2009-07-29 清华大学 一种基于词频和多元文法的新闻关键词抽取方法
CN102193936B (zh) * 2010-03-09 2013-09-18 阿里巴巴集团控股有限公司 一种数据分类的方法及装置
CN102831198A (zh) * 2012-08-07 2012-12-19 人民搜索网络股份公司 一种基于文档签名技术的相似文档识别装置及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020099730A1 (en) * 2000-05-12 2002-07-25 Applied Psychology Research Limited Automatic text classification system
CN101079031A (zh) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 一种网页主题提取系统和方法
CN105103153A (zh) * 2013-03-06 2015-11-25 汤姆逊许可公司 视频的图像概要
CN104978320A (zh) * 2014-04-02 2015-10-14 东华软件股份公司 一种基于相似度的知识推荐方法和设备
CN105589847A (zh) * 2015-12-22 2016-05-18 北京奇虎科技有限公司 带权重的文章标识方法和装置

Also Published As

Publication number Publication date
CN105589847B (zh) 2019-02-15
CN105589847A (zh) 2016-05-18

Similar Documents

Publication Publication Date Title
WO2017107696A1 (fr) Procédé et dispositif d'identification d'article pondéré
JP6741110B2 (ja) イベント発見方法、装置、機器及びプログラム
WO2016095626A1 (fr) Procédé et dispositif de surveillance de processus
WO2016015621A1 (fr) Procédé et système de reconnaissance de nom d'image de visage humain
CN104464726B (zh) 一种相似音频的确定方法及装置
WO2015070673A1 (fr) Procédé pour une recherche de réseau côté navigateur et navigateur
WO2017113677A1 (fr) Procédé et système de traitement de données de comportement d'utilisateur
US9116879B2 (en) Dynamic rule reordering for message classification
TW202029079A (zh) 異常群體識別方法及裝置
RU2017110458A (ru) Файловая система с поэкстентными контрольными суммами
WO2014000536A1 (fr) Système et procédé d'identification de site web d'hameçonnage
JP2017503273A5 (fr)
WO2017107843A1 (fr) Procédé et appareil de traitement de tâches périodiques, programme informatique et support lisible
WO2017000613A1 (fr) Procédé et dispositif de génération d'informations d'indication dans une page de résultats de recherche
CN106599247B (zh) LSM-tree结构中数据文件的合并方法及装置
CN106598997B (zh) 一种计算文本主题归属度的方法及装置
CN111370022A (zh) 音频广告检测方法、装置、电子设备及介质
WO2017107679A1 (fr) Procédé et appareil d'affichage d'informations d'historique
WO2022134683A1 (fr) Procédé et dispositif de génération d'informations de contexte de contenu écrit dans un processus d'écriture
TW201917608A (zh) 資料去識別化方法、資料去識別化裝置及執行資料去識別化方法的非暫態電腦可讀取儲存媒體
US9390719B1 (en) Interest points density control for audio matching
WO2015176624A1 (fr) Procédé et système d'identification de termes de recherche d'actualité immédiate
US9201967B1 (en) Rule based product classification
WO2017107835A1 (fr) Procédé et appareil de mise en marche de navigateur
CN106782612B (zh) 一种逆向爆音检测方法及其装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16877507

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16877507

Country of ref document: EP

Kind code of ref document: A1