WO2019041524A1 - Method, electronic apparatus, and computer readable storage medium for generating cluster tag - Google Patents

Method, electronic apparatus, and computer readable storage medium for generating cluster tag Download PDF

Info

Publication number
WO2019041524A1
WO2019041524A1 PCT/CN2017/108807 CN2017108807W WO2019041524A1 WO 2019041524 A1 WO2019041524 A1 WO 2019041524A1 CN 2017108807 W CN2017108807 W CN 2017108807W WO 2019041524 A1 WO2019041524 A1 WO 2019041524A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
keywords
keyword
calculation formula
extracting
Prior art date
Application number
PCT/CN2017/108807
Other languages
French (fr)
Chinese (zh)
Inventor
罗傲雪
汪伟
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019041524A1 publication Critical patent/WO2019041524A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for generating a cluster tag, the method comprising steps of: constructing, for text clustering results, a semantic network relationship for words in each cluster (S31); extracting, from the semantic network relationship constructed from each of the clusters, representative keywords, and marking the same as cluster keywords (S32); and extracting, from the keywords of each cluster, the most discriminating keyword, and marking the same as a tag of said cluster (S33). In this way, the present invention improves the discrimination and identification ability of a tag of a cluster.

Description

聚类标签生成方法、电子设备及计算机可读存储介质Cluster label generation method, electronic device and computer readable storage medium
本专利申请以2017年8月31日提交的申请号为201710776351.5,名称为“聚类标签生成方法、电子设备及计算机可读存储介质”的中国发明专利申请为基础,并要求其优先权。This patent application is based on the Chinese Patent Application No. 201710776351.5 filed on Aug. 31, 2017, entitled "Cluster Label Generation Method, Electronic Device, and Computer Readable Storage Medium", and requires priority.
技术领域Technical field
本申请涉及计算机信息技术领域,尤其涉及一种聚类标签生成方法、电子设备及计算机可读存储介质。The present application relates to the field of computer information technology, and in particular, to a cluster label generation method, an electronic device, and a computer readable storage medium.
背景技术Background technique
现有技术中对于非监督的语料进行聚类,聚类后结果往往由于缺少标签,从而导致在用户交互中不易呈现聚类结果的问题。故,现有技术中的聚类方法设计不够合理,亟需改进。In the prior art, clustering of unsupervised corpora, the result of clustering often lacks tags, which leads to the problem that clustering results are not easily presented in user interaction. Therefore, the design of the clustering method in the prior art is not reasonable enough and needs to be improved.
发明内容Summary of the invention
有鉴于此,本申请提出一种聚类标签生成方法、电子设备及计算机可读存储介质,通过预设的朴素贝叶斯计算公式,在语义层面上优化了聚类关键词的提取过程,并对聚类文本的标签提取进行了优化。In view of this, the present application proposes a cluster label generation method, an electronic device, and a computer readable storage medium, and optimizes a clustering keyword extraction process on a semantic level by a preset naive Bayesian calculation formula, and The label extraction of clustered text is optimized.
首先,为实现上述目的,本申请提出一种电子设备,所述电子设备包括存储器和处理器,在所述存储器上存储有可在所述处理器上运行的聚类标签生成系统,所述聚类标签生成系统被所述处理器执行时实现如下步骤:First, in order to achieve the above object, the present application provides an electronic device including a memory and a processor, on which is stored a cluster label generation system executable on the processor, the The class tag generation system is implemented by the processor to implement the following steps:
针对文本聚类结果构建每个聚类中词语间的语义网络关系;Constructing a semantic network relationship between words in each cluster for text clustering results;
从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词;及Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;
从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
优选地,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。Preferably, the extracting the representative keywords comprises: extracting keywords of each cluster according to the conditional probability value size of the words.
优选地,所述抽取具有代表性的关键词包括:Preferably, the extracting representative keywords includes:
计算每个聚类所构建的语义网络关系中每个词语的条件概率值,其中,所述条件概率值根据预设的朴素贝叶斯计算公式得出;Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;
针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量的关键词,记为聚类关键词。The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
优选地,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词。Preferably, the extracting the most distinguishing keyword comprises: extracting the most discriminating keyword from the keywords of each cluster according to the transition probability value between the words and the preset naive Bayes calculation formula.
优选地,所述抽取区分性最高的关键词包括:Preferably, the extracting the most distinguishing keywords comprises:
根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值; Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;
将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式中,重新计算每个关键词的条件概率值;Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;
针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
优选地,所述预设的朴素贝叶斯计算公式设置为公式1:Preferably, the preset naive Bayesian calculation formula is set to formula 1:
Figure PCTCN2017108807-appb-000001
Figure PCTCN2017108807-appb-000001
公式1中,S代表由n个词语W1、W2、…Wn组成的一段文本,Wi代表该段文本所构建的语义网络关系中的一个词语;In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;
所述预设的转移概率计算公式设置为公式2:The preset transition probability calculation formula is set to Equation 2:
Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+…Pm(Wj|Wi));Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));
公式2中,m代表文本聚类后的聚类数量,t代表其中的某个聚类,Wi和Wj代表每个聚类抽取的关键词,Pt(Wj|Wi)代表:将第t个聚类的所有文档聚成的总文档中,关键词Wi到Wj的转移概率。In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
此外,为实现上述目的,本申请还提供一种聚类标签生成方法,该方法应用于电子设备,所述方法包括:In addition, to achieve the above object, the present application further provides a cluster label generation method, which is applied to an electronic device, and the method includes:
针对文本聚类结果构建每个聚类中词语间的语义网络关系;Constructing a semantic network relationship between words in each cluster for text clustering results;
从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词;及Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;
从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
优选地,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词,具体包括:Preferably, the extracting the representative keywords includes: extracting keywords of each cluster according to the conditional probability value size of the words, specifically including:
计算每个聚类所构建的语义网络关系中每个词语的条件概率值,其中,所述条件概率值根据预设的朴素贝叶斯计算公式得出;Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;
针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量的关键词,记为聚类关键词。The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
优选地,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词,具体包括:Preferably, the extracting the most distinguishing keyword comprises: extracting the most distinctive keyword from each cluster of keywords according to a transition probability value between words and a preset naive Bayesian calculation formula, Specifically include:
根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值;Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;
将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式中,重新计算每个关键词的条件概率值;Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;
针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有聚类标签生成系统,所述聚类标签生成系统 可被至少一个处理器执行,以使所述至少一个处理器执行如上述的聚类标签生成方法的步骤。Further, in order to achieve the above object, the present application further provides a computer readable storage medium storing a cluster label generation system, the cluster label generation system The step of causing the at least one processor to perform the clustering label generation method as described above may be performed by at least one processor.
相较于现有技术,本申请所提出的电子设备、聚类标签生成方法及计算机可读存储介质,通过预设的朴素贝叶斯计算公式,在语义层面上优化了聚类关键词的提取过程。进一步地,也对聚类文本的标签提取进行了优化,使得提取的聚类标签具有高区分性和辨识度。Compared with the prior art, the electronic device, the cluster label generation method and the computer readable storage medium proposed by the present application optimize the extraction of clustering keywords on the semantic level by using a preset naive Bayesian calculation formula. process. Further, the label extraction of the clustered text is also optimized, so that the extracted cluster labels have high discrimination and recognition.
附图说明DRAWINGS
图1是本申请电子设备一可选的硬件架构的示意图;1 is a schematic diagram of an optional hardware architecture of an electronic device of the present application;
图2是本申请电子设备中聚类标签生成系统一实施例的程序模块示意图;2 is a schematic diagram of a program module of an embodiment of a cluster label generation system in an electronic device of the present application;
图3为本申请聚类标签生成方法一实施例的实施流程示意图。FIG. 3 is a schematic diagram of an implementation process of an embodiment of a method for generating a cluster label according to the present application.
附图标记:Reference mark:
电子设备 Electronic equipment 22
存储器Memory 21twenty one
处理器processor 22twenty two
网络接口Network Interface 23twenty three
聚类标签生成系统Cluster tag generation system 2020
构建模块 Building module 201201
抽取模块 Extraction module 202202
生成模块 Build module 203203
流程步骤Process step S31-S33S31-S33
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特 征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the indicated technical features. The number of levies. Thus, features defining "first" and "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
进一步需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It is further to be understood that the term "comprises", "comprises" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a And includes other elements not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
首先,本申请提出一种电子设备2。First of all, the present application proposes an electronic device 2.
参阅图1所示,是本申请电子设备2一可选的硬件架构的示意图。本实施例中,所述电子设备2可包括,但不限于,可通过系统总线相互通信连接存储器21、处理器22、网络接口23。需要指出的是,图1仅示出了具有组件21-23的电子设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the electronic device 2 of the present application. In this embodiment, the electronic device 2 may include, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system bus. It is pointed out that FIG. 1 only shows the electronic device 2 with the components 21-23, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
其中,所述电子设备2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该电子设备2可以是独立的服务器,也可以是多个服务器所组成的服务器集群。The electronic device 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The electronic device 2 may be an independent server or a server cluster composed of multiple servers. .
所述存储器21至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述电子设备2的内部存储单元,例如该电子设备2的硬盘或内存。在另一些实施例中,所述存储器21也可以是所述电子设备2的外部存储设备,例如该电子设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述电子设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述电子设备2的操作系统和各类应用软件,例如所述聚类标签生成系统20的程序代码等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the electronic device 2, such as a hard disk or memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, such as a plug-in hard disk equipped on the electronic device 2, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card, etc. Of course, the memory 21 may also include both an internal storage unit of the electronic device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used to store an operating system installed in the electronic device 2 and various types of application software, such as program codes of the cluster tag generation system 20. Further, the memory 21 can also be used to temporarily store various types of data that have been output or are to be output.
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述电子设备2的总体操作,例如执行与所述电子设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述的聚类 标签生成系统20等。The processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the electronic device 2, such as performing control and processing related to data interaction or communication with the electronic device 2. In this embodiment, the processor 22 is configured to run program code or process data stored in the memory 21, for example, to run the cluster. The tag generation system 20 and the like.
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述电子设备2与其他电子设备之间建立通信连接。例如,所述网络接口23用于通过网络将所述电子设备2与外部数据平台相连,在所述电子设备2与外部数据平台之间的建立数据传输通道和通信连接。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。The network interface 23 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 2 and other electronic devices. For example, the network interface 23 is configured to connect the electronic device 2 to an external data platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and an external data platform. The network may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, or a 5G network. Wireless or wired networks such as network, Bluetooth, Wi-Fi, etc.
至此,己经详细介绍了本申请各个实施例的应用环境和相关设备的硬件结构和功能。下面,将基于上述应用环境和相关设备,提出本申请的各个实施例。So far, the application environment of the various embodiments of the present application and the hardware structure and functions of related devices have been described in detail. Hereinafter, various embodiments of the present application will be proposed based on the above-described application environment and related devices.
参阅图2所示,是本申请电子设备2中聚类标签生成系统20一实施例的程序模块图。本实施例中,所述的聚类标签生成系统20可以被分割成一个或多个程序模块,所述一个或者多个程序模块被存储于所述存储器21中,并由一个或多个处理器(本实施例中为所述处理器22)所执行,以完成本申请。例如,在图2中,所述的聚类标签生成系统20可以被分割成构建模块201、抽取模块202、以及生成模块203。本申请所称的程序模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述所述聚类标签生成系统20在所述电子设备2中的执行过程。以下将就各程序模块201-203的功能进行详细描述。Referring to FIG. 2, it is a program module diagram of an embodiment of the cluster label generation system 20 in the electronic device 2 of the present application. In this embodiment, the cluster tag generation system 20 may be divided into one or more program modules, the one or more program modules being stored in the memory 21 and being processed by one or more processors. (Processing in the present embodiment for the processor 22) to complete the application. For example, in FIG. 2, the cluster tag generation system 20 can be divided into a construction module 201, an extraction module 202, and a generation module 203. The program module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program to describe the execution process of the cluster tag generation system 20 in the electronic device 2. The function of each program module 201-203 will be described in detail below.
所述构建模块201,用于针对文本聚类结果构建每个聚类中词语间的语义网络关系。在本实施例中,针对非监督的语料进行文本聚类,聚类方法可以采用Text-rank聚类算法,文本聚类结果可以是文本摘要信息等。所述语义网络关系用于描述物体概念与状态及其间的关系,由结点和结点之间的弧组成,其中,结点表示概念(事件、事物等),弧表示概念之间的关系。The building module 201 is configured to construct a semantic network relationship between words in each cluster for the text clustering result. In this embodiment, the text clustering is performed for the unsupervised corpus, the clustering method may adopt the Text-rank clustering algorithm, and the text clustering result may be the text summary information or the like. The semantic network relationship is used to describe the concept and state of an object and its relationship. It consists of an arc between a node and a node. The node represents a concept (event, thing, etc.), and the arc represents the relationship between concepts.
所述抽取模块202,用于从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词。The extracting module 202 is configured to extract a representative keyword from the semantic network relationship constructed by each cluster, and record it as a clustering keyword.
优选地,在本实施例中,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。具体而言,假设S代表着一段文本,Wi代表该段文本所构建的语义网络关系中的一个词语,计算每个聚类所构建的语义网络关系中每个词语的条件概率值P(S|Wi)。从理论上而言,如果某个词语Wi是该段文本的关键词,那么应该使得上述条件概率值最大。因此,针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量(如3个)的关键词,记为聚类关键词。在本实施例中,所述聚类关键词为最能代表该段文本语义的词语。Preferably, in the embodiment, the extracting the representative keywords comprises: extracting keywords of each cluster according to the conditional probability value size of the words. Specifically, suppose S represents a piece of text, Wi represents a word in the semantic network relationship constructed by the piece of text, and calculates the conditional probability value P of each word in the semantic network relationship constructed by each cluster (S| Wi). Theoretically, if a word Wi is a keyword for the text of the paragraph, then the above conditional probability value should be maximized. Therefore, the conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number (for example, three) of keywords is extracted and recorded as a clustering keyword. In this embodiment, the clustering keyword is a word that best represents the semantics of the piece of text.
优选地,在本实施例中,所述条件概率值根据预设的朴素贝叶斯计算公 式得出。举例而言,假设文本S由n个词语W1、W2、…Wn组成,则预设的朴素贝叶斯计算公式可以设置为如下公式1(LaTex版本)。Preferably, in the embodiment, the conditional probability value is calculated according to a preset naive Bayesian Drawn. For example, assuming that the text S is composed of n words W1, W2, ..., Wn, the preset naive Bayesian calculation formula can be set as the following formula 1 (LaTex version).
P(S|Wi)=P(W1,W2,...,Wn|Wi)=\prod_{k=1}^n P(Wk|Wi)--公式1P(S|Wi)=P(W1,W2,...,Wn|Wi)=\prod_{k=1}^n P(Wk|Wi)--Formula 1
需要说明的是,在其它实施例中,公式1也可以表示为如下形式:It should be noted that, in other embodiments, Equation 1 can also be expressed as follows:
Figure PCTCN2017108807-appb-000002
Figure PCTCN2017108807-appb-000002
其中,公式1中P(S|Wi)代表:给定词语Wi出现的情况下,文本S出现的概率,等式右半部分为乘积计算公式,n代表文本S中的词语个数。Where P(S|Wi) in Equation 1 represents the probability that the text S appears in the case where the given word Wi appears, the right half of the equation is the product calculation formula, and n represents the number of words in the text S.
所述生成模块203,用于从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The generating module 203 is configured to extract, from the keywords of each cluster, the most distinctive keywords, and record the labels of each cluster.
优选地,在本实施例中,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和所述预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词。具体而言,首先,根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值。在本实施例中,所述预设的转移概率计算公式可以设置为如下公式2。Preferably, in the embodiment, the extracting the most distinguishing keyword comprises: according to the transition probability value between the words and the preset naive Bayes calculation formula, from each cluster of keywords Extract the most discriminating keywords. Specifically, first, according to a preset transition probability calculation formula, a transition probability value between the keywords in the total document aggregated by all the documents of each cluster is calculated. In this embodiment, the preset transition probability calculation formula may be set to the following formula 2.
Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+…Pm(Wj|Wi))--公式2Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi))--Formula 2
其中,m代表文本聚类后的聚类数量,t代表其中的某个聚类(如第一个聚类),Wi和Wj代表每个聚类抽取的关键词,则Pt(Wj|Wi)代表:将第t个聚类的所有文档聚成的总文档中,关键词Wi到Wj的转移概率。Where m represents the number of clusters after text clustering, t represents one of the clusters (eg, the first cluster), and Wi and Wj represent keywords extracted by each cluster, then Pt(Wj|Wi) Representation: The transition probability of the keywords Wi to Wj in the total document in which all the documents of the t-th cluster are aggregated.
例如,如果文本聚类后的聚类数量m=3,则第一个聚类中的关键词之间的转移概率计算公式为:For example, if the number of clusters after text clustering is m=3, the formula for calculating the transition probability between keywords in the first cluster is:
P1(Wj|Wi)=P1(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+P3(Wj|Wi))。P1(Wj|Wi)=P1(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+P3(Wj|Wi)).
进一步地,将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式(上述公式1)中,重新计算每个关键词的条件概率值(最后结果为一个转移矩阵的叠乘)。针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。在本实施例中,重新计算出的条件概率值代表每个关键词的区分性高低,一个关键词重新计算出的条件概率值越高,代表区分性越高,更加适合做聚类标签。Further, the transition probability value between the keywords in each cluster is substituted into the preset naive Bayesian calculation formula (the above formula 1), and the conditional probability value of each keyword is recalculated (final result) Is a multiplication of a transfer matrix). The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label. In this embodiment, the recalculated conditional probability value represents the discriminative level of each keyword. The higher the conditional probability value recalculated by a keyword, the higher the discriminativeness, and the more suitable for clustering labels.
需要说明的是,在其它实施例中,也可以从每个聚类的关键词中选取区分性较高的多个关键词(如区分性前两位的关键词),作为每个聚类的标签。It should be noted that, in other embodiments, multiple keywords with higher discrimination (such as the first two keywords of distinguishing) may be selected from the keywords of each cluster, as each cluster. label.
通过上述程序模块201-203,本申请所提出的聚类标签生成系统20,通过预设的朴素贝叶斯计算公式,在语义层面上优化了聚类关键词的提取过程。进一步地,也对聚类文本的标签提取进行了优化,使得提取的聚类标签具有高区分性和辨识度。Through the above program modules 201-203, the cluster label generation system 20 proposed by the present application optimizes the extraction process of cluster keywords on the semantic level by using a preset naive Bayesian calculation formula. Further, the label extraction of the clustered text is also optimized, so that the extracted cluster labels have high discrimination and recognition.
此外,本申请还提出一种聚类标签生成方法。 In addition, the present application also proposes a cluster label generation method.
参阅图3所示,是本申请聚类标签生成方法一实施例的实施流程示意图。在本实施例中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Referring to FIG. 3, it is a schematic flowchart of an implementation process of an embodiment of a method for generating a cluster label of the present application. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
步骤S31,针对文本聚类结果构建每个聚类中词语间的语义网络关系。在本实施例中,针对非监督的语料进行文本聚类,聚类方法可以采用Text-rank聚类算法,文本聚类结果可以是文本摘要信息等。所述语义网络关系用于描述物体概念与状态及其间的关系,由结点和结点之间的弧组成,其中,结点表示概念(事件、事物等),弧表示概念之间的关系。Step S31, constructing a semantic network relationship between words in each cluster for the text clustering result. In this embodiment, the text clustering is performed for the unsupervised corpus, the clustering method may adopt the Text-rank clustering algorithm, and the text clustering result may be the text summary information or the like. The semantic network relationship is used to describe the concept and state of an object and its relationship. It consists of an arc between a node and a node. The node represents a concept (event, thing, etc.), and the arc represents the relationship between concepts.
步骤S32,从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词。In step S32, a representative keyword is extracted from the semantic network relationship constructed by each cluster, and is recorded as a clustering keyword.
优选地,在本实施例中,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。具体而言,假设S代表着一段文本,Wi代表该段文本所构建的语义网络关系中的一个词语,计算每个聚类所构建的语义网络关系中每个词语的条件概率值P(S|Wi)。从理论上而言,如果某个词语Wi是该段文本的关键词,那么应该使得上述条件概率值最大。因此,针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量(如3个)的关键词,记为聚类关键词。在本实施例中,所述聚类关键词为最能代表该段文本语义的词语。Preferably, in the embodiment, the extracting the representative keywords comprises: extracting keywords of each cluster according to the conditional probability value size of the words. Specifically, suppose S represents a piece of text, Wi represents a word in the semantic network relationship constructed by the piece of text, and calculates the conditional probability value P of each word in the semantic network relationship constructed by each cluster (S| Wi). Theoretically, if a word Wi is a keyword for the text of the paragraph, then the above conditional probability value should be maximized. Therefore, the conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number (for example, three) of keywords is extracted and recorded as a clustering keyword. In this embodiment, the clustering keyword is a word that best represents the semantics of the piece of text.
优选地,在本实施例中,所述条件概率值根据预设的朴素贝叶斯计算公式得出。举例而言,假设文本S由n个词语W1、W2、…Wn组成,则预设的朴素贝叶斯计算公式可以设置为如下公式1(LaTex版本)。Preferably, in the embodiment, the conditional probability value is obtained according to a preset naive Bayesian calculation formula. For example, assuming that the text S is composed of n words W1, W2, ..., Wn, the preset naive Bayesian calculation formula can be set as the following formula 1 (LaTex version).
P(S|Wi)=P(W1,W2,...,Wn|Wi)=\prod_{k=1}^n P(Wk|Wi)--公式1P(S|Wi)=P(W1,W2,...,Wn|Wi)=\prod_{k=1}^n P(Wk|Wi)--Formula 1
需要说明的是,在其它实施例中,公式1也可以表示为如下形式:It should be noted that, in other embodiments, Equation 1 can also be expressed as follows:
Figure PCTCN2017108807-appb-000003
Figure PCTCN2017108807-appb-000003
其中,公式1中P(S|Wi)代表:给定词语Wi出现的情况下,文本S出现的概率,等式右半部分为乘积计算公式,n代表文本S中的词语个数。Where P(S|Wi) in Equation 1 represents the probability that the text S appears in the case where the given word Wi appears, the right half of the equation is the product calculation formula, and n represents the number of words in the text S.
步骤S33,从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。In step S33, the most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
优选地,在本实施例中,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和所述预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词。具体而言,首先,根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值。在本实施例中,所述预设的转移概率计算公式可以设置为如下公式2。Preferably, in the embodiment, the extracting the most distinguishing keyword comprises: according to the transition probability value between the words and the preset naive Bayes calculation formula, from each cluster of keywords Extract the most discriminating keywords. Specifically, first, according to a preset transition probability calculation formula, a transition probability value between the keywords in the total document aggregated by all the documents of each cluster is calculated. In this embodiment, the preset transition probability calculation formula may be set to the following formula 2.
Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+…Pm(Wj|Wi))--公式2Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi))--Formula 2
其中,m代表文本聚类后的聚类数量,t代表其中的某个聚类(如第一个聚类),Wi和Wj代表每个聚类抽取的关键词,则Pt(Wj|Wi)代表:将第t个聚类 的所有文档聚成的总文档中,关键词Wi到Wj的转移概率。Where m represents the number of clusters after text clustering, t represents one of the clusters (eg, the first cluster), and Wi and Wj represent keywords extracted by each cluster, then Pt(Wj|Wi) Representative: the t-th cluster In the total document where all the documents are aggregated, the transition probability of the keyword Wi to Wj.
例如,如果文本聚类后的聚类数量m=3,则第一个聚类中的关键词之间的转移概率计算公式为:For example, if the number of clusters after text clustering is m=3, the formula for calculating the transition probability between keywords in the first cluster is:
P1(Wj|Wi)=P1(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+P3(Wj|Wi))。P1(Wj|Wi)=P1(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+P3(Wj|Wi)).
进一步地,将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式(上述公式1)中,重新计算每个关键词的条件概率值(最后结果为一个转移矩阵的叠乘)。针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。在本实施例中,重新计算出的条件概率值代表每个关键词的区分性高低,一个关键词重新计算出的条件概率值越高,代表区分性越高,更加适合做聚类标签。Further, the transition probability value between the keywords in each cluster is substituted into the preset naive Bayesian calculation formula (the above formula 1), and the conditional probability value of each keyword is recalculated (final result) Is a multiplication of a transfer matrix). The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label. In this embodiment, the recalculated conditional probability value represents the discriminative level of each keyword. The higher the conditional probability value recalculated by a keyword, the higher the discriminativeness, and the more suitable for clustering labels.
需要说明的是,在其它实施例中,也可以从每个聚类的关键词中选取区分性较高的多个关键词(如区分性前两位的关键词),作为每个聚类的标签。It should be noted that, in other embodiments, multiple keywords with higher discrimination (such as the first two keywords of distinguishing) may be selected from the keywords of each cluster, as each cluster. label.
通过上述步骤S31-S33,本申请所提出的聚类标签生成方法,通过预设的朴素贝叶斯计算公式,在语义层面上优化了聚类关键词的提取过程。进一步地,也对聚类文本的标签提取进行了优化,使得提取的聚类标签具有高区分性和辨识度。Through the above steps S31-S33, the cluster label generation method proposed by the present application optimizes the extraction process of the clustering keywords on the semantic level by the preset naive Bayesian calculation formula. Further, the label extraction of the clustered text is also optimized, so that the extracted cluster labels have high discrimination and recognition.
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质(如ROM/RAM、磁碟、光盘),所述计算机可读存储介质存储有聚类标签生成系统20,所述聚类标签生成系统20可被至少一个处理器22执行,以使所述至少一个处理器22执行如上所述的聚类标签生成方法的步骤。Further, in order to achieve the above object, the present application further provides a computer readable storage medium (such as a ROM/RAM, a magnetic disk, an optical disk), where the computer readable storage medium stores a cluster label generation system 20, the aggregation The class tag generation system 20 can be executed by at least one processor 22 to cause the at least one processor 22 to perform the steps of the cluster tag generation method as described above.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
以上参照附图说明了本申请的优选实施例,并非因此局限本申请的权利范围。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。另外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The preferred embodiments of the present application have been described above with reference to the drawings, and are not intended to limit the scope of the application. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments. Additionally, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
本领域技术人员不脱离本申请的范围和实质,可以有多种变型方案实现本申请,比如作为一个实施例的特征可用于另一实施例而得到又一实施例。凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。 A person skilled in the art can implement the present application in various variants without departing from the scope and spirit of the present application. For example, the features of one embodiment can be used in another embodiment to obtain another embodiment. The equivalent structure or equivalent process transformations made by the present specification and the contents of the drawings, or directly or indirectly applied to other related technical fields, are all included in the scope of patent protection of the present application.

Claims (20)

  1. 一种电子设备,其特征在于,所述电子设备包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的聚类标签生成系统,所述聚类标签生成系统被所述处理器执行时实现如下步骤:An electronic device, comprising: a memory and a processor, wherein the memory stores a cluster label generation system operable on the processor, the cluster label generation system being The processor implements the following steps when it executes:
    针对文本聚类结果构建每个聚类中词语间的语义网络关系;Constructing a semantic network relationship between words in each cluster for text clustering results;
    从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词;及Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;
    从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
  2. 如权利要求1所述的电子设备,其特征在于,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。The electronic device according to claim 1, wherein said extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
  3. 如权利要求2所述的电子设备,其特征在于,所述抽取具有代表性的关键词包括:The electronic device according to claim 2, wherein said extracting representative keywords comprises:
    计算每个聚类所构建的语义网络关系中每个词语的条件概率值,其中,所述条件概率值根据预设的朴素贝叶斯计算公式得出;Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;
    针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量的关键词,记为聚类关键词。The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
  4. 如权利要求3所述的电子设备,其特征在于,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词。The electronic device according to claim 3, wherein said extracting the most distinguishing keyword comprises: based on a transition probability value between words and a preset naive Bayesian calculation formula, from each cluster The keywords with the highest discrimination are extracted from the keywords.
  5. 如权利要求4所述的电子设备,其特征在于,所述抽取区分性最高的关键词包括:The electronic device according to claim 4, wherein said extracting the most distinguishing keywords comprises:
    根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值;Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;
    将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式中,重新计算每个关键词的条件概率值;Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;
    针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
  6. 如权利要求5所述的电子设备,其特征在于,所述预设的朴素贝叶斯计算公式设置为公式1:The electronic device according to claim 5, wherein said preset naive Bayesian calculation formula is set to formula 1:
    Figure PCTCN2017108807-appb-100001
    Figure PCTCN2017108807-appb-100001
    公式1中,S代表由n个词语W1、W2、…Wn组成的一段文本,Wi代表该段文本所构建的语义网络关系中的一个词语; In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;
    所述预设的转移概率计算公式设置为公式2:The preset transition probability calculation formula is set to Equation 2:
    Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+…Pm(Wj|Wi));Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));
    公式2中,m代表文本聚类后的聚类数量,t代表其中的某个聚类,Wi和Wj代表每个聚类抽取的关键词,Pt(Wj|Wi)代表:将第t个聚类的所有文档聚成的总文档中,关键词Wi到Wj的转移概率。In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
  7. 一种聚类标签生成方法,应用于电子设备,其特征在于,所述方法包括:A cluster label generation method is applied to an electronic device, and the method includes:
    针对文本聚类结果构建每个聚类中词语间的语义网络关系;Constructing a semantic network relationship between words in each cluster for text clustering results;
    从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词;及Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;
    从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
  8. 如权利要求7所述的聚类标签生成方法,其特征在于,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。The clustering label generating method according to claim 7, wherein the extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
  9. 如权利要求8所述的聚类标签生成方法,其特征在于,所述抽取具有代表性的关键词具体包括:The clustering label generating method according to claim 8, wherein the extracting the representative keywords comprises:
    计算每个聚类所构建的语义网络关系中每个词语的条件概率值,其中,所述条件概率值根据预设的朴素贝叶斯计算公式得出;Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;
    针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量的关键词,记为聚类关键词。The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
  10. 如权利要求9所述的聚类标签生成方法,其特征在于,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词。The clustering label generating method according to claim 9, wherein the extracting the most discriminating keyword comprises: according to a transition probability value between words and a preset naive Bayesian calculation formula, from each The most distinguishing keywords are extracted from the clustered keywords.
  11. 如权利要求10所述的聚类标签生成方法,其特征在于,所述抽取区分性最高的关键词具体包括:The method of generating a clustering label according to claim 10, wherein the extracting the most distinctive keyword specifically comprises:
    根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值;Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;
    将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式中,重新计算每个关键词的条件概率值;Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;
    针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
  12. 如权利要求11所述的聚类标签生成方法,其特征在于,所述预设的朴素贝叶斯计算公式设置为公式1: The cluster label generation method according to claim 11, wherein the preset naive Bayes calculation formula is set to formula 1:
    Figure PCTCN2017108807-appb-100002
    Figure PCTCN2017108807-appb-100002
    公式1中,S代表由n个词语W1、W2、…Wn组成的一段文本,Wi代表该段文本所构建的语义网络关系中的一个词语;In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;
    所述预设的转移概率计算公式设置为公式2:The preset transition probability calculation formula is set to Equation 2:
    Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+…Pm(Wj|Wi));Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));
    公式2中,m代表文本聚类后的聚类数量,t代表其中的某个聚类,Wi和Wj代表每个聚类抽取的关键词,Pt(Wj|Wi)代表:将第t个聚类的所有文档聚成的总文档中,关键词Wi到Wj的转移概率。In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有聚类标签生成系统,所述聚类标签生成系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A computer readable storage medium, characterized in that the computer readable storage medium stores a cluster label generation system, the cluster label generation system executable by at least one processor to cause the at least one processor Perform the following steps:
    针对文本聚类结果构建每个聚类中词语间的语义网络关系;Constructing a semantic network relationship between words in each cluster for text clustering results;
    从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词;及Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;
    从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
  14. 如权利要求13所述的计算机可读存储介质,其特征在于,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。The computer readable storage medium of claim 13, wherein the extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
  15. 如权利要求14所述的计算机可读存储介质,其特征在于,所述抽取具有代表性的关键词包括:The computer readable storage medium of claim 14 wherein said extracting representative keywords comprises:
    计算每个聚类所构建的语义网络关系中每个词语的条件概率值,其中,所述条件概率值根据预设的朴素贝叶斯计算公式得出;Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;
    针对上述每个聚类所计算出的每个词语的条件概率值进行降序排列,提取预设数量的关键词,记为聚类关键词。The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述抽取区分性最高的关键词包括:根据词语之间的转移概率值和预设的朴素贝叶斯计算公式,从每个聚类的关键词中抽取区分性最高的关键词。A computer readable storage medium according to claim 15, wherein said extracting the most discriminating keyword comprises: from each of the transition probability values between words and a preset naive Bayesian calculation formula, from each The most distinguishing keywords are extracted from the clustered keywords.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述抽取区分性最高的关键词包括:The computer readable storage medium of claim 16 wherein said extracting the most discriminating keywords comprises:
    根据预设的转移概率计算公式,计算每个聚类的所有文档聚成的总文档中,关键词之间的转移概率值;Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;
    将每个聚类中的关键词之间的转移概率值代入所述预设的朴素贝叶斯计算公式中,重新计算每个关键词的条件概率值; Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;
    针对上述每个聚类所重新计算出的每个关键词的条件概率值进行降序排列,提取条件概率值最高的关键词,记为聚类标签。The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预设的朴素贝叶斯计算公式设置为公式1:The computer readable storage medium of claim 17, wherein the preset naive Bayesian calculation formula is set to Equation 1:
    Figure PCTCN2017108807-appb-100003
    Figure PCTCN2017108807-appb-100003
    公式1中,S代表由n个词语W1、W2、…Wn组成的一段文本,Wi代表该段文本所构建的语义网络关系中的一个词语;In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;
    所述预设的转移概率计算公式设置为公式2:The preset transition probability calculation formula is set to Equation 2:
    Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+…Pm(Wj|Wi));Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));
    公式2中,m代表文本聚类后的聚类数量,t代表其中的某个聚类,Wi和Wj代表每个聚类抽取的关键词,Pt(Wj|Wi)代表:将第t个聚类的所有文档聚成的总文档中,关键词Wi到Wj的转移概率。In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
  19. 一种聚类标签生成系统,其特征在于,所述聚类标签生成系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A cluster label generation system, wherein the cluster label generation system is executable by at least one processor to cause the at least one processor to perform the following steps:
    针对文本聚类结果构建每个聚类中词语间的语义网络关系;Constructing a semantic network relationship between words in each cluster for text clustering results;
    从每个聚类所构建的语义网络关系中抽取具有代表性的关键词,记为聚类关键词;及Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;
    从每个聚类的关键词中抽取区分性最高的关键词,记为每个聚类的标签。The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
  20. 如权利要求19所述的聚类标签生成系统,其特征在于,所述抽取具有代表性的关键词包括:根据词语的条件概率数值大小抽取每个聚类的关键词。 The cluster tag generation system according to claim 19, wherein said extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
PCT/CN2017/108807 2017-08-31 2017-10-31 Method, electronic apparatus, and computer readable storage medium for generating cluster tag WO2019041524A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710776351.5 2017-08-31
CN201710776351.5A CN107679084B (en) 2017-08-31 2017-08-31 Clustering label generation method, electronic device and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2019041524A1 true WO2019041524A1 (en) 2019-03-07

Family

ID=61134852

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108807 WO2019041524A1 (en) 2017-08-31 2017-10-31 Method, electronic apparatus, and computer readable storage medium for generating cluster tag

Country Status (2)

Country Link
CN (1) CN107679084B (en)
WO (1) WO2019041524A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110647914A (en) * 2019-08-14 2020-01-03 深圳壹账通智能科技有限公司 Intelligent service level training method and device and computer readable storage medium
CN111814475A (en) * 2019-04-09 2020-10-23 Oppo广东移动通信有限公司 User portrait construction method and device, storage medium and electronic equipment
CN112487132A (en) * 2019-09-12 2021-03-12 北京国双科技有限公司 Keyword determination method and related equipment
CN112560455A (en) * 2019-09-26 2021-03-26 北京国双科技有限公司 Data fusion method and related system
CN112579769A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Keyword clustering method and device, storage medium and electronic equipment
CN112784063A (en) * 2019-03-15 2021-05-11 北京金山数字娱乐科技有限公司 Idiom knowledge graph construction method and device
CN113204653A (en) * 2021-06-04 2021-08-03 中国银行股份有限公司 Demand value labeling method and device, computer equipment and readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688209B (en) * 2021-09-01 2023-08-25 江苏省城市规划设计研究院有限公司 Text semantic network construction method by adjusting keyword dependency relationship
CN115618085B (en) * 2022-10-21 2024-04-05 华信咨询设计研究院有限公司 Interface data exposure detection method based on dynamic tag

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070078832A1 (en) * 2005-09-30 2007-04-05 Yahoo! Inc. Method and system for using smart tags and a recommendation engine using smart tags
CN103164463A (en) * 2011-12-16 2013-06-19 国际商业机器公司 Method and device for recommending labels
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model
CN106844748A (en) * 2017-02-16 2017-06-13 湖北文理学院 Text Clustering Method, device and electronic equipment

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0934861A (en) * 1995-07-17 1997-02-07 Olympus Optical Co Ltd Cluster classification device
CN102207945B (en) * 2010-05-11 2013-10-23 天津海量信息技术有限公司 Knowledge network-based text indexing system and method
CN102207946B (en) * 2010-06-29 2013-10-23 天津海量信息技术有限公司 Knowledge network semi-automatic generation method
CN102708096B (en) * 2012-05-29 2014-10-15 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN104239300B (en) * 2013-06-06 2017-10-20 富士通株式会社 The method and apparatus that semantic key words are excavated from text
CN103544255B (en) * 2013-10-15 2017-01-11 常州大学 Text semantic relativity based network public opinion information analysis method
US10191978B2 (en) * 2014-01-03 2019-01-29 Verint Systems Ltd. Labeling/naming of themes
CN104794500A (en) * 2015-05-11 2015-07-22 苏州大学 Tri-training semi-supervised learning method and device
CN104951432B (en) * 2015-05-21 2019-01-11 腾讯科技(深圳)有限公司 The method and device that a kind of pair of information is handled
CN105468713B (en) * 2015-11-19 2018-07-17 西安交通大学 A kind of short text classification method of multi-model fusion
CN106940726B (en) * 2017-03-22 2020-09-01 山东大学 Creative automatic generation method and terminal based on knowledge network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070078832A1 (en) * 2005-09-30 2007-04-05 Yahoo! Inc. Method and system for using smart tags and a recommendation engine using smart tags
CN103164463A (en) * 2011-12-16 2013-06-19 国际商业机器公司 Method and device for recommending labels
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model
CN106844748A (en) * 2017-02-16 2017-06-13 湖北文理学院 Text Clustering Method, device and electronic equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784063A (en) * 2019-03-15 2021-05-11 北京金山数字娱乐科技有限公司 Idiom knowledge graph construction method and device
CN111814475A (en) * 2019-04-09 2020-10-23 Oppo广东移动通信有限公司 User portrait construction method and device, storage medium and electronic equipment
CN110647914A (en) * 2019-08-14 2020-01-03 深圳壹账通智能科技有限公司 Intelligent service level training method and device and computer readable storage medium
CN112487132A (en) * 2019-09-12 2021-03-12 北京国双科技有限公司 Keyword determination method and related equipment
CN112560455A (en) * 2019-09-26 2021-03-26 北京国双科技有限公司 Data fusion method and related system
CN112579769A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Keyword clustering method and device, storage medium and electronic equipment
CN113204653A (en) * 2021-06-04 2021-08-03 中国银行股份有限公司 Demand value labeling method and device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN107679084A (en) 2018-02-09
CN107679084B (en) 2021-09-28

Similar Documents

Publication Publication Date Title
WO2019041524A1 (en) Method, electronic apparatus, and computer readable storage medium for generating cluster tag
WO2021068610A1 (en) Resource recommendation method and apparatus, electronic device and storage medium
WO2019085120A1 (en) Collaborative filtering recommendation method, electronic device, and computer readable storage medium
WO2019062021A1 (en) Method for pushing loan advertisement in application program, electronic device, and medium
CN110598001A (en) Method, device and storage medium for extracting association entity relationship
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
WO2019061991A1 (en) Multi-element universal model platform modeling method, electronic device, and computer readable storage medium
US10650274B2 (en) Image clustering method, image clustering system, and image clustering server
WO2016015621A1 (en) Human face picture name recognition method and system
US20220012231A1 (en) Automatic content-based append detection
CN109189888B (en) Electronic device, infringement analysis method, and storage medium
WO2019075967A1 (en) Enterprise name recognition method, electronic device, and computer-readable storage medium
WO2020253043A1 (en) Intelligent text classification method and apparatus, and computer-readable storage medium
US9971809B1 (en) Systems and methods for searching unstructured documents for structured data
WO2019062081A1 (en) Salesman profile formation method, electronic device and computer readable storage medium
WO2019024231A1 (en) Automatic data matching method, electronic device and computer-readable storage medium
US9355091B2 (en) Systems and methods for language classification
US9330075B2 (en) Method and apparatus for identifying garbage template article
WO2019041528A1 (en) Method, electronic apparatus, and computer readable storage medium for determining polarity of news sentiment
WO2019119635A1 (en) Seed user development method, electronic device and computer-readable storage medium
CN110781955A (en) Method and device for classifying label-free objects and detecting nested codes and computer-readable storage medium
US10331789B2 (en) Semantic analysis apparatus, method, and non-transitory computer readable storage medium thereof
WO2019041525A1 (en) Method, electronic apparatus, and computer readable storage medium for identifying entities having identical name
JP5952441B2 (en) Method for identifying secret data, electronic apparatus and computer-readable recording medium
US20220114198A1 (en) System and method for entity disambiguation for customer relationship management

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17923073

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17923073

Country of ref document: EP

Kind code of ref document: A1