WO2019041524A1

WO2019041524A1 - Method, electronic apparatus, and computer readable storage medium for generating cluster tag

Info

Publication number: WO2019041524A1
Application number: PCT/CN2017/108807
Authority: WO
Inventors: 罗傲雪; 汪伟; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-08-31
Filing date: 2017-10-31
Publication date: 2019-03-07
Also published as: CN107679084A; CN107679084B

Abstract

A method for generating a cluster tag, the method comprising steps of: constructing, for text clustering results, a semantic network relationship for words in each cluster (S31); extracting, from the semantic network relationship constructed from each of the clusters, representative keywords, and marking the same as cluster keywords (S32); and extracting, from the keywords of each cluster, the most discriminating keyword, and marking the same as a tag of said cluster (S33). In this way, the present invention improves the discrimination and identification ability of a tag of a cluster.

Description

Cluster label generation method, electronic device and computer readable storage medium

This patent application is based on the Chinese Patent Application No. 201710776351.5 filed on Aug. 31, 2017, entitled "Cluster Label Generation Method, Electronic Device, and Computer Readable Storage Medium", and requires priority.

Technical field

The present application relates to the field of computer information technology, and in particular, to a cluster label generation method, an electronic device, and a computer readable storage medium.

Background technique

In the prior art, clustering of unsupervised corpora, the result of clustering often lacks tags, which leads to the problem that clustering results are not easily presented in user interaction. Therefore, the design of the clustering method in the prior art is not reasonable enough and needs to be improved.

Summary of the invention

In view of this, the present application proposes a cluster label generation method, an electronic device, and a computer readable storage medium, and optimizes a clustering keyword extraction process on a semantic level by a preset naive Bayesian calculation formula, and The label extraction of clustered text is optimized.

First, in order to achieve the above object, the present application provides an electronic device including a memory and a processor, on which is stored a cluster label generation system executable on the processor, the The class tag generation system is implemented by the processor to implement the following steps:

Constructing a semantic network relationship between words in each cluster for text clustering results;

Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;

The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.

Preferably, the extracting the representative keywords comprises: extracting keywords of each cluster according to the conditional probability value size of the words.

Preferably, the extracting representative keywords includes:

Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;

The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.

Preferably, the extracting the most distinguishing keyword comprises: extracting the most discriminating keyword from the keywords of each cluster according to the transition probability value between the words and the preset naive Bayes calculation formula.

Preferably, the extracting the most distinguishing keywords comprises:

Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;

Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;

The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.

Preferably, the preset naive Bayesian calculation formula is set to formula 1:

In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;

The preset transition probability calculation formula is set to Equation 2:

In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.

In addition, to achieve the above object, the present application further provides a cluster label generation method, which is applied to an electronic device, and the method includes:

Preferably, the extracting the representative keywords includes: extracting keywords of each cluster according to the conditional probability value size of the words, specifically including:

Preferably, the extracting the most distinguishing keyword comprises: extracting the most distinctive keyword from each cluster of keywords according to a transition probability value between words and a preset naive Bayesian calculation formula, Specifically include:

Further, in order to achieve the above object, the present application further provides a computer readable storage medium storing a cluster label generation system, the cluster label generation system The step of causing the at least one processor to perform the clustering label generation method as described above may be performed by at least one processor.

Compared with the prior art, the electronic device, the cluster label generation method and the computer readable storage medium proposed by the present application optimize the extraction of clustering keywords on the semantic level by using a preset naive Bayesian calculation formula. process. Further, the label extraction of the clustered text is also optimized, so that the extracted cluster labels have high discrimination and recognition.

DRAWINGS

1 is a schematic diagram of an optional hardware architecture of an electronic device of the present application;

2 is a schematic diagram of a program module of an embodiment of a cluster label generation system in an electronic device of the present application;

FIG. 3 is a schematic diagram of an implementation process of an embodiment of a method for generating a cluster label according to the present application.

Reference mark:

电子设备 Electronic equipment	22
存储器Memory	21twenty one
处理器processor	22twenty two
网络接口Network Interface	23twenty three
聚类标签生成系统Cluster tag generation system	2020
构建模块 Building module	201201
抽取模块 Extraction module	202202
生成模块 Build module	203203
流程步骤Process step	S31-S33S31-S33

The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed ways

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the indicated technical features. The number of levies. Thus, features defining "first" and "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.

It is further to be understood that the term "comprises", "comprises" or any other variations thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device that comprises a And includes other elements not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

First of all, the present application proposes an electronic device 2.

Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the electronic device 2 of the present application. In this embodiment, the electronic device 2 may include, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system bus. It is pointed out that FIG. 1 only shows the electronic device 2 with the components 21-23, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.

The electronic device 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The electronic device 2 may be an independent server or a server cluster composed of multiple servers. .

The memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the electronic device 2, such as a hard disk or memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, such as a plug-in hard disk equipped on the electronic device 2, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card, etc. Of course, the memory 21 may also include both an internal storage unit of the electronic device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used to store an operating system installed in the electronic device 2 and various types of application software, such as program codes of the cluster tag generation system 20. Further, the memory 21 can also be used to temporarily store various types of data that have been output or are to be output.

The processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the electronic device 2, such as performing control and processing related to data interaction or communication with the electronic device 2. In this embodiment, the processor 22 is configured to run program code or process data stored in the memory 21, for example, to run the cluster. The tag generation system 20 and the like.

The network interface 23 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 2 and other electronic devices. For example, the network interface 23 is configured to connect the electronic device 2 to an external data platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and an external data platform. The network may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, or a 5G network. Wireless or wired networks such as network, Bluetooth, Wi-Fi, etc.

So far, the application environment of the various embodiments of the present application and the hardware structure and functions of related devices have been described in detail. Hereinafter, various embodiments of the present application will be proposed based on the above-described application environment and related devices.

Referring to FIG. 2, it is a program module diagram of an embodiment of the cluster label generation system 20 in the electronic device 2 of the present application. In this embodiment, the cluster tag generation system 20 may be divided into one or more program modules, the one or more program modules being stored in the memory 21 and being processed by one or more processors. (Processing in the present embodiment for the processor 22) to complete the application. For example, in FIG. 2, the cluster tag generation system 20 can be divided into a construction module 201, an extraction module 202, and a generation module 203. The program module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program to describe the execution process of the cluster tag generation system 20 in the electronic device 2. The function of each program module 201-203 will be described in detail below.

The building module 201 is configured to construct a semantic network relationship between words in each cluster for the text clustering result. In this embodiment, the text clustering is performed for the unsupervised corpus, the clustering method may adopt the Text-rank clustering algorithm, and the text clustering result may be the text summary information or the like. The semantic network relationship is used to describe the concept and state of an object and its relationship. It consists of an arc between a node and a node. The node represents a concept (event, thing, etc.), and the arc represents the relationship between concepts.

The extracting module 202 is configured to extract a representative keyword from the semantic network relationship constructed by each cluster, and record it as a clustering keyword.

Preferably, in the embodiment, the extracting the representative keywords comprises: extracting keywords of each cluster according to the conditional probability value size of the words. Specifically, suppose S represents a piece of text, Wi represents a word in the semantic network relationship constructed by the piece of text, and calculates the conditional probability value P of each word in the semantic network relationship constructed by each cluster (S| Wi). Theoretically, if a word Wi is a keyword for the text of the paragraph, then the above conditional probability value should be maximized. Therefore, the conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number (for example, three) of keywords is extracted and recorded as a clustering keyword. In this embodiment, the clustering keyword is a word that best represents the semantics of the piece of text.

Preferably, in the embodiment, the conditional probability value is calculated according to a preset naive Bayesian Drawn. For example, assuming that the text S is composed of n words W1, W2, ..., Wn, the preset naive Bayesian calculation formula can be set as the following formula 1 (LaTex version).

P(S|Wi)=P(W1,W2,...,Wn|Wi)=\prod_{k=1}^n P(Wk|Wi)--Formula 1

It should be noted that, in other embodiments, Equation 1 can also be expressed as follows:

Where P(S|Wi) in Equation 1 represents the probability that the text S appears in the case where the given word Wi appears, the right half of the equation is the product calculation formula, and n represents the number of words in the text S.

The generating module 203 is configured to extract, from the keywords of each cluster, the most distinctive keywords, and record the labels of each cluster.

Preferably, in the embodiment, the extracting the most distinguishing keyword comprises: according to the transition probability value between the words and the preset naive Bayes calculation formula, from each cluster of keywords Extract the most discriminating keywords. Specifically, first, according to a preset transition probability calculation formula, a transition probability value between the keywords in the total document aggregated by all the documents of each cluster is calculated. In this embodiment, the preset transition probability calculation formula may be set to the following formula 2.

Where m represents the number of clusters after text clustering, t represents one of the clusters (eg, the first cluster), and Wi and Wj represent keywords extracted by each cluster, then Pt(Wj|Wi) Representation: The transition probability of the keywords Wi to Wj in the total document in which all the documents of the t-th cluster are aggregated.

For example, if the number of clusters after text clustering is m=3, the formula for calculating the transition probability between keywords in the first cluster is:

Further, the transition probability value between the keywords in each cluster is substituted into the preset naive Bayesian calculation formula (the above formula 1), and the conditional probability value of each keyword is recalculated (final result) Is a multiplication of a transfer matrix). The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label. In this embodiment, the recalculated conditional probability value represents the discriminative level of each keyword. The higher the conditional probability value recalculated by a keyword, the higher the discriminativeness, and the more suitable for clustering labels.

It should be noted that, in other embodiments, multiple keywords with higher discrimination (such as the first two keywords of distinguishing) may be selected from the keywords of each cluster, as each cluster. label.

Through the above program modules 201-203, the cluster label generation system 20 proposed by the present application optimizes the extraction process of cluster keywords on the semantic level by using a preset naive Bayesian calculation formula. Further, the label extraction of the clustered text is also optimized, so that the extracted cluster labels have high discrimination and recognition.

In addition, the present application also proposes a cluster label generation method.

Referring to FIG. 3, it is a schematic flowchart of an implementation process of an embodiment of a method for generating a cluster label of the present application. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.

Step S31, constructing a semantic network relationship between words in each cluster for the text clustering result. In this embodiment, the text clustering is performed for the unsupervised corpus, the clustering method may adopt the Text-rank clustering algorithm, and the text clustering result may be the text summary information or the like. The semantic network relationship is used to describe the concept and state of an object and its relationship. It consists of an arc between a node and a node. The node represents a concept (event, thing, etc.), and the arc represents the relationship between concepts.

In step S32, a representative keyword is extracted from the semantic network relationship constructed by each cluster, and is recorded as a clustering keyword.

Preferably, in the embodiment, the conditional probability value is obtained according to a preset naive Bayesian calculation formula. For example, assuming that the text S is composed of n words W1, W2, ..., Wn, the preset naive Bayesian calculation formula can be set as the following formula 1 (LaTex version).

P(S|Wi)=P(W1,W2,...,Wn|Wi)=\prod_{k=1}^n P(Wk|Wi)--Formula 1

In step S33, the most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.

Where m represents the number of clusters after text clustering, t represents one of the clusters (eg, the first cluster), and Wi and Wj represent keywords extracted by each cluster, then Pt(Wj|Wi) Representative: the t-th cluster In the total document where all the documents are aggregated, the transition probability of the keyword Wi to Wj.

Through the above steps S31-S33, the cluster label generation method proposed by the present application optimizes the extraction process of the clustering keywords on the semantic level by the preset naive Bayesian calculation formula. Further, the label extraction of the clustered text is also optimized, so that the extracted cluster labels have high discrimination and recognition.

Further, in order to achieve the above object, the present application further provides a computer readable storage medium (such as a ROM/RAM, a magnetic disk, an optical disk), where the computer readable storage medium stores a cluster label generation system 20, the aggregation The class tag generation system 20 can be executed by at least one processor 22 to cause the at least one processor 22 to perform the steps of the cluster tag generation method as described above.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

The preferred embodiments of the present application have been described above with reference to the drawings, and are not intended to limit the scope of the application. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments. Additionally, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.

A person skilled in the art can implement the present application in various variants without departing from the scope and spirit of the present application. For example, the features of one embodiment can be used in another embodiment to obtain another embodiment. The equivalent structure or equivalent process transformations made by the present specification and the contents of the drawings, or directly or indirectly applied to other related technical fields, are all included in the scope of patent protection of the present application.

Claims

An electronic device, comprising: a memory and a processor, wherein the memory stores a cluster label generation system operable on the processor, the cluster label generation system being The processor implements the following steps when it executes:

Constructing a semantic network relationship between words in each cluster for text clustering results;

Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;

The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
The electronic device according to claim 1, wherein said extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
The electronic device according to claim 2, wherein said extracting representative keywords comprises:

Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;

The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
The electronic device according to claim 3, wherein said extracting the most distinguishing keyword comprises: based on a transition probability value between words and a preset naive Bayesian calculation formula, from each cluster The keywords with the highest discrimination are extracted from the keywords.
The electronic device according to claim 4, wherein said extracting the most distinguishing keywords comprises:

Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;

Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;

The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
The electronic device according to claim 5, wherein said preset naive Bayesian calculation formula is set to formula 1:

In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;

The preset transition probability calculation formula is set to Equation 2:

Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));

In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
A cluster label generation method is applied to an electronic device, and the method includes:

Constructing a semantic network relationship between words in each cluster for text clustering results;

Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;

The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
The clustering label generating method according to claim 7, wherein the extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
The clustering label generating method according to claim 8, wherein the extracting the representative keywords comprises:

Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;

The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
The clustering label generating method according to claim 9, wherein the extracting the most discriminating keyword comprises: according to a transition probability value between words and a preset naive Bayesian calculation formula, from each The most distinguishing keywords are extracted from the clustered keywords.
The method of generating a clustering label according to claim 10, wherein the extracting the most distinctive keyword specifically comprises:

Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;

Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;

The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
The cluster label generation method according to claim 11, wherein the preset naive Bayes calculation formula is set to formula 1:

In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;

The preset transition probability calculation formula is set to Equation 2:

Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));

In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
A computer readable storage medium, characterized in that the computer readable storage medium stores a cluster label generation system, the cluster label generation system executable by at least one processor to cause the at least one processor Perform the following steps:

Constructing a semantic network relationship between words in each cluster for text clustering results;

Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;

The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
The computer readable storage medium of claim 13, wherein the extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.
The computer readable storage medium of claim 14 wherein said extracting representative keywords comprises:

Calculating a conditional probability value of each word in the semantic network relationship constructed by each cluster, wherein the conditional probability value is obtained according to a preset naive Bayesian calculation formula;

The conditional probability values of each word calculated by each cluster are sorted in descending order, and a preset number of keywords are extracted and recorded as cluster keywords.
A computer readable storage medium according to claim 15, wherein said extracting the most discriminating keyword comprises: from each of the transition probability values between words and a preset naive Bayesian calculation formula, from each The most distinguishing keywords are extracted from the clustered keywords.
The computer readable storage medium of claim 16 wherein said extracting the most discriminating keywords comprises:

Calculating a transition probability value between keywords in the total document of all the documents aggregated by each cluster according to a preset transition probability calculation formula;

Substituting the transition probability values between the keywords in each cluster into the preset naive Bayesian calculation formula, and recalculating the conditional probability values of each keyword;

The conditional probability values of each keyword recalculated for each cluster are sorted in descending order, and the keyword with the highest conditional probability value is extracted and recorded as a clustering label.
The computer readable storage medium of claim 17, wherein the preset naive Bayesian calculation formula is set to Equation 1:

In Equation 1, S represents a piece of text consisting of n words W1, W2, ... Wn, and Wi represents a word in the semantic network relationship constructed by the piece of text;

The preset transition probability calculation formula is set to Equation 2:

Pt(Wj|Wi)=Pt(Wj|Wi)/(P1(Wj|Wi)+P2(Wj|Wi)+...Pm(Wj|Wi));

In Equation 2, m represents the number of clusters after text clustering, t represents one of the clusters, Wi and Wj represent keywords extracted by each cluster, and Pt(Wj|Wi) represents: the tth The transition probability of the keyword Wi to Wj in the total document of all the documents of the class.
A cluster label generation system, wherein the cluster label generation system is executable by at least one processor to cause the at least one processor to perform the following steps:

Constructing a semantic network relationship between words in each cluster for text clustering results;

Extract representative keywords from the semantic network relationships constructed by each cluster, and record them as cluster keywords;

The most discriminating keywords are extracted from the keywords of each cluster, and are recorded as labels of each cluster.
The cluster tag generation system according to claim 19, wherein said extracting the representative keywords comprises: extracting keywords of each cluster according to a conditional probability value size of the words.