CN111061837A

CN111061837A - Topic identification method, device, equipment and medium

Info

Publication number: CN111061837A
Application number: CN201911311140.XA
Authority: CN
Inventors: 罗欣; 张爽; 林少娃; 朱蕊倩; 陈博; 魏骁雄; 陈奕汝; 叶红豆; 丁嘉涵; 杨建军; 钟震远; 李元; 张琪; 雍旭龙; 陈小红
Original assignee: Electric Power Research Institute of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-04-24

Abstract

The invention discloses a topic identification method, which relates to the field of natural language processing and is used for solving the problem that the existing method for counting customer service content is lacked, and the method comprises the following steps: obtaining unmarked text data and marked text data; taking the marked text data as a training sample, and training to obtain a theme recognition model; identifying the unlabelled text data by using the topic identification model to obtain a topic identification result; and inputting the topic identification result into a topic generation model to obtain topics corresponding to each topic. The invention also discloses a topic identification device, electronic equipment and a computer storage medium. According to the method and the device, the topics are identified for the text, and topics corresponding to each topic are obtained through the topic model, so that the statistics of the customer service content is completed.

Description

Topic identification method, device, equipment and medium

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a topic identification method, apparatus, device, and medium.

Background

The customer service is a main bridge for communication between enterprises and users, and comprises character customer service and voice customer service, the customer service can solve problems according to the requirements of clients, and the customer service and the clients are generally in one-to-one service.

With the increase of business volume of enterprises, the demands of clients are increased, and the customer service workload is increased day by day.

However, after the customer service solves a problem of a client, other clients also consult the same problem, especially voice customer service, and the repeated solution of the same problem by the customer service affects the working efficiency. In order to improve the service quality and reduce multiple demands of clients on the same problem, the client demands or service problems need to be mined and summarized through client service records, and the comprehensive client demands are found so as to provide data support for subsequent analysis and improvement of the client demands, thereby reducing the client service risk and improving the client experience.

Disclosure of Invention

In order to overcome the defects of the prior art, one of the objectives of the present invention is to provide a topic identification method, which inputs the topic identification result into a topic generation model, and further counts the most common topics in the client appeal.

One of the purposes of the invention is realized by adopting the following technical scheme:

a topic identification method comprises the following steps:

obtaining unmarked text data and marked text data;

taking the marked text data as a training sample, and training to obtain a theme recognition model;

identifying the unlabelled text data by using the topic identification model to obtain a topic identification result;

and inputting the topic identification result into a topic generation model to obtain topics corresponding to each topic.

Further, the labeled text data comprises a theme and text data corresponding to the theme.

Further, the method for training the labeled text data to obtain the topic identification model by taking the labeled text data as a training sample comprises the following steps:

and taking the theme and the text data corresponding to the theme as training samples, and training by a deep learning text classification algorithm to obtain a theme recognition model.

Further, the topic identification model is a binary classification model, and the deep learning text algorithm is a textCNN algorithm.

Further, the method for recognizing the text data which is not marked by the topic recognition model to obtain a topic recognition result comprises the following steps:

converting the unlabelled text data into a multi-dimensional vector through a vector calculation tool Word2 vec;

and inputting the multi-dimensional vector into the theme recognition model to obtain a theme recognition result of the unmarked text data.

Further, before inputting the subject identification result into a topic generation model, the method further comprises the following steps:

and preprocessing the theme recognition result, wherein the preprocessing comprises word segmentation and word stop.

Further, the topic generation model is LDA.

Another object of the present invention is to provide a topic identification device that inputs a topic identification result into a topic generation model and further counts the most common topics among client requests.

The second purpose of the invention is realized by adopting the following technical scheme:

a topic identification device, comprising:

the acquisition module is used for acquiring the unlabeled text data and the labeled text data;

the model construction module is used for training the marked text data as a training sample to obtain a theme recognition model;

the theme recognition module is used for recognizing the unmarked text data by using the theme recognition model to obtain a theme recognition result;

and the topic identification module is used for inputting the topic identification result into a topic generation model to obtain topics corresponding to each topic.

It is a further object of the present invention to provide an electronic device comprising a processor, a storage medium, and a computer program, the computer program being stored in the storage medium, the computer program, when executed by the processor, implementing the above-mentioned topic identification method.

It is a fourth object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, having stored thereon a computer program which, when executed by a processor, implements the topic identification method described above.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the text corresponding to each theme is obtained through theme identification, and the popularity of the theme is further known; the topic generation model is used for carrying out topic identification on each topic, and different topics of each topic and the heat degree of each topic can be intuitively known through the identified topics, so that the follow-up targeted service improvement is facilitated.

Drawings

Fig. 1 is a flowchart of a topic identification method of the first embodiment;

FIG. 2 is a flow diagram of a method of topic identification in accordance with an embodiment;

fig. 3 is a block diagram showing the configuration of a topic identification device according to the second embodiment;

fig. 4 is a block diagram of the electronic apparatus of the third embodiment.

Detailed Description

The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.

Example one

The embodiment I provides a topic identification method, which aims to analyze topics and topics of texts so as to obtain an analysis result of a client appeal topic.

Referring to fig. 1, a topic identification method is characterized by comprising the following steps:

s110, obtaining unmarked text data and marked text data;

the marked text data comprises a theme and text data corresponding to the theme.

The topics are summarized according to actual customer service problems, such as intelligent payment, inquiry of operator environment and house number, and the like. The embodiment does not limit the text labeling method, and only involves receiving sample data.

S120, training the marked text data as a training sample to obtain a theme recognition model;

s120 specifically includes the following steps:

The theme recognition model is a binary classification model, and the deep learning text algorithm is a textCNN algorithm.

Because the problem for the topic identification model is topic classification, namely, a two-classification problem, the topic identification model adopts a two-classification model, the model is divided into a training set and a test set, 70% of data in a corpus is randomly extracted as the training set for constructing the model, and the rest 30% of data is used as the test set for testing the model, so that the accuracy and the stability of the model are conveniently verified.

The two-classification model (topic identification model) adopts the accuracy, precision and recall rate to evaluate. The following is a description of accuracy, precision and recall.

Accuracy (Accuracy): predicting the proportion of the correct result in the total number; in this embodiment, the prediction accuracy of the corresponding text of one topic is obtained.

Accuracy＝(TP+TN)/(TP+TN+FP+FN)

Precision (Precision): the proportion of samples that are predicted to be positive and correctly predicted to all samples that are predicted to be positive is also called precision.

Precision＝TP/(TP+FP)

Recall (Recall): the proportion of the samples which are predicted to be positive and are correctly predicted to all the samples which are actually positive is called sensitivity and recall ratio.

Recall＝TP/(TP+FN)

TP: actually positive, and divided into positive number of samples, true number.

FP: actually negative but divided into positive number of samples, false positive number.

TN: actually negative, and divided into negative number of samples, true negative.

FN: actually positive, but divided into negative sample numbers, false negative numbers.

The threshold values of the accuracy, the precision and the recall rate are set according to actual requirements, and the embodiment is not limited, and only when the threshold values are reached, the identification result is output.

S130, recognizing the text data which are not marked by the topic recognition model to obtain a topic recognition result;

it should be noted that the topic identification result includes a topic, a text number corresponding to the topic, and a text number not corresponding to the topic, but the specific content of the text is not analyzed.

Specifically, referring to fig. 2, recognizing the text data that is not labeled by using the topic recognition model to obtain a topic recognition result includes the following steps:

s1301, converting the text data which are not marked into a multidimensional vector through a vector calculation tool Word2 vec;

s1302, inputting the multi-dimensional vector into the theme recognition model to obtain a theme recognition result of the unmarked text data.

The Word2vec algorithm described above implements distributed vector representation of words, and maps each Word to a numerical vector of a fixed dimension N, where N is 100 in this embodiment, that is, each Word is mapped to a 100-dimensional vector. The algorithm is based on a shallow neural network structure, comprises an input layer, a hidden layer and an output layer, and is trained by adopting a CBOW model or a Skip-gram model. The present embodiment is described by taking CBOW model as an example:

suppose the corpus contains N different words, i.e. the vector space is V and the number of context words is C.

The input layer is onehot (one-hot coded) representation of a context word, the predicted V-dim probability distribution Y1 of the target word is output after weighted average of the hidden layer (weight matrix W, dimensionality is [ V, N ]), the cross entropy loss L of the actual distribution Y and the predicted distribution Y1 is calculated, and the hidden layer weight optimization loss L is updated through back propagation until the L is not reduced any more, so that the training of the model is completed. And finally, the updated hidden layer W is the Word2vec vector representation of the corpus.

The TextCNN algorithm described above is a text classification algorithm that applies a convolutional neural network to text classification, and extracts key information in text by using a plurality of convolution kernels of different sizes. The TextCNN includes a convolutional layer, a pooling layer, and a full link layer.

The TextCNN training process is as follows:

1. inputting a text library (corpus (text) x and theme y), and outputting a predicted value.

2. And calculating the cross entropy loss of the predicted value and the actual value.

3. The weight optimization L is adjusted by back propagation.

4. The above process is repeated until L no longer falls.

The topic classification function of the text is realized through the training, namely, the topic matching is carried out on the word vector according to each text through the TextCNN according to the vector after word segmentation, and then the text with each topic successfully matched is obtained.

After the training is finished, the accuracy is evaluated through the prediction result of the two-classification model so as to ensure the accuracy of prediction.

It should be noted that, after the recognition result is obtained, the recognition result is also used as a new training sample, so as to continuously improve the accuracy of model prediction.

According to the number of texts corresponding to each theme, the heat degree of each theme can be intuitively known.

S140, inputting the topic identification result into a topic generation model to obtain topics corresponding to the topics.

Before the topic identification result is input into a topic generation model, the method further comprises the following steps:

Stop words can affect the judgment of topics, for example: the term "yes" or "o", etc., which needs to be filtered; the purpose of word segmentation is to identify proper nouns in some customer service conversation processes, for example, "intelligent payment" needs to be identified as one word, so as to prevent from being identified as two words, namely "intelligent" and "payment", and filtering stop words and Chinese word segmentation is a conventional means during model training and can be realized through various open-source tools, so that the embodiment is not specifically described and limited.

The topic generation model in S140 is LDA.

The LDA algorithm (late Dirichlet Allocation) is a document theme generation model, also called a three-layer bayesian probability model, and includes a word, a theme, and a document three-layer structure, and outputs a theme corresponding to a document by splitting an input document into words and identifying the words; in this embodiment, the input document is text data after the topic identification, and the output vector is a topic.

LDA is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document collections (documentcollections) or corpora (corpuses). It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. The bag-of-words approach does not take into account word-to-word ordering, which simplifies the complexity of the problem and also provides opportunities for model improvement. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words.

LDA has two basic principles, namely that each document is composed of several topics (Topic); each topic may be described using several important words (words), and the same Word may occur between different topics at the same time.

The specific training process of LDA is as follows:

1. each word ω in each document is randomly assigned a topic number z. The word ω, that is, each topic corresponds to text data, and performs word segmentation on the text, and the word segmentation method is not limited in this embodiment, for example, mahout chinese word segmentation, ANSJ, and the like can be used to implement word segmentation in this embodiment. The topic mentioned above is the topic.

2. The text data is rescanned and sampled for each word ω using Gibbs Sampling formula to find its topic.

3. Repeat step 2 until Gibbs Sampling converges.

4. And counting a trained topic-word (topic and keywords related to the topic) co-occurrence frequency matrix, wherein the matrix is the model of the LDA.

The topic identification model identifies the topic corresponding to the text, but the text content is not analyzed in detail, and the LDA topic generation model analyzes the input text data in detail, and converts the text into the topic corresponding to the topic by combining the word segmentation and the similar text of the text.

According to an LDA (topic generation model), a topic corresponding to each text can be combed, the occurrence frequency of similar texts in the same topic can be counted, the higher the frequency is, the more client appeal is shown, and the later period can be improved aiming at the high-frequency client appeal.

Example two

The second embodiment discloses a device corresponding to the topic identification method in the second embodiment, which is a virtual device structure in the first embodiment, and as shown in fig. 3, the device includes:

an obtaining module 210, configured to obtain unlabeled text data and labeled text data;

the model construction module 220 is configured to train the labeled text data as a training sample to obtain a topic identification model;

the topic identification module 230 is configured to identify the text data that is not labeled by using the topic identification model to obtain a topic identification result;

and the topic identification module 240 is configured to input the topic identification result into a topic generation model to obtain topics corresponding to each topic.

EXAMPLE III

Fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of the processors 310 in the computer device may be one or more, and one processor 310 is taken as an example in fig. 4; the processor 310, the memory 320, the input device 330 and the output device 340 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.

The memory 320 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the topic identification method in the embodiment of the present invention (for example, the obtaining module 210, the model building module 220, the topic identification module 230, and the topic identification module 240 in the topic identification method apparatus). The processor 310 executes various functional applications and data processing of the electronic device by executing the software programs, instructions and modules stored in the memory 320, so as to implement the topic identification method of the first embodiment.

The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 320 may further include memory located remotely from the processor 310, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input user identification information, text data, and the like. The output device 340 may include a display device such as a display screen.

Example four

The fourth embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the storage medium may be used for a computer to execute a topic identification method, and the method includes:

obtaining unmarked text data and marked text data;

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the method described above, and may also perform related operations in the topic-based identification method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the topic identification method based device, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. A topic identification method is characterized by comprising the following steps:

obtaining unmarked text data and marked text data;

2. The topic identification method of claim 1, wherein the tagged text data comprises a topic and text data corresponding to the topic.

3. The topic identification method of claim 2, wherein training the labeled text data as a training sample to obtain a topic identification model comprises the following steps:

4. The topic identification method of claim 3 wherein the topic identification model is a binary model and the deep learning text algorithm is a textCNN algorithm.

5. The topic identification method according to claim 1 or 4, wherein the identifying the unlabeled text data by the topic identification model to obtain a topic identification result comprises the following steps:

6. The topic identification method of claim 5 wherein prior to inputting the topic identification result into a topic generation model, further comprising the steps of:

7. The topic identification method of claim 6 wherein the topic generation model is LDA.

8. A topic identification device characterized by comprising:

9. An electronic device comprising a processor, a storage medium, and a computer program, the computer program being stored in the storage medium, wherein the computer program, when executed by the processor, implements the topic identification method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the topic identification method according to any one of claims 1 to 7.