CN112100986B

CN112100986B - Voice text clustering method and device

Info

Publication number: CN112100986B
Application number: CN202011247724.8A
Authority: CN
Inventors: 胡洪兵; 李健; 武卫东; 陈明
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-12
Anticipated expiration: 2040-11-10
Also published as: CN112100986A; WO2022100071A1

Abstract

The embodiment of the application relates to a method and a device for clustering voice texts. The method comprises the following steps: preprocessing the plurality of voice texts to obtain a plurality of corresponding voice texts to be processed; converting the voice text to be processed into a text vector by using a word representation model; clustering the text vectors, and dividing the text vectors into a plurality of middle layer categories; calculating a center vector for each intermediate layer category, and re-dividing the plurality of text vectors into a plurality of categories based on the center vector. The embodiment of the application can realize more accurate cluster analysis on the voice text.

Description

Voice text clustering method and device

Technical Field

The embodiment of the application relates to the technical field of text clustering, in particular to a method and a device for clustering voice texts.

Background

In the big data era, the importance of data is self-evident, and the data becomes important virtual property of each company, and each company establishes beyond-reach technical advantages for itself in the field according to the data mastered by itself.

The premise of using data requires clustering analysis on data for subsequent use, but a large number of voice text data sets generated in daily life are more discrete and more classified relative to internet texts, for example, a section of service call text of a mobile communication operator may include a large number of different classes such as fee inquiry, broadband handling, emergency stop and start, regional roaming and the like. These attributes of the speech text bring great pressure to the common clustering method, so how to cluster the speech text is a major difficulty of research in the industry.

Disclosure of Invention

Based on the above problems, embodiments of the present application provide a method and an apparatus for clustering speech texts, which aim to implement more accurate cluster analysis on the speech texts.

A first aspect of an embodiment of the present application provides a method for clustering speech texts, where the method includes:

preprocessing the plurality of voice texts to obtain a plurality of voice texts to be processed;

converting the voice text to be processed into a text vector by using a word representation model;

clustering all text vectors obtained by the voice texts to be processed to obtain a plurality of intermediate categories;

and calculating the central vector of each intermediate category, and dividing all the text vectors into a plurality of categories again based on a plurality of central vectors obtained by calculation.

Optionally, the method further comprises:

calculating the number of the text vectors of which the categories are changed after all the text vectors are divided into the plurality of categories;

judging whether the number of the text vectors of the changed category is larger than a preset threshold value, and when the number of the text vectors of the changed category is larger than the preset threshold value, continuing to execute the following steps:

iteratively, recalculating the central vector of each of the plurality of categories, and classifying all the text vectors based on the recalculated central vectors until the number of text vectors of the changed category is less than the preset threshold.

Optionally, the preprocessing the plurality of speech texts comprises:

performing word segmentation and part-of-speech tagging on the voice text;

and/or stop word filtering of the speech text.

Optionally, the word representation model comprises any one of:

Word2vec、CBOW、Skip-gram、GloVe、BERT、GPT2.0。

optionally, converting the to-be-processed speech text into a text vector by using a word representation model, including:

converting real words in the voice text to be processed into a plurality of word vectors, wherein the real words comprise at least one of nouns, verbs and vernouns;

averagely pooling word vectors contained in a plurality of sentences in the voice text to be processed to obtain a plurality of corresponding sentence vectors;

and combining the sentence vectors to obtain the text vector.

Optionally, clustering all text vectors obtained from the plurality of to-be-processed speech texts to obtain a plurality of intermediate categories, including:

s1, numbering all text vectors obtained by the plurality of to-be-processed voice texts, wherein the numbers are 1 to n;

s2, dividing the first text vector into a first text vector class;

s3, calculating the first round cosine similarity of the second text vector and the first text vector;

s4, if the cosine similarity of the first round is larger than a preset threshold value, dividing the second text vector into a first text vector class;

s5, if the cosine similarity of the first round is smaller than a preset threshold value, dividing a second text vector into a second text vector class;

s6, reading the classification of the divided text vector class when calculating a new text vector in sequence;

s7, calculating a new round of cosine similarity between the new text vector and the classified text vector class in sequence, and when the calculated new round of cosine similarity between the new text vector and any one of the classified text vector classes is larger than a preset threshold value, dividing the new text vector into the classified text vector class;

or when the similarity of the new cosine of any one category of the new text vector and the divided text vector is not greater than a preset threshold value, dividing the new text vector into a pth text vector category, wherein p is the number of the divided text vector categories plus one;

s8, repeating the steps S6 and S7 until all the text vectors are calculated.

A second aspect of the embodiments of the present application provides a speech text clustering device, where the device includes:

the preprocessing model is used for preprocessing the voice texts to obtain a plurality of voice texts to be processed;

the conversion module is used for converting the voice text to be processed into a text vector by utilizing a word representation model;

the first clustering module is used for clustering all text vectors obtained by the voice texts to be processed to obtain a plurality of intermediate categories;

and the second classification module is used for calculating the central vector of each intermediate class and dividing all the text vectors into a plurality of classes again on the basis of a plurality of central vectors obtained by calculation.

Optionally, the apparatus further comprises:

the device further comprises:

the judging module is used for calculating the number of the text vectors of which the categories are changed after all the text vectors are divided into the plurality of categories;

the iteration module is used for judging whether the number of the text vectors of the changed category is larger than a preset threshold value or not, and when the number of the text vectors of the changed category is larger than the preset threshold value, the following steps are continuously executed:

Optionally, the conversion module includes:

the real word processing submodule is used for converting real words in the voice text to be processed into a plurality of word vectors, and the real words comprise at least one of nouns, verbs and dynamic nouns;

the average pooling submodule is used for averagely pooling word vectors contained in a plurality of sentences in the voice text to be processed to obtain a plurality of corresponding sentence vectors;

and the sentence vector combination submodule is used for combining the sentence vectors to obtain the text vector.

Optionally, the step of clustering, by the first clustering module, all text vectors obtained from the plurality of to-be-processed speech texts to obtain a plurality of intermediate categories includes:

s2, dividing the first text vector into a first text vector class;

s8, repeating the steps S6 and S7 until all the text vectors are calculated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart of a method for clustering speech texts according to an embodiment of the present application;

FIG. 2 is a flow chart of obtaining intermediate layer categories for clustering according to an embodiment of the present application;

fig. 3 is a schematic diagram of a speech text clustering apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Firstly, the present application introduces the existing speech text clustering method, and when the conventional clustering algorithm is applied to speech texts, the clustering effect often cannot reach the expectation, and two main problems exist: the common clustering algorithm such as KMeans classifies massive voice texts by specifying specific category numbers, which requires that the category numbers are given in advance, but the category numbers of the voice texts in the actual scene are unknown and cannot be estimated, so that the algorithm which requires the specified category numbers cannot really meet the requirements. In addition, the single category clustering results have no similarity constraint, and there are some samples farther from the center. For example, the SinglePass clustering algorithm does not need to give the clustering number and only needs to give a similarity threshold, but the overall effect of the algorithm cannot meet the requirement due to the fact that the class center of the single Pass clustering algorithm is directly selected as a fixed sample point and mean shift iteration is lacked.

The invention of the application is characterized in that the cosine distance between the speech text vectors is clustered, the category number of the speech text is preliminarily specified, and then each category is further subdivided based on each category center vector according to the category number. Thereby realizing the classification of the voice text.

Referring to fig. 1, fig. 1 is a flowchart of a speech text clustering method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

step S101, preprocessing a plurality of voice texts to obtain a plurality of voice texts to be processed. The voice call text cannot be directly processed, and preprocessing steps such as segmenting the voice text, performing part-of-speech tagging on segmentation results, removing punctuation marks in the voice text and the like are required.

And S102, converting the voice text to be processed into a text vector by using a word representation model. The language is a product of human social life and is a tool for expressing abstract thinking information, and carrier texts of the language and the like are only symbolic data seen by a machine, but the machine cannot understand the text, so that the voice text needs to be converted into a vector which is convenient for the machine to process.

The word representation model is a pre-trained model for converting natural language into vectors, such as VSM vector space model, local sensitive Hash, LSA/LDA topic model, word2vec/doc2vec distributed representation, Bert/ELMo/GPT deep pre-training model, and text representation under specific tasks (AVG/DNN/RNN/CNN/AE).

Step S103, clustering all the text vectors obtained by the voice texts to be processed to obtain a plurality of intermediate categories.

In an optional embodiment of the present application, after the text vectors are obtained in step S102, cosine distances between all the text vectors are calculated, and text vectors can be simply classified by classifying text vectors whose cosine distances are smaller than a certain threshold.

And step S104, calculating the center vector of each intermediate category, and dividing all the text vectors into a plurality of categories again based on a plurality of center vectors obtained by calculation.

After clustering the text vectors in step S103, a plurality of categories can be obtained. However, the classification obtained in step S103 is not an accurate classification of each text vector, and some vectors that are actually of the same class but have a far cosine distance may be classified into another class.

Therefore, in step S103, the center vectors of the respective categories are calculated, and a plurality of center vectors are obtained.

In an embodiment of the present application, the average of the text vectors of each type in step S103 is calculated and used as the center vector of each type.

After central vectors of various types are obtained, cosine distances between all text vectors and the central vectors are calculated, the text vectors are classified again according to the calculated cosine distances, for example, for A-type text vectors 6 and 11 and B-type text vectors 9 and 12, A-type central vector 7 and B-type central vector 11.5 are calculated first, then distances 1, 4, 2 and 5 between all text vectors and A-type central vectors and distances 5.5, 0.5, 2.5 and 0.5 between all text vectors and B-type central vectors are calculated, and it can be seen that the distances between the text vectors for different central vectors are different. All the text vectors are divided into the class of the center vector with the shortest cosine distance to the text vectors, so that all the text vectors can be divided into classes again according to the center vector, for example, after the distances between the text vectors in the class A and the class B and the two center vectors are obtained, the text vectors 6 and 9 are classified into one class if the distances from the text vectors to the class B center vector are the shortest, and the text vectors 11 and 12 are classified into another class if the distances from the text vectors to the class B center vector are the shortest, so that the class of the text vectors is divided again.

According to the voice text clustering method, a plurality of voice texts are preprocessed, and a plurality of voice texts to be processed are obtained; converting the plurality of to-be-processed voice texts into a plurality of text vectors by utilizing a word representation model; clustering the text vectors based on cosine distances, and dividing the text vectors into a plurality of middle layer categories; calculating a center vector for each intermediate layer category, and re-dividing the plurality of text vectors into a plurality of categories based on the center vector. After clustering is carried out based on cosine distances among the voice text vectors, the center vectors of each class are calculated, and classification is carried out again based on the distances between the voice text vectors and the center vectors, so that more accurate clustering analysis on the voice texts is realized.

In an optional embodiment of the present application, as shown in fig. 2, the clustering all text vectors obtained from the plurality of to-be-processed speech texts in step S103 to obtain a plurality of intermediate categories includes:

and S1, numbering all the text vectors obtained by the plurality of to-be-processed voice texts, wherein the numbering is from 1 to n. Here, the text vectors are numbered in order to distinguish the text vectors from each other, and therefore, other numbering schemes may be selected and the text vectors may be numbered in the present application.

And S2, dividing the first text vector into a first text vector class. And dividing the text vector with the number of 1 into a 1 st text vector class, wherein the representative vector of the class is the text vector with the number of 1.

And S3, calculating the first round cosine similarity of the second text vector and the first text vector. The cosine similarity degree actually refers to the cosine distance between the text vectors, and the cosine distance between the 2 nd text vector and the first text vector is calculated.

And S4, if the cosine similarity of the first round is greater than a preset threshold, dividing the second text vector into a first text vector class. Under the condition that the cosine similarity in the first round is greater than the preset threshold, the 1 st text vector class of the text vector with the representative vector number of 1 actually comprises two text vectors, namely the 1 st text vector and the 2 nd text vector, and only the representative vector of the class is still the 1 st text vector as in the initial period.

And S5, if the cosine similarity of the first round is smaller than a preset threshold, dividing the second text vector into a second text vector class. And under the condition that the cosine similarity of the first round is smaller than a preset threshold value, establishing a 2 nd text vector class, and taking the 2 nd text vector as a representative vector of the class.

And S6, reading the classification of the divided text vector class when calculating the new text vector according to the sequence. When the 3 rd text vector or the text vector after 3 is calculated, more than two text vector classes may appear in the front (for example, the 2 nd text vector before the 3 rd text vector and the 1 st text vector are divided into the 2 nd text vector class when the cosine similarity is smaller than the preset threshold), and at this time, the previously divided text vector classes are read for subsequent calculation.

or when the similarity of the new cosine of any one category of the new text vector and the divided text vector is not greater than a preset threshold value, dividing the new text vector into a pth text vector category, wherein p is the number of the divided text vector categories plus one.

Reading the previously divided text vector classes, and sequentially calculating cosine similarities between the 3 rd text vector and the representative vectors of all the previously divided text vector classes, wherein a plurality of second round cosine similarities may be obtained. When the cosine similarity obtained by calculation is larger than a preset threshold value, the 3 rd text vector is divided into the text vector class represented by the representative vector, and the subsequent text vector class is not calculated any more.

And if the cosine similarity between the 3 rd text vector and the representative vector of the previous text vector class is not more than a preset threshold value, classifying the 3 rd text vector into a new pth text vector class. The new text vector class determines the name according to the classification condition of the previous two text vectors, when the previous two text vectors are divided into two classes, the 3 rd text vector is determined as the 3 rd text vector class by adding one; thus, when the previous two text vectors are classified into one class, plus one, the 3 rd text vector is determined to be of the 2 nd text vector class. Similarly, the representative vector of the newly divided text vector class is the 3 rd text vector.

S8, repeating the steps S6 and S7 until all the text vectors are calculated.

For the text vector after the 3 rd text vector, similarly, the divided text vector classes need to be read, the cosine similarity between the text vector and the representative vectors of all the text vector classes is calculated in sequence, and when the cosine similarity is larger than a preset threshold, the text vector is classified as the text vector class represented by the representative vector; and when the current value is less than the preset threshold value, the current value is divided into a new text vector class, and the text vector is selected as a representative vector of the new class. This step continues until all text vectors have been calculated.

In an optional embodiment of the present application, after step S104, the method further includes:

step S105, calculating the number of the text vectors of which the categories are changed after all the text vectors are divided into the plurality of categories.

Based on the classification of the text vectors in step S103, S104 re-classifies the categories based on the center vector of each category, in which a part of the text vectors are changed from the intermediate-level category in step S103 to a new category, calculates the number of changed categories, determines whether the number is greater than a preset threshold, and performs the subsequent steps when the number is greater than the preset threshold.

Step S106, judging whether the number of the text vectors of the changed category is larger than a preset threshold value, and when the number of the text vectors of the changed category is larger than the preset threshold value, continuously executing the following steps:

And on the basis of the classification of S104, calculating a new round of central vectors of each class, clustering according to the new round of central vectors, judging whether the number of the text vectors of the changed class is smaller than a preset threshold value again in the process, repeatedly iterating the text vectors of the changed class larger than the preset threshold value, and calculating the central vectors of each class based on the step of clustering of the central vectors until the number of the text vectors of the changed class is smaller than the preset threshold value. The classification at this time is output as a final classification result.

In an optional embodiment of the present application, the preprocessing the plurality of phonetic texts in step S101 includes:

performing word segmentation and part-of-speech tagging on the voice text;

and/or stop word filtering of the speech text.

In an optional embodiment of the application, the word representation model comprises any one of:

Word2vec、CBOW、Skip-gram、GloVe、BERT、GPT2.0。

the text is unstructured data information and cannot be directly calculated, so that text representation is required in NLP (natural language processing), the unstructured information is converted into structured information through the text representation, and the tasks of text classification, emotion judgment and the like which can be seen in daily life can be completed by calculating the text information.

word embedding is a method for text representation, a text can be expressed by a low-dimensional vector, words with similar semantemes are relatively similar in vector space, the universality is high, and the method can be used in different tasks. Similarly, CBOW, Skip-gram, GloVe, BERT, GPT2.0 are all neural network language models for text representation, e.g., GloVe is an extension of Word2vec method, which combines global statistics with context-based learning of Word2 vec.

In an optional embodiment of the present application, the converting the to-be-processed phonetic text into a text vector by using the word representation model in step S102 includes:

and combining the sentence vectors to obtain the text vector.

Based on the same inventive concept, an embodiment of the present application provides a speech text clustering device. Referring to fig. 3, fig. 3 is a schematic diagram of a speech text clustering apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

the preprocessing model 301 is used for preprocessing a plurality of voice texts to obtain a plurality of voice texts to be processed;

a conversion module 302, configured to convert the to-be-processed speech text into a text vector by using a word representation model;

a first clustering module 303, configured to cluster all text vectors obtained from the multiple to-be-processed speech texts to obtain multiple intermediate categories;

and a second classification module 304, configured to calculate a center vector of each intermediate class, and re-divide all text vectors into multiple classes based on multiple center vectors obtained by calculation.

In an optional embodiment of the present application, the apparatus further comprises:

In an optional embodiment of the present application, the conversion module includes:

In an optional embodiment of the present application, the clustering, by the first clustering module, all text vectors obtained from the plurality of to-be-processed speech texts to obtain a plurality of intermediate categories includes:

s2, dividing the first text vector into a first text vector class;

s8, repeating the steps S6 and S7 until all the text vectors are calculated.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and the device for clustering the voice texts provided by the application are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for clustering speech texts, the method comprising:

calculating a center vector of each intermediate category, and dividing all the text vectors into a plurality of categories again based on a plurality of center vectors obtained by calculation;

2. The method of claim 1, wherein pre-processing the plurality of phonetic texts comprises:

performing word segmentation and part-of-speech tagging on the voice text;

and/or stop word filtering of the speech text.

3. The method of claim 1, wherein the word representation model comprises any one of:

Word2vec、CBOW、Skip-gram、GloVe、BERT、GPT2.0。

4. the method of claim 1, wherein converting the phonetic text to be processed into a text vector using a word representation model comprises:

and combining the sentence vectors to obtain the text vector.

5. The method of claim 1, wherein clustering all text vectors obtained from the plurality of to-be-processed speech texts to obtain a plurality of intermediate classes comprises:

s2, dividing the first text vector into a first text vector class;

s8, repeating the steps S6 and S7 until all the text vectors are calculated.

6. An apparatus for clustering speech texts, the apparatus comprising:

the second classification module is used for calculating the central vector of each intermediate class and dividing all the text vectors into a plurality of classes again on the basis of a plurality of central vectors obtained by calculation;

7. The apparatus of claim 6, wherein the conversion module comprises:

8. The apparatus according to claim 6, wherein the step of clustering all the text vectors obtained from the plurality of speech texts to be processed by the first clustering module to obtain a plurality of intermediate categories comprises:

s2, dividing the first text vector into a first text vector class;

s8, repeating the steps S6 and S7 until all the text vectors are calculated.