CN110674293B

CN110674293B - Text classification method based on semantic migration

Info

Publication number: CN110674293B
Application number: CN201910796512.6A
Authority: CN
Inventors: 王雄; 任朝俊; 吴环宇; 任婧
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2022-03-25
Anticipated expiration: 2039-08-27
Also published as: CN110674293A

Abstract

The invention discloses a text classification method based on semantic migration, which takes the huge data volume into consideration, adopts a election mechanism, firstly groups original texts, preprocesses a text data set to be classified of each group to construct a text-word matrix, then calculates an interword similarity matrix by using a neural network model BERT to obtain distributed representation of words in a task set, then carries out truncation singular value decomposition on the similarity matrix to obtain a transfer matrix, migrates semantic information contained in the transfer matrix to the text vectorization process to obtain low-dimensional representation of a task set text, then uses a K-Means algorithm to cluster each group, finally selects a plurality of representations from all subclasses to carry out second-round clustering, and adopts a majority voting principle to realize final text classification.

Description

Text classification method based on semantic migration

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a text classification method based on semantic migration.

Background

Text classification refers to a computer automatically classifying text into several categories according to the subject matter of the text expression. In the era of information explosion, automatic text classification by computers can help users to quickly acquire required information from massive texts, and the efficiency of manual information processing is greatly improved. Today, the artificial intelligence and wave mat is global, text classification has been widely applied to the fields of text auditing, advertisement filtering, network public sentiment and the like, and becomes a very important research direction for natural language processing.

Methods of text classification can be broadly classified into text classification methods based on supervised and unsupervised learning. Since the supervised learning based text classification method needs to use a large amount of texts with classification labels for model training, the label data is difficult to obtain in practical application. The present invention is therefore primarily concerned with unsupervised text classification.

Unsupervised text classification can find potential knowledge and rules from a large amount of text data, and can not only obtain knowledge, but also process the text data, so that the unsupervised text classification becomes an important means for effectively organizing, abstracting and navigating text information at present and is concerned by more and more researchers.

The text is classified, firstly, the text vectorization is carried out, and then the text vector is classified. The text vectorization is the core of the whole classification process and is the main content of the research of the invention.

The traditional text vectorization method has the following defects: firstly, a text vector is constructed by a traditional method represented by a word bag model and a TFIDF (Term Frequency-Inverse Document Frequency) completely according to the statistical characteristics (such as word Frequency and word weight) of words in a text, so that the text semantics cannot be well expressed, particularly in a Chinese polysemous word scene; secondly, a text vectorization method represented by lda (content Dirichlet allocation) and Doc2vec only uses a text set to be classified to train a model, cannot acquire prior information of natural language semantics, and is difficult to continuously optimize the training model. The quality of the finally obtained text vector is poor, and the classification result is poor.

Therefore, how to accurately vectorize the text to make the text have more accurate semantics, so that the implementation of the subsequent classification algorithm is a very important ring in the text classification task. The present invention is based on this.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a text classification method based on semantic migration, which migrates natural semantic information to text vectorization and then realizes text classification processing.

In order to achieve the above object, the present invention provides a text classification method based on semantic migration, which is characterized by comprising the following steps:

(1) dividing the text data set to be classified into G groups by adopting an election mechanism, and recording the G groups as G₁,g₂,…,g_G(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X₁,x₂,…,x_|X|In which x_iThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W₁,w₂...w_|W|}，w_jRepresenting the jth word, | W | represents the number of words in the word bank set W;

(2) to g₁Performing word segmentation processing on the group text data X to remove stop words;

(3) constructing a text-word matrix A_|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrix_ijRepresenting the appearance condition of the jth word in the ith text, wherein the appearance condition of the jth word is 1 if the jth word appears, and otherwise, the appearance condition of the jth word is 0;

(4) constructing an interword similarity matrix S by using a neural network model BERT_|W|×|W|；

Inputting each word into BERT to obtain a vector representation of each word, wherein the vector representation of the jth word of the ith text is represented as

Calculating the similarity between words by using a cosine function:

wherein j' ≠ j^*Respectively represent the jth word and the jth^*Each word;

finally, constructing an interword similarity matrix S by using all the interword similarities_|W|×|W|；

(5) To the inter-word similarity matrix S_|W|×|W|Performing truncated singular value decomposition into

Where Σ is a low rank matrix,

k < | W | for the transition matrix;

(6) calculating the matrix

(7) And utilizing the K-Means clustering algorithm to pair the matrixes

Low-dimensional representation of (a) to perform clustering;

and (3) calculating the distance between texts by using the self-defined cosine distance:

wherein i' ≠ i^*Respectively represent the ith' text and the ith text^*C is a constant;

according to the distance between texts, g₁Group text data is grouped as l₁A subclass;

(8) processing the rest groups of text data according to the method from the step (2) to the step (7), and then after carrying out group-in clustering on the G groups of text data, recording each group of text data as G_kEach group is aggregated into l_kIn a subclass, then co-polymerize

A subclass;

(9) and in each subclass of the L classes, randomly selecting a plurality of text representatives to form text data, repeating the method in the steps (2) - (7), carrying out second-round clustering, and after the clustering of the current round is finished, adopting a majority voting principle, wherein if a certain class is selected by the majority representatives, all the texts of the current subclass are considered to belong to the class, so that the text classification is realized.

The invention aims to realize the following steps:

the invention relates to a text classification method based on semantic migration, which takes the huge data volume into consideration, adopts a election mechanism, firstly groups original texts, preprocesses a text data set to be classified in each group to construct a text-word matrix, then calculates an interword similarity matrix by using a neural network model BERT to obtain distributed representation of words in a task set, then carries out truncation singular value decomposition on the similarity matrix to obtain a transfer matrix, migrates semantic information contained in the transfer matrix to a text vectorization process to obtain low-dimensional representation of a task set text, then uses a K-Means algorithm to cluster each group, finally selects a plurality of representations from all subclasses to carry out second-round clustering, and adopts a majority voting principle to realize final text classification.

Meanwhile, the text classification method based on semantic migration further has the following beneficial effects:

(1) the BERT is utilized to construct an interword similarity matrix, so that the relation between words can be well excavated, and rich external knowledge is introduced for realizing the low-dimensional representation of the text subsequently;

(2) the method is insensitive to the scale of the task set, and can still realize a good clustering effect on a small amount of data clustering;

(3) compared with the traditional TFIDF and LDA methods, the method has the advantages that the accuracy, the recall rate and the F1 value are greatly improved;

(4) when the scale of the task set is large, the invention designs an election mechanism, and can fully utilize the parallel computing capability of the computer.

Drawings

FIG. 1 is a flow chart of a text classification method based on semantic migration according to the present invention;

FIG. 2 is a text vectorization diagram;

fig. 3 is a schematic diagram of text clustering.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flow chart of a text classification method based on semantic migration according to the present invention.

The invention is mainly divided into two stages. The main purpose of the first stage is to obtain a text vector with accurate semantics. Firstly, a large amount of linguistic data are learned, a word similarity matrix capable of expressing semantics is constructed, then semantic information contained in the learned word similarity matrix is migrated and applied to vectorization processing of a text to be classified, and finally low-dimensional vector representation of the text is completed. In the second stage, the text vectors are classified using a classification method. Experiment comparison shows that compared with the existing typical classification method, the classification performance of the method on the public text data set is greatly improved.

In the following, we will describe in detail a text classification method based on semantic migration with reference to fig. 1, which specifically includes the following steps:

s1, selecting a text data set to be classified

Selecting a text data set to be classified from the encyclopedia questions and answers of the question-answer public data set, and dividing the text data set to be classified into G groups which are marked as G by adopting an election mechanism₁,g₂,…,g_G(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X₁,x₂,…,x_|X|}，X＝{x₁,x₂,…,x_|X|In which x_iThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W₁,w₂...w_|W|}，w_jRepresenting the jth word, | W | represents the number of words in the word bank set W;

s2 preprocessing text data

For g₁Performing word segmentation processing on the group text data X to remove stop words; in this embodiment, a chinese word segmentation tool jieba with a Python open source is used, and removing stop words is mainly to remove meaningless connecting words and the like in a text, so as to obtain a word bank of a task set.

S3, constructing a text-word matrix A_|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrix_ijDenotes the jth wordThe occurrence condition of the words in the ith text is 1 if the words occur, or 0 if the words occur;

s4, constructing an interword similarity matrix S by using a neural network model BERT_|W|×|W|；

Inputting each word into BERT, and obtaining a vector representation of each word, wherein the vector representation of the jth word is

Calculating the similarity between words by using a cosine function:

wherein j' ≠ j^*Respectively represent the jth word and the jth^*Each word;

In this embodiment, the bert (bidirectional Encoder replication from transforms) model is a neural network model proposed by Google, and has a strong semantic mining capability.

S5, obtaining a transition matrix

For the similarity matrix S between words_|W|×|W|Performing truncated singular value decomposition into

Where Σ is a low rank matrix,

k < | W | for the transition matrix;

in this embodiment, a Python open source library scipy is used to complete truncated singular value decomposition to obtain a transfer matrix, and a specific process is shown in fig. 2, so that semantic information contained in the transfer matrix can be migrated to a text vectorization process, and low-dimensional representation of a task set text is obtained.

S6, calculating matrix

S7, K-Means clustering

Matrix pair using K-Means clustering algorithm

Clustering the low-dimensional vectors of the represented texts; in order to fully utilize the parallel processing capability of the computer, an election mechanism is adopted for text clustering when the size of a task set is large, as shown in fig. 3.

In this embodiment, K-Means in the Python open source library nltk is used for clustering, where the distance between data points does not use euclidean distance, but uses a self-defined cosine distance for calculation, specifically:

wherein i' ≠ i^*Respectively represent the ith' text and the ith text^*The number of the text is one,

for the low-dimensional vector of the i' th text, i.e. as determined in step 7

In the ith' row, in order to avoid overflow problems possibly caused by floating point number operation, a smoothing mechanism is adopted for the denominator, and a smaller constant c is added;

s8, similarly, according to the method of steps S2-S7, the same processing is carried out on the remaining group text data, and then the group G text data is groupedAfter inner clustering, recording each group of text data as g_kEach group is aggregated into l_kIn a subclass, then co-polymerize

A subclass;

s9, randomly selecting a plurality of text representatives in each subclass of the L classes to form text data, repeating the method of the steps S2-S7, carrying out second-round clustering, adopting a majority voting principle after the clustering of the current round is finished, and if a certain class is selected by the majority representatives, determining that all texts of the subclass belong to the class, thereby realizing text classification.

In this embodiment, the scale of the task set for the second round of clustering is small, and the method can still achieve a good clustering effect on a small amount of data clustering.

Authentication

And (4) carrying out classification experiments on the method by adopting a public data set, calculating a classification result and evaluating the performance. The experimental data set was an open questions and answers data set encyclopedia questions and answers (baike2018 qa). And (4) carrying out performance evaluation by adopting the accuracy, the recall rate and the F1-Score index.

Wherein, the accuracy rate P is the ratio of the text with correct text classification to all the texts:

the recall rate is the ratio of the number of correctly classified texts to the actual number of texts:

f1 values trade off accuracy against recall:

then, two text classification methods of TFIDF + K-Means and LDA + K-Means are respectively adopted to carry out comparison experiments with the method, and the classification performance is evaluated.

The experimental data set selects an open questioning and answering data set encyclopedia questioning and answering (baike2018qa), 4000 pieces of data of three categories of entertainment-vacation travel "," life-food/cooking "and" computer/network-software "are selected from the encyclopedia questioning and answering data set, and 12000 pieces of data are used as a task set. The data set contains real labels of all categories, and after data clustering is completed, the data set is compared with the real category labels to finally obtain Recall rate Recall, accuracy Precision and F1 values of an evaluation clustering result.

Table 1 is the experimental results of the method of the invention; table 2 shows the results of the LDA + K-Means experiments; table 3 shows the results of the experiments with TFIDF + K-Means.

TABLE 1

TABLE 2

TABLE 3

As can be seen from the experimental results of tables 1, 2 and 3, compared with the traditional text classification method, the text classification method has the advantages that the accuracy, the recall rate and the F1 value are greatly improved. The semantic understanding ability of the TFIDF and the LDA is highly dependent on the scale of the task set, the TFIDF calculates the word frequency TF and the universal importance IDF through the task set, the LDA carries out Bayesian estimation according to the task set, and the larger the task scale is, the more accurate the text understanding is. The invention migrates the natural language semantics through the BERT to acquire the prior information, and is insensitive to the scale of the task set.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A text classification method based on semantic migration is characterized by comprising the following steps:

(1) dividing the text data set to be classified into G groups by adopting an election mechanism, and recording the G groups as G₁，g₂,…,g_G(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X₁,x₂,…,x_|X|In which x_iThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W₁,w₂...w_|W|}，w_jRepresenting the jth word, | W | represents the number of words in the word bank set W;

Inputting each word into BERT to obtain vector representation of each word, wherein the vector representation of the jth word is

Calculating the similarity between words by using a cosine function:

wherein j' ≠ j^*Respectively represent the jth word and the jth^*Each word;

Where sigma is a low rank matrix,

k < | W | for the transition matrix;

(6) calculating the matrix

(7) And utilizing the K-Means clustering algorithm to pair the matrixes

Clustering the low-dimensional vectors of the represented texts;

(8)、processing the rest groups of text data according to the method in the steps (2) - (7), and then after carrying out group-in clustering on the G groups of text data, recording each small group of text data as G_kEach group is aggregated into l_kIn a subclass, then co-polymerize

A subclass;