CN110674293B - Text classification method based on semantic migration - Google Patents
Text classification method based on semantic migration Download PDFInfo
- Publication number
- CN110674293B CN110674293B CN201910796512.6A CN201910796512A CN110674293B CN 110674293 B CN110674293 B CN 110674293B CN 201910796512 A CN201910796512 A CN 201910796512A CN 110674293 B CN110674293 B CN 110674293B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- matrix
- group
- text data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Abstract
The invention discloses a text classification method based on semantic migration, which takes the huge data volume into consideration, adopts a election mechanism, firstly groups original texts, preprocesses a text data set to be classified of each group to construct a text-word matrix, then calculates an interword similarity matrix by using a neural network model BERT to obtain distributed representation of words in a task set, then carries out truncation singular value decomposition on the similarity matrix to obtain a transfer matrix, migrates semantic information contained in the transfer matrix to the text vectorization process to obtain low-dimensional representation of a task set text, then uses a K-Means algorithm to cluster each group, finally selects a plurality of representations from all subclasses to carry out second-round clustering, and adopts a majority voting principle to realize final text classification.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a text classification method based on semantic migration.
Background
Text classification refers to a computer automatically classifying text into several categories according to the subject matter of the text expression. In the era of information explosion, automatic text classification by computers can help users to quickly acquire required information from massive texts, and the efficiency of manual information processing is greatly improved. Today, the artificial intelligence and wave mat is global, text classification has been widely applied to the fields of text auditing, advertisement filtering, network public sentiment and the like, and becomes a very important research direction for natural language processing.
Methods of text classification can be broadly classified into text classification methods based on supervised and unsupervised learning. Since the supervised learning based text classification method needs to use a large amount of texts with classification labels for model training, the label data is difficult to obtain in practical application. The present invention is therefore primarily concerned with unsupervised text classification.
Unsupervised text classification can find potential knowledge and rules from a large amount of text data, and can not only obtain knowledge, but also process the text data, so that the unsupervised text classification becomes an important means for effectively organizing, abstracting and navigating text information at present and is concerned by more and more researchers.
The text is classified, firstly, the text vectorization is carried out, and then the text vector is classified. The text vectorization is the core of the whole classification process and is the main content of the research of the invention.
The traditional text vectorization method has the following defects: firstly, a text vector is constructed by a traditional method represented by a word bag model and a TFIDF (Term Frequency-Inverse Document Frequency) completely according to the statistical characteristics (such as word Frequency and word weight) of words in a text, so that the text semantics cannot be well expressed, particularly in a Chinese polysemous word scene; secondly, a text vectorization method represented by lda (content Dirichlet allocation) and Doc2vec only uses a text set to be classified to train a model, cannot acquire prior information of natural language semantics, and is difficult to continuously optimize the training model. The quality of the finally obtained text vector is poor, and the classification result is poor.
Therefore, how to accurately vectorize the text to make the text have more accurate semantics, so that the implementation of the subsequent classification algorithm is a very important ring in the text classification task. The present invention is based on this.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a text classification method based on semantic migration, which migrates natural semantic information to text vectorization and then realizes text classification processing.
In order to achieve the above object, the present invention provides a text classification method based on semantic migration, which is characterized by comprising the following steps:
(1) dividing the text data set to be classified into G groups by adopting an election mechanism, and recording the G groups as G1,g2,…,gG(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X1,x2,…,x|X|In which xiThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W1,w2...w|W|},wjRepresenting the jth word, | W | represents the number of words in the word bank set W;
(2) to g1Performing word segmentation processing on the group text data X to remove stop words;
(3) constructing a text-word matrix A|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrixijRepresenting the appearance condition of the jth word in the ith text, wherein the appearance condition of the jth word is 1 if the jth word appears, and otherwise, the appearance condition of the jth word is 0;
(4) constructing an interword similarity matrix S by using a neural network model BERT|W|×|W|;
Inputting each word into BERT to obtain a vector representation of each word, wherein the vector representation of the jth word of the ith text is represented as
Calculating the similarity between words by using a cosine function:
wherein j' ≠ j*Respectively represent the jth word and the jth*Each word;
finally, constructing an interword similarity matrix S by using all the interword similarities|W|×|W|;
(5) To the inter-word similarity matrix S|W|×|W|Performing truncated singular value decomposition intoWhere Σ is a low rank matrix,k < | W | for the transition matrix;
(7) And utilizing the K-Means clustering algorithm to pair the matrixesLow-dimensional representation of (a) to perform clustering;
and (3) calculating the distance between texts by using the self-defined cosine distance:
wherein i' ≠ i*Respectively represent the ith' text and the ith text*C is a constant;
according to the distance between texts, g1Group text data is grouped as l1A subclass;
(8) processing the rest groups of text data according to the method from the step (2) to the step (7), and then after carrying out group-in clustering on the G groups of text data, recording each group of text data as GkEach group is aggregated into lkIn a subclass, then co-polymerizeA subclass;
(9) and in each subclass of the L classes, randomly selecting a plurality of text representatives to form text data, repeating the method in the steps (2) - (7), carrying out second-round clustering, and after the clustering of the current round is finished, adopting a majority voting principle, wherein if a certain class is selected by the majority representatives, all the texts of the current subclass are considered to belong to the class, so that the text classification is realized.
The invention aims to realize the following steps:
the invention relates to a text classification method based on semantic migration, which takes the huge data volume into consideration, adopts a election mechanism, firstly groups original texts, preprocesses a text data set to be classified in each group to construct a text-word matrix, then calculates an interword similarity matrix by using a neural network model BERT to obtain distributed representation of words in a task set, then carries out truncation singular value decomposition on the similarity matrix to obtain a transfer matrix, migrates semantic information contained in the transfer matrix to a text vectorization process to obtain low-dimensional representation of a task set text, then uses a K-Means algorithm to cluster each group, finally selects a plurality of representations from all subclasses to carry out second-round clustering, and adopts a majority voting principle to realize final text classification.
Meanwhile, the text classification method based on semantic migration further has the following beneficial effects:
(1) the BERT is utilized to construct an interword similarity matrix, so that the relation between words can be well excavated, and rich external knowledge is introduced for realizing the low-dimensional representation of the text subsequently;
(2) the method is insensitive to the scale of the task set, and can still realize a good clustering effect on a small amount of data clustering;
(3) compared with the traditional TFIDF and LDA methods, the method has the advantages that the accuracy, the recall rate and the F1 value are greatly improved;
(4) when the scale of the task set is large, the invention designs an election mechanism, and can fully utilize the parallel computing capability of the computer.
Drawings
FIG. 1 is a flow chart of a text classification method based on semantic migration according to the present invention;
FIG. 2 is a text vectorization diagram;
fig. 3 is a schematic diagram of text clustering.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flow chart of a text classification method based on semantic migration according to the present invention.
The invention is mainly divided into two stages. The main purpose of the first stage is to obtain a text vector with accurate semantics. Firstly, a large amount of linguistic data are learned, a word similarity matrix capable of expressing semantics is constructed, then semantic information contained in the learned word similarity matrix is migrated and applied to vectorization processing of a text to be classified, and finally low-dimensional vector representation of the text is completed. In the second stage, the text vectors are classified using a classification method. Experiment comparison shows that compared with the existing typical classification method, the classification performance of the method on the public text data set is greatly improved.
In the following, we will describe in detail a text classification method based on semantic migration with reference to fig. 1, which specifically includes the following steps:
s1, selecting a text data set to be classified
Selecting a text data set to be classified from the encyclopedia questions and answers of the question-answer public data set, and dividing the text data set to be classified into G groups which are marked as G by adopting an election mechanism1,g2,…,gG(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X1,x2,…,x|X|},X={x1,x2,…,x|X|In which xiThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W1,w2...w|W|},wjRepresenting the jth word, | W | represents the number of words in the word bank set W;
s2 preprocessing text data
For g1Performing word segmentation processing on the group text data X to remove stop words; in this embodiment, a chinese word segmentation tool jieba with a Python open source is used, and removing stop words is mainly to remove meaningless connecting words and the like in a text, so as to obtain a word bank of a task set.
S3, constructing a text-word matrix A|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrixijDenotes the jth wordThe occurrence condition of the words in the ith text is 1 if the words occur, or 0 if the words occur;
s4, constructing an interword similarity matrix S by using a neural network model BERT|W|×|W|;
Inputting each word into BERT, and obtaining a vector representation of each word, wherein the vector representation of the jth word is
Calculating the similarity between words by using a cosine function:
wherein j' ≠ j*Respectively represent the jth word and the jth*Each word;
finally, constructing an interword similarity matrix S by using all the interword similarities|W|×|W|;
In this embodiment, the bert (bidirectional Encoder replication from transforms) model is a neural network model proposed by Google, and has a strong semantic mining capability.
S5, obtaining a transition matrix
For the similarity matrix S between words|W|×|W|Performing truncated singular value decomposition intoWhere Σ is a low rank matrix,k < | W | for the transition matrix;
in this embodiment, a Python open source library scipy is used to complete truncated singular value decomposition to obtain a transfer matrix, and a specific process is shown in fig. 2, so that semantic information contained in the transfer matrix can be migrated to a text vectorization process, and low-dimensional representation of a task set text is obtained.
S7, K-Means clustering
Matrix pair using K-Means clustering algorithmClustering the low-dimensional vectors of the represented texts; in order to fully utilize the parallel processing capability of the computer, an election mechanism is adopted for text clustering when the size of a task set is large, as shown in fig. 3.
In this embodiment, K-Means in the Python open source library nltk is used for clustering, where the distance between data points does not use euclidean distance, but uses a self-defined cosine distance for calculation, specifically:
wherein i' ≠ i*Respectively represent the ith' text and the ith text*The number of the text is one,for the low-dimensional vector of the i' th text, i.e. as determined in step 7In the ith' row, in order to avoid overflow problems possibly caused by floating point number operation, a smoothing mechanism is adopted for the denominator, and a smaller constant c is added;
according to the distance between texts, g1Group text data is grouped as l1A subclass;
s8, similarly, according to the method of steps S2-S7, the same processing is carried out on the remaining group text data, and then the group G text data is groupedAfter inner clustering, recording each group of text data as gkEach group is aggregated into lkIn a subclass, then co-polymerizeA subclass;
s9, randomly selecting a plurality of text representatives in each subclass of the L classes to form text data, repeating the method of the steps S2-S7, carrying out second-round clustering, adopting a majority voting principle after the clustering of the current round is finished, and if a certain class is selected by the majority representatives, determining that all texts of the subclass belong to the class, thereby realizing text classification.
In this embodiment, the scale of the task set for the second round of clustering is small, and the method can still achieve a good clustering effect on a small amount of data clustering.
Authentication
And (4) carrying out classification experiments on the method by adopting a public data set, calculating a classification result and evaluating the performance. The experimental data set was an open questions and answers data set encyclopedia questions and answers (baike2018 qa). And (4) carrying out performance evaluation by adopting the accuracy, the recall rate and the F1-Score index.
Wherein, the accuracy rate P is the ratio of the text with correct text classification to all the texts:
the recall rate is the ratio of the number of correctly classified texts to the actual number of texts:
f1 values trade off accuracy against recall:
then, two text classification methods of TFIDF + K-Means and LDA + K-Means are respectively adopted to carry out comparison experiments with the method, and the classification performance is evaluated.
The experimental data set selects an open questioning and answering data set encyclopedia questioning and answering (baike2018qa), 4000 pieces of data of three categories of entertainment-vacation travel "," life-food/cooking "and" computer/network-software "are selected from the encyclopedia questioning and answering data set, and 12000 pieces of data are used as a task set. The data set contains real labels of all categories, and after data clustering is completed, the data set is compared with the real category labels to finally obtain Recall rate Recall, accuracy Precision and F1 values of an evaluation clustering result.
Table 1 is the experimental results of the method of the invention; table 2 shows the results of the LDA + K-Means experiments; table 3 shows the results of the experiments with TFIDF + K-Means.
TABLE 1
TABLE 2
TABLE 3
As can be seen from the experimental results of tables 1, 2 and 3, compared with the traditional text classification method, the text classification method has the advantages that the accuracy, the recall rate and the F1 value are greatly improved. The semantic understanding ability of the TFIDF and the LDA is highly dependent on the scale of the task set, the TFIDF calculates the word frequency TF and the universal importance IDF through the task set, the LDA carries out Bayesian estimation according to the task set, and the larger the task scale is, the more accurate the text understanding is. The invention migrates the natural language semantics through the BERT to acquire the prior information, and is insensitive to the scale of the task set.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.
Claims (1)
1. A text classification method based on semantic migration is characterized by comprising the following steps:
(1) dividing the text data set to be classified into G groups by adopting an election mechanism, and recording the G groups as G1,g2,…,gG(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X1,x2,…,x|X|In which xiThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W1,w2...w|W|},wjRepresenting the jth word, | W | represents the number of words in the word bank set W;
(2) to g1Performing word segmentation processing on the group text data X to remove stop words;
(3) constructing a text-word matrix A|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrixijRepresenting the appearance condition of the jth word in the ith text, wherein the appearance condition of the jth word is 1 if the jth word appears, and otherwise, the appearance condition of the jth word is 0;
(4) constructing an interword similarity matrix S by using a neural network model BERT|W|×|W|;
Inputting each word into BERT to obtain vector representation of each word, wherein the vector representation of the jth word is
Calculating the similarity between words by using a cosine function:
wherein j' ≠ j*Respectively represent the jth word and the jth*Each word;
finally, constructing an interword similarity matrix S by using all the interword similarities|W|×|W|;
(5) To the inter-word similarity matrix S|W|×|W|Performing truncated singular value decomposition intoWhere sigma is a low rank matrix,k < | W | for the transition matrix;
(7) And utilizing the K-Means clustering algorithm to pair the matrixesClustering the low-dimensional vectors of the represented texts;
and (3) calculating the distance between texts by using the self-defined cosine distance:
wherein i' ≠ i*Respectively represent the ith' text and the ith text*C is a constant;
according to the distance between texts, g1Group text data is grouped as l1A subclass;
(8)、processing the rest groups of text data according to the method in the steps (2) - (7), and then after carrying out group-in clustering on the G groups of text data, recording each small group of text data as GkEach group is aggregated into lkIn a subclass, then co-polymerizeA subclass;
(9) and in each subclass of the L classes, randomly selecting a plurality of text representatives to form text data, repeating the method in the steps (2) - (7), carrying out second-round clustering, and after the clustering of the current round is finished, adopting a majority voting principle, wherein if a certain class is selected by the majority representatives, all the texts of the current subclass are considered to belong to the class, so that the text classification is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910796512.6A CN110674293B (en) | 2019-08-27 | 2019-08-27 | Text classification method based on semantic migration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910796512.6A CN110674293B (en) | 2019-08-27 | 2019-08-27 | Text classification method based on semantic migration |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110674293A CN110674293A (en) | 2020-01-10 |
CN110674293B true CN110674293B (en) | 2022-03-25 |
Family
ID=69075596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910796512.6A Active CN110674293B (en) | 2019-08-27 | 2019-08-27 | Text classification method based on semantic migration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110674293B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111506836A (en) * | 2020-04-16 | 2020-08-07 | 广东南方新媒体科技有限公司 | Content similarity sorting algorithm |
CN111694961A (en) * | 2020-06-23 | 2020-09-22 | 上海观安信息技术股份有限公司 | Keyword semantic classification method and system for sensitive data leakage detection |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947858B (en) * | 2017-07-26 | 2022-10-21 | 腾讯科技(深圳)有限公司 | Data processing method and device |
US10726061B2 (en) * | 2017-11-17 | 2020-07-28 | International Business Machines Corporation | Identifying text for labeling utilizing topic modeling-based text clustering |
CN109697221B (en) * | 2018-11-22 | 2021-07-09 | 东软集团股份有限公司 | Track law mining method and device, storage medium and electronic equipment |
-
2019
- 2019-08-27 CN CN201910796512.6A patent/CN110674293B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110674293A (en) | 2020-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN109977413B (en) | Emotion analysis method based on improved CNN-LDA | |
US20230016365A1 (en) | Method and apparatus for training text classification model | |
CN111966917B (en) | Event detection and summarization method based on pre-training language model | |
CN110598005B (en) | Public safety event-oriented multi-source heterogeneous data knowledge graph construction method | |
CN110765260A (en) | Information recommendation method based on convolutional neural network and joint attention mechanism | |
CN111930942B (en) | Text classification method, language model training method, device and equipment | |
Feng et al. | Enhanced sentiment labeling and implicit aspect identification by integration of deep convolution neural network and sequential algorithm | |
CN113254599A (en) | Multi-label microblog text classification method based on semi-supervised learning | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
Chang et al. | Research on detection methods based on Doc2vec abnormal comments | |
Wang et al. | A short text classification method based on convolutional neural network and semantic extension | |
Xing et al. | A convolutional neural network for aspect-level sentiment classification | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN110674293B (en) | Text classification method based on semantic migration | |
CN114048354B (en) | Test question retrieval method, device and medium based on multi-element characterization and metric learning | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN111353032B (en) | Community question and answer oriented question classification method and system | |
CN116108840A (en) | Text fine granularity emotion analysis method, system, medium and computing device | |
CN115906824A (en) | Text fine-grained emotion analysis method, system, medium and computing equipment | |
Lin et al. | Text classification feature extraction method based on deep learning for unbalanced data sets | |
CN115359486A (en) | Method and system for determining custom information in document image | |
Aalaa Abdulwahab et al. | Documents classification based on deep learning | |
Ashwini et al. | Impact of Text Representation Techniques on Clustering Models | |
Rath | Word and relation embedding for sentence representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |