CN110674293B - Text classification method based on semantic migration - Google Patents

Text classification method based on semantic migration Download PDF

Info

Publication number
CN110674293B
CN110674293B CN201910796512.6A CN201910796512A CN110674293B CN 110674293 B CN110674293 B CN 110674293B CN 201910796512 A CN201910796512 A CN 201910796512A CN 110674293 B CN110674293 B CN 110674293B
Authority
CN
China
Prior art keywords
text
word
matrix
group
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910796512.6A
Other languages
Chinese (zh)
Other versions
CN110674293A (en
Inventor
王雄
任朝俊
吴环宇
任婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910796512.6A priority Critical patent/CN110674293B/en
Publication of CN110674293A publication Critical patent/CN110674293A/en
Application granted granted Critical
Publication of CN110674293B publication Critical patent/CN110674293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention discloses a text classification method based on semantic migration, which takes the huge data volume into consideration, adopts a election mechanism, firstly groups original texts, preprocesses a text data set to be classified of each group to construct a text-word matrix, then calculates an interword similarity matrix by using a neural network model BERT to obtain distributed representation of words in a task set, then carries out truncation singular value decomposition on the similarity matrix to obtain a transfer matrix, migrates semantic information contained in the transfer matrix to the text vectorization process to obtain low-dimensional representation of a task set text, then uses a K-Means algorithm to cluster each group, finally selects a plurality of representations from all subclasses to carry out second-round clustering, and adopts a majority voting principle to realize final text classification.

Description

Text classification method based on semantic migration
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a text classification method based on semantic migration.
Background
Text classification refers to a computer automatically classifying text into several categories according to the subject matter of the text expression. In the era of information explosion, automatic text classification by computers can help users to quickly acquire required information from massive texts, and the efficiency of manual information processing is greatly improved. Today, the artificial intelligence and wave mat is global, text classification has been widely applied to the fields of text auditing, advertisement filtering, network public sentiment and the like, and becomes a very important research direction for natural language processing.
Methods of text classification can be broadly classified into text classification methods based on supervised and unsupervised learning. Since the supervised learning based text classification method needs to use a large amount of texts with classification labels for model training, the label data is difficult to obtain in practical application. The present invention is therefore primarily concerned with unsupervised text classification.
Unsupervised text classification can find potential knowledge and rules from a large amount of text data, and can not only obtain knowledge, but also process the text data, so that the unsupervised text classification becomes an important means for effectively organizing, abstracting and navigating text information at present and is concerned by more and more researchers.
The text is classified, firstly, the text vectorization is carried out, and then the text vector is classified. The text vectorization is the core of the whole classification process and is the main content of the research of the invention.
The traditional text vectorization method has the following defects: firstly, a text vector is constructed by a traditional method represented by a word bag model and a TFIDF (Term Frequency-Inverse Document Frequency) completely according to the statistical characteristics (such as word Frequency and word weight) of words in a text, so that the text semantics cannot be well expressed, particularly in a Chinese polysemous word scene; secondly, a text vectorization method represented by lda (content Dirichlet allocation) and Doc2vec only uses a text set to be classified to train a model, cannot acquire prior information of natural language semantics, and is difficult to continuously optimize the training model. The quality of the finally obtained text vector is poor, and the classification result is poor.
Therefore, how to accurately vectorize the text to make the text have more accurate semantics, so that the implementation of the subsequent classification algorithm is a very important ring in the text classification task. The present invention is based on this.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a text classification method based on semantic migration, which migrates natural semantic information to text vectorization and then realizes text classification processing.
In order to achieve the above object, the present invention provides a text classification method based on semantic migration, which is characterized by comprising the following steps:
(1) dividing the text data set to be classified into G groups by adopting an election mechanism, and recording the G groups as G1,g2,…,gG(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X1,x2,…,x|X|In which xiThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W1,w2...w|W|},wjRepresenting the jth word, | W | represents the number of words in the word bank set W;
(2) to g1Performing word segmentation processing on the group text data X to remove stop words;
(3) constructing a text-word matrix A|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrixijRepresenting the appearance condition of the jth word in the ith text, wherein the appearance condition of the jth word is 1 if the jth word appears, and otherwise, the appearance condition of the jth word is 0;
(4) constructing an interword similarity matrix S by using a neural network model BERT|W|×|W|
Inputting each word into BERT to obtain a vector representation of each word, wherein the vector representation of the jth word of the ith text is represented as
Figure BDA0002181107690000021
Calculating the similarity between words by using a cosine function:
Figure BDA0002181107690000022
wherein j' ≠ j*Respectively represent the jth word and the jth*Each word;
finally, constructing an interword similarity matrix S by using all the interword similarities|W|×|W|
(5) To the inter-word similarity matrix S|W|×|W|Performing truncated singular value decomposition into
Figure BDA0002181107690000023
Where Σ is a low rank matrix,
Figure BDA0002181107690000024
k < | W | for the transition matrix;
(6) calculating the matrix
Figure BDA0002181107690000025
Figure BDA0002181107690000026
(7) And utilizing the K-Means clustering algorithm to pair the matrixes
Figure BDA0002181107690000027
Low-dimensional representation of (a) to perform clustering;
and (3) calculating the distance between texts by using the self-defined cosine distance:
Figure BDA0002181107690000031
wherein i' ≠ i*Respectively represent the ith' text and the ith text*C is a constant;
according to the distance between texts, g1Group text data is grouped as l1A subclass;
(8) processing the rest groups of text data according to the method from the step (2) to the step (7), and then after carrying out group-in clustering on the G groups of text data, recording each group of text data as GkEach group is aggregated into lkIn a subclass, then co-polymerize
Figure BDA0002181107690000032
A subclass;
(9) and in each subclass of the L classes, randomly selecting a plurality of text representatives to form text data, repeating the method in the steps (2) - (7), carrying out second-round clustering, and after the clustering of the current round is finished, adopting a majority voting principle, wherein if a certain class is selected by the majority representatives, all the texts of the current subclass are considered to belong to the class, so that the text classification is realized.
The invention aims to realize the following steps:
the invention relates to a text classification method based on semantic migration, which takes the huge data volume into consideration, adopts a election mechanism, firstly groups original texts, preprocesses a text data set to be classified in each group to construct a text-word matrix, then calculates an interword similarity matrix by using a neural network model BERT to obtain distributed representation of words in a task set, then carries out truncation singular value decomposition on the similarity matrix to obtain a transfer matrix, migrates semantic information contained in the transfer matrix to a text vectorization process to obtain low-dimensional representation of a task set text, then uses a K-Means algorithm to cluster each group, finally selects a plurality of representations from all subclasses to carry out second-round clustering, and adopts a majority voting principle to realize final text classification.
Meanwhile, the text classification method based on semantic migration further has the following beneficial effects:
(1) the BERT is utilized to construct an interword similarity matrix, so that the relation between words can be well excavated, and rich external knowledge is introduced for realizing the low-dimensional representation of the text subsequently;
(2) the method is insensitive to the scale of the task set, and can still realize a good clustering effect on a small amount of data clustering;
(3) compared with the traditional TFIDF and LDA methods, the method has the advantages that the accuracy, the recall rate and the F1 value are greatly improved;
(4) when the scale of the task set is large, the invention designs an election mechanism, and can fully utilize the parallel computing capability of the computer.
Drawings
FIG. 1 is a flow chart of a text classification method based on semantic migration according to the present invention;
FIG. 2 is a text vectorization diagram;
fig. 3 is a schematic diagram of text clustering.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flow chart of a text classification method based on semantic migration according to the present invention.
The invention is mainly divided into two stages. The main purpose of the first stage is to obtain a text vector with accurate semantics. Firstly, a large amount of linguistic data are learned, a word similarity matrix capable of expressing semantics is constructed, then semantic information contained in the learned word similarity matrix is migrated and applied to vectorization processing of a text to be classified, and finally low-dimensional vector representation of the text is completed. In the second stage, the text vectors are classified using a classification method. Experiment comparison shows that compared with the existing typical classification method, the classification performance of the method on the public text data set is greatly improved.
In the following, we will describe in detail a text classification method based on semantic migration with reference to fig. 1, which specifically includes the following steps:
s1, selecting a text data set to be classified
Selecting a text data set to be classified from the encyclopedia questions and answers of the question-answer public data set, and dividing the text data set to be classified into G groups which are marked as G by adopting an election mechanism1,g2,…,gG(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X1,x2,…,x|X|},X={x1,x2,…,x|X|In which xiThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W1,w2...w|W|},wjRepresenting the jth word, | W | represents the number of words in the word bank set W;
s2 preprocessing text data
For g1Performing word segmentation processing on the group text data X to remove stop words; in this embodiment, a chinese word segmentation tool jieba with a Python open source is used, and removing stop words is mainly to remove meaningless connecting words and the like in a text, so as to obtain a word bank of a task set.
S3, constructing a text-word matrix A|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrixijDenotes the jth wordThe occurrence condition of the words in the ith text is 1 if the words occur, or 0 if the words occur;
s4, constructing an interword similarity matrix S by using a neural network model BERT|W|×|W|
Inputting each word into BERT, and obtaining a vector representation of each word, wherein the vector representation of the jth word is
Figure BDA0002181107690000051
Calculating the similarity between words by using a cosine function:
Figure BDA0002181107690000052
wherein j' ≠ j*Respectively represent the jth word and the jth*Each word;
finally, constructing an interword similarity matrix S by using all the interword similarities|W|×|W|
In this embodiment, the bert (bidirectional Encoder replication from transforms) model is a neural network model proposed by Google, and has a strong semantic mining capability.
S5, obtaining a transition matrix
For the similarity matrix S between words|W|×|W|Performing truncated singular value decomposition into
Figure BDA0002181107690000053
Where Σ is a low rank matrix,
Figure BDA0002181107690000054
k < | W | for the transition matrix;
in this embodiment, a Python open source library scipy is used to complete truncated singular value decomposition to obtain a transfer matrix, and a specific process is shown in fig. 2, so that semantic information contained in the transfer matrix can be migrated to a text vectorization process, and low-dimensional representation of a task set text is obtained.
S6, calculating matrix
Figure BDA0002181107690000055
Figure BDA0002181107690000056
S7, K-Means clustering
Matrix pair using K-Means clustering algorithm
Figure BDA0002181107690000057
Clustering the low-dimensional vectors of the represented texts; in order to fully utilize the parallel processing capability of the computer, an election mechanism is adopted for text clustering when the size of a task set is large, as shown in fig. 3.
In this embodiment, K-Means in the Python open source library nltk is used for clustering, where the distance between data points does not use euclidean distance, but uses a self-defined cosine distance for calculation, specifically:
Figure BDA0002181107690000061
wherein i' ≠ i*Respectively represent the ith' text and the ith text*The number of the text is one,
Figure BDA0002181107690000062
for the low-dimensional vector of the i' th text, i.e. as determined in step 7
Figure BDA0002181107690000063
In the ith' row, in order to avoid overflow problems possibly caused by floating point number operation, a smoothing mechanism is adopted for the denominator, and a smaller constant c is added;
according to the distance between texts, g1Group text data is grouped as l1A subclass;
s8, similarly, according to the method of steps S2-S7, the same processing is carried out on the remaining group text data, and then the group G text data is groupedAfter inner clustering, recording each group of text data as gkEach group is aggregated into lkIn a subclass, then co-polymerize
Figure BDA0002181107690000064
A subclass;
s9, randomly selecting a plurality of text representatives in each subclass of the L classes to form text data, repeating the method of the steps S2-S7, carrying out second-round clustering, adopting a majority voting principle after the clustering of the current round is finished, and if a certain class is selected by the majority representatives, determining that all texts of the subclass belong to the class, thereby realizing text classification.
In this embodiment, the scale of the task set for the second round of clustering is small, and the method can still achieve a good clustering effect on a small amount of data clustering.
Authentication
And (4) carrying out classification experiments on the method by adopting a public data set, calculating a classification result and evaluating the performance. The experimental data set was an open questions and answers data set encyclopedia questions and answers (baike2018 qa). And (4) carrying out performance evaluation by adopting the accuracy, the recall rate and the F1-Score index.
Wherein, the accuracy rate P is the ratio of the text with correct text classification to all the texts:
Figure BDA0002181107690000065
the recall rate is the ratio of the number of correctly classified texts to the actual number of texts:
Figure BDA0002181107690000066
f1 values trade off accuracy against recall:
Figure BDA0002181107690000067
then, two text classification methods of TFIDF + K-Means and LDA + K-Means are respectively adopted to carry out comparison experiments with the method, and the classification performance is evaluated.
The experimental data set selects an open questioning and answering data set encyclopedia questioning and answering (baike2018qa), 4000 pieces of data of three categories of entertainment-vacation travel "," life-food/cooking "and" computer/network-software "are selected from the encyclopedia questioning and answering data set, and 12000 pieces of data are used as a task set. The data set contains real labels of all categories, and after data clustering is completed, the data set is compared with the real category labels to finally obtain Recall rate Recall, accuracy Precision and F1 values of an evaluation clustering result.
Table 1 is the experimental results of the method of the invention; table 2 shows the results of the LDA + K-Means experiments; table 3 shows the results of the experiments with TFIDF + K-Means.
Figure BDA0002181107690000071
TABLE 1
Figure BDA0002181107690000072
TABLE 2
Figure BDA0002181107690000073
TABLE 3
As can be seen from the experimental results of tables 1, 2 and 3, compared with the traditional text classification method, the text classification method has the advantages that the accuracy, the recall rate and the F1 value are greatly improved. The semantic understanding ability of the TFIDF and the LDA is highly dependent on the scale of the task set, the TFIDF calculates the word frequency TF and the universal importance IDF through the task set, the LDA carries out Bayesian estimation according to the task set, and the larger the task scale is, the more accurate the text understanding is. The invention migrates the natural language semantics through the BERT to acquire the prior information, and is insensitive to the scale of the task set.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (1)

1. A text classification method based on semantic migration is characterized by comprising the following steps:
(1) dividing the text data set to be classified into G groups by adopting an election mechanism, and recording the G groups as G1,g2,…,gG(ii) a Each group of text data after grouping is expressed as X ═ { X ═ X1,x2,…,x|X|In which xiThe text to be classified of the ith text is represented, i is more than or equal to 1 and less than or equal to | X |, and | X | represents the text number in the text data set X; | X | text constitutes a lexicon set W ═ W1,w2...w|W|},wjRepresenting the jth word, | W | represents the number of words in the word bank set W;
(2) to g1Performing word segmentation processing on the group text data X to remove stop words;
(3) constructing a text-word matrix A|X|×|W|Where each row of the matrix represents text and each column represents a word, each element a in the matrixijRepresenting the appearance condition of the jth word in the ith text, wherein the appearance condition of the jth word is 1 if the jth word appears, and otherwise, the appearance condition of the jth word is 0;
(4) constructing an interword similarity matrix S by using a neural network model BERT|W|×|W|
Inputting each word into BERT to obtain vector representation of each word, wherein the vector representation of the jth word is
Figure FDA0003439517060000011
Calculating the similarity between words by using a cosine function:
Figure FDA0003439517060000012
wherein j' ≠ j*Respectively represent the jth word and the jth*Each word;
finally, constructing an interword similarity matrix S by using all the interword similarities|W|×|W|
(5) To the inter-word similarity matrix S|W|×|W|Performing truncated singular value decomposition into
Figure FDA0003439517060000013
Where sigma is a low rank matrix,
Figure FDA0003439517060000014
k < | W | for the transition matrix;
(6) calculating the matrix
Figure FDA0003439517060000015
Figure FDA0003439517060000016
(7) And utilizing the K-Means clustering algorithm to pair the matrixes
Figure FDA0003439517060000017
Clustering the low-dimensional vectors of the represented texts;
and (3) calculating the distance between texts by using the self-defined cosine distance:
Figure FDA0003439517060000018
wherein i' ≠ i*Respectively represent the ith' text and the ith text*C is a constant;
according to the distance between texts, g1Group text data is grouped as l1A subclass;
(8)、processing the rest groups of text data according to the method in the steps (2) - (7), and then after carrying out group-in clustering on the G groups of text data, recording each small group of text data as GkEach group is aggregated into lkIn a subclass, then co-polymerize
Figure FDA0003439517060000021
A subclass;
(9) and in each subclass of the L classes, randomly selecting a plurality of text representatives to form text data, repeating the method in the steps (2) - (7), carrying out second-round clustering, and after the clustering of the current round is finished, adopting a majority voting principle, wherein if a certain class is selected by the majority representatives, all the texts of the current subclass are considered to belong to the class, so that the text classification is realized.
CN201910796512.6A 2019-08-27 2019-08-27 Text classification method based on semantic migration Active CN110674293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910796512.6A CN110674293B (en) 2019-08-27 2019-08-27 Text classification method based on semantic migration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910796512.6A CN110674293B (en) 2019-08-27 2019-08-27 Text classification method based on semantic migration

Publications (2)

Publication Number Publication Date
CN110674293A CN110674293A (en) 2020-01-10
CN110674293B true CN110674293B (en) 2022-03-25

Family

ID=69075596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910796512.6A Active CN110674293B (en) 2019-08-27 2019-08-27 Text classification method based on semantic migration

Country Status (1)

Country Link
CN (1) CN110674293B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506836A (en) * 2020-04-16 2020-08-07 广东南方新媒体科技有限公司 Content similarity sorting algorithm
CN111694961A (en) * 2020-06-23 2020-09-22 上海观安信息技术股份有限公司 Keyword semantic classification method and system for sensitive data leakage detection

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947858B (en) * 2017-07-26 2022-10-21 腾讯科技(深圳)有限公司 Data processing method and device
US10726061B2 (en) * 2017-11-17 2020-07-28 International Business Machines Corporation Identifying text for labeling utilizing topic modeling-based text clustering
CN109697221B (en) * 2018-11-22 2021-07-09 东软集团股份有限公司 Track law mining method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN110674293A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN109977413B (en) Emotion analysis method based on improved CNN-LDA
US20230016365A1 (en) Method and apparatus for training text classification model
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN110598005B (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN110765260A (en) Information recommendation method based on convolutional neural network and joint attention mechanism
CN111930942B (en) Text classification method, language model training method, device and equipment
Feng et al. Enhanced sentiment labeling and implicit aspect identification by integration of deep convolution neural network and sequential algorithm
CN113254599A (en) Multi-label microblog text classification method based on semi-supervised learning
CN108733647B (en) Word vector generation method based on Gaussian distribution
Chang et al. Research on detection methods based on Doc2vec abnormal comments
Wang et al. A short text classification method based on convolutional neural network and semantic extension
Xing et al. A convolutional neural network for aspect-level sentiment classification
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN110674293B (en) Text classification method based on semantic migration
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN111353032B (en) Community question and answer oriented question classification method and system
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
Lin et al. Text classification feature extraction method based on deep learning for unbalanced data sets
CN115359486A (en) Method and system for determining custom information in document image
Aalaa Abdulwahab et al. Documents classification based on deep learning
Ashwini et al. Impact of Text Representation Techniques on Clustering Models
Rath Word and relation embedding for sentence representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant