CN114547303A - Text multi-feature classification method and device based on Bert-LSTM - Google Patents

Text multi-feature classification method and device based on Bert-LSTM Download PDF

Info

Publication number
CN114547303A
CN114547303A CN202210165299.0A CN202210165299A CN114547303A CN 114547303 A CN114547303 A CN 114547303A CN 202210165299 A CN202210165299 A CN 202210165299A CN 114547303 A CN114547303 A CN 114547303A
Authority
CN
China
Prior art keywords
text
lstm
bert
feature classification
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210165299.0A
Other languages
Chinese (zh)
Inventor
韩启龙
高艺涵
宋洪涛
张海涛
马志强
李丽洁
王宇华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202210165299.0A priority Critical patent/CN114547303A/en
Publication of CN114547303A publication Critical patent/CN114547303A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a text multi-feature classification method and device based on Bert-LSTM, belonging to the technical field of text classification, wherein the method comprises the following steps: determining a text data set to be classified, and dividing the text data set into a training set and a test set; constructing a text multi-feature classification model based on Bert-LSTM; training the text multi-feature classification model by using a training set to obtain an optimal text multi-feature classification model; and inputting the text data to be classified into the optimal text multi-feature classification model, calculating the score of the text data to be classified, and classifying the text data to be classified into preset corresponding categories according to the score. According to the method, a text multi-feature classification model based on Bert-LSTM is constructed by using BERT, a bidirectional long-short term memory network and the like, word feature information and word meaning potential semantics of various aspects of a mined text are used for representing feature information and are integrated into a text vector, the multi-feature information is fully utilized by the model in a training process, and the text classification performance is improved.

Description

Text multi-feature classification method and device based on Bert-LSTM
Technical Field
The invention relates to the technical field of text classification, in particular to a text multi-feature classification method and device based on Bert-LSTM.
Background
With the development of modern network technologies and the appearance of technologies such as big data, network information data is huge and mixed. Due to the characteristics of fast update and diversification of texts and vocabularies, the internet accommodates a large amount of various information, wherein text data occupies fewer resources than other data (such as image data and the like), so that most information in the network is presented in a text form. The text classification is used as a basic work in the field of natural language processing, and aims to sort and classify text information, so that valuable information can be conveniently found from massive information, such as classification of news topics; besides this, there is also the analysis of emotions: the method comprises two types of emotion classification and three types of emotion classification, and the processing methods of the different types of classification can be different. The application aspect is generally wider in the application fields of film and television comment evaluation, article evaluation of online shopping, service industry and the like. Other also public opinion analysis: similar language expression polarity analysis is used by news organizations more often. Other areas include filtering of mail: at present, personal information leakage is serious, users often receive various mail information, but a large number of junk advertisements and malicious harassing mails are contained in the mail information, and the information can be filtered and intercepted by the application of text classification, so that the spread of the junk mails is reduced to a great extent. In addition to these, there are also related applications in the field of question answering, such as subject or answer analysis of question sentences, and the like.
For traditional text classification, the work is mainly completed manually, which not only wastes time but also wastes labor, how to reduce the cost of text classification and improve the efficiency of text classification becomes a research hotspot in the Natural Language Processing (NLP) direction. The text classification method based on machine learning can automatically classify texts through a related network model according to the content of the texts under a given classifier model, so that people can be better helped to explore text information, and therefore researchers pay more and more attention.
Human linguistic expression emotion is complex, words cannot be simply extracted to serve as features to be classified, knowledge in the aspect of languages needs to be combined, context semantic part-of-speech information in texts and linguistic features in related fields are used for performing related processing analysis on the texts, and classification processing is performed on the texts belonging to the categories. For feature extraction of a text, a bag-of-words method is generally used to represent feature information of the text, but because human emotion semantics in natural language are complex in expression, for example, in the form of metaphors of expression or inverse words, the potential information is not easy to find. For a more complex language form, a bag-of-words method is adopted to extract relevant features for vector representation, and then the text is classified, due to the fact that the lack of the features causes the effect to be extremely limited, and based on the problems, huge challenges are brought to text classification.
Therefore, a text multi-feature classification method is urgently needed to solve the problems that the existing text-dependent integral information, including Word meaning expression, potential meaning, language polarity and the like of words, needs to be expressed according to the characteristics of context information, Word vectors such as Word2Vec cannot capture the integral information of the text, and complete feature dependence is lacked.
Disclosure of Invention
The invention provides a text multi-feature classification method based on Bert-LSTM, which is used for solving the problems that most deep learning models do not notice semantic information of context and ignore the processing problem of a knowledge entity, the word vector representation of the traditional language model cannot solve the polysemous word representation, and the existing model cannot fully capture long-distance semantic information.
The embodiment of the invention provides a text multi-feature classification method based on Bert-LSTM, which comprises the following steps:
step S1, acquiring text data information, determining a text data set to be classified, and dividing the text data set into a training set and a test set;
step S2, constructing a text multi-feature classification model based on the Bert-LSTM;
step S3, training the text multi-feature classification model based on the Bert-LSTM by using the training set to obtain an optimal text multi-feature classification model;
step S4, inputting the text data to be classified into the optimal text multi-feature classification model, calculating the score of the text data to be classified, and classifying the text data to be classified into preset corresponding categories according to the score.
Another embodiment of the present invention provides a text multi-feature classification device based on Bert-LSTM, including:
the acquisition module is used for acquiring text data information, determining a text data set to be classified, and dividing the text data set into a training set and a test set;
the construction module is used for constructing a text multi-feature classification model based on the Bert-LSTM;
the training module is used for training the text multi-feature classification model based on the Bert-LSTM by using the training set to obtain an optimal text multi-feature classification model;
and the classification module is used for inputting the text data to be classified into the optimal text multi-feature classification model, calculating the score of the text data to be classified, and classifying the text data to be classified into preset corresponding categories according to the score.
In another embodiment of the present invention, a text multi-feature classification device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the Bert-LSTM-based text multi-feature classification method as described in the above embodiments.
Yet another aspect of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the Bert-LSTM-based text multi-feature classification method according to the foregoing embodiment.
The technical scheme of the invention at least realizes the following beneficial technical effects: the text can be input into a Bert model for better preprocessing in the text classification process, different text features are captured through the multiple dimensions of Tree-LSTM and Bi-LSTM and are merged into text vectors, multiple feature information of the text is captured and given to the vectors, accurate classification is carried out, and a more accurate classification problem result is obtained while text problem classification is guaranteed.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of a text multi-feature classification method based on Bert-LSTM according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an implementation of the text multi-feature classification method based on Bert-LSTM according to the embodiment of the present invention;
FIG. 3 is a diagram of a text multi-feature classification model based on Bert-LSTM constructed in the embodiment of the present invention;
FIG. 4 is a Bi-LSTM extraction context dependent feature model diagram constructed by the embodiment of the invention;
FIG. 5 is a diagram of a multi-feature vector extraction architecture constructed in accordance with an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a text multi-feature classification device based on Bert-LSTM according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The text multi-feature classification method and device based on Bert-LSTM proposed in the embodiments of the present invention will be described below with reference to the accompanying drawings, and first, the text multi-feature classification method based on Bert-LSTM proposed in the embodiments of the present invention will be described with reference to the accompanying drawings.
Fig. 1 is a flowchart of a text multi-feature classification method based on Bert-LSTM according to an embodiment of the present invention.
As shown in FIG. 1, the text multi-feature classification method based on Bert-LSTM comprises the following steps:
in step S1, the text data information is obtained to determine the text data set to be classified, and the text data set is divided into a training set and a test set.
Further, in an embodiment of the present invention, the step S1 specifically includes:
step S101, acquiring text data information, and extracting a text data set to be classified;
and S102, preprocessing the text data set to be classified, and dividing the preprocessed text data set to be classified into a training set and a test set.
For example, text data information is acquired, a text data set to be classified is extracted from the text data information and is preprocessed, and the preprocessed data set is divided into an 80% training set and a 20% testing set.
In step S2, a Bert-LSTM based text multi-feature classification model is constructed.
Further, in an embodiment of the present invention, the specific construction process of step S2 is as follows:
step S201, a full word shielding WWM-EXT model of a BERT model is used, an Attention mechanism is introduced, partial features of the text data information are extracted from different dimensions to generate text sentence vectors, partial context information is fused, and a high-dimensional vector matrix is obtained;
step S202, reducing the dimension of the high-dimensional vector matrix through a Principal Component Analysis (PCA) technology to obtain a low-dimensional vector matrix;
step S203, constructing a Bi-LSTM network, capturing text context dependence characteristic information, and integrating text vectors to obtain text sentence vectors with context dependence;
step S204, a Tree-LSTM network is built, the low-dimensional vector matrix is used as input to capture potential part-of-speech information of the text, and the potential part-of-speech information of the text is merged into a text vector to obtain a text sentence vector with the potential part-of-speech information of the text;
step S205, performing Concat splicing on the text sentence vector with context dependence and the text sentence vector with text potential part-of-speech information to obtain multi-feature text vector representation;
and S206, inputting the multi-feature text vector representation into an RCNN convolutional neural network to obtain a final text vector representation, thereby completing the construction of the text multi-feature classification model based on the Bert-LSTM.
Specifically, as shown in fig. 2 and 3, the text features are enriched by introducing an Attention mechanism by using a full word mask WWM-EXT model of the BERT model.
And then after the vectors obtained by BERT processing and Attention partial feature extraction are subjected to Concat splicing, a high-dimensional vector matrix is obtained, but the high-dimensional vector overhead is overlarge, the solution space is unstable, the generalization capability of the model is poor, the data is sparse, the data features are difficult to accurately position, and the PCA is adopted to perform dimension reduction operation on the vector matrix.
The sentence vectors of the input layer are subjected to feature extraction, the sentence vectors comprise overall and local features of a text, two LSTMs with opposite time sequences are combined, an LSTM mechanism enables internal sequence information and logic relation information of the text to be well captured, the first layer of the bidirectional LSTM can carry out feature integration internally, the second layer further fuses context dependence, and the text sentence vectors passing through the Bi-LSTM all have context dependence.
The method is characterized in that a base dimension vector after PCA processing is input into a Tree-LSTM layer to perform deeper feature extraction, the Tree-LSTM extracts potential part-of-speech information features of the Tree-LSTM layer, the semantic features of the Tree-LSTM layer are rich and are similar to a standard LSTM structure, each cell in the Tree-LSTM layer comprises a similar input gate it, an output gate ot, a ct and a hidden layer output ht, and the method is different from the standard LSTM in that updating of the Tree-LSTM layer is related to states of subunits, and a plurality of subunit mechanisms of the Tree-LSTM layer enable the Tree-LSTM layer to capture richer and more complete feature information in sentence capturing semantics.
Inputting the multi-feature text vectors obtained in the first two steps into RCNN, scoring through Softmax, and classifying the texts according to the scores. And obtaining a text vector through the steps, taking the text vector as the input of the next-layer RCNN, further judging the vector by the RCNN to obtain a final feature vector F, inputting the final feature vector F into a Softmax layer for scoring, and judging the category of the text according to the score.
In step S3, the text multi-feature classification model based on Bert-LSTM is trained by using the training set, so as to obtain an optimal text multi-feature classification model.
Further, in an embodiment of the present invention, the training process of step S3 specifically includes:
step S301, inputting the training set into the text multi-feature classification model based on the Bert-LSTM to obtain final text vector representation;
step S302, inputting the final text vector representation into a preset prediction module to obtain a prediction score;
step S303, continuously updating the input of the text multi-feature classification model based on the Bert-LSTM to obtain a plurality of prediction scores;
and S304, optimizing the text multi-feature classification model based on the Bert-LSTM according to the error between the prediction score and the real score, and training to obtain the optimal text multi-feature classification model.
Specifically, firstly, the data in the training set is input into the Bert-LSTM-based text multi-feature classification model constructed in step S2, and the final vector representation of the text is obtained:
(1) text data input in a training set is based on a text multi-feature classification model of Bert-LSTM, wherein BERT-WWM-EXT and an attention mechanism are adopted to simultaneously receive text input, certain features are extracted from texts with different dimensions, and the input text is Q ═ W (W)1,W2,...,Wn) After the word segmentation is carried out on Q, (E)1,E2,...,En) The sentences after word segmentation are vectorized by BERT, and the vector representation [ T ] is obtained through the semantic structure of a Transformer analyzer and the relation between words1,T2,...,Tn]
(2) Inputting a sentence X to an Attention mechanism to enhance the coverage rate of model extraction information, wherein two identical Attention modules are connected in a one-way mode, so that the model can pay Attention to some characteristics with larger weight when capturing information;
(3) the expression obtained after word segmentation is X ═ X (X)1,x2,...,xN)TThe corresponding embedded matrix is A ═ a ═1,a2,...,aN)TTwo modules respectively extract features f described by different dimensions1、f2
(4) Concat splices two features to obtain a high-dimensional feature vector matrix F ═ F1,f2]Combining semantic scene vectors siAnd the resulting feature vector fi-1
And performing dimensionality reduction operation on the obtained high-dimensional vector by adopting a PCA vector matrix, and specifically comprising the following steps of:
(1) input n-dimensional dataset D ═ (x (1), x (2),.. x (m))
(2) Performing a decentralized operation:
Figure BDA0003511433020000051
(3) for matrix XXTCarrying out eigenvalue decomposition;
(4) extracting the eigenvectors (w) corresponding to the largest n' eigenvalues1,w2,,...,wn′) Standardizing all the eigenvectors to form an eigenvector matrix w;
(5) for each sample x (i) in the sample set, convert to a new sample z(i)=WTx(i)
(6) To obtain an output D ═ z(1),z(2),...,z(m))
As shown in fig. 4, the first layer using bi-directional LSTM can then be internally feature integrated:
(1)Htobtaining respective hidden layers x for the forward direction and the backward direction of the output text feature vector of the Bi-LSTM respectively1
Figure BDA0003511433020000061
And
Figure BDA0003511433020000062
obtaining the vector by the two different hidden layer vectors through Concat
Figure BDA0003511433020000063
I.e. [ h ]1,h2,...,hm]
(2) The Attention mechanism carries out weight distribution on a text vector, a sentence is output through LSTM and then serves as the input of Attention, N is the output of Attention, the input sequence number is represented by N, the weight matrix is U, F is the sum of hidden layer values, and the input w at a certain momentnAnd previous-time hidden layer state hn-1U, V represents a weight matrix, which is expressed by the following formula:
Figure BDA0003511433020000064
h′n=hn′ TUF
as shown in fig. 5, the potential part-of-speech information features of the Tree-LSTM are extracted by using a Tree-STM calculation formula as follows:
Figure BDA0003511433020000065
Figure BDA0003511433020000066
fjk=σ(W(f)xj+U(F)hk+b(f))
Figure BDA0003511433020000067
Figure BDA0003511433020000068
Figure BDA0003511433020000069
hj=oj⊙tanh(cj)
calculated according to the above formula: h is2,h3Sum and x1
Figure BDA00035114330200000610
To obtain u1
c2,f2And c3,f3Separately multiplied by bit and summed, plus u1,i1Multiplication by bit, the last two sums being added to obtain c1And finally, by the formula:
hj=oj e tanh(cj)
obtaining a hidden layer h1
And (3) processing the hidden layer y by a linear layer, corresponding the processed result to the category, and calculating loss:
Figure BDA0003511433020000071
Figure BDA0003511433020000072
the final loss function obtained from the above equation is:
Figure BDA0003511433020000073
the final vector representation of the text is solved according to the formula.
Inputting the final representation of the text vector into a prediction module to obtain the score of the text, wherein the calculation probability formula is as follows:
Figure BDA0003511433020000074
F′=V·F
t is the total of all classes, FiThe ith component of the vector is taken as a weight matrix V, the obtained result is input into a Softmax layer to be classified and scored, distribution Y with the sum of 1 is obtained according to the score, and the cross entropy loss is obtained by the sum of the distribution Y and the real distribution Y:
E(Y,y)=-Ylog(y)
for the input m samples, [ x ]1,x2,xm]N attributes per sample, i.e. xi=[a1,a2,...,an]The method is divided into k types, the ith position is 1, and the rest are 0. Probability of each class pkP (y (k) ═ 1), and Σ P ═ 1.
Then updating according to the Bert-LSTM-based text multi-feature classification input obtained in the step S2;
and finally, optimizing the model through the error between the prediction score and the real score, and training to obtain the optimal text multi-feature classification model based on the Bert-LSTM, namely the optimal text multi-feature classification model.
In step S4, the text data to be classified is input into the optimal text multi-feature classification model, the score of the text data to be classified is calculated, and the text data to be classified is classified into preset corresponding categories according to the score.
According to the text multi-feature classification method based on the Bert-LSTM, provided by the embodiment of the invention, the word features and the part-of-speech problems are processed by combining the dictionary matching latent words with the algorithm of the semantic part-of-speech information in the Tree structure long-short term memory network Tree-LSTM, and the bidirectional long-short term memory network Bi-LSTM is introduced to fully acquire the text context dependence feature information, so that the accurate expression of the text features is realized, the capability of extracting the key word multi-feature information is improved, the text features are enriched, the performance of the model is improved, and the classification accuracy is improved.
In order to realize the embodiment, the invention also provides a text multi-feature classification device based on the Bert-LSTM.
FIG. 6 is a schematic structural diagram of a text multi-feature classification apparatus based on Bert-LSTM according to an embodiment of the present invention.
As shown in fig. 6, the Bert-LSTM-based text multi-feature classification apparatus 10 includes: an acquisition module 100, a construction module 200, a training module 300, and a classification module 400.
The obtaining module 100 is configured to obtain text data information, determine a text data set to be classified, and divide the text data set into a training set and a test set.
And the building module 200 is used for building a text multi-feature classification model based on the Bert-LSTM.
And the training module 300 is configured to train the Bert-LSTM-based text multi-feature classification model by using the training set to obtain an optimal text multi-feature classification model.
The classification module 400 is configured to input text data to be classified into the optimal text multi-feature classification model, calculate a score of the text data to be classified, and classify the text data to be classified into preset corresponding categories according to the score.
Further, in an embodiment of the present invention, the obtaining module 100 specifically includes:
the extraction unit is used for acquiring text data information and extracting the text data set to be classified;
and the preprocessing and dividing unit is used for preprocessing the text data set to be classified and dividing the preprocessed text data set to be classified into a training set and a test set.
Further, in an embodiment of the present invention, the building module 200 specifically includes:
the extraction and fusion unit is used for using a full word mask WWM-EXT model of a BERT model, introducing an Attention mechanism, extracting partial features of the text data information from different dimensions to generate text sentence vectors, and fusing partial context information to obtain a high-dimensional vector matrix;
the dimensionality reduction unit is used for reducing the dimensionality of the high-dimensional vector matrix through a Principal Component Analysis (PCA) technology to obtain a low-dimensional vector matrix;
the first capturing and fusing unit is used for constructing a Bi-LSTM network, capturing text context dependence characteristic information and fusing text vectors to obtain text sentence vectors with context dependence;
the second capturing and fusing unit is used for constructing a Tree-LSTM network, capturing potential part-of-speech information of the text by taking the low-dimensional vector matrix as input, and fusing the potential part-of-speech information into the text vector to obtain a text sentence vector with the potential part-of-speech information of the text;
the splicing unit is used for performing Concat splicing on the text sentence vector with context dependence and the text sentence vector with text potential part-of-speech information to obtain multi-feature text vector representation;
a construction unit, configured to input the multi-feature text vector representation into an RCNN convolutional neural network to obtain a final text vector representation, thereby completing construction of the Bert-LSTM-based text multi-feature classification model
Further, in an embodiment of the present invention, the training module 300 specifically includes:
the processing unit is used for inputting the training set into the text multi-feature classification model based on the Bert-LSTM to obtain final text vector representation;
the prediction unit is used for inputting the final text vector representation into a preset prediction module to obtain a prediction score;
the updating unit is used for continuously updating the input of the text multi-feature classification model based on the Bert-LSTM to obtain a plurality of prediction scores;
and the optimization unit is used for optimizing the text multi-feature classification model based on the Bert-LSTM according to the error between the prediction score and the real score, and training to obtain the optimal text multi-feature classification model.
It should be noted that the foregoing explanation of the text multi-feature classification method embodiment based on Bert-LSTM is also applicable to the text multi-feature classification device based on Bert-LSTM in this embodiment, and is not repeated here.
To sum up, the text multi-feature classification device based on the Bert-LSTM in the embodiment of the present invention processes the word feature and the part-of-speech problem by integrating the latent word matching algorithm with the Tree-structured long-short term memory network Tree-LSTM into semantic part-of-speech information in the dictionary, and introduces the Bi-directional long-short term memory network Bi-LSTM to fully obtain the text context dependence feature information, thereby realizing the precise expression of the text feature, improving the capability of extracting the keyword multi-feature information, enriching the text feature, improving the performance of the model, and improving the classification accuracy.
In order to implement the foregoing embodiments, the present invention further provides a text multi-feature classification device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the text multi-feature classification device implements the text multi-feature classification method based on Bert-LSTM as described in the foregoing embodiments.
In order to implement the foregoing embodiments, the present invention further proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the Bert-LSTM-based text multi-feature classification method according to the foregoing embodiments.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that changes, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A text multi-feature classification method based on Bert-LSTM is characterized by comprising the following steps:
step S1, acquiring text data information, determining a text data set to be classified, and dividing the text data set into a training set and a test set;
step S2, constructing a text multi-feature classification model based on the Bert-LSTM;
step S3, training the text multi-feature classification model based on the Bert-LSTM by using the training set to obtain an optimal text multi-feature classification model;
step S4, inputting the text data to be classified into the optimal text multi-feature classification model, calculating the score of the text data to be classified, and classifying the text data to be classified into preset corresponding categories according to the score.
2. The Bert-LSTM-based text multi-feature classification method according to claim 1, wherein the step S1 specifically includes:
step S101, acquiring text data information, and extracting a text data set to be classified;
and S102, preprocessing the text data set to be classified, and dividing the preprocessed text data set to be classified into a training set and a test set.
3. The Bert-LSTM-based text multi-feature classification method according to claim 1, wherein the specific construction process of the step S2 is as follows:
step S201, a full word shielding WWM-EXT model of a BERT model is used, an Attention mechanism is introduced, partial features of the text data information are extracted from different dimensions to generate text sentence vectors, partial context information is fused, and a high-dimensional vector matrix is obtained;
step S202, reducing the dimension of the high-dimensional vector matrix through a Principal Component Analysis (PCA) technology to obtain a low-dimensional vector matrix;
step S203, constructing a Bi-LSTM network, capturing text context dependence characteristic information, and integrating text vectors to obtain text sentence vectors with context dependence;
step S204, a Tree-LSTM network is built, the low-dimensional vector matrix is used as input to capture potential part-of-speech information of the text, and the potential part-of-speech information of the text is merged into a text vector to obtain a text sentence vector with the potential part-of-speech information of the text;
step S205, performing Concat splicing on the text sentence vector with context dependence and the text sentence vector with text potential part-of-speech information to obtain multi-feature text vector representation;
and S206, inputting the multi-feature text vector representation into an RCNN convolutional neural network to obtain a final text vector representation, thereby completing the construction of the text multi-feature classification model based on the Bert-LSTM.
4. The Bert-LSTM-based text multi-feature classification method according to claim 1, wherein the training process of step S3 specifically comprises:
step S301, inputting the training set into the text multi-feature classification model based on the Bert-LSTM to obtain final text vector representation;
step S302, inputting the final text vector representation into a preset prediction module to obtain a prediction score;
step S303, continuously updating the input of the text multi-feature classification model based on the Bert-LSTM to obtain a plurality of prediction scores;
and S304, optimizing the text multi-feature classification model based on the Bert-LSTM according to the error between the prediction score and the real score, and training to obtain the optimal text multi-feature classification model.
5. A text multi-feature classification device based on Bert-LSTM is characterized by comprising the following components:
the acquisition module is used for acquiring text data information, determining a text data set to be classified, and dividing the text data set into a training set and a test set;
the construction module is used for constructing a text multi-feature classification model based on the Bert-LSTM;
the training module is used for training the text multi-feature classification model based on the Bert-LSTM by using the training set to obtain an optimal text multi-feature classification model;
and the classification module is used for inputting the text data to be classified into the optimal text multi-feature classification model, calculating the score of the text data to be classified, and classifying the text data to be classified into preset corresponding categories according to the score.
6. The Bert-LSTM-based text multi-feature classification device of claim 5, wherein the obtaining module specifically comprises:
the extraction unit is used for acquiring text data information and extracting the text data set to be classified;
and the preprocessing and dividing unit is used for preprocessing the text data set to be classified and dividing the preprocessed text data set to be classified into a training set and a test set.
7. The Bert-LSTM-based text multi-feature classification device of claim 5, wherein the building module specifically comprises:
the extraction and fusion unit is used for using a full word mask WWM-EXT model of a BERT model, introducing an Attention mechanism, extracting partial features of the text data information from different dimensions to generate text sentence vectors, and fusing partial context information to obtain a high-dimensional vector matrix;
the dimensionality reduction unit is used for reducing the dimensionality of the high-dimensional vector matrix through a Principal Component Analysis (PCA) technology to obtain a low-dimensional vector matrix;
the first capturing and fusing unit is used for constructing a Bi-LSTM network, capturing text context dependence characteristic information and fusing text vectors to obtain text sentence vectors with context dependence;
the second capturing and fusing unit is used for constructing a Tree-LSTM network, capturing potential part-of-speech information of the text by taking the low-dimensional vector matrix as input, and fusing the potential part-of-speech information into the text vector to obtain a text sentence vector with the potential part-of-speech information of the text;
the splicing unit is used for performing Concat splicing on the text sentence vector with context dependence and the text sentence vector with text potential part-of-speech information to obtain multi-feature text vector representation;
and the construction unit is used for inputting the multi-feature text vector representation into the RCNN convolutional neural network to obtain a final text vector representation, so that the construction of the text multi-feature classification model based on the Bert-LSTM is completed.
8. The Bert-LSTM-based text multi-feature classification device of claim 5, wherein the training module is specifically configured to:
the processing unit is used for inputting the training set into the text multi-feature classification model based on the Bert-LSTM to obtain final text vector representation;
the prediction unit is used for inputting the final text vector representation into a preset prediction module to obtain a prediction score;
the updating unit is used for continuously updating the input of the text multi-feature classification model based on the Bert-LSTM to obtain a plurality of prediction scores;
and the optimization unit is used for optimizing the text multi-feature classification model based on the Bert-LSTM according to the error between the prediction score and the real score, and training to obtain the optimal text multi-feature classification model.
9. A text multi-feature classification device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the Bert-LSTM-based text multi-feature classification method as claimed in any one of claims 1 to 4 when executing the computer program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the Bert-LSTM based text multi-feature classification method as recited in any one of claims 1-4.
CN202210165299.0A 2022-02-18 2022-02-18 Text multi-feature classification method and device based on Bert-LSTM Pending CN114547303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210165299.0A CN114547303A (en) 2022-02-18 2022-02-18 Text multi-feature classification method and device based on Bert-LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210165299.0A CN114547303A (en) 2022-02-18 2022-02-18 Text multi-feature classification method and device based on Bert-LSTM

Publications (1)

Publication Number Publication Date
CN114547303A true CN114547303A (en) 2022-05-27

Family

ID=81678220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210165299.0A Pending CN114547303A (en) 2022-02-18 2022-02-18 Text multi-feature classification method and device based on Bert-LSTM

Country Status (1)

Country Link
CN (1) CN114547303A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658886A (en) * 2022-09-20 2023-01-31 广东技术师范大学 Intelligent liver cancer staging method, system and medium based on semantic text
CN115730237A (en) * 2022-11-28 2023-03-03 智慧眼科技股份有限公司 Junk mail detection method and device, computer equipment and storage medium
CN115730237B (en) * 2022-11-28 2024-04-23 智慧眼科技股份有限公司 Junk mail detection method, device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658886A (en) * 2022-09-20 2023-01-31 广东技术师范大学 Intelligent liver cancer staging method, system and medium based on semantic text
CN115730237A (en) * 2022-11-28 2023-03-03 智慧眼科技股份有限公司 Junk mail detection method and device, computer equipment and storage medium
CN115730237B (en) * 2022-11-28 2024-04-23 智慧眼科技股份有限公司 Junk mail detection method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Ghosh et al. Fracking sarcasm using neural network
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN112001185A (en) Emotion classification method combining Chinese syntax and graph convolution neural network
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN112001187A (en) Emotion classification system based on Chinese syntax and graph convolution neural network
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN109101490B (en) Factual implicit emotion recognition method and system based on fusion feature representation
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
CN110765769A (en) Entity attribute dependency emotion analysis method based on clause characteristics
CN111368082A (en) Emotion analysis method for domain adaptive word embedding based on hierarchical network
Zhang et al. Exploring deep recurrent convolution neural networks for subjectivity classification
CN111626042A (en) Reference resolution method and device
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
Kancharapu et al. A comparative study on word embedding techniques for suicide prediction on COVID-19 tweets using deep learning models
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
CN114547303A (en) Text multi-feature classification method and device based on Bert-LSTM
CN111859955A (en) Public opinion data analysis model based on deep learning
Ermatita et al. Sentiment Analysis of COVID-19 using Multimodal Fusion Neural Networks.
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN115758218A (en) Three-modal emotion analysis method based on long-time and short-time feature and decision fusion
CN115269846A (en) Text processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination