CN115640399A

CN115640399A - Text classification method, device, equipment and storage medium

Info

Publication number: CN115640399A
Application number: CN202211358384.5A
Authority: CN
Inventors: 纪鑫; 武同心; 王宏刚; 陈屹婷; 李君婷; 何禹德; 董林啸; 张乐; 安然; 赵加奎
Original assignee: Big Data Center Of State Grid Corp Of China
Current assignee: Big Data Center Of State Grid Corp Of China
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-01-24

Abstract

The invention discloses a text classification method, a text classification device, text classification equipment and a storage medium. The method comprises the following steps: acquiring a text set to be classified, and inputting the text set to be classified into a preset language model to obtain an embedded vector; inputting the embedded vector and the text set to be classified into a preset text graph network to obtain the text graph characteristics of each text to be classified; processing the text image features and the embedded vectors by using a preset bidirectional long-short term memory network to obtain text features; and determining the text classification of the text to be classified according to the text features and the text graph features. According to the technical scheme, the preset text graph network model is utilized to obtain graph structure information contained in the electric power text, deeper semantic structure information in the electric power text is mined, and the electric power text classification effect is improved.

Description

Text classification method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text classification.

Background

Text classification is a technology for automatically classifying and labeling texts according to certain classification rules, and is also a basic task of natural language processing, and is used in many directions such as emotion analysis, data mining, and news filtering. The current text classification needs to realize automatic classification by training and learning text data through a computer, so that the problem of huge workload of manual classification is solved, and the working efficiency is improved.

At present, a text classification method mainly uses a deep learning method as a main method, most models generally express unstructured text data into data which can be understood by a computer, and then a large number of text data sets with classification labels are trained, and important features in texts are extracted by using a deep learning network to obtain final classification labels.

However, the conventional deep learning model has difficulty in capturing the graph structure information included in the text, which limits the effect of text classification to some extent.

Disclosure of Invention

The invention provides a text classification method, a text classification device, text classification equipment and a storage medium, which are used for improving the text classification effect.

In a first aspect, an embodiment of the present invention provides a text classification method, including:

acquiring a text set to be classified, and inputting the text set to be classified into a preset language model to obtain an embedded vector, wherein the text set to be classified comprises a plurality of texts to be classified, and the texts to be classified comprise electric power texts;

inputting the embedded vector and the text set to be classified into a preset text graph network to obtain text graph characteristics of each text to be classified, wherein the text graphs corresponding to the text graph characteristics are connected without edges, and the text graph characteristics comprise a plurality of word node characteristics;

processing the text image features and the embedded vectors by using a preset bidirectional long-short term memory network to obtain text features;

and determining the text classification of the text to be classified according to the text features and the text graph features.

In a second aspect, an embodiment of the present invention provides an apparatus for text classification, including:

the embedded vector determining module is used for acquiring a text set to be classified and inputting the text set to be classified into a preset language model to obtain an embedded vector, wherein the text set to be classified comprises a plurality of texts to be classified, and the texts to be classified comprise electric power texts;

the text graph feature determination module is used for inputting the embedded vector and the text set to be classified into a preset text graph network to obtain text graph features of each text to be classified, wherein the text graphs corresponding to the text graph features are connected without edges, and the text graph features comprise a plurality of word node features;

the text feature determining module is used for processing the text image features and the embedded vectors by utilizing a preset bidirectional long-short term memory network to obtain text features;

and the text classification determining module is used for determining the text classification of the text to be classified according to the text characteristics and the text graph characteristics.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of text classification of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium storing computer instructions for causing a processor to implement the method for text classification of the first aspect when executed.

According to the scheme of text classification provided by the embodiment of the invention, a text set to be classified is obtained, the text set to be classified is input into a preset language model to obtain an embedded vector, wherein the text set to be classified comprises a plurality of texts to be classified, the texts to be classified comprise electric texts, the embedded vector and the text set to be classified are input into a preset text graph network to obtain text graph characteristics of each text to be classified, the text graphs corresponding to the text graph characteristics are connected without edges, the text graph characteristics comprise a plurality of word node characteristics, the text graph characteristics and the embedded vector are processed by using a preset two-way long-short term memory network to obtain text characteristics, and the text classification of the texts to be classified is determined according to the text characteristics and the text graph characteristics. By adopting the technical scheme, the text set to be classified is firstly input into a preset language model to obtain an embedded vector, then the embedded vector and the text set to be classified are input into a preset text graph network to obtain text graph characteristics, the text graph characteristics and the embedded vector are processed by utilizing a preset two-way long-short term memory network to obtain text characteristics, finally the text classification of the text to be classified is determined according to the text characteristics and the text graph characteristics, the graph structure information contained in the electric text (the text to be classified) is obtained by utilizing the preset text graph network model, the advantages of the pre-training language model (namely the preset language model) and the text graph network model are integrated, the semantic structure information of deeper layers in the electric text is mined, and the electric text classification effect is improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for classifying texts according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for classifying texts according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for text classification according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. In the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate understanding of the technical solution of the present invention, the following describes related art:

the current text classification method mainly comprises two types of traditional machine learning and deep learning:

1) Compared with the early rule-based method, the text classification method based on machine learning has the advantages that the classification precision and stability are greatly improved. The text classification method based on the traditional machine learning mainly comprises two parts of feature extraction and classifier. The common feature extraction tools mainly include a word bag model and a TF-IDF (term frequency-inverse document frequency) algorithm, and after extracting key features of a document, the features are directly input into a classifier to obtain a classification result. Common classifiers include bayesian, hidden markov models, and random forests.

2) Text classification tools commonly used in a text classification method based on deep learning mainly include a Fast text model, a CNN (Convolutional Neural Network) model, an RNN (Convolutional Neural Network) model, an RCNN (Regions with CNN features, object detection based on Convolutional Neural Network) model, and related variants thereof. Fast text is a deep neural network classification model with very high efficiency, and the main idea is to perform word segmentation on a text, perform weighted superposition on word vectors after vectorization, and perform classification by using a softmax function after calculation of a hidden layer. The RNN idea is to process the characteristics of each time series data and mine new information in conjunction with the previous processing state. The convolutional neural network has the main idea that convolution calculation is carried out on input data by utilizing convolution kernel, pooling operation is carried out after an intermediate result is obtained, and then a softmax function is applied to classification. The main idea of the RCNN is that the RNN can efficiently model time sequence data and the CNN can effectively extract features, after the RNN is used for performing time sequence modeling on text data, the CNN is used for further performing feature extraction, and finally, a classification function is used for effectively classifying the text.

The graph neural network is a graph information processing method based on deep learning. The graph convolution deep neural network can extract features of a graph data structure, namely, before the graph convolution neural network is applied to extract features of a text, the text needs to be converted into an image. Text GCN (Text Graph Convolutional neural network) can learn vectors of predicted words and documents, and robustness is superior. On the basis of GCN, the Graph Attention Network (Graph Attention Network) further matches the aggregation function by using an Attention mechanism, so that more accurate extraction of Graph node features can be realized. The heterogeneous graph neural network models the short text into a heterogeneous graph, and then performs feature extraction and classification, and the heterogeneous graph is widely applied to data mining because of complex information and rich semantics.

The electric power texts are generally unstructured service texts, the electric power texts lack uniform label management, knowledge accumulation in the fixed service field cannot be formed, efficient and convenient support cannot be provided for an application layer, and the traditional text classification method cannot dig out abundant graph structure information contained in the electric power texts, so that the classification effect is poor. According to the embodiment of the invention, the semantic features and the graph structure information in the electric power text are mined by utilizing the preset language pre-training model, the preset text graph network, the preset two-way long-short term memory network and the like, so that the problem of poor effect of the conventional classification method on electric power text classification is solved. The power text can be understood as related text in the power field, such as power failure description text, work log, and the like.

Example one

Fig. 1 is a flowchart of a method for text classification according to an embodiment of the present invention, where the embodiment is applicable to a case of text classification, and the method may be executed by a text classification apparatus, where the text classification apparatus may be implemented in a form of hardware and/or software, and the text classification apparatus may be configured in an electronic device, and the electronic device may be formed by two or more physical entities or may be formed by one physical entity.

As shown in fig. 1, a method for text classification provided in an embodiment of the present invention specifically includes the following steps:

s101, a text set to be classified is obtained and input into a preset language model to obtain an embedded vector, wherein the text set to be classified comprises a plurality of texts to be classified, and the texts to be classified comprise electric power texts.

In this embodiment, after the text set to be classified, that is, the text set to be classified, is obtained, the text set may be input into a pre-training language model, that is, a preset language model, and the preset language model may output an embedded vector of each text to be classified in the text set to be classified. The preset language model may be a trimmed BERT (Bidirectional encoder representation from converters) model after training is completed, the embedded vector usually includes semantic feature information of the text to be classified, and the type of the text to be classified may be a power text or the like.

S102, inputting the embedded vector and the text set to be classified into a preset text graph network to obtain text graph features of each text to be classified, wherein the text graphs corresponding to the text graph features are connected without edges, and the text graph features comprise a plurality of word node features.

In this embodiment, the embedded vector and the text set to be classified are input into a preset text graph network, and the preset text graph network may construct a text graph for each text to be classified, and obtain text graph features of each corresponding text to be classified. The text graph features generally consist of a plurality of word node features, the word node features generally comprise semantic feature information such as co-occurrence relations between words, and the number of word nodes in the text graph and the text graph features is consistent with the number of words in the text to be classified. The preset Text graph network can be a network model adjusted and trained based on a conventional Text GCN, when the conventional Text GCN constructs a Text graph, texts of a training set and a test set need to be all included in the Text graph, so that new texts cannot be classified, and the constructed Text graph becomes huge with the continuous increase of data sets, which not only affects the training speed of the model, but also causes larger memory overhead.

And S103, processing the text graph features and the embedded vectors by utilizing a preset bidirectional long-short term memory network to obtain text features.

In this embodiment, after obtaining the text graph features, the text graph features and the embedded vectors may be input into a preset bidirectional long-short term memory network, and the preset bidirectional long-short term memory network may fuse the text graph features and feature information in the embedded vectors and output the text features. The preset bidirectional Long-Short Term Memory network can be a network model after the completion of adjustment and training on the basis of a conventional LSTM (Long Short-Term Memory network).

And S104, determining the text classification of the text to be classified according to the text features and the text graph features.

For example, after the text features and the text map features are processed, for example, after the text features and the text map features are spliced, the processed text features and the spliced text features are input into a set classifier, such as a softmax classifier, and the text classification of the text to be classified is determined according to the result output by the classifier.

The method for classifying the texts provided by the embodiment of the invention comprises the steps of firstly inputting a text set to be classified into a preset language model to obtain an embedded vector, then inputting the embedded vector and the text set to be classified into a preset text graph network to obtain text graph characteristics, then processing the text graph characteristics and the embedded vector by using a preset two-way long-short term memory network to obtain text characteristics, and finally determining the text classification of the texts to be classified according to the text characteristics and the text graph characteristics. According to the technical scheme, the graph structure information contained in the power text is acquired by using the preset text graph network model, the advantages of the pre-training language model (namely the preset language model) and the text graph network model are combined, the semantic structure information of a deeper level in the power text is mined, and the effect of power text classification is improved.

Example two

Fig. 2 is a flowchart of a text classification method provided in the second embodiment of the present invention, and the technical solution of the second embodiment of the present invention is further optimized based on the above optional technical solutions, and a specific manner of text classification is given.

Optionally, the inputting the text set to be classified into a preset language model to obtain an embedded vector includes: extracting word features, sentence features and position features of each text to be classified from the text set to be classified by using a preset language model; and splicing the word features, the sentence features and the position features by utilizing the preset language model to obtain an embedded vector. The method has the advantages that the word features, sentence features and position features in the text to be classified are extracted by using the preset language model, so that the embedded vector contains semantic feature information of the text to be classified.

Optionally, the processing the text graph features and the embedded vectors by using a preset bidirectional long-term and short-term memory network to obtain text features includes: performing graph convolution processing on the text graph characteristics to obtain processed text graph characteristics; and splicing the processed text graph features and the embedded vectors, and inputting a splicing result into a preset bidirectional long-short term memory network to obtain text features. The text classification method has the advantages that more time sequence context characteristics in the text to be classified are obtained by performing graph convolution processing on the text graph characteristics, and the accuracy of text classification is improved.

Optionally, the determining the text classification of the text to be classified according to the text feature and the text graph feature includes: inputting the text graph features into a full-connection network to obtain the text graph features after dimension reduction processing; and processing the text features and the text image features after the dimension reduction processing by using a preset nonlinear classifier to obtain the text classification of the text to be classified, wherein the text classification comprises fault type classification. The text features and the text graph features after the dimension reduction processing are input into the nonlinear classifier, and the text features of the text to be classified can be accurately determined according to the output of the nonlinear classifier.

As shown in fig. 2, a method for text classification provided in the second embodiment of the present invention specifically includes the following steps:

s201, acquiring a text set to be classified, and extracting word features, sentence features and position features of each text to be classified from the text set to be classified by using a preset language model.

Specifically, after each sentence in the text to be classified is input into the preset language model, the preset language model may perform word segmentation on the text to be classified, and extract word features, sentence features, and position features in each text to be classified. The word feature can be understood as an embedded feature representation of each word, the sentence feature can be understood as an embedded feature representation of a sentence where each word is located, and the position feature can be understood as an embedded feature representation of a position of each word in the sentence. The beginning of each sentence may be [ CLS ], and the end of each sentence may be [ SEP ].

Illustratively, if the word is characterized by e ^w The sentence is characterized by e ^s The position characteristic is e ^p Then, the embedded vector E of the language model output is preset _CLS Can be expressed as: e _CLS ＝e ^w +e ^s +e ^p . The embedded vector output by the preset language model usually does not contain the characteristics of set words, such as the characteristics of words without actual meanings, such as virtual words, conjunctions and the like.

Optionally, the training mode of the preset language model includes:

1) A plurality of first training texts configured with first text classification labels are obtained.

For example, if the first training text is a power failure description text, the first text classification tag may be a failure type tag, such as a main transformer failure and an oil temperature tank failure.

2) And inputting the first training text into a language model to be trained to obtain a first embedded training vector, and determining a first prediction text classification based on the first embedded training vector and a first preset classifier.

Specifically, after the first training text is input into the language model to be trained, the language model to be trained outputs an embedded vector corresponding to the first training text, that is, a first embedded training vector, in this embodiment, the first embedded training vector and the first training text may be continuously input into a set text graph network, and a text graph feature output by the text graph network and the first embedded training vector are processed by using a set two-way long-short term memory network, and finally, a text feature output by the two-way long-short term memory network is fused with a text graph feature output by the text graph network, and a fused result is input into a first preset classifier, such as a softmax classifier, so as to obtain a predictive classification of each first training text, that is, a first predictive text classification. The set text graph network and the set bidirectional long and short term memory network can be untrained network models or trained network models.

3) Determining a loss function according to the first text classification label and the first prediction text classification, and determining a layer to be adjusted of an encoder according to the similarity between the first training text and a historical training text, wherein the encoder belongs to the language model to be trained, the number of layers to be adjusted is smaller than the total number of layers of the encoder, and the historical training text is determined based on historical training of the language model to be trained.

Specifically, the loss function may be established based on the predicted classification and the actual classification, i.e., the classification corresponding to the first predicted text classification and the first text classification label. After determining the similarity between the first training text and the historical training text, the number of layers of the encoder, such as four layers, which need to adjust parameters, and the positions of the layers to be adjusted, such as the 9 th layer to the 12 th layer, can be determined. The number of layers of the layer to be adjusted can be inversely related to the similarity between the first training text and the historical training text, namely the higher the similarity is, the smaller the number of layers of the layer to be adjusted is, and a large number of training texts are generally needed to train the model for many times, and before the first training text is used for training the language model to be trained, the used training text is the historical training text.

4) And adjusting the parameters of the layer to be regulated on the basis of the historical parameters of the encoder based on the size of the calculation result of the loss function so as to finish the training of the language model to be trained and obtain a preset language model.

Illustratively, the language model to be trained utilizes a self-attention mechanism to link words in the text during the training process. And adjusting the parameters of the layer to be adjusted of the encoder according to the calculation result of the loss function, so that the calculation result of the loss function is lower than a first set threshold value, thereby completing the training of the language model to be trained and obtaining the preset language model. And after the language model to be trained is trained, the relevant parameters of the model to be trained are stored, so that when the parameters of the layer to be adjusted of the encoder are adjusted in the training process, the historical training text with the highest similarity to the first training text can be determined from the plurality of historical training texts, and the adjustment is performed on the basis of the parameters of the model corresponding to the historical training text.

The advantage of training the preset language model in this way is that by fine-tuning part of parameters of the BERT, the trimmed BERT, i.e. the preset language model, can contain richer semantic feature information

S202, splicing the word features, the sentence features and the position features by utilizing the preset language model to obtain an embedded vector.

Optionally, the process of determining the text graph feature by the preset text graph network includes:

1) Determining the word features in the embedded vector as text image word features, and performing first weighted calculation on the word features in the embedded vector to obtain text image sentence features.

Illustratively, as described above, the word feature e may be ^w Determined as text word feature E _word And for each sentence corresponding word characteristic e ^w Performing weighted calculation, i.e. first weighted calculation, to obtain the characteristics of the sentence in which each word is located, i.e. the characteristics E of the text image sentence _sentence . The weight coefficient of each word feature in the first weighting calculation can be the same, and the number of words in each sentence is consistent with the number of the text graph word features and the number of the text graph sentence features.

2) According to the co-occurrence relation among all words in the text to be classified, co-occurrence sides of word nodes are determined, second weighting calculation is carried out on word features in the embedded vectors according to the number of the co-occurrence sides, text co-occurrence features are obtained, the size of a weighting coefficient corresponding to the second weighting calculation is positively correlated with the number of the co-occurrence sides, and the first weighting calculation is different from the second weighting calculation.

For example, a co-occurrence relationship between each word in each sentence may be determined, then word nodes corresponding to each word may be determined, and co-occurrence edges between word nodes may be determined, if the arrangement sequence of the word a, the word b, and the word c in the sentence is bac, then word node a, word node b, and word node c, ba co-occurrence, ac co-occurrence, bc do not have a co-occurrence relationship, a co-occurrence edge exists between word node a and word node b, and a co-occurrence edge exists between word node a and word node c, and then a second weighted calculation may be performed on word features in each sentence, and a text co-occurrence feature E corresponding to each word node may be obtained _ngram And the weight coefficient of each word feature in the second weighting calculation is positively correlated with the number of the co-occurrence sides. Wherein, since one sentence can contain a plurality of words, E _ngram The number of the included word feature information may be preset, that is, the number of the word features calculated by the second weighting calculation may be preset, for example, if the number is preset to 3, and the sentence includes 9 words, E _ngram Is E ₃ The ngram is a preset number, and 9E corresponding to 9 word nodes can be obtained by performing second weighted calculation once on every three feature words ₃ 。

3) And splicing the word features of the text graph, the sentence features of the text graph and the co-occurrence features of the text to obtain word node features, and determining the features of the text graph according to the word node features.

For example, as described above, each word in the sentence corresponds to a word node, each word corresponds to a word feature, and each word corresponds to a word feature, so that the word node features of the word node corresponding to each word, such as the word node feature W, can be obtained by concatenating the text graph word feature, the text graph sentence feature, and the text co-occurrence feature corresponding to each word _i Can be expressed as

W _i ＝(E _sentence ,E _ngram, E _word )

And finally integrating the word node characteristics of each text to be classified, so as to obtain the text graph characteristics of each text to be classified.

The text graph characteristics are determined, so that the obtained text graph characteristics comprise the text graph word characteristics, the text graph sentence characteristics and the text co-occurrence characteristics, sufficient graph structure information is successfully extracted from the text to be classified, and the classification accuracy is ensured.

And S203, carrying out image convolution processing on the text image characteristics to obtain the processed text image characteristics.

For example, the text image features may be subjected to image convolution by using a set network model, such as an image convolution deep neural network, so that the processed text image features are matched with the dimensions of the embedded vectors.

And S204, splicing the processed text image features and the embedded vectors, and inputting a splicing result into a preset bidirectional long-short term memory network to obtain text features.

Optionally, the preset text graph network training mode includes:

1) And acquiring a plurality of second training texts configured with second text classification labels and second embedded training vectors, wherein the second embedded training vectors are determined based on the trained preset language model.

Specifically, the second embedded training vector may be output by a trained preset language model, or may be output by an untrained preset language model.

2) And inputting the second training text into a to-be-trained text graph network to obtain training text graph characteristics, wherein the training text graph characteristics are determined based on the classification label nodes and the training word node characteristics.

Specifically, the second training text may be input into the to-be-trained text graph network, the to-be-trained text graph network may output training text graph features according to a second text classification label configured in the second training text, the training text graph features may include training word node features corresponding to words in the second training text, training text classification labels may be configured in the second training text, nodes corresponding to the training text classification labels are generated, a co-occurrence relationship between nodes corresponding to the training labels in the second training text and word nodes is determined, and co-occurrence features included in the training word node features are determined according to a co-occurrence relationship between word nodes corresponding to words in the second training text and a co-occurrence relationship between nodes corresponding to the training labels in the second training text and the word nodes.

3) And fusing the training word node characteristics and the second embedded training vector, and inputting a fusion result and the training text graph characteristics into a second preset classifier to determine a second prediction text classification.

Illustratively, the training word node features and the second embedded training vectors may be spliced, the obtained splicing result is fused with the training text graph features, so as to obtain feature information fused with the text features and the text graph features, and finally the feature information is input into a classifier, so as to obtain a prediction classification of the second training text, that is, a second prediction text classification.

4) And determining a loss relation according to the second text classification label and the second prediction text classification, and training the text graph network to be trained according to the loss relation to complete the training to obtain a preset text graph network.

Specifically, based on the prediction classification and the truth classification, i.e., the classification corresponding to the second prediction text classification and the second text classification label, a loss function, such as a cross-entropy loss function τ, can be established, which can be expressed in the form of

τ＝CrossEntropy(y,y′)

And y represents the classification corresponding to the second text classification label, y' represents the second prediction text classification, and then the text graph network to be trained is trained according to the calculation result of the loss function, so that the calculation result of the loss function is lower than a second set threshold value, and the preset text graph network is obtained. If the second embedded training vector is output by the untrained preset language model, the preset language model and the text graph network to be trained can be trained simultaneously.

Wherein, the process of determining the training text chart features by the to-be-trained text chart network comprises the following steps:

1) Determining training word features and training sentence features according to the second embedded training vector, determining training co-occurrence edges of training word nodes according to the co-occurrence relationship between each word in the second training text and the second text classification label, performing third weighted calculation on the word features in the second embedded training vector of the embedded vector according to the number of the training co-occurrence edges, and determining the training co-occurrence features, wherein the training co-occurrence edges comprise the co-occurrence edges between the training word nodes and the co-occurrence edges of the classification label nodes corresponding to the training word nodes and the second text classification label, and the magnitude of the weighting coefficient corresponding to the third weighted calculation is positively correlated with the number of the training co-occurrence edges.

Exemplarily, as described above, similar to the process of determining the word features of the text graph and the sentence features of the text graph, the word features in the second embedded training vector are determined as the training word features, the fourth weighting calculation is performed on the word features corresponding to each sentence of the second training text to obtain the features of the sentence in which each word is located, i.e., the training sentence features, then the co-occurrence relationship between each word in each sentence of the second training text and the co-occurrence relationship between each word and the second text classification label are determined, then the word nodes corresponding to each word and the classification label nodes corresponding to the second text classification label are determined, so that the co-occurrence edges between the word nodes and the classification label nodes can be determined, and finally the third weighting calculation is performed on the word features corresponding to the word nodes according to the number of the co-occurrence edges of each word node to obtain the training co-occurrence features. The weight coefficient of each word feature in the fourth weighting calculation may be the same, the number of words in each sentence of the second training text is the same as the number of training word features and the number of training sentence features, the weight coefficient of each word feature in the third weighting calculation is positively correlated with the number of co-occurrence edges, the number of word features in the third weighting calculation may also be preset, and the number of co-occurrence edges of each word node includes the number of co-occurrence edges between word nodes and classification label nodes.

2) And splicing the training word characteristics, the training sentence characteristics and the training co-occurrence characteristics to determine training word node characteristics, and determining training text graph characteristics according to the training word node characteristics.

Specifically, the obtained feature information is spliced to obtain training word node features, and the training text graph features are composed of the plurality of training word node features. Due to the addition of the classification label nodes, the number of co-occurrence edges among each word node is increased in unequal proportion, so that the weighting coefficients and the training co-occurrence characteristics in the third weighting calculation are influenced, and for example, the unequal proportion of the weighting coefficients of the part word characteristics in the third weighting calculation can be increased or decreased.

The advantage of training the preset text graph network is that by introducing the classification label nodes, each word node is connected with each classification label node by edges, and the accuracy of representing graph structure information in a text by the preset text graph network is improved.

And S205, inputting the text graph characteristics into a full-connection network to obtain the text graph characteristics after the dimension reduction processing.

Specifically, the full-connection network may be used to perform dimension reduction on the text graph features to obtain the text graph features after the dimension reduction.

S206, processing the text features and the text image features after the dimension reduction processing by using a preset nonlinear classifier to obtain the text classification of the text to be classified.

Wherein the text classification comprises a fault type classification.

For example, if the non-linear classifier is a softmax function, the text features and the feature of the text map after the dimension reduction processing may be first merged, and then the merged result is input to the softmax function, so as to obtain the text classifications and the confidence level of each text classification, and the text classification higher than the set confidence level threshold is determined as the text classification of the text to be classified.

The text classification method provided by the embodiment of the invention comprises the steps of firstly inputting a text set to be classified into a preset language model, wherein the preset language model can output an embedded vector containing word characteristics, sentence characteristics and position characteristics, then inputting the embedded vector and the text set to be classified into a preset text graph network to obtain text graph characteristics, then carrying out image convolution processing on the text graph characteristics, splicing the obtained result and the embedded vector, then processing the spliced result by using a preset bidirectional long-short term memory network to obtain text characteristics, and finally inputting the text graph characteristics and the text characteristics subjected to dimension reduction into a preset nonlinear classifier so as to determine the text classification of the text to be classified. According to the technical scheme of the embodiment of the invention, the problem of inaccurate electric power text classification caused by limited representation capacity of word vectors is solved by using the improved pre-training language model, namely the pre-training language model, and the problem that the pre-training language model and the text graph network cannot acquire time sequence context characteristics is solved by using the preset bidirectional long-short term memory network, so that the electric power text classification effect is improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a text classification apparatus according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: an embedded vector determination module 301, a text graph feature determination module 302, a text feature determination module 303, and a text classification determination module 304, wherein:

the text feature determination module is used for processing the text image features and the embedded vectors by utilizing a preset two-way long-short term memory network to obtain text features;

The text classification device provided by the embodiment of the invention firstly inputs a text set to be classified into a preset language model to obtain an embedded vector, then inputs the embedded vector and the text set to be classified into a preset text graph network to obtain text graph characteristics, processes the text graph characteristics and the embedded vector by using a preset two-way long-short term memory network to obtain the text characteristics, and finally determines the text classification of the text to be classified according to the text characteristics and the text graph characteristics, acquires graph structure information contained in an electric power text by using the preset text graph network model, integrates the advantages of a pre-training language model (namely the preset language model) and the text graph network model, digs out deeper semantic structure information in the electric power text, and improves the electric power text classification effect.

Optionally, the embedded vector determining module includes:

the characteristic determining unit is used for extracting word characteristics, sentence characteristics and position characteristics of each text to be classified from the text set to be classified by utilizing a preset language model;

and the vector determining unit is used for splicing the word features, the sentence features and the position features by utilizing the preset language model to obtain an embedded vector.

Optionally, the training mode of the preset language model includes: acquiring a plurality of first training texts configured with first text classification labels; inputting the first training text into a language model to be trained to obtain a first embedded training vector, and determining a first prediction text classification based on the first embedded training vector and a first preset classifier; determining a loss function according to the first text classification label and the first prediction text classification, and determining a layer to be adjusted of an encoder according to the similarity between the first training text and a historical training text, wherein the encoder belongs to the language model to be trained, the number of layers to be adjusted is less than the total number of layers of the encoder, and the historical training text is determined based on historical training of the language model to be trained; and adjusting the parameters of the layer to be regulated on the basis of the historical parameters of the encoder based on the size of the calculation result of the loss function so as to finish the training of the language model to be trained and obtain a preset language model.

Optionally, the process of determining the text graph feature by the preset text graph network includes: determining the word features in the embedded vector as text image word features, and performing first weighted calculation on the word features in the embedded vector to obtain text image sentence features; determining co-occurrence sides of word nodes according to the co-occurrence relationship among each word in the text to be classified, and performing second weighting calculation on the word features in the embedded vector according to the number of the co-occurrence sides to obtain text co-occurrence features, wherein the magnitude of a weighting coefficient corresponding to the second weighting calculation is positively correlated with the number of the co-occurrence sides, and the first weighting calculation is different from the second weighting calculation; and splicing the text graph word features, the text graph sentence features and the text co-occurrence features to obtain word node features, and determining the text graph features according to the word node features.

Optionally, the text feature determination module includes:

the text image characteristic determining unit is used for carrying out image volume processing on the text image characteristics to obtain the processed text image characteristics;

and the text feature determining unit is used for splicing the processed text image features and the embedded vectors and inputting a splicing result into a preset bidirectional long-short term memory network to obtain the text features.

Optionally, the preset text graph network training mode includes: acquiring a plurality of second training texts configured with second text classification labels and second embedded training vectors, wherein the second embedded training vectors are determined based on a trained preset language model; inputting the second training text into a to-be-trained text graph network to obtain training text graph characteristics, wherein the training text graph characteristics are determined based on classification label nodes and training word node characteristics; fusing the training word node characteristics and the second embedded training vector, and inputting a fusion result and the training text graph characteristics into a second preset classifier to determine a second prediction text classification; determining a loss relation according to the second text classification label and the second prediction text classification, and training the to-be-trained text graph network according to the loss relation to complete the training to obtain a preset text graph network; wherein, the process of determining the training text chart features by the to-be-trained text chart network comprises the following steps: determining training word features and training sentence features according to the second embedded training vector, determining training co-occurrence sides of training word nodes according to the co-occurrence relationship between each word in the second training text and the second text classification label, and performing third weighted calculation on the word features in the second embedded training vector of the embedded vector according to the number of the training co-occurrence sides to determine the training co-occurrence features, wherein the training co-occurrence sides comprise the co-occurrence sides between the training word nodes and the co-occurrence sides of the training word nodes and the classification label nodes corresponding to the second text classification label, and the magnitude of the weighting coefficient corresponding to the third weighted calculation is related to the number of the bars of the training co-occurrence sides; and splicing the training word features, the training sentence features and the training co-occurrence features to determine training word node features, and determining training text graph features according to the training word node features.

Optionally, the text classification determining module includes:

the dimension reduction processing unit is used for inputting the text graph characteristics into a full-connection network to obtain the text graph characteristics after dimension reduction processing;

and the text classification determining unit is used for processing the text features and the text graph features after the dimension reduction processing by using a preset nonlinear classifier to obtain the text classification of the text to be classified, wherein the text classification comprises fault type classification.

The text classification device provided by the embodiment of the invention can execute the text classification method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 4 shows a schematic block diagram of an electronic device 40 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 41 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from a storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data necessary for the operation of the electronic apparatus 40 can also be stored. The processor 41, the ROM 42, and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

A plurality of components in the electronic device 40 are connected to the I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, or the like; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the electronic device 40 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 41 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. Processor 41 performs the various methods and processes described above, such as the method of text classification.

In some embodiments, the method of text classification may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into the RAM 43 and executed by the processor 41, one or more steps of the method of text classification described above may be performed. Alternatively, in other embodiments, processor 41 may be configured by any other suitable means (e.g., by way of firmware) to perform the method of text classification.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

The computer device provided by the above can be used to execute the text classification method provided by any of the above embodiments, and has corresponding functions and advantages.

EXAMPLE five

In the context of the present invention, a computer-readable storage medium may be a tangible medium, the computer-executable instructions, when executed by a computer processor, for performing a method of text classification, the method comprising:

In the context of the present invention, a computer readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer device provided above can be used to execute the method for text classification provided in any of the above embodiments, and has corresponding functions and advantages.

It should be noted that, in the embodiment of the text classification apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing description is only exemplary of the invention and that the principles of the technology may be employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of text classification, comprising:

inputting the embedded vector and the text set to be classified into a preset text graph network to obtain text graph features of each text to be classified, wherein the text graphs corresponding to the text graph features are connected without edges, and the text graph features comprise a plurality of word node features;

2. The method according to claim 1, wherein the inputting the text set to be classified into a preset language model to obtain an embedded vector comprises:

extracting word features, sentence features and position features of each text to be classified from the text set to be classified by using a preset language model;

and splicing the word features, the sentence features and the position features by using the preset language model to obtain an embedded vector.

3. The method according to claim 1, wherein the preset language model is trained in a manner comprising:

acquiring a plurality of first training texts configured with first text classification labels;

inputting the first training text into a language model to be trained to obtain a first embedded training vector, and determining a first prediction text classification based on the first embedded training vector and a first preset classifier;

determining a loss function according to the first text classification label and the first prediction text classification, and determining a layer to be adjusted of an encoder according to the similarity between the first training text and a historical training text, wherein the encoder belongs to the language model to be trained, the number of layers to be adjusted is smaller than the total number of layers of the encoder, and the historical training text is determined based on historical training of the language model to be trained;

and adjusting the parameters of the layer to be regulated on the basis of the historical parameters of the encoder based on the size of the calculation result of the loss function so as to finish the training of the language model to be trained and obtain a preset language model.

4. The method of claim 2, wherein the step of determining the text graph characteristics by the predetermined text graph network comprises:

determining the word features in the embedded vector as text image word features, and performing first weighted calculation on the word features in the embedded vector to obtain text image sentence features;

determining co-occurrence sides of word nodes according to the co-occurrence relation between each word in the text to be classified, and performing second weighting calculation on the word features in the embedded vector according to the number of the co-occurrence sides to obtain text co-occurrence features, wherein the magnitude of a weighting coefficient corresponding to the second weighting calculation is positively correlated with the number of the co-occurrence sides, and the first weighting calculation is different from the second weighting calculation;

and splicing the word features of the text graph, the sentence features of the text graph and the co-occurrence features of the text to obtain word node features, and determining the features of the text graph according to the word node features.

5. The method of claim 1, wherein the processing the text graph features and the embedded vectors using a pre-defined two-way long-short term memory network to obtain text features comprises:

performing graph convolution processing on the text graph characteristics to obtain processed text graph characteristics;

and splicing the processed text graph features and the embedded vectors, and inputting a splicing result into a preset bidirectional long-short term memory network to obtain text features.

6. The method of claim 1, wherein the preset textbook network training mode comprises:

acquiring a plurality of second training texts configured with second text classification labels and second embedded training vectors, wherein the second embedded training vectors are determined based on a preset language model after training;

inputting the second training text into a text graph network to be trained to obtain training text graph characteristics, wherein the training text graph characteristics are determined based on classification label nodes and training word node characteristics;

fusing the training word node characteristics and the second embedded training vector, and inputting a fusion result and the training text graph characteristics into a second preset classifier to determine a second prediction text classification;

determining a loss relation according to the second text classification label and the second prediction text classification, and training the to-be-trained text graph network according to the loss relation to complete the training to obtain a preset text graph network;

determining training word features and training sentence features according to the second embedded training vector, determining training co-occurrence edges of training word nodes according to the co-occurrence relationship between each word in the second training text and the second text classification label, performing third weighting calculation on the word features in the second embedded training vector of the embedded vector according to the number of the training co-occurrence edges, and determining the training co-occurrence features, wherein the training co-occurrence edges comprise the co-occurrence edges between the training word nodes and the co-occurrence edges of the classification label nodes corresponding to the training word nodes and the second text classification label, and the magnitude of the weighting coefficient corresponding to the third weighting calculation is positively correlated with the number of the training co-occurrence edges;

and splicing the training word characteristics, the training sentence characteristics and the training co-occurrence characteristics to determine training word node characteristics, and determining training text graph characteristics according to the training word node characteristics.

7. The method of claim 1, wherein determining the text classification of the text to be classified according to the text feature and the text graph feature comprises:

inputting the text graph features into a full-connection network to obtain the text graph features after dimension reduction processing;

and processing the text features and the text graph features after the dimension reduction processing by using a preset nonlinear classifier to obtain the text classification of the text to be classified, wherein the text classification comprises fault type classification.

8. An apparatus for text classification, comprising:

the text graph characteristic determining module is used for inputting the embedded vector and the text set to be classified into a preset text graph network to obtain the text graph characteristic of each text to be classified, wherein the text graphs corresponding to the text graph characteristics are connected without edges, and the text graph characteristics comprise a plurality of word node characteristics;

and the text classification determining module is used for determining the text classification of the text to be classified according to the text features and the text graph features.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of text classification of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to perform the method of text classification of any one of claims 1-7 when executed.