CN116127075A

CN116127075A - Text classification method, device, equipment and storage medium

Info

Publication number: CN116127075A
Application number: CN202310180559.6A
Authority: CN
Inventors: 王雅晴; 窦德景
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-05-16

Abstract

The disclosure provides a text classification method, a device, equipment and a storage medium, relates to the technical field of artificial intelligence, and particularly relates to the fields of natural language processing, deep learning and the like. The specific implementation scheme is as follows: and obtaining target texts to be classified, and encoding to obtain the representation of the target texts according to the graph embedded features of the target words in the target texts in the word graphs. And determining the connection relation between the target text and at least one sample in the text graph according to the characterization similarity between the target text and at least one sample, and extracting graph embedding characteristics of the target text according to the connection relation. Classifying the target text according to the graph embedded characteristics of the target text in the text graph. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in the encoding of the target text, the representation of the target text is enriched, and the classification accuracy according to the representation is correspondingly improved.

Description

Text classification method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the fields of natural language processing, deep learning and the like, which can be applied to application scenes such as short text classification, semantic analysis, intention recognition and the like, and particularly relates to a text classification method, device, equipment and storage medium.

Background

Text classification (Text Classification) is the basic task of many application scenarios for semantic analysis, intent recognition, etc. In a semantic analysis scenario, the categories of classification may be different semantics. Similarly, in the context of intent recognition, the category of classification may then be a different intent.

For short text classification tasks, short text lacks context information and strict grammar structures due to text length limitations, so that the short text is difficult to understand, and classification accuracy is not guaranteed.

Disclosure of Invention

The disclosure provides a text classification method, a text classification device, text classification equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided a text classification method, comprising:

obtaining target texts to be classified, wherein the target texts comprise at least one target word in a word graph;

according to the embedded features of the target words in the word graph, coding to obtain the representation of the target text;

determining a connection relationship between the target text and at least one sample in a text diagram according to the characterization similarity between the target text and the at least one sample;

extracting graph embedding characteristics of the target text according to the connection relation between the target text and the at least one text sample in the text graph;

Classifying the target text according to the graph embedded features of the target text in the text graph.

According to a second aspect of the present disclosure, there is provided a model training method comprising:

obtaining any target sample text from a sample set, wherein the target sample text comprises at least one target word in a word graph;

according to the embedded features of the target words in the word graph, coding to obtain the representation of the target sample text;

determining the connection relation between the target sample text and the rest sample texts in a text diagram according to the characterization similarity between the target sample text and the rest sample texts in the sample set;

extracting and obtaining graph embedding characteristics of the target sample text according to the connection relation in the text graph;

classifying the target sample text by using a classifier according to the graph embedded characteristics of the target sample text to obtain a prediction category;

and adjusting model parameters of the classifier according to the difference between the predicted category and the expected category of the target sample text.

According to a third aspect of the present disclosure, there is provided a text classification apparatus comprising:

The first acquisition module is used for acquiring target texts to be classified, wherein the target texts comprise at least one target word in a word graph;

the first coding module is used for obtaining the representation of the target text by coding according to the graph embedded feature of the target word in the word graph;

the first determining module is used for determining the connection relation between the target text and at least one sample in the text graph according to the characterization similarity between the target text and the at least one sample;

the first extraction module is used for extracting graph embedding characteristics of the target text according to the connection relation between the target text and the at least one sample in the text graph;

and the first classification module is used for classifying the target text according to the graph embedded features of the target text in the text graph.

According to a fourth aspect of the present disclosure, there is provided a model training apparatus comprising:

the second processing module is used for acquiring any target sample text from the sample set, wherein the target sample text comprises at least one target word in a word graph;

the second coding module is used for obtaining the representation of the target sample text by coding according to the graph embedded feature of the target word in the word graph;

The second determining module is used for determining the connection relation between the target sample text and the rest sample texts in the text graph according to the characterization similarity between the target sample text and the rest sample texts in the sample set;

the second extraction module is used for extracting and obtaining the graph embedding characteristics of the target sample text according to the connection relation in the text graph;

the second classification module is used for classifying the target sample text by using a classifier according to the graph embedded characteristics of the target sample text to obtain a prediction class;

and the first training module is used for adjusting model parameters of the classifier according to the difference between the predicted category and the expected category of the target sample text.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or the method of the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect, or the method of the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect, or the method according to the second aspect.

According to the text classification method, device and equipment and storage medium, the target text to be classified is obtained, wherein the target text comprises at least one target word in a word graph, and the representation of the target text is obtained through encoding according to the graph embedding characteristics of the target word in the word graph. And further, according to the characterization similarity between the target text and at least one sample, determining the connection relation between the target text and at least one sample in the text graph, and extracting the graph embedding characteristics of the target text according to the connection relation. Classifying the target text according to the graph embedded characteristics of the target text in the text graph. In the embodiment of the disclosure, the word graph is used as an external corpus, and the graph embedded features extracted from the target word in the word graph indicate semantic information obtained by understanding the target word in the external corpus. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in the encoding of the target text, the representation of the target text is enriched, and the classification accuracy according to the representation is correspondingly improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a text classification method according to an embodiment of the disclosure;

FIG. 2 is a flow chart of another text classification method according to an embodiment of the disclosure;

FIG. 3 is a flow chart of another text classification method according to an embodiment of the disclosure;

FIG. 4 is a schematic flow chart of a model training method according to an embodiment of the disclosure;

FIG. 5 is a flow chart of another model training method provided in an embodiment of the present disclosure;

FIG. 6 is a flow chart of another model training method provided in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a text classification device 70 according to an embodiment of the disclosure;

fig. 8 is a schematic structural diagram of a model training device 80 according to an embodiment of the disclosure;

fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Short text classification (Short Text Classification) is the basic task of many application scenarios for semantic analysis, intent recognition, etc., limited by text length, short text lacks context information and strict grammatical structures, making them difficult to understand. Therefore, it is necessary to enrich short text by incorporating various auxiliary information.

In the embodiment of the disclosure, the word graph is used as an external corpus, and the graph embedded features extracted from the target word in the word graph indicate semantic information obtained by understanding the target word in the external corpus. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in the encoding of the target text, the representation of the target text is enriched, and the classification accuracy according to the representation is correspondingly improved.

Fig. 1 is a flow chart of a text classification method according to an embodiment of the disclosure, as shown in fig. 1, including:

step 101, obtaining target text to be classified, wherein the target text comprises at least one target word in a word graph.

Word graphs are built based on high frequency words that appear in a large corpus. Nodes in the word graph are used for indicating words, edges between the nodes are used for indicating co-occurrence data of the words, or semantic relatedness of the words can be indicated.

Optionally, the target text to be classified may be a short text, and word segmentation is performed on the target text to obtain a plurality of words contained in the target text. Among the plurality of words contained in the target text, words in the word graph may be included. In the embodiment of the disclosure, discussion is made about target words included in target text and belonging to a word graph. The words in the target text that are not in the word graph may not participate in encoding of the subsequent target text, or may participate in encoding of the subsequent target text, which is not limited in this embodiment.

And 102, embedding features according to the graph of the target word in the word graph, and encoding to obtain the representation of the target text.

The graph embedding feature is obtained by extracting graph features of the corresponding target words in the word graph and the edges connected with the target words. As one possible implementation mode, each word in the word graph is encoded in a hot independent mode, as the representation of the word in the word graph, the representation and transmission are carried out on the neighbor word of the target word based on the edge connected with the target word, and the graph embedded feature of the target word is obtained by aggregation of the transmitted representation and the representation of the target word.

It should be noted that, the graph embedding feature in this embodiment may be obtained by extracting the graph feature by using two or more layers of graph neural network, and the number of layers is not limited in this embodiment.

Alternatively, in the case where the target word is plural, the graph embedded features of the plural target words may be fused so as to be a representation of the target text.

And step 103, determining the connection relation between the target text and the at least one sample in the text graph according to the characterization similarity between the target text and the at least one sample.

The token similarity may be measured by using a token cosine similarity, or may be measured by using other ways of determining similarity, which is not limited in this embodiment.

The text graph includes a plurality of nodes and edges connecting the nodes. The nodes in the text graph indicate texts, which can be specifically the target texts and sample texts. In this embodiment, the text graph may include a node indicating the target text, and also include at least one node indicating the sample text. And the edge is used for indicating the representation similarity between the texts indicated by the connection nodes.

As one possible implementation, the token similarity between the target text and any sample text is used to determine whether there is a connected edge between the target text and the sample text. Only if the token similarity is greater than the threshold, there is a connected edge between the target text and the sample text.

As another possible implementation, the token similarity between the target text and any sample text is used to determine the weight of the edge between the target text and the sample text. The magnitude of the weight will have an impact on the representation delivery, and the larger the weight, the more the sample text will be on the representation delivered by the target text.

It should be noted that the foregoing two implementations may be executed independently or may be executed in combination.

In the case of combined execution, there is a connected edge between the target text and the sample text only if the token similarity is greater than a threshold, and the token similarity between the target text and any sample text is used to determine the weight of the edge between the target text and the sample text.

And 104, extracting the graph embedding characteristics of the target text according to the connection relation between the target text and at least one sample text in the text graph.

Optionally, a two-layer or more-layer graph neural network is adopted, feature extraction is performed based on the connection relation between the target text and at least one text sample in the text graph, and graph embedded features of the target text in the text graph are obtained. In order to classify the target text accordingly.

And 105, classifying the target text according to the graph embedded features of the target text in the text graph.

Optionally, since the text graph is built according to the characterization similarity between the texts, the graph embedding feature of the target text in the text graph represents the similarity relationship with other texts. That is, the features of other texts similar to the embedded features of the target text are carried in the embedded features of the map of the target text, so that the target text can be classified more accurately.

In the embodiment of the disclosure, the target text to be classified is obtained, wherein the target text contains at least one target word in a word graph, and the representation of the target text is obtained by encoding according to the graph embedding characteristics of the target word in the word graph. And further, according to the characterization similarity between the target text and at least one sample, determining the connection relation between the target text and at least one sample in the text graph, and extracting the graph embedding characteristics of the target text according to the connection relation. Classifying the target text according to the graph embedded characteristics of the target text in the text graph. In the embodiment of the disclosure, the word graph is used as an external corpus, and the graph embedded features extracted from the target word in the word graph indicate semantic information obtained by understanding the target word in the external corpus. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in the encoding of the target text, the representation of the target text is enriched, and the classification accuracy according to the representation is correspondingly improved.

Fig. 2 is a flow chart of another text classification method according to an embodiment of the disclosure, as shown in fig. 2, including:

step 201, obtaining target text to be classified, wherein the target text contains at least one target word in a word graph.

The word graph comprises a plurality of nodes and edges for connecting the nodes, wherein the nodes in the word graph indicate words, and the edges indicate co-occurrence data among the words.

Alternatively, words in the word graph may be encoded in a thermo-unique manner to obtain vectors of words

Co-occurrence data between different words is determined from employing a point to point information algorithm (PMI). For example: different words in the word graphThe co-occurrence data between are as follows:

[C] _mn ＝max(PMI(v _m ，v _n )，0)

wherein m and n are any two words v _m ，v _n Is a word sequence number of (c). PMI (v) _m ，v _n ) V is _m ，v _n Point-to-point information therebetween.

The target text to be classified can be a short text, and word segmentation is performed on the target text to obtain a plurality of words contained in the target text. Among the plurality of words contained in the target text, words in the word graph may be included. In the embodiment of the disclosure, discussion is made about target words included in target text and belonging to a word graph. Words in the target text that are not in the word graph may not participate in encoding of subsequent target text. The embedded feature of the target word in the word graph may be denoted as E, which may be determined using the following formula, for example:

Wherein C is the matrix [ C ]]mn, I is the identity matrix.

Is a model parameter. ReLU is an activation function (Rectified Linear Unit).

Step 202, feature fusion is carried out on the graph embedded features of the target word in the word graph and word embedded features obtained by carrying out semantic feature extraction on the target word by adopting a pre-training model, so as to obtain fusion features of the target word.

Optionally, feature fusion is performed between the graph embedded feature E of the target word in the word graph and the word embedded feature obtained by extracting semantic features of the target word by adopting a pre-training model, so as to obtain the fusion feature of the target word

The pre-training model has learned the semantic features through semantic training, by embedding the features with the semanticsThe obtained word embedding features are fused, so that the information quantity carried in the fused features is further enriched, and the quality of subsequent classification tasks is improved.

And 203, coding to obtain the representation of the target text according to the fusion characteristics of the target words in the target text.

Optionally, under the condition that the target words are multiple, determining fusion weights of the corresponding target words according to word frequency-inverse text frequency of each target word; weighting and fusing the fusion characteristics of the target words according to the fusion weight of the target words

To obtain a representation of the target text.

For example: characterization of text

The following formula can be used to obtain: />

Wherein S is _l Is word frequency-inverse text frequency (TF-IDF), which is beneficial to highlighting the weight of semantically enriched words. Wherein, the subscript i is the sequence number of the target word and the subscript l is the sequence number of the text.

Step 204, determining the connection relation between the target text and at least one sample in the text graph according to the characterization similarity between the target text and the at least one sample.

The following formula is used to determine the connection feature matrix [ A ]] _ij ：

Wherein [ A ]] _ij And describing the connection relation between two texts in the text graph for the connection relation feature matrix between any two texts i and j. Delta.gtoreq.0 is used to prune between irrelevant textThreshold of edges, i.e. only

When the value is larger than delta, connection exists between two texts in the text graph, unnecessary connection in the text graph is reduced, and the calculated amount is reduced. ReLU is an activation function (Rectified Linear Unit).

Optionally, the target text is denoted as subscript l=i and the sample text is denoted as subscript l=j.

The representation of the target text is noted as

The characterization of the sample text is marked +.>

Representation of target text->

Characterization with sample text- >

The characterization similarity between the two adopts cosine similarity +.>

The representation is performed. The relationship between the target text and any of the text samples can then be noted as a _i I.e. matrix [ A ]] _ij In the text, the row vector corresponding to the row in which the target text i is located.

And step 205, extracting the graph embedded feature of the target text according to the connection relation between the target text and the at least one sample text in the text graph.

Alternatively, the graph embedding feature h of the target text may be determined using the following formula _i ：

Graph embedding features for target text

Wherein,,

referred to as the feature matrix of the target text. Wherein A is [ A ] as defined above] _ij X is a representation of the whole sample text except the target text, furthermore +.>

Is a model parameter, which can be determined by training.

And step 206, classifying the target text according to the graph embedded characteristics of the target text in the text graph.

Alternatively, the classification may employ a classification model to classify the target text. Based on the output of the classification model, the class to which the target text belongs is determined.

Fig. 3 is a flow chart of another text classification method according to an embodiment of the disclosure, as shown in fig. 3, including:

step 301, word segmentation is performed on the corpus in the corpus set to obtain a plurality of candidate words, the candidate words with word frequency less than a set value are deleted from the plurality of candidate words, the candidate words belonging to the deactivated word set are deleted, and the reserved candidate words are added into the global pool.

Under the scene of short text classification, as the semantic information carried in the short text is less, the probability of classification errors occurring when the short text is classified only depending on the semantics carried by the short text per se is higher. In order to improve accuracy of short text classification, in the embodiment, a word graph is built by introducing a large corpus, and semantic information obtained by understanding a target word in an external corpus is indicated by graph embedding features extracted from the target word in the word graph. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in the encoding of the target text, the representation of the target text is enriched, and the classification accuracy according to the representation is correspondingly improved.

Optionally, some candidate words with low occurrence frequency may exist in the candidate words, or some nonsensical candidate words may be deleted from the map in order to reduce the data size of the word graph, where the word frequency is less than the set value of the last limit number of times, and the candidate words belonging to the deactivated word set, so as to improve the quality of the candidate words retained in the global pool.

Step 302, a word graph is built according to co-occurrence data between any two words in the global pool.

A point mutual information algorithm (PMI) is employed to determine co-occurrence data between different words. For example: the co-occurrence data between different words in the word graph is as follows:

[C] _mn ＝max(PMI(v _m ，v _n )，0)

And taking each word in the global pool as a node in a word graph, and establishing edges between the nodes according to co-occurrence data between any two words in the global pool, so as to obtain the word graph. The word graph established according to the method can show semantic relativity among different words. Semantic understanding of words in the target text to be classified is facilitated.

Step 303, obtaining target text to be classified, wherein the target text comprises at least one target word in a word graph, and encoding to obtain the representation of the target text according to the graph embedded feature of the target word in the word graph.

The processing procedure of determining the representation of the target text based on the word graph can be referred to the related description in the foregoing embodiment, which is not repeated in this embodiment.

Step 304, determining the connection relation between the target text and at least one sample in the text graph according to the characterization similarity between the target text and the at least one sample.

As one possible implementation, the similarity of the representation between the target text and the at least one sample is determined based on a cosine similarity between the representation of the target text and the representation of the at least one sample. From the at least one sample, determining associated sample text for which the characterization similarity is greater than a threshold. In the text map, it is determined that there is a connection between the target text and the associated sample text, and there is no connection between the target text and sample text of the at least one sample other than the associated sample text.

And 305, extracting graph embedding characteristics of the target text according to the connection relation between the target text and at least one sample text in the text graph, and classifying the target text according to the graph embedding characteristics of the target text in the text graph.

As one possible implementation, in the text map, at least one associated sample text for which there is a connection with the target text is determined. And transmitting the representation of the at least one associated sample text according to the representation similarity between the target text and the at least one associated sample text. And performing characterization aggregation based on the characterization of the at least one associated sample text transfer and/or the characterization of the target text to obtain a graph embedded feature of the target text. The feature of the embedded graph of the target text is further enriched through the representation transmission and aggregation of the related texts in the text graph, and the representation of the similar texts carried in the embedded graph is also beneficial to the classification of the target text.

Further, a graph of the target text in the text graph is embedded with features into a classifier to determine a classification of the target text based on an output of the classifier. Optionally, the target text is classified by a classifier trained with or without supervision.

Fig. 4 is a flow chart of a model training method according to an embodiment of the disclosure, as shown in fig. 4, including:

step 401, any target sample text is obtained from the sample set, wherein the target sample text contains at least one target word in the word graph.

And step 402, according to the graph embedded feature of the target word in the word graph, coding to obtain the representation of the target sample text.

Alternatively, in the case where there are a plurality of target words, the graph embedded features of the plurality of target words may be fused to serve as a representation of the target sample text.

Step 403, determining the connection relation between the target sample text and the rest sample text in the text graph according to the characterization similarity between the target sample text and the rest sample text in the sample set.

The text graph includes a plurality of nodes and edges connecting the nodes. The nodes in the text graph indicate texts, which can be specifically the target texts and sample texts. In this embodiment, the text graph may include a node indicating the target sample text, and also include at least one node indicating the sample text. And the edge is used for indicating the representation similarity between the texts indicated by the connection nodes.

And step 404, extracting graph embedded features of the target sample text according to the connection relation in the text graph.

And step 405, classifying the target sample text by using a classifier according to the graph embedded features of the target sample text to obtain a prediction category.

And step 406, adjusting model parameters of the classifier according to the difference between the predicted category and the expected category of the target sample text.

In the embodiment of the disclosure, any target sample text is obtained from a sample set, wherein the target sample text contains at least one target word in a word graph. And according to the graph embedded feature of the target word in the word graph, encoding to obtain the representation of the target sample text. According to the characteristic similarity between the target sample text and the rest sample texts in the sample set, determining the connection relation between the target sample text and the rest sample texts in the text graph, and further extracting graph embedding characteristics of the target sample text in the text graph. And classifying the target sample text by using a classifier according to the graph embedded characteristics of the target sample text to obtain a prediction category. And adjusting model parameters of the classifier according to the difference between the predicted category and the expected category of the target sample text. In the embodiment of the disclosure, the word graph is used as an external corpus, and the graph embedded features extracted from the target word in the word graph indicate semantic information obtained by understanding the target word in the external corpus. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in the encoding of the target text, the representation of the target text is enriched, and the accuracy of classification according to the classifier is correspondingly improved.

Fig. 5 is a flow chart of another model training method according to an embodiment of the disclosure, as shown in fig. 5, including:

step 501, any target sample text is obtained from the sample set, wherein the target sample text contains at least one target word in the word graph.

Step 502, obtaining graph embedding characteristics of a target word in a word graph by adopting a first graph neural network.

In this embodiment, the number of layers of the first neural network is not limited, and may be two or more.

And step 503, feature fusion is carried out on the graph embedded features of the target words in the word graph and the word embedded features obtained by carrying out semantic feature extraction on the target words by adopting a pre-training model, so as to obtain fusion features of the target words.

And step 504, coding to obtain the representation of the target sample text according to the fusion characteristics of the target words in the target sample text.

Optionally, in the case that the target words are plural, determining fusion weights of the corresponding target words according to word frequency-inverse text frequency of each target word. And weighting and fusing the fusion characteristics of the target words according to the fusion weight of the target words so as to obtain the representation of the target sample text.

And 505, determining the connection relation between the target sample text and the rest sample texts in the text graph according to the characterization similarity between the target sample text and the rest sample texts in the sample set.

And step 506, extracting graph embedding characteristics of the target sample text by using a second graph neural network according to the connection relation in the text graph.

Optionally, in the text graph, at least one associated sample text with which a connection exists with the target sample text is determined. And transmitting the representation of the at least one associated sample text by adopting a second graph neural network according to the representation similarity between the target text and the at least one associated sample text, and performing representation aggregation based on the representation transmitted by the at least one associated sample text and/or the representation of the target text so as to obtain the graph embedded feature of the target text.

And step 507, classifying the target sample text by using a classifier according to the graph embedded features of the target sample text to obtain a prediction category.

And step 508, adjusting model parameters of the classifier, the first graph neural network and the second graph neural network according to the difference between the predicted category and the expected category of the target sample text.

Fig. 6 is a flow chart of another model training method according to an embodiment of the disclosure, as shown in fig. 6, including:

step 601, word segmentation is performed on the corpus in the corpus set to obtain a plurality of candidate words.

Under the scene of short text classification, as the semantic information carried in the short text is less, the probability of classification errors occurring when the short text is classified only depending on the semantics carried by the short text per se is higher. In order to improve accuracy of short text classification, in the embodiment, a word graph is built by introducing a large corpus, and semantic information obtained by understanding a target word in an external corpus is indicated by graph embedding features extracted from the target word in the word graph. Based on the graph embedded feature of the target word, the target text is encoded, so that the word graph serving as the external corpus participates in encoding, characterization is enriched, and the classification accuracy according to the characterization is correspondingly improved.

Step 602, deleting candidate words with word frequency less than a set value from the plurality of candidate words, and deleting candidate words belonging to a deactivated word set.

Step 603, adding the reserved candidate words to the global pool.

Step 604, a word graph is built according to co-occurrence data between any two words in the global pool.

And taking each word in the global pool as a node in a word graph, and establishing edges between the nodes according to co-occurrence data between any two words in the global pool, so as to obtain the word graph. The word graph established according to the method can show semantic relativity among different words.

Step 605, any target sample text is obtained from the sample set, wherein the target sample text contains at least one target word in the word graph.

And step 606, according to the graph embedded feature of the target word in the word graph, obtaining the representation of the target sample text by adopting the first graph neural network coding.

Step 607, determining the connection relationship between the target sample text and the rest sample texts in the text chart according to the characterization similarity between the target sample text and the rest sample texts in the sample set.

And 608, extracting graph embedding characteristics of the target sample text by using a second graph neural network according to the connection relation in the text graph.

And step 609, classifying the target sample text by using a classifier according to the graph embedded characteristics of the target sample text, and obtaining a prediction category.

And step 610, adjusting model parameters of a classifier, the first graph neural network and the second graph neural network according to the difference between the predicted category and the expected category of the target sample text.

Fig. 7 is a schematic structural diagram of a text classification device 70 according to an embodiment of the disclosure, as shown in fig. 7, including: a first acquisition module 71, a first encoding module 72, a first determination module 73, a first extraction module 74 and a first classification module 75.

The first obtaining module 71 is configured to obtain a target text to be classified, where the target text includes at least one target word in a word graph.

A first encoding module 72, configured to encode and obtain a representation of the target text according to the graph embedded feature of the target word in the word graph.

A first determining module 73, configured to determine, according to the characterization similarity between the target text and the at least one sample text, a connection relationship between the target text and the at least one sample text in the text map.

A first extraction module 74, configured to extract a graph embedded feature of the target text according to a connection relationship between the target text and the at least one sample text in the text graph.

A first classification module 75, configured to classify the target text according to the graph embedded feature of the target text in the text graph.

In one possible implementation of an embodiment of the present disclosure, the first encoding module 72 is configured to: embedding features of the target words in the word graphs, and carrying out feature fusion on the feature embedded features obtained by extracting semantic features of the target words by adopting a pre-training model to obtain fusion features of the target words; and coding to obtain the representation of the target text according to the fusion characteristics of the target words in the target text.

In one possible implementation of an embodiment of the present disclosure, the first encoding module 72 is configured to: under the condition that a plurality of target words are provided, determining fusion weights of the corresponding target words according to word frequency-inverse text frequency of each target word; and weighting and fusing the fusion characteristics of the target words according to the fusion weight of the target words so as to obtain the representation of the target text.

In one possible implementation of the embodiment of the disclosure, the first determining module 72 is configured to: determining a token similarity between the target text and the at least one sample based on a cosine similarity between the token of the target text and the token of the at least one sample; determining, from among the at least one sample, associated sample text for which the characterization similarity is greater than a threshold; in the text map, it is determined that there is a connection between the target text and the associated sample text, and there is no connection between the target text and sample text of the at least one sample other than the associated sample text.

In one possible implementation of the embodiment of the disclosure, the first extraction module 74 is configured to: determining, in the text map, at least one associated sample text with which a connection exists with the target text; transmitting the representation of the at least one associated sample text according to the representation similarity between the target text and the at least one associated sample text; and performing characterization aggregation based on the characterization of the at least one associated sample text transfer and/or the characterization of the target text to obtain a graph embedded feature of the target text.

In one possible implementation of the embodiment of the disclosure, the first classification module 75 is configured to: the graph of the target text in the text graph is embedded with features into a classifier to determine a classification of the target text based on an output of the classifier.

In one possible implementation of the embodiment of the disclosure, the text classification device 70 further includes a first generating module configured to: word segmentation is carried out on the corpus in the corpus set to obtain a plurality of candidate words; deleting candidate words with word frequency less than a set value from the plurality of candidate words, and deleting candidate words belonging to a deactivated word set; adding the reserved candidate words into a global pool; and establishing the word graph according to co-occurrence data between any two words in the global pool.

Fig. 8 is a schematic structural diagram of a model training device 80 according to an embodiment of the disclosure, as shown in fig. 8, including: a second processing module 81, a second encoding module 82, a second determination module 83, a second extraction module 84, a second classification module 85 and a first training module 86.

A second processing module 81, configured to obtain any target sample text from the sample set, where the target sample text includes at least one target word in the word graph.

And the second encoding module 82 is configured to encode and obtain a representation of the target sample text according to the graph embedded feature of the target word in the word graph.

A second determining module 83, configured to determine a connection relationship between the target sample text and the rest sample texts in the text map according to the characterization similarity between the target sample text and the rest sample texts in the sample set.

And the second extraction module 84 is configured to extract, according to the connection relationship in the text graph, a graph embedded feature of the target sample text.

And the second classification module 85 is configured to classify the target sample text by using a classifier according to the graph embedded feature of the target sample text, so as to obtain a prediction class.

A first training module 86 for adjusting model parameters of the classifier based on the difference between the predicted category and the desired category of the target sample text.

In one possible implementation of the embodiment of the disclosure, the second encoding module 82 is configured to: acquiring graph embedding characteristics of the target word in the word graph by adopting a first graph neural network; embedding features of the target words in the word graphs, and carrying out feature fusion on the feature embedded features obtained by extracting semantic features of the target words by adopting a pre-training model to obtain fusion features of the target words; and coding to obtain the representation of the target sample text according to the fusion characteristics of the target words in the target sample text.

In one possible implementation manner of the embodiment of the present disclosure, the model training apparatus 80 further includes a second training module, configured to: and adjusting model parameters of the first graph neural network according to the difference between the predicted category and the expected category of the target sample text.

In one possible implementation of the embodiment of the disclosure, the second encoding module 82 is configured to: under the condition that a plurality of target words are provided, determining fusion weights of the corresponding target words according to word frequency-inverse text frequency of each target word; and weighting and fusing the fusion characteristics of the target words according to the fusion weight of the target words so as to obtain the representation of the target sample text.

In one possible implementation of the embodiment of the disclosure, the second extraction module 84 is configured to: determining, in the text map, at least one associated sample text with which a connection exists with the target sample text; and transmitting the representation of the at least one associated sample text by adopting a second graph neural network according to the representation similarity between the target text and the at least one associated sample text, and performing representation aggregation based on the representation transmitted by the at least one associated sample text and/or the representation of the target text so as to obtain the graph embedded feature of the target text.

In one possible implementation manner of the embodiment of the present disclosure, the model training apparatus 80 further includes a third training module, configured to: and adjusting model parameters of the second graph neural network according to the difference between the predicted category and the expected category of the target sample text.

In a possible implementation manner of the embodiment of the present disclosure, the model training apparatus 80 further includes a second generating module, configured to: word segmentation is carried out on the corpus in the corpus set to obtain a plurality of candidate words; deleting candidate words with word frequency less than a set value from the plurality of candidate words, and deleting candidate words belonging to a deactivated word set; adding the reserved candidate words into a global pool; and establishing the word graph according to co-occurrence data between any two words in the global pool.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 902 or a computer program loaded from a storage unit 908 into a RAM (Random Access Memory ) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 801, the ROM 802, and the RAM 903 are connected to each other by a bus 904. An I/O (Input/Output) interface 805 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a CPU (Central Processing Unit ), GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSP (Digital Signal Processor ), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text classification method and/or a model training method. For example, in some embodiments, the text classification method and/or the model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM903 and executed by the computing unit 901, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text classification method and/or the model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text classification method, comprising:

2. The method of claim 1, wherein the encoding the representation of the target text based on the graph-embedded feature of the target word in the word graph comprises:

embedding features of the target words in the word graphs, and carrying out feature fusion on the feature embedded features obtained by extracting semantic features of the target words by adopting a pre-training model to obtain fusion features of the target words;

And coding to obtain the representation of the target text according to the fusion characteristics of the target words in the target text.

3. The method of claim 2, wherein the encoding the representation of the target text based on the fusion characteristics of the target word in the target text comprises:

under the condition that a plurality of target words are provided, determining fusion weights of the corresponding target words according to word frequency-inverse text frequency of each target word;

and weighting and fusing the fusion characteristics of the target words according to the fusion weight of the target words so as to obtain the representation of the target text.

4. A method according to any one of claims 1-3, wherein said determining a connection between the target text and at least one sample text in a text map based on a characterization similarity between the target text and the at least one sample text comprises:

determining a token similarity between the target text and the at least one sample based on a cosine similarity between the token of the target text and the token of the at least one sample;

determining, from among the at least one sample, associated sample text for which the characterization similarity is greater than a threshold;

In the text map, it is determined that there is a connection between the target text and the associated sample text, and there is no connection between the target text and sample text of the at least one sample other than the associated sample text.

5. A method according to any one of claims 1-3, wherein said extracting graph embedded features of the target text from a connection relationship between the target text and the at least one sample in the text graph comprises:

determining, in the text map, at least one associated sample text with which a connection exists with the target text;

transmitting the representation of the at least one associated sample text according to the representation similarity between the target text and the at least one associated sample text;

and performing characterization aggregation based on the characterization of the at least one associated sample text transfer and/or the characterization of the target text to obtain a graph embedded feature of the target text.

6. A method according to any of claims 1-3, wherein said classifying said target text according to its graph embedded features in said text graph comprises:

The graph of the target text in the text graph is embedded with features into a classifier to determine a classification of the target text based on an output of the classifier.

7. A method according to any one of claims 1-3, wherein the method further comprises:

word segmentation is carried out on the corpus in the corpus set to obtain a plurality of candidate words;

deleting candidate words with word frequency less than a set value from the plurality of candidate words, and deleting candidate words belonging to a deactivated word set;

adding the reserved candidate words into a global pool;

and establishing the word graph according to co-occurrence data between any two words in the global pool.

8. A model training method, comprising:

9. The method of claim 8, wherein the encoding to obtain the representation of the target sample text based on graph-embedded features of the target word in the word graph comprises:

acquiring graph embedding characteristics of the target word in the word graph by adopting a first graph neural network;

and coding to obtain the representation of the target sample text according to the fusion characteristics of the target words in the target sample text.

10. The method of claim 9, wherein the method further comprises:

and adjusting model parameters of the first graph neural network according to the difference between the predicted category and the expected category of the target sample text.

11. The method of claim 9, wherein the encoding the representation of the target sample text based on the fusion characteristics of the target word in the target sample text comprises:

and weighting and fusing the fusion characteristics of the target words according to the fusion weight of the target words so as to obtain the representation of the target sample text.

12. The method according to any one of claims 8-11, wherein the extracting, according to the connection relation in the text graph, the graph embedded feature of the target sample text includes:

determining, in the text map, at least one associated sample text with which a connection exists with the target sample text;

and transmitting the representation of the at least one associated sample text by adopting a second graph neural network according to the representation similarity between the target text and the at least one associated sample text, and performing representation aggregation based on the representation transmitted by the at least one associated sample text and/or the representation of the target text so as to obtain the graph embedded feature of the target text.

13. The method of claim 12, wherein the method further comprises:

and adjusting model parameters of the second graph neural network according to the difference between the predicted category and the expected category of the target sample text.

14. The method according to any one of claims 8-11, wherein the method further comprises:

adding the reserved candidate words into a global pool;

15. A text classification device, comprising:

16. The apparatus of claim 15, wherein the first encoding module is configured to:

17. The apparatus of claim 16, wherein the first encoding module is configured to:

18. The apparatus of any of claims 15-17, wherein the first determining module is configured to:

19. The apparatus of any of claims 15-17, wherein the first extraction module is to:

20. The apparatus of any of claims 15-17, wherein the first classification module is to:

21. The apparatus of any of claims 15-17, wherein the apparatus further comprises a first generation module to:

adding the reserved candidate words into a global pool;

22. A model training apparatus comprising:

23. The apparatus of claim 22, wherein the second encoding module is configured to:

24. The apparatus of claim 22, wherein the apparatus further comprises a second training module to:

25. The apparatus of claim 22, wherein the second encoding module is configured to:

26. The apparatus of any of claims 22-25, wherein the second extraction module is configured to:

27. The apparatus of claim 26, wherein the apparatus further comprises a third training module to:

28. The apparatus of any of claims 22-25, wherein the apparatus further comprises a second generation module to:

adding the reserved candidate words into a global pool;

29. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 or the method of any one of claims 8-14.

30. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7 or the method of any one of claims 8-14.

31. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7 or the method according to any one of claims 8-14.