CN113886577A

CN113886577A - Text classification method, device, equipment and storage medium

Info

Publication number: CN113886577A
Application number: CN202111063727.0A
Authority: CN
Inventors: 于翠翠; 王伟; 黄勇其; 张黔
Original assignee: Runlian Software System Shenzhen Co Ltd
Current assignee: Runlian Software System Shenzhen Co Ltd
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2022-01-04

Abstract

The application relates to the technical field of artificial intelligence, and discloses a text classification method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring text data to be classified; extracting keywords from the text data to be classified to obtain word information, and coding according to the word information to obtain corresponding position information; respectively embedding the word information and the position information, and merging vectors obtained after embedding to obtain a text matrix; the text matrix is processed by a pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, wherein the pre-trained classification model comprises a mask multi-head attention structure; and determining the category of the text data to be classified based on the first label probability distribution and the second label probability distribution. The application also relates to a block chain technology, and the category data corresponding to the text data to be classified is stored in the block chain. According to the method and the device, the classification accuracy is improved, and meanwhile, a new label can be generated.

Description

Text classification method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a text classification method, apparatus, device, and storage medium.

Background

At present, with the development of internet industry and the progress of science and technology, the information knowledge of various industries is explosively increased, and in order to meet the diversified demands of users under the background of massive information, the effective management of text information is urgently needed, so that various text classification technologies are rapidly developed, become research hotspots and core technologies in the fields of data mining, data retrieval and the like, and have an important role in the development of enterprise and social information technologies. In the prior art, a machine learning method such as logistic regression, naive Bayes or random forests is often adopted for text classification, or a traditional text classification method such as TF-IDF, vector space model and the like is utilized.

Disclosure of Invention

The application provides a text classification method, a text classification device, text classification equipment and a storage medium, which are used for solving the problem that only existing labels can be used for classifying texts when the texts are classified in the prior art.

In order to solve the above problem, the present application provides a text classification method, including:

acquiring text data to be classified;

extracting keywords from the text data to be classified to obtain word information, and coding according to the word information to obtain corresponding position information;

embedding the word information and the position information respectively, and combining vectors obtained after embedding to obtain a text matrix;

the text matrix is processed by a pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, wherein the pre-trained classification model comprises a mask multi-head attention structure;

and determining the category of the text data to be classified based on the first label probability distribution and the second label probability distribution.

Further, the pre-trained classification model processing of the text matrix comprises:

normalizing the text matrix through a normalization layer in the pre-trained classification model to obtain a first matrix;

the first matrix is used for extracting information through a mask multi-head attention structure in the pre-trained classification model to obtain a second matrix containing the context information of the text data to be classified;

residual error connection is carried out on the text matrix and the second matrix to obtain a third matrix, normalization processing is carried out on the third matrix through a normalization layer in the pre-trained classification model to obtain a fourth matrix;

mapping the fourth matrix through a feedforward network layer in the pre-trained classification model to obtain a fifth matrix;

processing the fifth matrix through an activation function layer in the pre-trained classification model to obtain a sixth matrix, performing residual connection on the sixth matrix and the fourth matrix to obtain a seventh matrix, and performing normalization processing on the seventh matrix through a normalization layer in the pre-trained classification model to obtain an eighth matrix;

respectively carrying out linear transformation twice on the eighth matrix to obtain a first label matrix and a second label matrix;

and mapping the first label matrix and the second label matrix through a Softmax layer in the classification model to obtain the first label probability distribution and the second label probability distribution.

Further, the information extraction processing of the first matrix through a mask multi-head attention structure in the pre-trained classification model to obtain a second matrix containing text data context information to be classified includes:

multiplying the first matrix by the parameter matrixes of multiple batches obtained after pre-training respectively to obtain Q matrixes, K matrixes and V matrixes of multiple batches;

performing point multiplication on the Q matrix and the K matrix of each batch, dividing a first result obtained by the point multiplication by the evolution of the corresponding dimension of the Q matrix to obtain a second result, adding the second result and a mask matrix, and then performing Softmax calculation to obtain a weight matrix, wherein the mask matrix is constructed on the basis of the word information;

multiplying the weight matrix with the V matrixes of the corresponding batches to obtain a ninth matrix of each batch;

and splicing the ninth matrixes of all batches, and performing linear transformation on the spliced matrixes to obtain the second matrix.

Further, the determining the category to which the text data to be classified belongs based on the first label probability distribution and the second label probability distribution includes:

acquiring a label corresponding to the probability maximum value in the first label probability distribution, and judging whether the probability maximum value is greater than or equal to a preset value;

if the probability maximum value is smaller than the preset numerical value, taking the label corresponding to the probability maximum value in the second label probability distribution as the category to which the text data to be classified belongs, and storing the label corresponding to the probability maximum value in the second label probability distribution into a label word list;

and if the probability maximum value is larger than or equal to the preset numerical value, taking a label corresponding to the probability maximum value in the first label probability distribution as the category to which the text data to be classified belongs.

Further, the extracting the keywords from the text data to be classified includes:

carrying out word segmentation processing on the text data to be classified by utilizing the ending word segmentation to obtain a plurality of corresponding words;

and extracting the keywords of the words by using a keyword extraction algorithm, extracting the keywords with a preset proportion, and replacing the keywords in the words by using a mask.

Further, extracting keywords from the text data to be classified to obtain word information, and encoding according to the word information to obtain corresponding position information includes:

based on the weight of the keywords obtained by the keyword extraction algorithm, the keywords are sorted, and an identifier is set in front of each keyword to obtain keyword information;

combining information consisting of a plurality of words and words containing the mask with keyword information to obtain word information;

and carrying out position coding according to the word information to obtain corresponding position information.

Further, the mask matrix is constructed based on the word information and includes:

based on the information consisting of the plurality of words and words containing the mask, the contents of the information are mutually associated, and first data are constructed according to the association relation;

receiving information transmitted by the previous text in the word information based on the content in the keyword information, and constructing second data;

and splicing the first data and the second data, and filling the spliced vacant positions with filling contents to obtain the mask matrix.

In order to solve the above problem, the present application also provides a text classification apparatus, including:

the acquisition module is used for acquiring text data to be classified;

the information extraction module is used for extracting keywords from the text data to be classified to obtain word information, and coding the word information to obtain corresponding position information;

the vectorization module is used for respectively embedding the word information and the position information, and merging vectors obtained after embedding processing to obtain a text matrix;

the classification module is used for processing the text matrix through a pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, wherein the pre-trained classification model comprises a mask multi-head attention structure;

and the output module is used for determining the category of the text data to be classified based on the first label probability distribution and the second label probability distribution.

In order to solve the above problem, the present application also provides a computer device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a text classification method as described above.

In order to solve the above problem, the present application also provides a non-volatile computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor implement the text classification method as described above.

Compared with the prior art, the text classification method, the text classification device, the text classification equipment and the text classification storage medium provided by the embodiment of the application have the following beneficial effects:

the method comprises the steps of extracting keywords from text data to be classified to obtain word information by obtaining the text data to be classified, carrying out position coding according to the word information to obtain corresponding position information, and facilitating a subsequent classification model to better understand the text to be classified by obtaining the word information and the position information; respectively embedding word information and position information, combining vectors obtained after embedding to obtain a text matrix, inputting the text matrix into a pre-trained classification model for processing to obtain a first label probability distribution and a second label probability distribution, wherein the first label probability distribution is the probability of the existing label, and the second label probability is the probability of a label newly generated aiming at the text; the classification of the text data to be classified is determined based on the first label probability distribution and the second label probability distribution, so that the text data to be classified not only has the capability of classifying in the designated label but also has the capability of generating the label under the condition of improving the classification accuracy, and the fault tolerance of labeling is improved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for describing the embodiments of the present application, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without inventive effort.

Fig. 1 is a schematic flowchart of a text classification method according to an embodiment of the present application;

fig. 2 is a schematic diagram of an overall network structure according to an embodiment of the present application;

fig. 3 is a schematic diagram of a mask matrix according to an embodiment of the present application;

fig. 4 is a schematic block diagram of a text classification apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. One skilled in the art will explicitly or implicitly appreciate that the embodiments described herein can be combined with other embodiments.

The application provides a text classification method. Referring to fig. 1 and fig. 2, fig. 1 is a schematic flowchart of a text classification method according to an embodiment of the present application, and fig. 2 is a schematic diagram of an overall network structure according to an embodiment of the present application.

In this embodiment, the text classification method includes:

s1, acquiring text data to be classified;

specifically, the text data to be classified is acquired from the database, or the text data to be classified input by the user is directly received.

When the text data to be classified is obtained from the database, a calling request is required to be sent to the database, and the calling request carries a signature checking token; and receiving a signature checking result returned by the database, and calling the text data to be classified in the database when the signature checking result is passed.

And also acquires the existing tag.

The database is encrypted, and the signature check is needed when the text data to be classified in the database is extracted, so that the safety of the data is ensured.

S2, extracting keywords from the text data to be classified to obtain word information, and coding according to the word information to obtain corresponding position information;

specifically, extracting keywords by using a keyword extraction algorithm, replacing the keywords in the text to be classified by using a mask, combining the extracted keywords into keyword information, and merging and splicing the text to be classified containing the mask and the keyword information to obtain word information; and coding the word information to obtain corresponding position information.

Specifically, word segmentation processing is performed on the text data to be processed through word segmentation at the ends, and the text data to be processed is divided into separate characters, so that a plurality of words are obtained; then, extracting keywords from the words by using a keyword extraction algorithm, wherein the keyword extraction algorithm is a TF-IDF algorithm in the application; the method comprises the steps of extracting core keywords in a text through a TF-IDF algorithm, wherein the preset proportion is 10% -15%, sorting is carried out according to the weight occupied by the keywords, the first 10% -15% with high weight is extracted, and after extraction is completed, the original position of the keywords is replaced by a mask, so that a plurality of words containing the mask and the extracted keywords are obtained.

For example, have w₁,w₂,w₃,w₄,w₅,w₆,w₇……w_TA plurality of words, wherein w is extracted₂，w₃And w₆，w₇Forming two key words to obtain multiple words w containing mask₁,[MASK],w₄,w₅,[MASK]……w_TAnd extracted keywords w2, w3 and w6, w 7.

By preprocessing the text data to be classified for extracting the keywords, word information and position information can be obtained conveniently in the follow-up process.

Specifically, after extracting the keywords, sorting is performed according to the weight of the keywords, and an identifier [ START ] is set before each keyword]E.g. two sets of keywords w₂，w₃And w₆，w₇Sorting according to the weight, which is 2 and 1 respectively; plus an identifier of START]To obtain the keyword information [ START]，w₆，w₇，[START]，w₂，w₃；

Combining the information composed of a plurality of words and words containing the mask with the keyword information to obtain word information w₁,[MASK],w₄,w₅,[MASK]……w_T，[START]，w₆，w₇，[START]，w₂，w₃And carrying out position coding based on word information to obtain two groups of codes, wherein the first group carries out coding according to the real position to obtain 1,2,3,4,5 … … T, 5, 5, 5, 2,2 and 2; for the coding of the keyword, the coding is consistent with the mask coding of the keyword at the position of a plurality of words, namely w₆，w₇Corresponding to the encoding of the second mask, i.e. 5; w is a₂，w₃The corresponding is the encoding of the first mask, i.e. 2. The second group starts encoding according to the identifier, resulting in 0,0,0, 0,0 … … 0,1,2,3,1,2, 3.

Word information is obtained by splicing the information, position information is obtained according to the word information, the position information is introduced to facilitate the recognition of a subsequent classification model on the position, and a mask matrix is obtained according to the word information in the subsequent process.

S3, respectively embedding the word information and the position information, and merging vectors obtained after embedding to obtain a text matrix;

specifically, the word information and the position information are respectively processed by a vector transformation model, for example, a pre-trained language model Bert is adopted to embed the word information to obtain a vector x₁Dimension of [ L, D]Wherein L is the length of the input data, and D is the dimension of the vector after Embedding. The position information is also subjected to Embedding to respectively obtain x₂,x₃Dimension and x₁Are in agreement of [ L, D]. X is to be₁、x₂And x₃And merging to obtain a text matrix.

S4, processing the text matrix through a pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, wherein the pre-trained classification model comprises a mask multi-head attention structure;

specifically, context information of a text is extracted by introducing a mask multi-head attention structure into a classification model, a text matrix is processed by a pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, the first label probability distribution is a probability obtained for an existing label, and the second label probability asks for a probability of a newly generated label obtained by being close to the text matrix.

Specifically, firstly, the text matrix is normalized through a normalization layer, namely a norm (normalization) layer, so that the convergence speed and precision of the model can be improved to a certain extent, and a first matrix x is obtained₄In particular

Where μ represents the mean of the text matrix and σ represents the standard deviation of the text matrix.

The first matrix carries out information extraction processing through a mask multi-head attention structure in the pre-trained classification model to obtain a second matrix x containing text data context information to be classified₅The masked multi-headed attention structure is basically the same as the encoder part structure of Bert, as shown in fig. 3, except that a mask matrix is added to cover future information, i.e. only information transmitted in the preamble can be received, so that a single encoder/decoder is converted into a generative model.

In the present application, a second matrix x is obtained₅Thereafter, Dropout processing may also be performed to effectively suppress overfitting.

Will be saidThe second matrix x after Dropout processing₅Residual error connection is carried out to obtain a third matrix, normalization processing is carried out on the third matrix through a Norm layer in the pre-trained classification model, and a fourth matrix x is obtained₆In particular

Where μ 'represents the mean after the first residual concatenation, σ' represents the standard deviation after the first residual concatenation, as determined by Add & Norm: residual concatenation ameliorates the gradient vanishing problem.

The normalized fourth matrix x₆Inputting the data into a feed-forward network in the classification model, namely FNN (fed forward neural network) network, and obtaining a fifth matrix x₇. The FNN network is a neural network which is subjected to two-layer linear mapping, and the specific calculation formula is as follows: x is the number of₇＝f(W₂·f(W₁·x₆+b₁)+b₂) Wherein W is₁,W₂,b₁,b₂The training parameters are parameters which are converged after the classification model is pre-trained.

The fifth matrix x₇Obtaining a sixth matrix through the activation function layer processing in the pre-trained classification model, and combining the sixth matrix and the fourth matrix x₆Residual error connection is carried out to obtain a seventh matrix, normalization processing is carried out on the seventh matrix through a Norm layer in the pre-trained classification model to obtain an eighth matrix x₈(ii) a The specific algorithm is

Where μ "represents the mean after the second residual concatenation, σ" represents the standard deviation after the second residual concatenation, and the eighth matrix x₈Its dimension is [ L1, D]And L1 represents the number of words to be generated.

For the eighth matrix x₈Respectively carrying out linear transformation twice to obtain a first label matrix x₉And a second label matrix x₁₀(ii) a First label matrix x₉Has a dimension of [ L2, D1 ]]Where L2 denotes the number of generated words and D1 denotes the number of tags. Second label matrix x₁₀Has a dimension of [ L3, D2 ]]Where L3 denotes the number of generated words and D2 denotes the number of word lists.

Mapping the first label matrix and the second label matrix through a Softmax layer in the classification model to obtain a first label probability distribution and a second label probability distribution, wherein dimensionalities of the first label probability distribution and the second label probability distribution are respectively equal to dimensionality of a first label matrix x₉And a second label matrix x₁₀And (5) the consistency is achieved.

The Norm layer, the mask multi-head attention structure, the first add & Norm layer, the FNN layer, the GeLU layer, and the second add & Norm layer in the pre-trained classification Model may be regarded as one group, and in other embodiments of the present application, multiple groups may be provided in the classification Model to obtain better effects, that is, as shown in fig. 2, multiple Model layers may be provided therein.

Processing the data through each layer in the classification model to obtain a first label matrix x₉And a second label matrix x₁₀The method and the device realize accurate classification of the data to be classified, and the classification of the data to be classified is not only limited to the existing label, but also can generate a new label.

Still further, the information extraction processing of the first matrix through the mask multi-head attention structure in the pre-trained classification model to obtain a second matrix containing the context information of the text data to be classified includes:

Specifically, before the first training of the classification model, the parameter matrices of the multiple batches are randomly generated, or randomly generated according to normal distribution or uniform distribution, and the parameter matrices of the multiple batches are continuously changed and converged in the continuous training process of the classification model until the training of the classification model is completed, and the parameter matrices of the multiple batches tend to be stable. And in the subsequent utilization of the trained classification model, directly utilizing the stable parameter matrix of multiple batches.

By using a first matrix x₄Multiplying the multiple batches of parameter matrixes respectively to obtain three matrixes of Q, K and V of the multiple batches, wherein the formula is Q-x₄·W^Q，K＝x₄·W^K,V＝x₄·W^VWherein W is^Q，W^K，W^VNamely the parameter matrix, and then utilizing three matrixes of Q, K and V to calculate the proportion of each word in the input text, wherein the weight of the word only receiving the information is 0, and the specific calculation formula is

Wherein A represents a weight matrix, d_kExpressing Q, K or V matrix dimensionality, multiplying the weight matrix with the V matrixes of corresponding batches to obtain a ninth matrix of each batch, splicing the ninth matrixes of all batches, and performing linear transformation on the spliced matrixes to obtain a second matrix x₅。

By introducing a mask Multi-Head Attention structure, namely mask Multi-Head Attention, the context information of the text is extracted, the mask is added to cover future information, the text can be classified better, and the text classification accuracy is improved.

Further, the performing linear transformation on the eighth matrix twice respectively to obtain a first label matrix and a second label matrix includes:

after averaging each word vector in the eighth matrix, performing linear transformation to obtain the first label matrix;

and performing linear transformation on each word vector in the eighth matrix to obtain the second label matrix.

Specifically, since each tag word is composed of a plurality of words, the pair identifier [ START]Averaging the vectors of each character, and performing linear transformation to obtain a first label matrix x₉(ii) a The other is only processed by linear transformation, each word vector in the eighth matrix is directly processed by linear transformation to obtain a second label matrix x₁₀。

Two different linear transformation means are carried out, so that two different label matrixes are obtained, and the classification of the existing labels and the newly generated labels is realized.

Still further, before the residual error connecting the text matrix and the second matrix, the method further includes:

processing the second matrix through a Dropout layer in the classification model to suppress overfitting of the classification model.

Specifically, overfitting of the classification model is effectively suppressed by inserting a Dropout layer before residual concatenation.

Still further, the mask matrix is constructed based on the word information and includes:

Specifically, as shown in fig. 3, since the word information is divided into two parts, one part is a plurality of word parts including masks, and the other part is a keyword information part, the plurality of word parts including masks can be associated with each other, that is, context information can be extracted, and the first data is constructed according to the associated information; for the keyword information part, future information needs to be masked, namely the information of the keyword information part is transmitted in a single direction, and only the information transmitted in the previous direction is received, so that second data is constructed, the first data and the second data are spliced, and the spliced vacant positions are filled with filler content to obtain a mask matrix. Further, the first data and the second data portions are directly represented by 0, and the server tag, i.e., -infinity, is used for the filler content portion.

Future information is masked out by constructing a mask matrix, which translates a single encoder/decoder into a generative model.

S5, determining the category of the text data to be classified based on the first label probability distribution and the second label probability distribution.

Specifically, the first label probability distribution is the probability distribution obtained based on the existing label, the second label probability distribution is the probability distribution of the newly generated label, the first label probability distribution and the second label probability distribution are integrated to select the label, and when the first label probability distribution does not meet the requirement, the label is selected according to the second label probability distribution.

Specifically, the first label probability distribution is obtained based on the existing label, and whether the probability maximum value in the first label probability distribution is greater than or equal to a preset value or not is judged, and if the probability maximum value in the first label probability distribution is greater than or equal to the preset value, the label corresponding to the probability maximum value in the first label probability distribution is directly used as the category to which the text data to be classified belongs. And if the probability distribution is smaller than the preset numerical value, directly extracting the label corresponding to the probability maximum value in the second label probability distribution as the category corresponding to the text data to be classified, and storing the label to a label word list. Again in this application the predetermined value is 0.5.

By comparing the probability distribution of the first labels with preset values and obtaining different labels according to the comparison result, the labels closest to the text data to be classified are ensured to be selected, and the effectiveness of text classification is improved.

It is emphasized that, in order to further ensure the privacy and security of the data, all data of the category of the text data to be classified may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The text classification method of the embodiment obtains text data to be classified, extracts keywords from the text data to be classified to obtain word information, performs position coding according to the word information to obtain corresponding position information, and obtains the word information and the position information, so that a subsequent classification model can better understand the text to be classified; respectively embedding word information and position information, combining vectors obtained after embedding to obtain a text matrix, inputting the text matrix into a pre-trained classification model for processing to obtain a first label probability distribution and a second label probability distribution, wherein the first label probability distribution is the probability of the existing label, and the second label probability is the probability of a label newly generated aiming at the text; and determining the category of the text data to be classified based on the first label probability distribution and the second label probability distribution, and under the condition of ensuring the classification accuracy, the method not only has the function of classifying the specified label, but also has the capability of generating the label, thereby improving the fault tolerance of the labeling.

The embodiment also provides a text classification device, as shown in fig. 4, which is a functional block diagram of the text classification device of the present application.

The text classification apparatus 100 may be installed in an electronic device. According to the implemented functions, the text classification apparatus 100 may include an acquisition module 101, an information extraction module 102, a vectorization module 103, a classification module 104, and an output module 105. A module, which may also be referred to as a unit in this application, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

an obtaining module 101, configured to obtain text data to be classified;

specifically, the obtaining module 101 obtains the text data to be classified from the database, or directly receives the text data to be classified input by the user.

The information extraction module 102 is configured to extract keywords from the text data to be classified to obtain word information, and encode the word information to obtain corresponding position information;

specifically, the information extraction module 102 extracts keywords by using a keyword extraction algorithm, replaces the keywords in the text to be classified with masks, combines the extracted keywords into keyword information, and combines and splices the text to be classified containing masks and the keyword information to obtain word information; and coding the word information to obtain corresponding position information.

Further, the information extraction module 102 includes a word segmentation sub-module and an extraction replacement sub-module;

the word segmentation sub-module is used for carrying out word segmentation processing on the text data to be classified by utilizing crust word segmentation to obtain a plurality of corresponding words;

and the extraction and replacement submodule is used for extracting the keywords from the words by using a keyword extraction algorithm, extracting the keywords in a preset proportion and replacing the keywords in the words by using a mask.

Specifically, firstly, the word segmentation submodule performs word segmentation on the text data to be processed through word segmentation according to the result, and divides the text data to be processed into individual characters, so that a plurality of words are obtained; then, the extraction and replacement sub-module extracts the keywords of the words by using a keyword extraction algorithm, wherein the keyword extraction algorithm is a TF-IDF algorithm in the application; the extraction and replacement sub-module extracts core keywords in the text through a TF-IDF algorithm, the preset proportion is 10% -15%, namely the core keywords are ranked according to the weight occupied by the keywords, the top 10% -15% with high weight is extracted, and the core keywords are replaced by masks at the original positions of the keywords after extraction is completed, so that a plurality of words containing the masks and the extracted keywords are obtained.

The word segmentation sub-module and the extraction replacement sub-module are matched to perform keyword extraction preprocessing on the text data to be classified, so that word information and position information can be obtained conveniently in the follow-up process.

Further, the information extraction module 102 includes a key information sub-module, a word information sub-module, and an encoding sub-module;

the key information submodule is used for sequencing the keywords based on the weight of the keywords obtained by the keyword extraction algorithm, and setting an identifier in front of each keyword to obtain keyword information;

the word information submodule is used for merging the information consisting of a plurality of words and words containing the mask with the keyword information to obtain word information;

and the coding submodule is used for carrying out position coding according to the word information to obtain corresponding position information.

Specifically, after extracting the keywords, the key information submodule sorts the keywords according to the weights of the keywords, and sets an identifier [ START ] in front of each keyword; the word information submodule merges information consisting of a plurality of words and words containing the mask with keyword information to obtain word information; and the coding submodule carries out position coding based on word information to obtain two groups of codes, wherein the first group is obtained by coding according to the real position, and the second group is obtained by coding according to the identifier.

The information is spliced through the cooperation of the key information submodule, the word information submodule and the coding submodule to obtain word information, position information is obtained according to the word information, the position information is introduced to facilitate the recognition of a subsequent classification model on the position, and the mask matrix is obtained according to the word information in the subsequent process.

The vectorization module 103 is configured to perform embedding processing on the word information and the position information respectively, and merge vectors obtained after the embedding processing to obtain a text matrix;

specifically, the vectorization module 103 processes the word information and the position information respectively through a vector transformation model, for example, a pre-trained language model Bert is used to embed the word information to obtain a vector x₁Dimension of [ L, D]Wherein L is the length of the input data, and D is the dimension of the vector after Embedding. The position information is also subjected to Embedding to respectively obtain x₂,x₃Dimension and x₁Are in agreement of [ L, D]. X is to be₁、x₂And x₃And merging to obtain a text matrix.

The classification module 104 is configured to process the text matrix through a pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, where the pre-trained classification model includes a mask multi-head attention structure;

specifically, the classification module 104 extracts context information of the text by introducing a mask multi-head attention structure into the classification model, and the text matrix is processed by the pre-trained classification model to obtain a first label probability distribution and a second label probability distribution, where the first label probability distribution is a probability obtained for an existing label, and the second label probability is a probability of a newly generated label obtained by being close to the text matrix.

Further, the classification module 104 includes a normalization sub-module, an information extraction sub-module, a first connection sub-module, a mapping sub-module, a second connection sub-module, a linear transformation sub-module, and an activation sub-module;

the normalization submodule is used for performing normalization processing on the text matrix through a normalization layer in the pre-trained classification model to obtain a first matrix;

the information extraction submodule is used for performing information extraction processing on the first matrix through a mask multi-head attention structure in the pre-trained classification model to obtain a second matrix containing text data context information to be classified;

the first connection submodule is used for performing residual connection on the text matrix and the second matrix to obtain a third matrix, and performing normalization processing on the third matrix through a normalization layer in the classification model to obtain a fourth matrix;

the mapping submodule is used for mapping the fourth matrix through a feedforward network layer in the pre-trained classification model to obtain a fifth matrix;

the second connection submodule is used for processing the fifth matrix through an activation function layer in the pre-trained classification model to obtain a sixth matrix, performing residual connection on the sixth matrix and the fourth matrix to obtain a seventh matrix, and performing normalization processing on the seventh matrix through a normalization layer in the pre-trained classification model to obtain an eighth matrix;

the linear transformation submodule is used for respectively carrying out linear transformation twice on the eighth matrix to obtain a first label matrix and a second label matrix;

the activation submodule is used for mapping the first label matrix and the second label matrix through a Softmax layer in the classification model to obtain the first label probability distribution and the second label probability distribution.

Specifically, firstly, the text matrix is normalized through a normalization layer of a normalization submodule, namely a norm (normalization) layer, so that the convergence speed and precision of the model can be improved to a certain extent, and a first matrix x is obtained₄In particular

The information extraction submodule performs information extraction processing on the first matrix through a mask multi-head attention structure in the pre-trained classification model to obtain a second matrix x containing text data context information to be classified₅The masked multi-headed attention structure is basically the same as the encoder part structure of Bert, and is different from the fact that a mask matrix is added, and the purpose is to cover future information, namely, only information transmitted in the front can be received, so that a single encoder/decoder is converted into a generative model.

The first connection submodule enables the text matrix and the second matrix x after Dropout processing to be conducted₅Residual error connection is carried out to obtain a third matrix, normalization processing is carried out on the third matrix through a Norm layer in the pre-trained classification model, and a fourth matrix x is obtained₆In particular

Where μ 'represents the mean after the first residual concatenation, σ' represents the standard deviation after the first residual concatenation, by Add&Norm：Residual concatenation ameliorates the gradient vanishing problem.

The mapping submodule enables the normalized fourth matrix x to be obtained₆Inputting the data into a feed-forward network in the classification model, namely FNN (fed forward neural network) network, and obtaining a fifth matrix x₇. The FNN network is a neural network which is subjected to two-layer linear mapping, and the specific calculation formula is as follows:

x₇＝f(W₂·f(W₁·x₆+b₁)+b₂)

wherein W₁,W₂,b₁,b₂The training parameters are parameters which are converged after the classification model is pre-trained.

The fifth matrix x of the second connection submodule₇Obtaining a sixth matrix through the activation function layer processing in the pre-trained classification model, and combining the sixth matrix and the fourth matrix x₆Residual error connection is carried out to obtain a seventh matrix, normalization processing is carried out on the seventh matrix through a Norm layer in the pre-trained classification model to obtain an eighth matrix x₈(ii) a The specific algorithm is

The linear transformation submodule pairs the eighth matrix x₈Respectively carrying out linear transformation twice to obtain a first label matrix x₉And a second label matrix x₁₀(ii) a First label matrix x₉Has a dimension of [ L2, D1 ]]Where L2 denotes the number of generated words and D1 denotes the number of tags. Second label matrix x₁₀Has a dimension of [ L3, D2 ]]Where L3 denotes the number of generated words and D2 denotes the number of word lists.

The activation submodule maps the first label matrix and the second label matrix through a Softmax layer in the classification model to obtain the label matrixThe dimensionalities of the first label probability distribution and the second label probability distribution are respectively equal to the dimensionality of the first label matrix x₉And a second label matrix x₁₀And (5) the consistency is achieved.

Obtaining a first label matrix x by matching the normalization submodule, the information extraction submodule, the first connection submodule, the mapping submodule, the second connection submodule, the linear transformation submodule and the activation submodule₉And a second label matrix x₁₀The method and the device realize accurate classification of the data to be classified, and the classification of the data to be classified is not only limited to the existing label, but also can generate a new label.

Still further, the information extraction submodule comprises a first matrix multiplication unit, a calculation unit, a second matrix multiplication unit and a splicing transformation unit;

the first matrix multiplication unit is used for multiplying the first matrix by the parameter matrixes of multiple batches obtained after pre-training respectively to obtain Q matrixes, K matrixes and V matrixes of multiple batches;

the calculating unit is configured to perform dot multiplication on the Q matrix and the K matrix of each batch, divide a first result obtained by the dot multiplication by an evolution of a dimension corresponding to the Q matrix to obtain a second result, add the second result to a mask matrix, and perform Softmax calculation to obtain a weight matrix, where the mask matrix is constructed based on the word information;

the second matrix multiplication unit is used for multiplying the weight matrix with the V matrixes of the corresponding batches to obtain a ninth matrix of each batch;

and the splicing conversion unit is used for splicing the ninth matrixes of all batches and obtaining the second matrix by linearly converting the spliced matrixes.

Specifically, the first matrix multiplication unit multiplies the first matrix by using a first matrix x₄Multiplying the multiple batches of parameter matrixes respectively to obtain three matrixes of Q, K and V of the multiple batches, wherein the formula is Q-x₄·W^Q，K＝x₄·W^K,V＝x₄·W^VWherein W is^Q，W^K，W^VNamely the parameter matrix, the calculation unit calculates the proportion of each word in the input text by using three matrixes Q, K and V, wherein the weight of the word only receiving the above information is 0, and the specific calculation formula is

Wherein A represents a weight matrix, d_kExpressing Q, K or V matrix dimensionality, multiplying the weight matrix by the V matrixes of corresponding batches by the second matrix multiplying unit to obtain a ninth matrix of each batch, splicing the ninth matrixes of all batches, and performing linear transformation on the spliced matrixes by the splicing transformation unit to obtain a second matrix x₅。

Through the cooperation of the first matrix multiplication unit, the calculation unit, the second matrix multiplication unit and the splicing transformation unit, a mask Multi-Head Attention structure, namely mask Multi-Head Attention, is introduced, the context information of the text is extracted, the mask is added to cover future information, the text can be better classified, and the text classification accuracy is improved.

Still further, the linear transformation submodule includes a first linear transformation unit and a second linear transformation unit;

the first linear transformation unit is configured to average each word vector in the eighth matrix and then perform linear transformation to obtain the first label matrix;

and the second linear transformation unit is used for performing linear transformation on each word vector in the eighth matrix to obtain the second label matrix.

Specifically, the first linear transformation unit is configured to generate an identifier [ START ] since each tag word is composed of a plurality of words]Averaging the vectors of each character, and performing linear transformation to obtain a first label matrix x₉(ii) a The second linear transformation unit only carries out linear transformation processing, and then directly carries out linear transformation on each word vector in the eighth matrix to obtain a second label matrix x₁₀。

Two different linear transformation means are carried out through the first linear transformation unit and the second linear transformation unit, so that two different label matrixes are obtained, and the classification of the existing labels and the newly generated labels is realized.

Still further, the classification module 104 further includes a Dropout sub-module;

and the Dropout submodule is used for processing the second matrix through a Dropout layer in the classification model so as to inhibit overfitting of the classification model.

And effectively inhibiting the overfitting of the classification model by the Dropout submodule.

Still further, the computing unit comprises a first data constructing subunit, a second data constructing subunit and a splicing subunit;

the first data constructing subunit is configured to construct first data according to the incidence relation, based on the information composed of the plurality of words and phrases including the mask, where the contents of the information are associated with each other;

the second data construction subunit is used for receiving the information transmitted by the previous text in the word information based on the content in the keyword information and constructing second data;

and the splicing subunit is configured to splice the first data and the second data, and fill the spliced vacant positions with filling content to obtain the mask matrix.

Specifically, the word information is divided into two parts, one part is a plurality of word parts containing masks, and the other part is a keyword information part, and the first data constructing subunit can associate the plurality of word parts containing masks, that is, can extract context information, and constructs first data according to the association information; the second data construction subunit needs to mask future information for the keyword information part, namely, the information of the keyword information part is transmitted in a single direction, and only the information transmitted in the front is received, so that the second data is constructed. Further, the first data and the second data portions are directly represented by 0, and the server tag, i.e., -infinity, is used for the filler content portion.

And constructing a mask matrix through the cooperation of the first data construction subunit, the second data construction subunit and the splicing subunit to cover the future information, and converting the single encoder/decoder into a generative model.

And the output module 105 is configured to determine the category of the text data to be classified based on the first label probability distribution and the second label probability distribution.

Specifically, the output module 105 selects the tag by synthesizing the first tag probability distribution and the second tag probability distribution because the first tag probability distribution is the probability distribution obtained based on the existing tag and the second tag probability distribution is the probability distribution of the newly generated tag, and selects the tag according to the second tag probability distribution when the first tag probability distribution does not meet the requirement.

Further, the output module 105 includes a tag judgment sub-module, an extraction and storage sub-module, and a tag extraction sub-module;

the label judgment submodule is used for acquiring a label corresponding to the probability maximum value in the first label probability distribution and judging whether the probability maximum value is greater than or equal to a preset value;

the extracting and storing submodule is used for taking the label corresponding to the probability maximum value in the second label probability distribution as the category to which the text data to be classified belongs if the probability maximum value is smaller than the preset value, and storing the label corresponding to the probability maximum value in the second label probability distribution into a label word list;

and the label extraction submodule is used for taking the label corresponding to the probability maximum value in the first label probability distribution as the category to which the text data to be classified belongs if the probability maximum value is greater than or equal to the preset numerical value.

Specifically, the tag judgment submodule judges whether the maximum probability value in the first tag probability distribution is greater than or equal to a preset value or not because the first tag probability distribution is the probability distribution obtained based on the existing tags, and when the maximum probability value in the first tag probability distribution is greater than or equal to the preset value, the tag extraction submodule directly takes the tag corresponding to the maximum probability value in the first tag probability distribution as the category to which the text data to be classified belongs. And when the probability maximum value in the first label probability distribution is smaller than a preset value, the extracting and storing submodule directly extracts the label corresponding to the probability maximum value in the second label probability distribution as the category corresponding to the text data to be classified and stores the label to the label word list.

Through the matching of the tag judgment submodule, the extraction and storage submodule and the tag extraction submodule, the probability distribution of the first tag is compared with a preset numerical value, different tags are obtained according to a comparison result, the tag closest to the text data to be classified is ensured to be selected, and the effectiveness of text classification is improved.

By adopting the device, the text classification device 100 not only has the capability of classifying the designated labels, but also has the capability of generating the labels by matching the acquisition module 101, the information extraction module 102, the vectorization module 103, the classification module 104 and the output module 105, and improves the fault tolerance of labeling.

The embodiment of the application also provides computer equipment. Referring to fig. 5, fig. 5 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as computer readable instructions of a text classification method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the text classification method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

In this embodiment, the steps of the text classification method in the above embodiment are implemented when the processor executes the computer readable instructions stored in the memory, the text data to be classified is obtained, the keywords are extracted from the text data to be classified, word information is obtained, position coding is performed according to the word information, corresponding position information is obtained, and the subsequent classification model can better understand the text to be classified by obtaining the word information and the position information; respectively embedding word information and position information, combining vectors obtained after embedding to obtain a text matrix, inputting the text matrix into a pre-trained classification model for processing to obtain a first label probability distribution and a second label probability distribution, wherein the first label probability distribution is the probability of the existing label, and the second label probability is the probability of a label newly generated aiming at the text; the classification of the text data to be classified is determined based on the first label probability distribution and the second label probability distribution, so that the text data to be classified not only has the capability of classifying in the designated label but also has the capability of generating the label under the condition of improving the classification accuracy, and the fault tolerance of labeling is improved.

The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable instructions are stored, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the text classification method, and performs keyword extraction on the text data to be classified to obtain word information by obtaining the text data to be classified, and performs position coding according to the word information to obtain corresponding position information, so that subsequent classification models can better understand the text to be classified by obtaining the word information and the position information; respectively embedding word information and position information, combining vectors obtained after embedding to obtain a text matrix, inputting the text matrix into a pre-trained classification model for processing to obtain a first label probability distribution and a second label probability distribution, wherein the first label probability distribution is the probability of the existing label, and the second label probability is the probability of a label newly generated aiming at the text; the classification of the text data to be classified is determined based on the first label probability distribution and the second label probability distribution, so that the text data to be classified not only has the capability of classifying in the designated label but also has the capability of generating the label under the condition of improving the classification accuracy, and the fault tolerance of labeling is improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The text classification device, the computer device, and the computer-readable storage medium according to the foregoing embodiments of the present application have the same technical effects as the text classification method according to the foregoing embodiments, and are not expanded herein.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method of text classification, the method comprising:

acquiring text data to be classified;

2. The method of claim 1, wherein the pre-trained classification model processing of the text matrix comprises:

3. The method of claim 2, wherein the information extraction of the first matrix through the masked multi-headed attention structure in the pre-trained classification model to obtain the second matrix containing the context information of the text data to be classified comprises:

4. The text classification method according to claim 1, wherein the determining the category to which the text data to be classified belongs based on the first label probability distribution and the second label probability distribution comprises:

5. The text classification method according to claim 1, wherein the extracting the keywords from the text data to be classified comprises:

6. The text classification method according to claim 5, wherein the extracting keywords from the text data to be classified to obtain word information, and the encoding according to the word information to obtain corresponding position information comprises:

7. The text classification method according to claim 6, wherein the mask matrix is constructed based on the word information and includes:

8. An apparatus for classifying text, the apparatus comprising:

the acquisition module is used for acquiring text data to be classified;

9. A computer device, characterized in that the computer device comprises:

at least one processor; and the number of the first and second groups,

the memory stores computer readable instructions which, when executed by the processor, implement the text classification method of any of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the text classification method of any one of claims 1 to 7.