CN113011533A

CN113011533A - Text classification method and device, computer equipment and storage medium

Info

Publication number: CN113011533A
Application number: CN202110482695.1A
Authority: CN
Inventors: 刘翔; 谷坤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-06-22
Anticipated expiration: 2041-04-30
Also published as: WO2022227207A1; CN113011533B

Abstract

The invention relates to a text classification method, a text classification device, computer equipment and a storage medium, wherein the method comprises the following steps: extracting target text data to be analyzed from an original text; preprocessing the target text data to obtain a word segmentation result of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model. The method adopts the albert model to process the text data, and effectively improves the text classification efficiency and accuracy.

Description

Text classification method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, computer device, and storage medium.

Background

With the rapid development of network technology, massive information resources exist in the form of texts. How to effectively classify the texts and quickly, accurately and comprehensively mine effective information from massive texts has become one of the hotspots in the natural language processing research field. The text classification method is to determine a category for each document in the document set according to predefined subject categories. The text classification method technology has wide application in daily life, for example, the technology division of patent texts and the like.

Compared with the general text, the patent text has the characteristics of special structure, strong professional, more domain vocabularies and the like, and a more targeted classification method is required. A patent text classification method belongs to the field of natural language processing, and generally comprises the steps of data preprocessing, text feature representation, classifier selection, effect evaluation and the like, wherein the text feature representation and the classifier selection are the most important, and the accuracy of a classification result is directly influenced.

In the prior art, a text classification method based on traditional machine learning, such as a TF-IDF text classification method, measures the importance of words only by 'word frequency', subsequently forms a characteristic value sequence of a document, and words are independent from each other and cannot reflect sequence information; the method is susceptible to data set skewness, and if a certain class of documents is too many, IDF underestimation can be caused; the processing method is to increase the class weight. Intra-class and inter-class distribution bias (when used for feature selection) is not considered. Text classification methods based on deep learning, such as Facebook open source FastText Text classification method, Text-CNN Text classification method, Text-RNN Text classification method, etc. TextCNN can have good performance in many tasks, but CNN has a biggest problem of fixing the view of filter _ size, on one hand, longer sequence information cannot be modeled, and on the other hand, the super-reference adjustment of filter _ size is also cumbersome. The nature of CNN is to do feature expression work of text, and a Recurrent Neural Network (RNN) is more commonly used in natural language processing, and can better express context information. Although the CNN and the RNN have obvious effects when used in the task of the text classification method, the CNN and the RNN have the defects of insufficient intuition, poor interpretability and particularly profound feeling when badcase is analyzed.

Disclosure of Invention

The application provides a text classification method, a text classification device, computer equipment and a storage medium.

A first aspect provides a text classification method, the method comprising:

extracting target text data to be analyzed from an original text;

preprocessing the target text data to obtain a word segmentation result of the target text data;

inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.

In some embodiments, before extracting text data to be classified from the original text, the method further includes:

extracting keywords in the original text to be processed, and forming a keyword set;

determining the word frequency-inverse document frequency of the keyword set in the corpus of each category based on a TF-IDF model;

determining confidence coefficients of the original text belonging to all categories based on word frequency-inverse document frequency of the keyword set of the original text in the corpus of all categories;

determining a primary classification label of the original text according to the confidence coefficient of the original text belonging to each category;

and matching the primary classification label with preset primary classification label information, and determining whether to adopt the text classification model to perform text classification on the original text according to a matching result.

In some embodiments, the preprocessing the text data to obtain a word segmentation result includes:

and performing one of stop word removal and duplicate removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.

In some embodiments, the method further comprises: pre-training the text classification model, the pre-training the text classification model comprising:

acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;

pre-training an albert model by taking the first classification label as a classification target based on the first training sample set to obtain an initial text classification model;

judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value,

if the initial text classification model is larger than the preset threshold value, taking the initial text classification model as a final text classification model;

and if the error correction result is not larger than the preset threshold, carrying out error correction on the classification label corresponding to the first training text, and iterating the initial text classification model based on the error-corrected first training sample set until the accuracy of the classification result of the initial text classification model is larger than the preset threshold.

In some embodiments, the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold includes:

acquiring a second training sample set, wherein the second training sample set comprises a second training text;

obtaining a prediction classification label corresponding to a second training text in the second training sample set based on the initial text classification model;

and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.

In some embodiments, the pre-training the albert model with the first classification label as a classification target based on the first training sample set to obtain an initial text classification model includes:

dividing the first training sample set into training data and verification data according to a preset proportion;

inputting the training data into an initial text classification model to be trained for model training;

and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.

In some embodiments, the correcting the classification label corresponding to the first training text includes:

auditing the prediction result to obtain a first training text with correct prediction and a first training text with wrong prediction;

and manually marking the first training text with the predicted error so as to correctly mark the label of the first training text with the predicted error.

A second aspect provides a text classification apparatus, including:

the target text acquisition module is used for extracting target text data to be analyzed from the original text;

the word segmentation module is used for preprocessing the target text data to obtain a word segmentation result of the target text data;

the classification module is used for inputting the word segmentation result into a trained text classification model, and the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.

A third aspect provides a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the text classification method described above.

A fourth aspect provides a storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text classification method described above.

The text classification method comprises the steps of firstly, extracting target text data to be analyzed from an original text; secondly, preprocessing the target text data to obtain a word segmentation result of the target text data; then, obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and finally, inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector. Therefore, the text data is processed by adopting the albert model, and the obtained word vector sequence contains the text information and the context information of the text data, so that the full-text semantic information is fused, the contained text information is more comprehensive, the subsequent text classification is facilitated, the accuracy of the text classification is improved, and the classification effect is improved.

Drawings

FIG. 1 is a diagram of an implementation environment for a text classification method provided in one embodiment;

FIG. 2 is a block diagram showing an internal configuration of a computer device according to an embodiment;

FIG. 3 is a flow diagram of a method of text classification in one embodiment;

fig. 4 is a block diagram showing a structure of a text classification device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.

Fig. 1 is a diagram of an implementation environment of the text classification method provided in an embodiment, as shown in fig. 1, in the implementation environment, including a computer device 110 and a terminal 120.

The computer device 110 is a text classification server, the terminal 120 is a text acquisition device to be classified, and has a text classification result output interface, when text classification is required, the text to be classified is acquired through the terminal 120, and the text to be classified is classified through the computer device 110.

It should be noted that the terminal 120 and the computer device 110 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The computer device 110 and the terminal 110 may be connected through bluetooth, USB (Universal serial bus), or other communication connection methods, which is not limited herein.

FIG. 2 is a diagram showing an internal configuration of a computer device according to an embodiment. As shown in fig. 2, the computer device includes a processor, a storage medium, a memory, and a network API interface connected by a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can make a processor realize a text classification method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of text classification. The network API interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

For convenience of understanding, terms referred to in the embodiments of the present application will be first described below.

albert model: a language model published by google in 2018 that trains a deep bi-directional representation by joining bi-directional transducers in all layers. The albert model integrates the advantages of a plurality of natural language processing models and achieves better effect in a plurality of natural language processing tasks. In the related art, the model input vector of the albert model is the sum of the vectors of a word vector (TokenEmbedding), a position vector (PositionEmbedding), and a sentence vector (SegmentEmbedding). The word vector is vectorized representation of characters, the position vector is used for representing positions of the characters in the text, and the sentence vector is used for representing the sequence of sentences in the text.

Pre-training (pre-training): a process for learning neural network models to common features in a data set by training the neural network models using a large data set. The pre-training is intended to provide superior model parameters for subsequent neural network model training on a particular data set. The pre-training in the embodiment of the application refers to a process of training the albert model by using label-free training text.

Fine-tuning (fine-tuning): a process for further training a pre-trained neural network model using a particular data set. In general, the data amount of the data set used in the fine tuning stage is smaller than that of the data set used in the pre-training stage, and the fine tuning stage adopts a supervised learning manner, that is, the training samples in the data set used in the fine tuning stage include labeled information. The fine tuning stage in the embodiment of the present application refers to training the albert model using a training text containing classification labels.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

As shown in fig. 3, in an embodiment, a text classification method is provided, which may be applied to the computer device 110 described above, and specifically includes the following steps:

step 101, extracting target text data to be analyzed from an original text;

the original text can be a patent text, and the patent text has the characteristics of special structure, strong professional, more field vocabularies and the like, and a more targeted classification method needs to be adopted. The patent text classification belongs to the field of natural language processing, and generally comprises the steps of data preprocessing, text feature representation, classifier selection, effect evaluation and the like, wherein the text feature representation and the classifier selection are the most important, and the accuracy of a classification result is directly influenced.

In the present embodiment, text data of the specification abstract, the claims, and the title part of the specification in the patent text is extracted as target text data.

Step 102, preprocessing target text data to obtain word segmentation results of the target text data;

in this embodiment, the target text data is preprocessed to extract useful data from the original text data or delete noise data from the original text, so that text data irrelevant to the extraction purpose from the original text data can be deleted.

In some embodiments, the step 102 may include: and performing one of stop word removal and duplication removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.

When the noise data is deleted, removing repeated data in the original text data in a repeated mode; the noise data and the like in the original text data are removed in a deleting mode, so that the noise data in the original text data can be removed.

Stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after processing natural language text, and the characters or Words are called Stop Words (Stop Words).

In this embodiment, the stop word may remove words in the natural language text that do not contribute to the text features, such as punctuation marks, tone, names, meaningless messy codes, spaces, and the like. The selected method for removing the stop word is stop word list filtering, the stop word list filtering can be one-to-one matching through the constructed stop word list and the words in the text data, if the matching is successful, the word is the stop word, and the word needs to be deleted.

In order to obtain the target text data in the form of vectors, the second text data needs to be participled first. Word segmentation is a basic task in lexical analysis, and word segmentation algorithms are mainly divided into two categories according to core ideas of the word segmentation algorithms: one is word segmentation based on a dictionary, firstly segmenting text data into words according to the dictionary, and then searching the optimal combination mode of the words; the other is word segmentation based on characters, namely, the words are constructed by characters, sentences are firstly divided into one character, then the characters are combined into words, an optimal segmentation strategy is searched, and meanwhile, the optimal segmentation strategy can be converted into a sequence labeling problem. The word segmentation algorithm adopted in the word segmentation of the embodiment may include: a rule-based word segmentation method, an understanding-based word segmentation method, or a statistics-based word segmentation method.

The rule-based word segmentation method (e.g., a word segmentation method based on character string matching) matches a Chinese character string to be analyzed with a term in a "sufficiently large" dictionary according to a certain policy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized). Common rule-based word segmentation methods include: forward maximum matching (left to right direction); inverse maximum matching (right-to-left direction); least segmentation (minimizing the number of words cut in each sentence). The forward maximum matching method is to separate a segment of character string, wherein the length of the separation is limited, then match the separated sub-character string with the words in the dictionary, if the matching is successful, then carry out the next round of matching until all the character strings are processed, otherwise, remove a word from the end of the sub-character string, then carry out the matching, and so on. The reverse maximum matching method is similar to the forward maximum matching method.

The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate the understanding of a sentence by a person. The basic idea of the word segmentation method based on understanding is to perform syntactic and semantic analysis while segmenting words, and to process ambiguity phenomena by using syntactic information and semantic information. The word segmentation method based on statistics comprises the following steps: a word is formally a stable combination of words, so in this context, the more times adjacent words occur simultaneously, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The mutual occurrence information of adjacent co-occurring words in the text data is calculated by counting the frequency of the combination of the words. The mutual presentation information reflects the closeness degree of the combination relation between the Chinese characters, and when the closeness degree is higher than a certain threshold value, the character group can be considered to possibly form a word. In practical application, the statistical word segmentation system can use a basic word segmentation dictionary to perform string matching word segmentation, and simultaneously uses a statistical method to identify some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high matching word segmentation speed and high efficiency are exerted, and the advantages of dictionary-free word segmentation combined with context recognition word generation and automatic ambiguity elimination are utilized.

After the word segmentation processing, the original text data is represented by a series of keywords, but the data in the text form cannot be directly processed by a subsequent classification algorithm and should be converted into a numerical value form, so that word vector form conversion needs to be performed on the keywords to obtain the text data to be classified, which is in the form of a text vector.

Step 103, inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein, the text classification model is a trained alber model.

In the embodiment of the application, vectors of three degrees, namely a word vector, a word vector and a position vector, are used during the training of the albert model.

Optionally, the word vector is obtained by converting the word by using a word vector (word 2vec) model.

In this embodiment, step 103 may include the following steps:

and step 1031, obtaining word vectors corresponding to the text data according to the part-of-speech and position information of the text data. In the present embodiment, position information is added to text data using position coding, and the text data to which the position information is added is represented using an initial word vector; acquiring the part of speech of text data, and converting the part of speech into a part of speech vector; and adding the initial word vector and the part of speech vector to obtain a word vector corresponding to the text data.

And step 1032, inputting the word vectors into the albert model for data processing to obtain a word matrix of the text data.

And 1033, acquiring a word vector sequence of the text data according to the word matrix. In this embodiment, a word matrix is used to predict whether two sentences in the text data are upper and lower sentences, and part-of-speech features of masked words and masked words in the two sentences, and the part-of-speech features are normalized to obtain a word vector sequence of the text data.

It should be understood that the albert model used in this embodiment is a model obtained through pre-training, so when processing text data, only the text data needs to be input into the pre-trained albert model to obtain a word vector sequence corresponding to the text data.

In order to enable the alber model to realize text classification, a classifier needs to be arranged in the alber model. Optionally, the classification category and number of the classifier are related to the classification task that the text classification model needs to implement, and the classifier may be a multi-classification classifier (such as a softmax classifier). The specific type of classifier is not limited in the embodiments of the present application.

In some embodiments, before extracting text data to be classified from an original text, the text classification method further includes:

step 100a, extracting keywords in an original text and forming a keyword set;

step 100b, determining word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;

specifically, text features matched with the keywords in the text features of the corpus of the category are determined, and the word frequency-inverse document frequency of the matched text features is used as the word frequency-inverse document frequency of the keywords. The text in a corpus of a certain category is divided into a plurality of sentences according to punctuation marks such as periods, question marks, exclamation marks, semicolons and the like, and text features in each sentence are extracted. And respectively establishing a text feature library for each category according to the extracted text features. And respectively counting the frequency of each text feature under each category. And counting the inverse document frequency of each text feature, namely the natural logarithm value of the quotient of the total category number and the category number containing the text feature, and respectively calculating the word frequency-inverse document frequency of each text feature under each category.

Step 100c, determining confidence coefficients of the original text belonging to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;

specifically, for each category, the following operations are performed: determining the times of occurrence of the keywords in the corpus of the category; determining class condition probability of the original text relative to the class according to the word frequency-inverse document frequency of the keyword in the corpus of the class and the occurrence frequency of the keyword in the corpus of the class; and determining the confidence degree of the text to be classified belonging to the category according to the class condition probability of the original text relative to the category.

Step 100d, determining a primary classification label of the original text according to the confidence coefficient of the original text belonging to each category;

specifically, the category with the highest confidence degree is used as a primary classification label of the text to be classified, among the confidence degrees of the original text belonging to the categories.

And step 100e, matching the primary classification labels with preset primary classification label information, and determining whether to adopt a text classification model to perform text classification on the original text according to a matching result.

It is understood that the target classification label obtained in the above steps 101 to 104 is the lowest classification label of the patent text, for example, the patent text has three classification labels, the first classification label has only one classification, and the second classification label has at least two classification labels, and the third classification label has at least two classification labels. Therefore, in this step, first class labels are performed according to the keywords of the original file through the TF-IDF model, if the preset class labels of the patent file are not matched, the original file is not required to be subjected to label classification, and the initial class labels are manually set to be higher than the bottom class labels.

In some embodiments, the text classification method further includes: the pre-training text classification model comprises the following steps:

1001, acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;

optionally, the first training sample set is a specific data set related to text classification, where the training text includes corresponding classification labels, the classification labels may be labeled manually, and the classification labels belong to a classification result of a text classification model. In one illustrative example, when a text classification model is used to classify patent text, the classification labels include specific different technical fields, such as cloud computing, image processing, and the like. The embodiment of the present application does not limit the specific content of the classification label.

Step 1002, pre-training an albert model by taking a first classification label as a classification target based on a first training sample set to obtain an initial text classification model;

the step 1002 may include:

inputting training data into an initial text classification model to be trained for model training;

In the step, a first training sample set is divided according to a ratio of 9:1, 90% of the first training sample set is used as a training set, 10% of the first training sample set is used as a verification set, a prediction model is generated after the model is trained by 90% of data, prediction is carried out on 10% of samples, and model parameters are properly optimized according to results to obtain an initial text classification model.

Step 1003, judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value,

step 1004, if the initial text classification model is larger than the final text classification model, taking the initial text classification model as the final text classification model;

step 1005, if the number of the training texts is not larger than the preset threshold value, performing error correction on the classification labels corresponding to the first training texts, and iterating the initial text classification model based on the error-corrected first training sample set until the accuracy of the classification result of the initial text classification model is larger than the preset threshold value.

It can be understood that, in step 1005, the initial text classification model is iterated based on the first training sample set after error correction, that is, the initial text model may be optimized based on all or part of the first training sample set after error correction, as for the number of specific iterations, it needs to be determined whether the accuracy of the classification result of the initial text classification model after fine tuning is greater than a preset threshold, if so, iteration is stopped, and if not, optimization training is continued on the initial text classification model.

In the step 1003, determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold may include:

1003a, obtaining a second training sample set, wherein the second training sample set comprises a second training text;

1003b, obtaining a prediction classification label corresponding to the second training text in the second training sample set based on the initial text classification model;

1003c, judging whether the accuracy of the classification result of the initial classification model is larger than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.

In this embodiment, a second training sample set different from the first training sample set is used as verification data for verifying the accuracy of the classification result of the initial text classification model, which extends the training data of the initial classification model, and avoids the problem of low accuracy of the initial text classification model caused by an error of the original classification label of the first training sample set.

In the step 1005, the correcting the classification label corresponding to the first training text may include:

1005a, auditing the prediction result to obtain a first training text with correct prediction and a first training text with wrong prediction;

and 1005b, manually labeling the first training text with the prediction error so as to correctly label the label of the first training text with the prediction error.

In this embodiment, for the case that the initial prediction of the initial text classification model is inaccurate, the model is iterated by this embodiment, so that the model prediction is more accurate.

In some embodiments, the computer device adjusts network parameters of the albert model according to an error between the prediction result and the classification tag using a gradient descent or back propagation algorithm until the error satisfies a convergence condition.

In one possible implementation, since the pre-trained albert model has learned the context of the text, the data size of the second training sample set used for fine tuning is much smaller than the data size of the first training sample set.

Similar to the pre-training process, in order to make the text classification model learn the mapping relationship between the text classification and the character pinyin, the albert model is finely adjusted except for taking the character vector, the position vector and the sentence vector of the characters in the second training text as input.

In a possible implementation manner, in the fine tuning process, the computer device takes the second word vector, the second target word vector and the second target position vector of the second training sample set as input vectors of the albert model to obtain a text classification prediction result output by the albert model, and then fine tuning is performed on the albert model by taking a classification label corresponding to the second training text as supervision, and finally the text classification model is obtained through training.

As shown in fig. 4, in one embodiment, a text classification apparatus is provided, which may be integrated in the computer device 110, and may specifically include

A target text obtaining module 411, configured to extract target text data to be analyzed from an original text;

the word segmentation module 412 is configured to pre-process the target text data to obtain a word segmentation result of the target text data;

the vector obtaining module 413 is configured to obtain a target word vector, a target position vector, and a target sentence vector corresponding to a word in the target classified text;

the classification module 414 is configured to input the target word vector, and the target position vector into the text classification model to obtain a target classification label output by the text classification model, where the target classification label is a text classification model obtained by training using the text classification model in any one of claims 1 to 4.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: extracting target text data to be analyzed from an original text; preprocessing the target text data to obtain a word segmentation result of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model.

In one embodiment, before extracting text data to be classified from the original text, the method further includes: extracting key words from the original text based on a TF-IDF model, and forming a key word set; determining a primary classification label of the original text according to the keyword set; and matching the primary classification label with preset primary classification label information, and determining whether to adopt a text classification model to perform text classification on the original text according to a matching result.

In one embodiment, the original text is patent text data, and the extracting text data to be classified from the original text includes: text data of a specification abstract, a claim and a specification title part in a patent text are extracted as text data to be classified.

In one embodiment, inputting the word segmentation result into a pre-trained albert model to obtain a word vector sequence corresponding to the text data, including: acquiring word vectors corresponding to the text data according to the part of speech and the position information of the text data; inputting the word vector into an albert model for data processing to obtain a word matrix of the text data; and acquiring a word vector sequence of the text data according to the word matrix.

In one embodiment, the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold includes: acquiring a second training sample set, wherein the second training sample set comprises a second training text; obtaining a prediction classification label corresponding to a second training text in a second training sample set based on the initial text classification model; and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the predicted classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.

In one embodiment, pre-training an albert model with a first classification label as a classification target based on a first training sample set to obtain an initial text classification model, includes: dividing the first training sample set into training data and verification data according to a preset proportion; inputting training data into an initial text classification model to be trained for model training; and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.

In one embodiment, the correcting the classification label corresponding to the first training text includes:

and manually labeling the first training text with the prediction error so as to correctly label the label of the first training text with the prediction error.

In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: extracting target text data to be analyzed from an original text; preprocessing the target text data to obtain a word segmentation result of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of text classification, the method comprising:

extracting target text data to be analyzed from an original text;

2. The text classification method according to claim 1, further comprising, before extracting text data to be classified from an original text:

3. The method according to claim 1, wherein the preprocessing the text data to obtain a word segmentation result comprises:

4. The method of text classification according to claim 1, characterized in that the method further comprises: training the text classification model, the training the text classification model comprising:

5. The method for training the text classification model according to claim 4, wherein the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold comprises:

6. The method for training the text classification model according to claim 4, wherein the pre-training the albert model with the first classification label as the classification target based on the first training sample set to obtain the initial text classification model comprises:

7. The method for training the text classification model according to claim 4, wherein the correcting the classification label corresponding to the first training text includes:

8. A text classification apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to carry out the steps of the text classification method according to any one of claims 1 to 7.

10. A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text classification method of any one of claims 1 to 7.