CN113011533B - Text classification method, apparatus, computer device and storage medium - Google Patents

Text classification method, apparatus, computer device and storage medium Download PDF

Info

Publication number
CN113011533B
CN113011533B CN202110482695.1A CN202110482695A CN113011533B CN 113011533 B CN113011533 B CN 113011533B CN 202110482695 A CN202110482695 A CN 202110482695A CN 113011533 B CN113011533 B CN 113011533B
Authority
CN
China
Prior art keywords
text
classification
target
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110482695.1A
Other languages
Chinese (zh)
Other versions
CN113011533A (en
Inventor
刘翔
谷坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110482695.1A priority Critical patent/CN113011533B/en
Priority to PCT/CN2021/097195 priority patent/WO2022227207A1/en
Publication of CN113011533A publication Critical patent/CN113011533A/en
Application granted granted Critical
Publication of CN113011533B publication Critical patent/CN113011533B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a text classification method, a text classification device, a computer device and a storage medium, wherein the method comprises the following steps: extracting target text data to be analyzed from an original text; preprocessing target text data to obtain word segmentation results of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model. According to the method, the albert model is adopted to process the text data, so that the text classification efficiency and accuracy are effectively improved.

Description

Text classification method, apparatus, computer device and storage medium
Technical Field
The present application relates to the field of natural language processing, and in particular, to a text classification method, apparatus, computer device, and storage medium.
Background
With the rapid development of network technology, massive information resources exist in the form of texts. How to effectively classify the texts and quickly, accurately and comprehensively mine effective information from massive texts has become one of the hot spots in the field of natural language processing research. The text classification method refers to determining a category for each document in a document set according to a predefined subject category. Text classification method technology has a wide range of applications in daily life, for example, technology division of patent text, and the like.
Compared with a general text, the patent text has the characteristics of special structure, strong specialization, more domain words and the like, and a more specific classification method is needed. The patent text classification method belongs to the field of natural language processing, and generally comprises the steps of data preprocessing, text feature representation, classifier selection, effect evaluation and the like, wherein the text feature representation and the classifier selection are most important, and the accuracy of a classification result is directly affected.
In the prior art, a text classification method based on traditional machine learning, such as a TF-IDF text classification method, measures the importance of words only by word frequency, and subsequently forms a characteristic value sequence of a document, wherein words are independent and cannot reflect sequence information; is susceptible to data set bias, such as some category of documents being too many, which can lead to IDF underestimation; the processing method is to increase the category weight. Intra-class and inter-class distribution bias (when used for feature selection) is not considered. Text classification methods based on deep learning, such as FastText Text classification method of Facebook open source, text-CNN Text classification method, text-RNN Text classification method, etc. TextCNN can perform well in many tasks, but CNN has the biggest problem of fixing the field of view of filter_size, on one hand, longer sequence information cannot be modeled, and on the other hand, the overparametric adjustment of filter_size is also complicated. CNN is essentially a feature expression work for text, and recurrent neural networks (RNN, recurrentNeuralNetwork) are more commonly used in natural language processing, which can better express context information. Although the CNN and RNN are effective in the task of the text classification method, the CNN and RNN have a disadvantage of being not intuitive enough and poor in interpretation, and particularly have profound feelings when analyzing badcase.
Disclosure of Invention
The application provides a text classification method, a text classification device, computer equipment and a storage medium.
A first aspect provides a text classification method, the method comprising:
extracting target text data to be analyzed from an original text;
preprocessing the target text data to obtain word segmentation results of the target text data;
inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result, and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
In some embodiments, before extracting text data to be classified from the original text, further comprising:
extracting keywords in the original text to be extracted, and forming a keyword set;
determining word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
determining the confidence that the original text belongs to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
determining a first-level classification label of the original text according to the confidence that the original text belongs to each category;
and matching the primary classification label with preset primary classification label information, and determining whether to adopt the text classification model to carry out text classification on the original text according to a matching result.
In some embodiments, the preprocessing the text data to obtain word segmentation results includes:
and performing one of stop word removal and duplication removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.
In some embodiments, the method further comprises: pre-training the text classification model, the pre-training the text classification model comprising:
acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;
based on the first training sample set, pre-training an albert model by taking the first classification label as a classification target to obtain an initial text classification model;
judging whether the accuracy of the classification result of the initial text classification model is larger than a preset threshold value,
if the initial text classification model is larger than the preset threshold value, taking the initial text classification model as a final text classification model;
and if the classification label is not greater than the preset threshold, correcting the error of the classification label corresponding to the first training text, and iterating the initial text classification model based on the corrected first training sample set until the accuracy of the classification result of the initial text classification model is greater than the preset threshold.
In some embodiments, the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold includes:
acquiring a second training sample set, wherein the second training sample set contains a second training text;
obtaining a prediction classification label corresponding to the second training text in the second training sample set based on the initial text classification model;
and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is a second classification label manually marked by a user.
In some embodiments, the pre-training the albert model with the first classification label as the classification target based on the first training sample set to obtain an initial text classification model includes:
dividing the first training sample set into training data and verification data according to a preset proportion;
inputting the training data into an initial text classification model to be trained to perform model training;
and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.
In some embodiments, the correcting the classification label corresponding to the first training text includes:
auditing the prediction result to obtain a first training text with correct prediction and a first training text with incorrect prediction;
and manually labeling the first training text with the wrong prediction so as to accurately label the label of the first training text with the wrong prediction.
A second aspect provides a text classification apparatus comprising:
the target text acquisition module is used for extracting target text data to be analyzed from the original text;
the word segmentation module is used for preprocessing the target text data to obtain word segmentation results of the target text data;
the classification module is used for inputting the word segmentation result into a trained text classification model, and the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
A third aspect provides a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the text classification method described above.
A fourth aspect provides a storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text classification method described above.
The text classification method, the device, the computer equipment and the storage medium, firstly, extracting target text data to be analyzed from an original text; secondly, preprocessing the target text data to obtain word segmentation results of the target text data; then, a target word vector and a target position vector corresponding to the target text data are obtained based on the word segmentation result; and finally, inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result, and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector. Therefore, the text data is processed by adopting the albert model, and the obtained word vector sequence contains text information and context information of the text data, so that the text data is integrated with full text semantic information, the contained text information is more comprehensive, the subsequent text classification is more facilitated, the accuracy of the text classification is improved, and the classification effect is improved.
Drawings
FIG. 1 is an environmental diagram of an implementation of a text classification method provided in one embodiment;
FIG. 2 is a block diagram of the internal architecture of a computer device in one embodiment;
FIG. 3 is a flow diagram of a method of text classification in one embodiment;
fig. 4 is a block diagram of a text classification apparatus in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It will be understood that the terms first, second, etc. as used herein may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another element.
Fig. 1 is a diagram of an implementation environment of a text classification method provided in one embodiment, as shown in fig. 1, in which a computer device 110 and a terminal 120 are included.
The computer device 110 is a text classification server, the terminal 120 is a text obtaining device to be classified, and has a text classification result output interface, when the text classification is required, the text to be classified is obtained through the terminal 120, and the text to be classified is classified through the computer device 110.
It should be noted that, the terminal 120 and the computer device 110 may be, but not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The computer device 110 and the terminal 110 may be connected by bluetooth, USB (Universal SerialBus ) or other communication connection, which is not limited herein.
FIG. 2 is a schematic diagram of the internal structure of a computer device in one embodiment. As shown in fig. 2, the computer device includes a processor, a storage medium, a memory, and a network API interface connected by a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable the processor to realize a text classification method when the computer readable instructions are executed by the processor. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a text classification method. The network API interface of the computer device is used for communicating with the terminal connection. It will be appreciated by persons skilled in the art that the architecture shown in fig. 2 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
For ease of understanding, the terms involved in the embodiments of the present application will be first described below.
albert model: a language model published by google in 2018 that trains deep bi-directional representations by combining bi-directional converters in all layers. The albert model combines the advantages of a plurality of natural language processing models, and obtains better effects in a plurality of natural language processing tasks. In the related art, the model input vector of the albert model is the sum of vectors of a word vector (tokenizing), a position vector (locationing), and a sentence vector (segmenteimbedding). The word vector is a vectorization representation of the text, the position vector is used for representing the position of the word in the text, and the sentence vector is used for representing the sequence of sentences in the text.
Pretraining (pre-training): a process for training a neural network model by using a large dataset to learn common features in the dataset. The purpose of the pre-training is to provide good quality model parameters for subsequent neural network model training on a particular data set. Pre-training in embodiments of the present application refers to the process of training an albert model using unlabeled training text.
Fine-tuning): a process for further training a pre-trained neural network model using a particular data set. In general, the data volume of the data set used in the fine tuning stage is smaller than the data volume of the data set used in the pre-training stage, and the fine tuning stage adopts a supervised learning mode, that is, training samples in the data set used in the fine tuning stage contain labeling information. The fine tuning stage in the embodiments of the present application refers to training the albert model using training text that contains classification tags.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
As shown in fig. 3, in one embodiment, a text classification method is provided, and the text classification method may be applied to the computer device 110, and specifically may include the following steps:
step 101, extracting target text data to be analyzed from an original text;
the original text can be a patent text, and the patent text has the characteristics of special structure, strong specialization, more domain words and the like, and a more specific classification method is needed. Patent text classification belongs to the field of natural language processing, and generally comprises the steps of data preprocessing, text feature representation, classifier selection, effect evaluation and the like, wherein the text feature representation and the classifier selection are most important, and the accuracy of a classification result is directly affected.
In the present embodiment, text data of a description abstract, a claim, and a description header section in a patent text is extracted as target text data.
102, preprocessing target text data to obtain word segmentation results of the target text data;
in this embodiment, the purpose of preprocessing target text data is to extract useful data in original text data or to delete noise data in original text, so that text data in original text data that is irrelevant to the extraction purpose can be deleted.
In some embodiments, the step 102 may include: and performing one of stop word removal and duplication removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.
When deleting noise data, removing repeated data in the original text data in a duplicate mode; noise data and the like in the original text data are removed in a deleting mode, so that the noise data in the original text data can be removed.
Stop Words refer to that in information retrieval, certain Words or Words are automatically filtered before or after processing natural language text, and are called Stop Words (Stop Words) in order to save storage space and improve search efficiency.
In this embodiment, the decommissioning word may remove words in the natural language text that do not contribute to the text feature, such as punctuation, mood, person name, meaningless messy codes, and spaces. The method of selecting the deactivation word is to deactivate the vocabulary filtering, the deactivation vocabulary filtering can be to match the words in the constructed deactivation vocabulary and the text data one by one, if the matching is successful, the word is the deactivation word, and the word needs to be deleted.
To obtain target text data in the form of a vector, it is necessary to first word-segment the second text data. Word segmentation is a basic task in lexical analysis, and word segmentation algorithms are mainly divided into two types according to the core ideas: one is word segmentation based on a dictionary, firstly, text data is segmented into words according to the dictionary, and then an optimal combination mode of the words is searched; the other is word segmentation based on words, namely, the words are formed by dividing sentences into individual words, then combining the words into words, searching for an optimal segmentation strategy, and simultaneously, converting the word segmentation strategy into a sequence labeling problem. The word segmentation algorithm adopted in the word segmentation of the embodiment may include: rule-based word segmentation methods, understanding-based word segmentation methods, or statistical-based word segmentation methods.
The word segmentation method based on rules (for example, word segmentation method based on character string matching) is to match the Chinese character string to be analyzed with the entry in a dictionary which is sufficiently large according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). Common rule-based word segmentation methods include: forward maximum matching (left to right direction); reverse maximum matching (right-to-left direction); minimum segmentation (minimizing the number of words cut in each sentence). The forward maximum matching method is to separate a segment of character string, wherein the separation length is limited, then match the separated sub-character string with the words in the dictionary, if the matching is successful, then match the next round until all character strings are processed, otherwise remove a word from the end of the sub-character string, and then match the sub-character string, and repeating the steps. The reverse maximum matching method is similar to this forward maximum matching method.
The word segmentation method based on understanding achieves the effect of word recognition by enabling a computer to simulate the understanding of a sentence by a person. The basic idea of the word segmentation method based on understanding is that syntax and semantic analysis is performed while word segmentation is performed, and the syntax information and the semantic information are utilized to process ambiguity. A word segmentation method based on statistics: formally, words are stable combinations of words, and therefore in this context, the more times adjacent words appear simultaneously, the more likely a word is composed. Therefore, the frequency or probability of co-occurrence of the characters adjacent to the characters can better reflect the credibility of the formed words. The frequency of the combination of each word of adjacent co-occurrence in the text data is counted to calculate the co-occurrence information of each word. The mutual information shows the tightness of the combination relation between Chinese characters, and when the tightness is higher than a certain threshold value, the character group can be considered to form a word. In practical application, the statistical word segmentation system can use a part of basic word segmentation dictionary to carry out string matching word segmentation, and simultaneously uses a statistical method to identify some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high speed and high efficiency of word segmentation by matching are brought into play, and the advantages of word segmentation combination without dictionary, word generation identification by context and automatic disambiguation are utilized.
After the word segmentation process, the original text data is represented by a series of keywords, but the text data cannot be directly processed by a subsequent classification algorithm and should be converted into a numerical form, so that the keywords need to be converted into word vector form to obtain the text data to be classified, which is in the form of text vectors.
Step 103, inputting the word segmentation result into a trained text classification model, and obtaining a target word vector, a target word vector and a target position vector corresponding to target text data based on the word segmentation result and obtaining a target classification label of the target text data based on the target word vector, the target word vector and the target position vector by the text classification model; wherein the text classification model is a trained alber model.
The word segmentation result is input into a pre-trained text classification model, the text classification model is a pre-trained albert model, and in order to enable the albert model to learn the context relation of characters and learn the mapping relation between pinyin and the characters, in the embodiment of the application, three-degree vectors, namely a character vector, a word vector and a position vector, are used when the albert model is trained.
Optionally, the word vector is obtained by converting the word by using a word vector (word 2 vec) model.
In this embodiment, step 103 may include the steps of:
step 1031, obtaining word vectors corresponding to the text data according to the part of speech and the position information of the text data. In the present embodiment, position information is added to text data using position coding, and the text data to which the position information is added is represented using an initial word vector; acquiring part of speech of text data, and converting the part of speech into a part of speech vector; and adding the initial word vector and the part-of-speech vector to obtain a word vector corresponding to the text data.
Step 1032, inputting the word vector into the albert model for data processing to obtain the word matrix of the text data.
Step 1033, obtaining a word vector sequence of the text data according to the word matrix. In this embodiment, a word matrix is used to predict whether two sentences in the text data are upper and lower sentences, mask words in the two sentences, and part-of-speech features of the mask words, and normalize the part-of-speech features to obtain a word vector sequence of the text data.
It should be understood that the albert model used in this embodiment is a model that is trained in advance, so that when text data is processed, only the text data needs to be input into the trained albert model to obtain the corresponding word vector sequence.
Wherein, in order for the alber model to implement text classification, a classifier needs to be set in the alber model. Alternatively, the classification category and number of the classifier are related to the classification task that the text classification model needs to implement, and the classifier may be a multi-classification classifier (such as a softmax classifier). The embodiment of the application does not limit the specific type of the classifier.
In some embodiments, the above text classification method, before extracting text data to be classified from an original text, further includes:
step 100a, extracting keywords in an original text and forming a keyword set;
step 100b, determining word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
specifically, text features matched with the keywords in the text features of the corpus of the category are determined, and the word frequency-inverse document frequency of the matched text features is used as the word frequency-inverse document frequency of the keywords. According to punctuation marks such as periods, question marks, exclamation marks, semicolons and the like, texts in a corpus of a certain category are divided into a plurality of sentences, and text features in each sentence are extracted. And respectively establishing a text feature library for each category according to the extracted text features. And respectively counting the frequency of each text feature under each category. And counting the inverse document frequency of each text feature, namely, the natural logarithm value of the quotient of the total category number and the category number containing the text feature, and respectively calculating the word frequency-inverse document frequency of each text feature under each category.
Step 100c, determining the confidence that the original text belongs to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
specifically, the following operations are performed for each category separately: determining the number of times that keywords appear in the corpus of the category; determining the class conditional probability of the original text relative to the class according to the word frequency-inverse document frequency of the keywords in the class corpus and the frequency of occurrence of the keywords in the class corpus; and determining the confidence coefficient of the text to be classified belonging to the category according to the class conditional probability of the original text relative to the category.
Step 100d, determining a first class classification label of the original text according to the confidence that the original text belongs to each class;
specifically, the category with the highest confidence level is used as a first class classification label of the text to be classified in the confidence level of the original text belonging to each category.
And 100e, matching the primary classification label with preset primary classification label information, and determining whether to adopt a text classification model to perform text classification on the original text according to a matching result.
It will be appreciated that the target classification labels obtained in steps 101 to 104 are the bottom classification labels of the patent text, for example, the patent text has three classification labels, one classification label has only one classification, and two classification labels have at least two classification labels. Therefore, in the step, first, the primary classification label is carried out according to the keywords of the original file through the TF-IDF model, if the primary classification label preset by the primary classification label of the patent file is not matched, the label classification of the original file is not needed, and the primary classification label is a label with a higher level than that of the bottom classification label which is manually set.
In some embodiments, the text classification method further includes: a pre-trained text classification model, the pre-trained text classification model comprising:
step 1001, a first training sample set is obtained, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;
optionally, the first training sample set is a specific data set related to text classification, wherein the training text contains a corresponding classification label, the classification label can be labeled manually, and the classification label belongs to a classification result of the text classification model. In one illustrative example, when a text classification model is used to classify patent text, classification labels include specific and different technical fields, such as cloud computing, image processing, and the like. The embodiment of the application is not limited to the specific content of the classification tag.
Step 1002, pre-training an albert model by taking a first classification label as a classification target based on a first training sample set to obtain an initial text classification model;
the step 1002 may include:
dividing the first training sample set into training data and verification data according to a preset proportion;
inputting training data into an initial text classification model to be trained to perform model training;
and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to the verification result.
In the step, a first training sample set is divided according to the proportion of 9:1, 90% is used as a training set, 10% is used as a verification set, after the model is trained by 90% data, a prediction model is generated, 10% samples are predicted, and model parameters are properly adjusted according to the result, so that an initial text classification model is obtained.
Step 1003, judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold,
step 1004, if the text classification model is larger than the initial text classification model, taking the initial text classification model as a final text classification model;
step 1005, if not, correcting the classification label corresponding to the first training text, and iterating the initial text classification model based on the corrected first training sample set until the accuracy of the classification result of the initial text classification model is greater than a preset threshold.
It may be appreciated that in step 1005, the initial text classification model is iterated based on the first training sample set after error correction, that is, the initial text model may be optimized based on all or part of the first training sample set after error correction, so that as to the number of specific iterations, it is required to determine whether the accuracy of the classification result of the fine-tuned initial text classification model is greater than a preset threshold, if so, the iteration is stopped, and if not, the optimization training of the initial text classification model is continued.
In the step 1003, determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold may include:
1003a, acquiring a second training sample set, wherein the second training sample set comprises a second training text;
1003b, obtaining a prediction classification label corresponding to the second training text in the second training sample set based on the initial text classification model;
1003c, judging whether the accuracy of the classification result of the initial classification model is larger than a preset threshold value according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is a second classification label manually marked by a user.
In this embodiment, a second training sample set different from the first training sample set is used as verification data for verifying the accuracy of the classification result of the initial text classification model, the first training sample set expands the training data of the initial classification model, and the second training sample set avoids the problem of low accuracy of the initial text classification model caused by the error of the original classification label of the first training sample set.
In step 1005, performing error correction on the classification label corresponding to the first training text may include:
1005a, auditing the prediction result to obtain a first training text with correct prediction and a first training text with incorrect prediction;
1005b, manually labeling the first training text with the wrong prediction so as to correctly label the label of the first training text with the wrong prediction.
In this embodiment, for the inaccurate situation of initial prediction of the initial text classification model, the model is iterated in this embodiment, so that the model prediction is more accurate.
In some embodiments, the computer device employs a gradient descent or back propagation algorithm to adjust network parameters of the albert model according to the error between the prediction result and the classification tag until the error satisfies a convergence condition.
In one possible implementation, the second training sample set is used for fine tuning in a much smaller amount of data than the first training sample set, since the pre-trained albert model has learned the context of the text.
Similar to the pre-training process, in order for the text classification model to learn the mapping relationship between text classification and pinyin, the albert model is fine-tuned except that word vectors, position vectors, and sentence vectors of the words in the second training text are used as inputs.
In a possible implementation manner, in the fine tuning process, the computer device uses the second word vector, the second target word vector and the second target position vector of the second training sample set as input vectors of the albert model to obtain a text classification prediction result output by the albert model, and further uses a classification label corresponding to the second training text as supervision to perform fine tuning on the albert model, and finally trains to obtain the text classification model.
As shown in FIG. 4, in one embodiment, a text classification apparatus is provided, which may be integrated into the computer device 110 described above, and may include
A target text acquisition module 411, configured to extract target text data to be analyzed from an original text;
the word segmentation module 412 is configured to pre-process the target text data to obtain a word segmentation result of the target text data;
the vector obtaining module 413 is configured to obtain a target word vector, a target position vector, and a target sentence vector corresponding to the Chinese in the target classification text;
the classification module 414 is configured to input the target word vector, and the target position vector into the text classification model, obtain a target classification label output by the text classification model, where the target classification label is a text classification model trained by using the training method of the text classification model according to any one of claims 1 to 4.
In one embodiment, a computer device is presented, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: extracting target text data to be analyzed from an original text; preprocessing target text data to obtain word segmentation results of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model.
In one embodiment, before extracting the text data to be classified from the original text, further comprising: acquiring keywords from an original text based on a TF-IDF model, and forming a keyword set; determining a first-level classification label of the original text according to the keyword set; and matching the first-level classification label with preset first-level classification label information, and determining whether to adopt a text classification model to carry out text classification on the original text according to a matching result.
In one embodiment, the original text is patent text data, and extracting text data to be classified from the original text includes: text data of a description abstract, a claim and a title part of the description in the patent text are extracted as text data to be classified.
In one embodiment, inputting the word segmentation result into a pre-trained albert model to obtain a word vector sequence corresponding to text data, including: acquiring word vectors corresponding to the text data according to the part of speech and the position information of the text data; inputting the word vector into an albert model for data processing to obtain a word matrix of text data; and acquiring a word vector sequence of the text data according to the word matrix.
In one embodiment, determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold includes: acquiring a second training sample set, wherein the second training sample set contains a second training text; based on the initial text classification model, obtaining a prediction classification label corresponding to the second training text in the second training sample set; and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is a second classification label manually marked by a user.
In one embodiment, based on the first training sample set, pre-training the albert model with the first classification label as a classification target to obtain an initial text classification model, including: dividing the first training sample set into training data and verification data according to a preset proportion; inputting training data into an initial text classification model to be trained to perform model training; and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to the verification result.
In one embodiment, performing error correction on the classification label corresponding to the first training text includes:
auditing the prediction result to obtain a first training text with correct prediction and a first training text with incorrect prediction;
and manually labeling the first training text with the wrong prediction so as to correctly label the label of the first training text with the wrong prediction.
In one embodiment, a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: extracting target text data to be analyzed from an original text; preprocessing target text data to obtain word segmentation results of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (8)

1. A method of text classification, the method comprising:
extracting target text data to be analyzed from an original text;
preprocessing the target text data to obtain word segmentation results of the target text data;
inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result, and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model;
before extracting the target text data to be analyzed from the original text, the method further comprises:
extracting keywords in the original text to be extracted, and forming a keyword set;
determining word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
determining the confidence that the original text belongs to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
determining a first-level classification label of the original text according to the confidence that the original text belongs to each category;
matching the primary classification label with preset primary classification label information, and determining whether to adopt the text classification model to perform text classification on the original text according to a matching result;
the method further comprises the steps of: training the text classification model, the training the text classification model comprising:
acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;
based on the first training sample set, pre-training an albert model by taking the first classification label as a classification target to obtain an initial text classification model;
judging whether the accuracy of the classification result of the initial text classification model is larger than a preset threshold value,
if the initial text classification model is larger than the preset threshold value, taking the initial text classification model as a final text classification model;
and if the classification label is not greater than the preset threshold, correcting the error of the classification label corresponding to the first training text, and iterating the initial text classification model based on the corrected first training sample set until the accuracy of the classification result of the initial text classification model is greater than the preset threshold.
2. The text classification method according to claim 1, wherein the preprocessing the target text data to obtain a word segmentation result includes:
and performing stop word removal or duplication removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.
3. The text classification method according to claim 1, wherein the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value comprises:
acquiring a second training sample set, wherein the second training sample set contains a second training text;
obtaining a prediction classification label corresponding to the second training text in the second training sample set based on the initial text classification model;
and judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is a second classification label manually marked by a user.
4. The text classification method according to claim 1, wherein the pre-training albert model with the first classification label as a classification target based on the first training sample set to obtain an initial text classification model includes:
dividing the first training sample set into training data and verification data according to a preset proportion;
inputting the training data into an initial text classification model to be trained to perform model training;
and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.
5. The text classification method according to claim 1, wherein the performing error correction on the classification label corresponding to the first training text includes:
auditing the prediction result to obtain a first training text with correct prediction and a first training text with incorrect prediction;
and manually labeling the first training text with the wrong prediction so as to accurately label the label of the first training text with the wrong prediction.
6. A text classification apparatus for implementing the text classification method of any one of claims 1 to 5, comprising:
the target text acquisition module is used for extracting target text data to be analyzed from the original text;
the word segmentation module is used for preprocessing the target text data to obtain word segmentation results of the target text data;
the classification module is used for inputting the word segmentation result into a trained text classification model, and the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
7. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the text classification method of any of claims 1 to 5.
8. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text classification method of any of claims 1 to 5.
CN202110482695.1A 2021-04-30 2021-04-30 Text classification method, apparatus, computer device and storage medium Active CN113011533B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110482695.1A CN113011533B (en) 2021-04-30 2021-04-30 Text classification method, apparatus, computer device and storage medium
PCT/CN2021/097195 WO2022227207A1 (en) 2021-04-30 2021-05-31 Text classification method, apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110482695.1A CN113011533B (en) 2021-04-30 2021-04-30 Text classification method, apparatus, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN113011533A CN113011533A (en) 2021-06-22
CN113011533B true CN113011533B (en) 2023-10-24

Family

ID=76380485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110482695.1A Active CN113011533B (en) 2021-04-30 2021-04-30 Text classification method, apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN113011533B (en)
WO (1) WO2022227207A1 (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254657B (en) * 2021-07-07 2021-11-19 明品云(北京)数据科技有限公司 User data classification method and system
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN113283235B (en) * 2021-07-21 2021-11-19 明品云(北京)数据科技有限公司 User label prediction method and system
CN113627509A (en) * 2021-08-04 2021-11-09 口碑(上海)信息技术有限公司 Data classification method and device, computer equipment and computer readable storage medium
CN113609860B (en) * 2021-08-05 2023-09-19 湖南特能博世科技有限公司 Text segmentation method and device and computer equipment
CN113836892B (en) * 2021-09-08 2023-08-08 灵犀量子(北京)医疗科技有限公司 Sample size data extraction method and device, electronic equipment and storage medium
CN114706974A (en) * 2021-09-18 2022-07-05 北京墨丘科技有限公司 Technical problem information mining method and device and storage medium
CN114141248A (en) * 2021-11-24 2022-03-04 青岛海尔科技有限公司 Voice data processing method and device, electronic equipment and storage medium
CN115861606B (en) * 2022-05-09 2023-09-08 北京中关村科金技术有限公司 Classification method, device and storage medium for long-tail distributed documents
CN115587185B (en) * 2022-11-25 2023-03-14 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN115545009B (en) * 2022-12-01 2023-07-07 中科雨辰科技有限公司 Data processing system for acquiring target text
CN115563289B (en) * 2022-12-06 2023-03-07 中信证券股份有限公司 Industry classification label generation method and device, electronic equipment and readable medium
CN115827875B (en) * 2023-01-09 2023-04-25 无锡容智技术有限公司 Text data processing terminal searching method
CN116205601B (en) * 2023-02-27 2024-04-05 开元数智工程咨询集团有限公司 Internet-based engineering list rechecking and data statistics method and system
CN116204645B (en) * 2023-03-02 2024-02-20 北京数美时代科技有限公司 Intelligent text classification method, system, storage medium and electronic equipment
CN115994527B (en) * 2023-03-23 2023-06-09 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116975400A (en) * 2023-08-03 2023-10-31 星环信息科技(上海)股份有限公司 Data hierarchical classification method and device, electronic equipment and storage medium
CN116992034B (en) * 2023-09-26 2023-12-22 之江实验室 Intelligent event marking method, device and storage medium
CN117009534B (en) * 2023-10-07 2024-02-13 之江实验室 Text classification method, apparatus, computer device and storage medium
CN117034901B (en) * 2023-10-10 2023-12-08 北京睿企信息科技有限公司 Data statistics system based on text generation template
CN117252514B (en) * 2023-11-20 2024-01-30 中铁四局集团有限公司 Building material library data processing method based on deep learning and model training

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508378A (en) * 2018-11-26 2019-03-22 平安科技(深圳)有限公司 A kind of sample data processing method and processing device
CN109710770A (en) * 2019-01-31 2019-05-03 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of file classification method and device based on transfer learning
WO2019149200A1 (en) * 2018-02-01 2019-08-08 腾讯科技(深圳)有限公司 Text classification method, computer device, and storage medium
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111078887A (en) * 2019-12-20 2020-04-28 厦门市美亚柏科信息股份有限公司 Text classification method and device
CN111125317A (en) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 Model training, classification, system, device and medium for conversational text classification
CN111198948A (en) * 2020-01-08 2020-05-26 深圳前海微众银行股份有限公司 Text classification correction method, device and equipment and computer readable storage medium
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information
WO2021008037A1 (en) * 2019-07-15 2021-01-21 平安科技(深圳)有限公司 A-bilstm neural network-based text classification method, storage medium, and computer device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10896385B2 (en) * 2017-07-27 2021-01-19 Logmein, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019149200A1 (en) * 2018-02-01 2019-08-08 腾讯科技(深圳)有限公司 Text classification method, computer device, and storage medium
CN109508378A (en) * 2018-11-26 2019-03-22 平安科技(深圳)有限公司 A kind of sample data processing method and processing device
CN109710770A (en) * 2019-01-31 2019-05-03 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of file classification method and device based on transfer learning
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information
WO2021008037A1 (en) * 2019-07-15 2021-01-21 平安科技(深圳)有限公司 A-bilstm neural network-based text classification method, storage medium, and computer device
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111078887A (en) * 2019-12-20 2020-04-28 厦门市美亚柏科信息股份有限公司 Text classification method and device
CN111125317A (en) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 Model training, classification, system, device and medium for conversational text classification
CN111198948A (en) * 2020-01-08 2020-05-26 深圳前海微众银行股份有限公司 Text classification correction method, device and equipment and computer readable storage medium

Also Published As

Publication number Publication date
WO2022227207A1 (en) 2022-11-03
CN113011533A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN113168499A (en) Method for searching patent document
CN113196277A (en) System for retrieving natural language documents
US20240111956A1 (en) Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111191442A (en) Similar problem generation method, device, equipment and medium
Gunaseelan et al. Automatic extraction of segments from resumes using machine learning
CN114881043A (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN112860898A (en) Short text box clustering method, system, equipment and storage medium
Kore et al. Legal document summarization using nlp and ml techniques
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
CN114239555A (en) Training method of keyword extraction model and related device
Liu et al. Suggestion mining from online reviews usingrandom multimodel deep learning
JP5342574B2 (en) Topic modeling apparatus, topic modeling method, and program
Kmetty et al. Boosting classification reliability of NLP transformer models in the long run
CN111581339A (en) Method for extracting gene events of biomedical literature based on tree-shaped LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant