CN113011533A - Text classification method and device, computer equipment and storage medium - Google Patents
Text classification method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113011533A CN113011533A CN202110482695.1A CN202110482695A CN113011533A CN 113011533 A CN113011533 A CN 113011533A CN 202110482695 A CN202110482695 A CN 202110482695A CN 113011533 A CN113011533 A CN 113011533A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- target
- training
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 238000003860 storage Methods 0.000 title claims abstract description 14
- 238000013145 classification model Methods 0.000 claims abstract description 104
- 239000013598 vector Substances 0.000 claims abstract description 102
- 230000011218 segmentation Effects 0.000 claims abstract description 63
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims description 135
- 238000012795 verification Methods 0.000 claims description 14
- 238000012937 correction Methods 0.000 claims description 7
- 238000003058 natural language processing Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a text classification method, a text classification device, computer equipment and a storage medium, wherein the method comprises the following steps: extracting target text data to be analyzed from an original text; preprocessing the target text data to obtain a word segmentation result of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model. The method adopts the albert model to process the text data, and effectively improves the text classification efficiency and accuracy.
Description
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, computer device, and storage medium.
Background
With the rapid development of network technology, massive information resources exist in the form of texts. How to effectively classify the texts and quickly, accurately and comprehensively mine effective information from massive texts has become one of the hotspots in the natural language processing research field. The text classification method is to determine a category for each document in the document set according to predefined subject categories. The text classification method technology has wide application in daily life, for example, the technology division of patent texts and the like.
Compared with the general text, the patent text has the characteristics of special structure, strong professional, more domain vocabularies and the like, and a more targeted classification method is required. A patent text classification method belongs to the field of natural language processing, and generally comprises the steps of data preprocessing, text feature representation, classifier selection, effect evaluation and the like, wherein the text feature representation and the classifier selection are the most important, and the accuracy of a classification result is directly influenced.
In the prior art, a text classification method based on traditional machine learning, such as a TF-IDF text classification method, measures the importance of words only by 'word frequency', subsequently forms a characteristic value sequence of a document, and words are independent from each other and cannot reflect sequence information; the method is susceptible to data set skewness, and if a certain class of documents is too many, IDF underestimation can be caused; the processing method is to increase the class weight. Intra-class and inter-class distribution bias (when used for feature selection) is not considered. Text classification methods based on deep learning, such as Facebook open source FastText Text classification method, Text-CNN Text classification method, Text-RNN Text classification method, etc. TextCNN can have good performance in many tasks, but CNN has a biggest problem of fixing the view of filter _ size, on one hand, longer sequence information cannot be modeled, and on the other hand, the super-reference adjustment of filter _ size is also cumbersome. The nature of CNN is to do feature expression work of text, and a Recurrent Neural Network (RNN) is more commonly used in natural language processing, and can better express context information. Although the CNN and the RNN have obvious effects when used in the task of the text classification method, the CNN and the RNN have the defects of insufficient intuition, poor interpretability and particularly profound feeling when badcase is analyzed.
Disclosure of Invention
The application provides a text classification method, a text classification device, computer equipment and a storage medium.
A first aspect provides a text classification method, the method comprising:
extracting target text data to be analyzed from an original text;
preprocessing the target text data to obtain a word segmentation result of the target text data;
inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
In some embodiments, before extracting text data to be classified from the original text, the method further includes:
extracting keywords in the original text to be processed, and forming a keyword set;
determining the word frequency-inverse document frequency of the keyword set in the corpus of each category based on a TF-IDF model;
determining confidence coefficients of the original text belonging to all categories based on word frequency-inverse document frequency of the keyword set of the original text in the corpus of all categories;
determining a primary classification label of the original text according to the confidence coefficient of the original text belonging to each category;
and matching the primary classification label with preset primary classification label information, and determining whether to adopt the text classification model to perform text classification on the original text according to a matching result.
In some embodiments, the preprocessing the text data to obtain a word segmentation result includes:
and performing one of stop word removal and duplicate removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.
In some embodiments, the method further comprises: pre-training the text classification model, the pre-training the text classification model comprising:
acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;
pre-training an albert model by taking the first classification label as a classification target based on the first training sample set to obtain an initial text classification model;
judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value,
if the initial text classification model is larger than the preset threshold value, taking the initial text classification model as a final text classification model;
and if the error correction result is not larger than the preset threshold, carrying out error correction on the classification label corresponding to the first training text, and iterating the initial text classification model based on the error-corrected first training sample set until the accuracy of the classification result of the initial text classification model is larger than the preset threshold.
In some embodiments, the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold includes:
acquiring a second training sample set, wherein the second training sample set comprises a second training text;
obtaining a prediction classification label corresponding to a second training text in the second training sample set based on the initial text classification model;
and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.
In some embodiments, the pre-training the albert model with the first classification label as a classification target based on the first training sample set to obtain an initial text classification model includes:
dividing the first training sample set into training data and verification data according to a preset proportion;
inputting the training data into an initial text classification model to be trained for model training;
and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.
In some embodiments, the correcting the classification label corresponding to the first training text includes:
auditing the prediction result to obtain a first training text with correct prediction and a first training text with wrong prediction;
and manually marking the first training text with the predicted error so as to correctly mark the label of the first training text with the predicted error.
A second aspect provides a text classification apparatus, including:
the target text acquisition module is used for extracting target text data to be analyzed from the original text;
the word segmentation module is used for preprocessing the target text data to obtain a word segmentation result of the target text data;
the classification module is used for inputting the word segmentation result into a trained text classification model, and the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
A third aspect provides a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the text classification method described above.
A fourth aspect provides a storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text classification method described above.
The text classification method comprises the steps of firstly, extracting target text data to be analyzed from an original text; secondly, preprocessing the target text data to obtain a word segmentation result of the target text data; then, obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and finally, inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector. Therefore, the text data is processed by adopting the albert model, and the obtained word vector sequence contains the text information and the context information of the text data, so that the full-text semantic information is fused, the contained text information is more comprehensive, the subsequent text classification is facilitated, the accuracy of the text classification is improved, and the classification effect is improved.
Drawings
FIG. 1 is a diagram of an implementation environment for a text classification method provided in one embodiment;
FIG. 2 is a block diagram showing an internal configuration of a computer device according to an embodiment;
FIG. 3 is a flow diagram of a method of text classification in one embodiment;
fig. 4 is a block diagram showing a structure of a text classification device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
Fig. 1 is a diagram of an implementation environment of the text classification method provided in an embodiment, as shown in fig. 1, in the implementation environment, including a computer device 110 and a terminal 120.
The computer device 110 is a text classification server, the terminal 120 is a text acquisition device to be classified, and has a text classification result output interface, when text classification is required, the text to be classified is acquired through the terminal 120, and the text to be classified is classified through the computer device 110.
It should be noted that the terminal 120 and the computer device 110 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The computer device 110 and the terminal 110 may be connected through bluetooth, USB (Universal serial bus), or other communication connection methods, which is not limited herein.
FIG. 2 is a diagram showing an internal configuration of a computer device according to an embodiment. As shown in fig. 2, the computer device includes a processor, a storage medium, a memory, and a network API interface connected by a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can make a processor realize a text classification method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of text classification. The network API interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
For convenience of understanding, terms referred to in the embodiments of the present application will be first described below.
albert model: a language model published by google in 2018 that trains a deep bi-directional representation by joining bi-directional transducers in all layers. The albert model integrates the advantages of a plurality of natural language processing models and achieves better effect in a plurality of natural language processing tasks. In the related art, the model input vector of the albert model is the sum of the vectors of a word vector (TokenEmbedding), a position vector (PositionEmbedding), and a sentence vector (SegmentEmbedding). The word vector is vectorized representation of characters, the position vector is used for representing positions of the characters in the text, and the sentence vector is used for representing the sequence of sentences in the text.
Pre-training (pre-training): a process for learning neural network models to common features in a data set by training the neural network models using a large data set. The pre-training is intended to provide superior model parameters for subsequent neural network model training on a particular data set. The pre-training in the embodiment of the application refers to a process of training the albert model by using label-free training text.
Fine-tuning (fine-tuning): a process for further training a pre-trained neural network model using a particular data set. In general, the data amount of the data set used in the fine tuning stage is smaller than that of the data set used in the pre-training stage, and the fine tuning stage adopts a supervised learning manner, that is, the training samples in the data set used in the fine tuning stage include labeled information. The fine tuning stage in the embodiment of the present application refers to training the albert model using a training text containing classification labels.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
As shown in fig. 3, in an embodiment, a text classification method is provided, which may be applied to the computer device 110 described above, and specifically includes the following steps:
step 101, extracting target text data to be analyzed from an original text;
the original text can be a patent text, and the patent text has the characteristics of special structure, strong professional, more field vocabularies and the like, and a more targeted classification method needs to be adopted. The patent text classification belongs to the field of natural language processing, and generally comprises the steps of data preprocessing, text feature representation, classifier selection, effect evaluation and the like, wherein the text feature representation and the classifier selection are the most important, and the accuracy of a classification result is directly influenced.
In the present embodiment, text data of the specification abstract, the claims, and the title part of the specification in the patent text is extracted as target text data.
Step 102, preprocessing target text data to obtain word segmentation results of the target text data;
in this embodiment, the target text data is preprocessed to extract useful data from the original text data or delete noise data from the original text, so that text data irrelevant to the extraction purpose from the original text data can be deleted.
In some embodiments, the step 102 may include: and performing one of stop word removal and duplication removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.
When the noise data is deleted, removing repeated data in the original text data in a repeated mode; the noise data and the like in the original text data are removed in a deleting mode, so that the noise data in the original text data can be removed.
Stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after processing natural language text, and the characters or Words are called Stop Words (Stop Words).
In this embodiment, the stop word may remove words in the natural language text that do not contribute to the text features, such as punctuation marks, tone, names, meaningless messy codes, spaces, and the like. The selected method for removing the stop word is stop word list filtering, the stop word list filtering can be one-to-one matching through the constructed stop word list and the words in the text data, if the matching is successful, the word is the stop word, and the word needs to be deleted.
In order to obtain the target text data in the form of vectors, the second text data needs to be participled first. Word segmentation is a basic task in lexical analysis, and word segmentation algorithms are mainly divided into two categories according to core ideas of the word segmentation algorithms: one is word segmentation based on a dictionary, firstly segmenting text data into words according to the dictionary, and then searching the optimal combination mode of the words; the other is word segmentation based on characters, namely, the words are constructed by characters, sentences are firstly divided into one character, then the characters are combined into words, an optimal segmentation strategy is searched, and meanwhile, the optimal segmentation strategy can be converted into a sequence labeling problem. The word segmentation algorithm adopted in the word segmentation of the embodiment may include: a rule-based word segmentation method, an understanding-based word segmentation method, or a statistics-based word segmentation method.
The rule-based word segmentation method (e.g., a word segmentation method based on character string matching) matches a Chinese character string to be analyzed with a term in a "sufficiently large" dictionary according to a certain policy, and if a certain character string is found in the dictionary, the matching is successful (a word is recognized). Common rule-based word segmentation methods include: forward maximum matching (left to right direction); inverse maximum matching (right-to-left direction); least segmentation (minimizing the number of words cut in each sentence). The forward maximum matching method is to separate a segment of character string, wherein the length of the separation is limited, then match the separated sub-character string with the words in the dictionary, if the matching is successful, then carry out the next round of matching until all the character strings are processed, otherwise, remove a word from the end of the sub-character string, then carry out the matching, and so on. The reverse maximum matching method is similar to the forward maximum matching method.
The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate the understanding of a sentence by a person. The basic idea of the word segmentation method based on understanding is to perform syntactic and semantic analysis while segmenting words, and to process ambiguity phenomena by using syntactic information and semantic information. The word segmentation method based on statistics comprises the following steps: a word is formally a stable combination of words, so in this context, the more times adjacent words occur simultaneously, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The mutual occurrence information of adjacent co-occurring words in the text data is calculated by counting the frequency of the combination of the words. The mutual presentation information reflects the closeness degree of the combination relation between the Chinese characters, and when the closeness degree is higher than a certain threshold value, the character group can be considered to possibly form a word. In practical application, the statistical word segmentation system can use a basic word segmentation dictionary to perform string matching word segmentation, and simultaneously uses a statistical method to identify some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high matching word segmentation speed and high efficiency are exerted, and the advantages of dictionary-free word segmentation combined with context recognition word generation and automatic ambiguity elimination are utilized.
After the word segmentation processing, the original text data is represented by a series of keywords, but the data in the text form cannot be directly processed by a subsequent classification algorithm and should be converted into a numerical value form, so that word vector form conversion needs to be performed on the keywords to obtain the text data to be classified, which is in the form of a text vector.
Step 103, inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein, the text classification model is a trained alber model.
In the embodiment of the application, vectors of three degrees, namely a word vector, a word vector and a position vector, are used during the training of the albert model.
Optionally, the word vector is obtained by converting the word by using a word vector (word 2vec) model.
In this embodiment, step 103 may include the following steps:
and step 1031, obtaining word vectors corresponding to the text data according to the part-of-speech and position information of the text data. In the present embodiment, position information is added to text data using position coding, and the text data to which the position information is added is represented using an initial word vector; acquiring the part of speech of text data, and converting the part of speech into a part of speech vector; and adding the initial word vector and the part of speech vector to obtain a word vector corresponding to the text data.
And step 1032, inputting the word vectors into the albert model for data processing to obtain a word matrix of the text data.
And 1033, acquiring a word vector sequence of the text data according to the word matrix. In this embodiment, a word matrix is used to predict whether two sentences in the text data are upper and lower sentences, and part-of-speech features of masked words and masked words in the two sentences, and the part-of-speech features are normalized to obtain a word vector sequence of the text data.
It should be understood that the albert model used in this embodiment is a model obtained through pre-training, so when processing text data, only the text data needs to be input into the pre-trained albert model to obtain a word vector sequence corresponding to the text data.
In order to enable the alber model to realize text classification, a classifier needs to be arranged in the alber model. Optionally, the classification category and number of the classifier are related to the classification task that the text classification model needs to implement, and the classifier may be a multi-classification classifier (such as a softmax classifier). The specific type of classifier is not limited in the embodiments of the present application.
In some embodiments, before extracting text data to be classified from an original text, the text classification method further includes:
step 100a, extracting keywords in an original text and forming a keyword set;
step 100b, determining word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
specifically, text features matched with the keywords in the text features of the corpus of the category are determined, and the word frequency-inverse document frequency of the matched text features is used as the word frequency-inverse document frequency of the keywords. The text in a corpus of a certain category is divided into a plurality of sentences according to punctuation marks such as periods, question marks, exclamation marks, semicolons and the like, and text features in each sentence are extracted. And respectively establishing a text feature library for each category according to the extracted text features. And respectively counting the frequency of each text feature under each category. And counting the inverse document frequency of each text feature, namely the natural logarithm value of the quotient of the total category number and the category number containing the text feature, and respectively calculating the word frequency-inverse document frequency of each text feature under each category.
Step 100c, determining confidence coefficients of the original text belonging to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
specifically, for each category, the following operations are performed: determining the times of occurrence of the keywords in the corpus of the category; determining class condition probability of the original text relative to the class according to the word frequency-inverse document frequency of the keyword in the corpus of the class and the occurrence frequency of the keyword in the corpus of the class; and determining the confidence degree of the text to be classified belonging to the category according to the class condition probability of the original text relative to the category.
Step 100d, determining a primary classification label of the original text according to the confidence coefficient of the original text belonging to each category;
specifically, the category with the highest confidence degree is used as a primary classification label of the text to be classified, among the confidence degrees of the original text belonging to the categories.
And step 100e, matching the primary classification labels with preset primary classification label information, and determining whether to adopt a text classification model to perform text classification on the original text according to a matching result.
It is understood that the target classification label obtained in the above steps 101 to 104 is the lowest classification label of the patent text, for example, the patent text has three classification labels, the first classification label has only one classification, and the second classification label has at least two classification labels, and the third classification label has at least two classification labels. Therefore, in this step, first class labels are performed according to the keywords of the original file through the TF-IDF model, if the preset class labels of the patent file are not matched, the original file is not required to be subjected to label classification, and the initial class labels are manually set to be higher than the bottom class labels.
In some embodiments, the text classification method further includes: the pre-training text classification model comprises the following steps:
1001, acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;
optionally, the first training sample set is a specific data set related to text classification, where the training text includes corresponding classification labels, the classification labels may be labeled manually, and the classification labels belong to a classification result of a text classification model. In one illustrative example, when a text classification model is used to classify patent text, the classification labels include specific different technical fields, such as cloud computing, image processing, and the like. The embodiment of the present application does not limit the specific content of the classification label.
Step 1002, pre-training an albert model by taking a first classification label as a classification target based on a first training sample set to obtain an initial text classification model;
the step 1002 may include:
dividing the first training sample set into training data and verification data according to a preset proportion;
inputting training data into an initial text classification model to be trained for model training;
and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.
In the step, a first training sample set is divided according to a ratio of 9:1, 90% of the first training sample set is used as a training set, 10% of the first training sample set is used as a verification set, a prediction model is generated after the model is trained by 90% of data, prediction is carried out on 10% of samples, and model parameters are properly optimized according to results to obtain an initial text classification model.
Step 1003, judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value,
step 1004, if the initial text classification model is larger than the final text classification model, taking the initial text classification model as the final text classification model;
step 1005, if the number of the training texts is not larger than the preset threshold value, performing error correction on the classification labels corresponding to the first training texts, and iterating the initial text classification model based on the error-corrected first training sample set until the accuracy of the classification result of the initial text classification model is larger than the preset threshold value.
It can be understood that, in step 1005, the initial text classification model is iterated based on the first training sample set after error correction, that is, the initial text model may be optimized based on all or part of the first training sample set after error correction, as for the number of specific iterations, it needs to be determined whether the accuracy of the classification result of the initial text classification model after fine tuning is greater than a preset threshold, if so, iteration is stopped, and if not, optimization training is continued on the initial text classification model.
In the step 1003, determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold may include:
1003a, obtaining a second training sample set, wherein the second training sample set comprises a second training text;
1003b, obtaining a prediction classification label corresponding to the second training text in the second training sample set based on the initial text classification model;
1003c, judging whether the accuracy of the classification result of the initial classification model is larger than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.
In this embodiment, a second training sample set different from the first training sample set is used as verification data for verifying the accuracy of the classification result of the initial text classification model, which extends the training data of the initial classification model, and avoids the problem of low accuracy of the initial text classification model caused by an error of the original classification label of the first training sample set.
In the step 1005, the correcting the classification label corresponding to the first training text may include:
1005a, auditing the prediction result to obtain a first training text with correct prediction and a first training text with wrong prediction;
and 1005b, manually labeling the first training text with the prediction error so as to correctly label the label of the first training text with the prediction error.
In this embodiment, for the case that the initial prediction of the initial text classification model is inaccurate, the model is iterated by this embodiment, so that the model prediction is more accurate.
In some embodiments, the computer device adjusts network parameters of the albert model according to an error between the prediction result and the classification tag using a gradient descent or back propagation algorithm until the error satisfies a convergence condition.
In one possible implementation, since the pre-trained albert model has learned the context of the text, the data size of the second training sample set used for fine tuning is much smaller than the data size of the first training sample set.
Similar to the pre-training process, in order to make the text classification model learn the mapping relationship between the text classification and the character pinyin, the albert model is finely adjusted except for taking the character vector, the position vector and the sentence vector of the characters in the second training text as input.
In a possible implementation manner, in the fine tuning process, the computer device takes the second word vector, the second target word vector and the second target position vector of the second training sample set as input vectors of the albert model to obtain a text classification prediction result output by the albert model, and then fine tuning is performed on the albert model by taking a classification label corresponding to the second training text as supervision, and finally the text classification model is obtained through training.
As shown in fig. 4, in one embodiment, a text classification apparatus is provided, which may be integrated in the computer device 110, and may specifically include
A target text obtaining module 411, configured to extract target text data to be analyzed from an original text;
the word segmentation module 412 is configured to pre-process the target text data to obtain a word segmentation result of the target text data;
the vector obtaining module 413 is configured to obtain a target word vector, a target position vector, and a target sentence vector corresponding to a word in the target classified text;
the classification module 414 is configured to input the target word vector, and the target position vector into the text classification model to obtain a target classification label output by the text classification model, where the target classification label is a text classification model obtained by training using the text classification model in any one of claims 1 to 4.
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: extracting target text data to be analyzed from an original text; preprocessing the target text data to obtain a word segmentation result of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model.
In one embodiment, before extracting text data to be classified from the original text, the method further includes: extracting key words from the original text based on a TF-IDF model, and forming a key word set; determining a primary classification label of the original text according to the keyword set; and matching the primary classification label with preset primary classification label information, and determining whether to adopt a text classification model to perform text classification on the original text according to a matching result.
In one embodiment, the original text is patent text data, and the extracting text data to be classified from the original text includes: text data of a specification abstract, a claim and a specification title part in a patent text are extracted as text data to be classified.
In one embodiment, inputting the word segmentation result into a pre-trained albert model to obtain a word vector sequence corresponding to the text data, including: acquiring word vectors corresponding to the text data according to the part of speech and the position information of the text data; inputting the word vector into an albert model for data processing to obtain a word matrix of the text data; and acquiring a word vector sequence of the text data according to the word matrix.
In one embodiment, the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold includes: acquiring a second training sample set, wherein the second training sample set comprises a second training text; obtaining a prediction classification label corresponding to a second training text in a second training sample set based on the initial text classification model; and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the predicted classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.
In one embodiment, pre-training an albert model with a first classification label as a classification target based on a first training sample set to obtain an initial text classification model, includes: dividing the first training sample set into training data and verification data according to a preset proportion; inputting training data into an initial text classification model to be trained for model training; and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.
In one embodiment, the correcting the classification label corresponding to the first training text includes:
auditing the prediction result to obtain a first training text with correct prediction and a first training text with wrong prediction;
and manually labeling the first training text with the prediction error so as to correctly label the label of the first training text with the prediction error.
In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: extracting target text data to be analyzed from an original text; preprocessing the target text data to obtain a word segmentation result of the target text data; obtaining a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result; and inputting the target word vector, the target word vector and the target position vector into a pre-trained text classification model to obtain a target classification label output by the text classification model, wherein the text classification model is a fine-tuned alber model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method of text classification, the method comprising:
extracting target text data to be analyzed from an original text;
preprocessing the target text data to obtain a word segmentation result of the target text data;
inputting the word segmentation result into a trained text classification model, wherein the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
2. The text classification method according to claim 1, further comprising, before extracting text data to be classified from an original text:
extracting keywords in the original text to be processed, and forming a keyword set;
determining the word frequency-inverse document frequency of the keyword set in the corpus of each category based on a TF-IDF model;
determining confidence coefficients of the original text belonging to all categories based on word frequency-inverse document frequency of the keyword set of the original text in the corpus of all categories;
determining a primary classification label of the original text according to the confidence coefficient of the original text belonging to each category;
and matching the primary classification label with preset primary classification label information, and determining whether to adopt the text classification model to perform text classification on the original text according to a matching result.
3. The method according to claim 1, wherein the preprocessing the text data to obtain a word segmentation result comprises:
and performing one of stop word removal and duplicate removal on the target text data to obtain second text data, and performing word segmentation operation on the second text data to obtain a word segmentation result.
4. The method of text classification according to claim 1, characterized in that the method further comprises: training the text classification model, the training the text classification model comprising:
acquiring a first training sample set, wherein the first training sample set comprises a first training text, and the first training text comprises a corresponding first classification label;
pre-training an albert model by taking the first classification label as a classification target based on the first training sample set to obtain an initial text classification model;
judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold value,
if the initial text classification model is larger than the preset threshold value, taking the initial text classification model as a final text classification model;
and if the error correction result is not larger than the preset threshold, carrying out error correction on the classification label corresponding to the first training text, and iterating the initial text classification model based on the error-corrected first training sample set until the accuracy of the classification result of the initial text classification model is larger than the preset threshold.
5. The method for training the text classification model according to claim 4, wherein the determining whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold comprises:
acquiring a second training sample set, wherein the second training sample set comprises a second training text;
obtaining a prediction classification label corresponding to a second training text in the second training sample set based on the initial text classification model;
and judging whether the accuracy of the classification result of the initial classification model is greater than a preset threshold value or not according to the prediction classification label and a second classification label corresponding to the second training text, wherein the second classification label is manually labeled by a user.
6. The method for training the text classification model according to claim 4, wherein the pre-training the albert model with the first classification label as the classification target based on the first training sample set to obtain the initial text classification model comprises:
dividing the first training sample set into training data and verification data according to a preset proportion;
inputting the training data into an initial text classification model to be trained for model training;
and verifying the trained initial text classification model based on the verification data, and obtaining an optimized initial text classification model according to a verification result.
7. The method for training the text classification model according to claim 4, wherein the correcting the classification label corresponding to the first training text includes:
auditing the prediction result to obtain a first training text with correct prediction and a first training text with wrong prediction;
and manually marking the first training text with the predicted error so as to correctly mark the label of the first training text with the predicted error.
8. A text classification apparatus, comprising:
the target text acquisition module is used for extracting target text data to be analyzed from the original text;
the word segmentation module is used for preprocessing the target text data to obtain a word segmentation result of the target text data;
the classification module is used for inputting the word segmentation result into a trained text classification model, and the text classification model obtains a target word vector, a target word vector and a target position vector corresponding to the target text data based on the word segmentation result and obtains a target classification label of the target text data based on the target word vector, the target word vector and the target position vector; wherein the text classification model is a trained alber model.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to carry out the steps of the text classification method according to any one of claims 1 to 7.
10. A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the text classification method of any one of claims 1 to 7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110482695.1A CN113011533B (en) | 2021-04-30 | 2021-04-30 | Text classification method, apparatus, computer device and storage medium |
PCT/CN2021/097195 WO2022227207A1 (en) | 2021-04-30 | 2021-05-31 | Text classification method, apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110482695.1A CN113011533B (en) | 2021-04-30 | 2021-04-30 | Text classification method, apparatus, computer device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113011533A true CN113011533A (en) | 2021-06-22 |
CN113011533B CN113011533B (en) | 2023-10-24 |
Family
ID=76380485
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110482695.1A Active CN113011533B (en) | 2021-04-30 | 2021-04-30 | Text classification method, apparatus, computer device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113011533B (en) |
WO (1) | WO2022227207A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254657A (en) * | 2021-07-07 | 2021-08-13 | 明品云(北京)数据科技有限公司 | User data classification method and system |
CN113283235A (en) * | 2021-07-21 | 2021-08-20 | 明品云(北京)数据科技有限公司 | User label prediction method and system |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113535960A (en) * | 2021-08-02 | 2021-10-22 | 中国工商银行股份有限公司 | Text classification method, device and equipment |
CN113609860A (en) * | 2021-08-05 | 2021-11-05 | 湖南特能博世科技有限公司 | Text segmentation method and device and computer equipment |
CN113627509A (en) * | 2021-08-04 | 2021-11-09 | 口碑(上海)信息技术有限公司 | Data classification method and device, computer equipment and computer readable storage medium |
CN113836892A (en) * | 2021-09-08 | 2021-12-24 | 灵犀量子(北京)医疗科技有限公司 | Sample size data extraction method and device, electronic equipment and storage medium |
CN114065772A (en) * | 2021-11-19 | 2022-02-18 | 浙江百应科技有限公司 | Business opportunity identification method and device based on Albert model and electronic equipment |
CN114492661A (en) * | 2022-02-14 | 2022-05-13 | 平安科技(深圳)有限公司 | Text data classification method and device, computer equipment and storage medium |
CN114706974A (en) * | 2021-09-18 | 2022-07-05 | 北京墨丘科技有限公司 | Technical problem information mining method and device and storage medium |
CN114706961A (en) * | 2022-01-20 | 2022-07-05 | 平安国际智慧城市科技股份有限公司 | Target text recognition method, device and storage medium |
CN114936282A (en) * | 2022-04-28 | 2022-08-23 | 北京中科闻歌科技股份有限公司 | Financial risk cue determination method, apparatus, device and medium |
CN115587185A (en) * | 2022-11-25 | 2023-01-10 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and storage medium |
CN115861606A (en) * | 2022-05-09 | 2023-03-28 | 北京中关村科金技术有限公司 | Method and device for classifying long-tail distribution documents and storage medium |
WO2023093074A1 (en) * | 2021-11-24 | 2023-06-01 | 青岛海尔科技有限公司 | Voice data processing method and apparatus, and electronic device and storage medium |
CN116975400A (en) * | 2023-08-03 | 2023-10-31 | 星环信息科技(上海)股份有限公司 | Data hierarchical classification method and device, electronic equipment and storage medium |
CN118535739A (en) * | 2024-06-26 | 2024-08-23 | 上海建朗信息科技有限公司 | Data classification method and system based on keyword weight matching |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115545009B (en) * | 2022-12-01 | 2023-07-07 | 中科雨辰科技有限公司 | Data processing system for acquiring target text |
CN115563289B (en) * | 2022-12-06 | 2023-03-07 | 中信证券股份有限公司 | Industry classification label generation method and device, electronic equipment and readable medium |
CN115827875B (en) * | 2023-01-09 | 2023-04-25 | 无锡容智技术有限公司 | Text data processing terminal searching method |
CN116205601B (en) * | 2023-02-27 | 2024-04-05 | 开元数智工程咨询集团有限公司 | Internet-based engineering list rechecking and data statistics method and system |
CN116204645B (en) * | 2023-03-02 | 2024-02-20 | 北京数美时代科技有限公司 | Intelligent text classification method, system, storage medium and electronic equipment |
CN115994527B (en) * | 2023-03-23 | 2023-06-09 | 广东聚智诚科技有限公司 | Machine learning-based PPT automatic generation system |
CN116992034B (en) * | 2023-09-26 | 2023-12-22 | 之江实验室 | Intelligent event marking method, device and storage medium |
CN117009534B (en) * | 2023-10-07 | 2024-02-13 | 之江实验室 | Text classification method, apparatus, computer device and storage medium |
CN117034901B (en) * | 2023-10-10 | 2023-12-08 | 北京睿企信息科技有限公司 | Data statistics system based on text generation template |
CN117252514B (en) * | 2023-11-20 | 2024-01-30 | 中铁四局集团有限公司 | Building material library data processing method based on deep learning and model training |
CN117743573B (en) * | 2023-12-11 | 2024-10-18 | 中国科学院文献情报中心 | Corpus automatic labeling method and device, storage medium and electronic equipment |
CN117743857B (en) * | 2023-12-29 | 2024-09-17 | 北京海泰方圆科技股份有限公司 | Text correction model training, text correction method, device, equipment and medium |
CN117951007A (en) * | 2024-01-09 | 2024-04-30 | 航天中认软件测评科技(北京)有限责任公司 | Test case classification method based on theme |
CN117910479B (en) * | 2024-03-19 | 2024-06-04 | 湖南蚁坊软件股份有限公司 | Method, device, equipment and medium for judging aggregated news |
CN117992600B (en) * | 2024-04-07 | 2024-06-11 | 之江实验室 | Service execution method and device, storage medium and electronic equipment |
CN118193743B (en) * | 2024-05-20 | 2024-08-16 | 山东齐鲁壹点传媒有限公司 | Multi-level text classification method based on pre-training model |
CN118332091B (en) * | 2024-06-06 | 2024-08-09 | 中电信数智科技有限公司 | Ancient book knowledge base intelligent question-answering method, device and equipment based on large model technology |
CN118503795B (en) * | 2024-07-18 | 2024-09-20 | 北京睿企信息科技有限公司 | Text label verification method, electronic equipment and storage medium |
CN118503796B (en) * | 2024-07-18 | 2024-09-20 | 北京睿企信息科技有限公司 | Label system construction method, device, equipment and medium |
CN118503399B (en) * | 2024-07-18 | 2024-09-20 | 北京睿企信息科技有限公司 | Standardized text acquisition method, device, equipment and medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190034823A1 (en) * | 2017-07-27 | 2019-01-31 | Getgo, Inc. | Real time learning of text classification models for fast and efficient labeling of training data and customization |
CN109508378A (en) * | 2018-11-26 | 2019-03-22 | 平安科技(深圳)有限公司 | A kind of sample data processing method and processing device |
CN109710770A (en) * | 2019-01-31 | 2019-05-03 | 北京牡丹电子集团有限责任公司数字电视技术中心 | A kind of file classification method and device based on transfer learning |
WO2019149200A1 (en) * | 2018-02-01 | 2019-08-08 | 腾讯科技(深圳)有限公司 | Text classification method, computer device, and storage medium |
CN110717039A (en) * | 2019-09-17 | 2020-01-21 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN111078887A (en) * | 2019-12-20 | 2020-04-28 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
CN111125317A (en) * | 2019-12-27 | 2020-05-08 | 携程计算机技术(上海)有限公司 | Model training, classification, system, device and medium for conversational text classification |
CN111198948A (en) * | 2020-01-08 | 2020-05-26 | 深圳前海微众银行股份有限公司 | Text classification correction method, device and equipment and computer readable storage medium |
CN112052331A (en) * | 2019-06-06 | 2020-12-08 | 武汉Tcl集团工业研究院有限公司 | Method and terminal for processing text information |
WO2021008037A1 (en) * | 2019-07-15 | 2021-01-21 | 平安科技(深圳)有限公司 | A-bilstm neural network-based text classification method, storage medium, and computer device |
-
2021
- 2021-04-30 CN CN202110482695.1A patent/CN113011533B/en active Active
- 2021-05-31 WO PCT/CN2021/097195 patent/WO2022227207A1/en active Application Filing
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190034823A1 (en) * | 2017-07-27 | 2019-01-31 | Getgo, Inc. | Real time learning of text classification models for fast and efficient labeling of training data and customization |
WO2019149200A1 (en) * | 2018-02-01 | 2019-08-08 | 腾讯科技(深圳)有限公司 | Text classification method, computer device, and storage medium |
CN109508378A (en) * | 2018-11-26 | 2019-03-22 | 平安科技(深圳)有限公司 | A kind of sample data processing method and processing device |
CN109710770A (en) * | 2019-01-31 | 2019-05-03 | 北京牡丹电子集团有限责任公司数字电视技术中心 | A kind of file classification method and device based on transfer learning |
CN112052331A (en) * | 2019-06-06 | 2020-12-08 | 武汉Tcl集团工业研究院有限公司 | Method and terminal for processing text information |
WO2021008037A1 (en) * | 2019-07-15 | 2021-01-21 | 平安科技(深圳)有限公司 | A-bilstm neural network-based text classification method, storage medium, and computer device |
CN110717039A (en) * | 2019-09-17 | 2020-01-21 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN111078887A (en) * | 2019-12-20 | 2020-04-28 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
CN111125317A (en) * | 2019-12-27 | 2020-05-08 | 携程计算机技术(上海)有限公司 | Model training, classification, system, device and medium for conversational text classification |
CN111198948A (en) * | 2020-01-08 | 2020-05-26 | 深圳前海微众银行股份有限公司 | Text classification correction method, device and equipment and computer readable storage medium |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254657A (en) * | 2021-07-07 | 2021-08-13 | 明品云(北京)数据科技有限公司 | User data classification method and system |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113283235A (en) * | 2021-07-21 | 2021-08-20 | 明品云(北京)数据科技有限公司 | User label prediction method and system |
CN113283235B (en) * | 2021-07-21 | 2021-11-19 | 明品云(北京)数据科技有限公司 | User label prediction method and system |
CN113535960A (en) * | 2021-08-02 | 2021-10-22 | 中国工商银行股份有限公司 | Text classification method, device and equipment |
CN113627509B (en) * | 2021-08-04 | 2024-05-10 | 口碑(上海)信息技术有限公司 | Data classification method, device, computer equipment and computer readable storage medium |
CN113627509A (en) * | 2021-08-04 | 2021-11-09 | 口碑(上海)信息技术有限公司 | Data classification method and device, computer equipment and computer readable storage medium |
CN113609860A (en) * | 2021-08-05 | 2021-11-05 | 湖南特能博世科技有限公司 | Text segmentation method and device and computer equipment |
CN113609860B (en) * | 2021-08-05 | 2023-09-19 | 湖南特能博世科技有限公司 | Text segmentation method and device and computer equipment |
CN113836892A (en) * | 2021-09-08 | 2021-12-24 | 灵犀量子(北京)医疗科技有限公司 | Sample size data extraction method and device, electronic equipment and storage medium |
CN113836892B (en) * | 2021-09-08 | 2023-08-08 | 灵犀量子(北京)医疗科技有限公司 | Sample size data extraction method and device, electronic equipment and storage medium |
CN114706974A (en) * | 2021-09-18 | 2022-07-05 | 北京墨丘科技有限公司 | Technical problem information mining method and device and storage medium |
CN114065772A (en) * | 2021-11-19 | 2022-02-18 | 浙江百应科技有限公司 | Business opportunity identification method and device based on Albert model and electronic equipment |
WO2023093074A1 (en) * | 2021-11-24 | 2023-06-01 | 青岛海尔科技有限公司 | Voice data processing method and apparatus, and electronic device and storage medium |
CN114706961A (en) * | 2022-01-20 | 2022-07-05 | 平安国际智慧城市科技股份有限公司 | Target text recognition method, device and storage medium |
CN114492661A (en) * | 2022-02-14 | 2022-05-13 | 平安科技(深圳)有限公司 | Text data classification method and device, computer equipment and storage medium |
CN114936282A (en) * | 2022-04-28 | 2022-08-23 | 北京中科闻歌科技股份有限公司 | Financial risk cue determination method, apparatus, device and medium |
CN115861606A (en) * | 2022-05-09 | 2023-03-28 | 北京中关村科金技术有限公司 | Method and device for classifying long-tail distribution documents and storage medium |
CN115861606B (en) * | 2022-05-09 | 2023-09-08 | 北京中关村科金技术有限公司 | Classification method, device and storage medium for long-tail distributed documents |
CN115587185B (en) * | 2022-11-25 | 2023-03-14 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and storage medium |
CN115587185A (en) * | 2022-11-25 | 2023-01-10 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and storage medium |
CN116975400A (en) * | 2023-08-03 | 2023-10-31 | 星环信息科技(上海)股份有限公司 | Data hierarchical classification method and device, electronic equipment and storage medium |
CN116975400B (en) * | 2023-08-03 | 2024-05-24 | 星环信息科技(上海)股份有限公司 | Data classification and classification method and device, electronic equipment and storage medium |
CN118535739A (en) * | 2024-06-26 | 2024-08-23 | 上海建朗信息科技有限公司 | Data classification method and system based on keyword weight matching |
Also Published As
Publication number | Publication date |
---|---|
WO2022227207A1 (en) | 2022-11-03 |
CN113011533B (en) | 2023-10-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
CN107229610B (en) | A kind of analysis method and device of affection data | |
JP7164701B2 (en) | Computer-readable storage medium storing methods, apparatus, and instructions for matching semantic text data with tags | |
CN106776562B (en) | Keyword extraction method and extraction system | |
US20230195773A1 (en) | Text classification method, apparatus and computer-readable storage medium | |
CN111709243B (en) | Knowledge extraction method and device based on deep learning | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN110321563B (en) | Text emotion analysis method based on hybrid supervision model | |
CN112256939B (en) | Text entity relation extraction method for chemical field | |
WO2023159758A1 (en) | Data enhancement method and apparatus, electronic device, and storage medium | |
CN107180026B (en) | Event phrase learning method and device based on word embedding semantic mapping | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
Rizvi et al. | Optical character recognition system for Nastalique Urdu-like script languages using supervised learning | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN112966068A (en) | Resume identification method and device based on webpage information | |
CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
CN111858842A (en) | Judicial case screening method based on LDA topic model | |
CN114881043B (en) | Deep learning model-based legal document semantic similarity evaluation method and system | |
CN111008530A (en) | Complex semantic recognition method based on document word segmentation | |
Gunaseelan et al. | Automatic extraction of segments from resumes using machine learning | |
CN117251524A (en) | Short text classification method based on multi-strategy fusion | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
CN117573869A (en) | Network connection resource key element extraction method | |
CN117574858A (en) | Automatic generation method of class case retrieval report based on large language model | |
CN117422074A (en) | Method, device, equipment and medium for standardizing clinical information text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |