CN110245232B - Text classification method, device, medium and computing equipment - Google Patents

Text classification method, device, medium and computing equipment Download PDF

Info

Publication number
CN110245232B
CN110245232B CN201910480256.XA CN201910480256A CN110245232B CN 110245232 B CN110245232 B CN 110245232B CN 201910480256 A CN201910480256 A CN 201910480256A CN 110245232 B CN110245232 B CN 110245232B
Authority
CN
China
Prior art keywords
text
classified
texts
deep learning
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910480256.XA
Other languages
Chinese (zh)
Other versions
CN110245232A (en
Inventor
赵振宇
丁长林
张华�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Media Technology Beijing Co Ltd
Original Assignee
Netease Media Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Media Technology Beijing Co Ltd filed Critical Netease Media Technology Beijing Co Ltd
Priority to CN201910480256.XA priority Critical patent/CN110245232B/en
Publication of CN110245232A publication Critical patent/CN110245232A/en
Application granted granted Critical
Publication of CN110245232B publication Critical patent/CN110245232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a text classification method. The method comprises the following steps: acquiring a text to be classified; according to the text to be classified, adopting a first deep learning model to obtain first classification information of the text to be classified; and under the condition that the category of the text to be classified represented by the first classification information is not the first category, determining the category of the text to be classified by adopting a second deep learning model. The sample data of the second deep learning model obtained through optimization comprises a first text, and the category represented by the first classification information of the first text is not the first category. The method adopts two deep learning models to determine the text classification, wherein the sample data of the second deep learning model is the text recalled by the first deep learning model. Therefore, the concentration of effective samples in the sample data can be improved, so that the sample labeling is reduced, and the accuracy of classification prediction can be improved. In addition, the embodiment of the invention also provides a text classification device, a medium and a computing device.

Description

Text classification method, device, medium and computing equipment
Technical Field
The embodiment of the invention relates to the field of text processing, in particular to a text classification method, a text classification device, a text classification medium and a computing device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In the internet field, it is often necessary to select out a text that does not meet the specification from a large number of texts, so as to avoid a situation that the user is guided negatively due to the display of the text that does not meet the specification. The text selection can be performed by adopting a text classification method.
In general, most text classification methods in the natural language processing field are applicable to the detection of non-compliant text (e.g., "trivia" news). For example, early detection of non-compliant text is manual review based on editing, but with the rapid development of internet and self-media, the magnitude of news increases, so that manual review with low efficiency and high cost cannot meet the requirement, and thus a manual and machine combined review mode becomes the mainstream. In machine auditing, the detection method of the text which does not conform to the specification is mainly a dictionary-based method, and the content of the text is subjected to regular matching by constructing a keyword list containing the vocabulary which does not conform to the specification. In recent years, with the prominent expression of machine learning and deep learning in the field of text classification, some deep learning models are also applied in text classification.
Text auditing based on a dictionary mode is usually low in recall rate due to the fact that word lists are not rich enough and text semantic relations cannot be captured. The method of machine learning and deep learning is often to solve the problem of text classification end to end, but in a news text scene, because the text which does not conform to the regulations occupies a lower total amount of the total text, training data is often difficult to obtain. Meanwhile, when the method is applied, the non-compliant texts and the compliant texts are extremely unevenly distributed, so that the accuracy of online classification and the recall rate of the non-compliant texts are difficult to ensure.
Disclosure of Invention
Therefore, in the prior art, when the existing text classification method is adopted to classify texts in internet scenes such as news, the accuracy is low due to unbalanced distribution of different types of texts, and the recall rate is difficult to guarantee. Furthermore, since a large number of samples with labels are required to train the model, there is a drawback that training data is difficult to obtain.
Therefore, an improved text classification method is highly needed, which can accurately classify the texts with unbalanced category distribution without a large amount of training data.
In this context, the embodiments of the present invention are expected to increase the concentration of effective samples in a sample text, and then train a classification model using the sample text with high concentration of effective samples as training data, so as to improve the accuracy of classification of the classification model on the premise of reducing the total amount of training data.
In a first aspect of an embodiment of the present invention, a text classification method is provided, including: acquiring a text to be classified; according to the text to be classified, adopting a first deep learning model to obtain first classification information of the text to be classified; and under the condition that the category of the text to be classified represented by the first classification information is not the first category, determining the category of the text to be classified by adopting a second deep learning model. The sample data of the second deep learning model obtained through optimization comprises a first text, and the category represented by the first classification information of the first text is not the first category.
In an embodiment of the present invention, the determining the category of the text to be classified by using the second deep learning model includes: acquiring second classification information of the text to be classified by adopting a second deep learning model: and determining the category of the text to be classified as the category represented by the second classification information.
In another embodiment of the present invention, the first deep learning model is set with a first threshold, and the second deep learning model is set with a second threshold. The first classification information comprises first class information and a plurality of first confidence degrees of the text to be classified relative to a plurality of preset classes, and the first class information is determined by the size relation between the first confidence degree of the text to be classified relative to the second class and a first threshold; the second classification information comprises second class information and a plurality of second confidence degrees of the text to be classified relative to a plurality of preset classes, and the second class information is determined by the size relation between the second confidence degrees of the text to be classified relative to the second classes and a second threshold value. The first threshold value is smaller than the second threshold value, the first category information and the second category information are used for representing categories of texts to be classified, and the plurality of preset categories comprise a first category and a second category.
In another embodiment of the present invention, the first classification information and/or the second classification information includes a plurality of confidences of the text to be classified with respect to a plurality of predetermined categories, and after determining the category of the text to be classified, the text classification method further includes: and determining the parameter value of the text to be classified as the confidence of the text to be classified relative to the second class. Wherein the plurality of predetermined categories includes a first category and a second category.
In another embodiment of the present invention, the text classification method further includes: acquiring a plurality of second texts and actual categories of the second texts; taking a plurality of second texts as input of a first deep learning model, and acquiring first classification information of the plurality of second texts; and optimizing the first deep learning model according to the first classification information of the plurality of second texts and the actual classification of the plurality of second texts. Wherein the first deep learning model comprises: a logistic regression model or a long-time and short-time memory network model.
In a further embodiment of the present invention, a ratio of the text in the plurality of second texts whose actual category is the first category is greater than a ratio of the text in the plurality of first texts whose actual category is the first category.
In another embodiment of the present invention, the text classification method further includes: acquiring a plurality of first texts and actual categories of the first texts; taking the plurality of first texts as the input of a second deep learning model to obtain second classification information of the plurality of first texts; and optimizing the second deep learning model according to the second classification information of the plurality of first texts and the actual classification of the plurality of first texts. Wherein the second deep learning model comprises: a support vector machine model, a random forest model and a long-time and short-time memory network model.
In another embodiment of the present invention, the obtaining the plurality of first texts includes: acquiring a plurality of third texts; taking a plurality of third texts as input of the first deep learning model, and acquiring first classification information of the plurality of third texts; and determining the third text of which the category represented by the first classification information is not the first category as the first text.
In another embodiment of the present invention, before obtaining the first classification information of the text to be classified, the text classification method further includes: and preprocessing the text to be classified by adopting a preprocessing model, and extracting to obtain the characteristic information of the text to be classified. The method comprises the steps that feature information of a text to be classified is used as input of a first deep learning model, and first classification information of the text to be classified is output; the preprocessing model comprises a word frequency-inverse document frequency model or a word vector model.
In a second aspect of the embodiments of the present invention, there is provided a text classification apparatus including: the text to be classified acquisition module is used for acquiring a text to be classified; the first class determination module is used for acquiring first classification information of the text to be classified by adopting a first deep learning model according to the text to be classified; and the second category determining module is used for determining the category of the text to be classified by adopting a second deep learning model under the condition that the category of the text to be classified represented by the first classification information is not the first category. The sample data of the second deep learning model obtained through optimization comprises a first text, and the category represented by the first classification information of the first text is not the first category.
In an embodiment of the invention, the second category determining module includes: the second classification information acquisition submodule is used for acquiring second classification information of the text to be classified by adopting a second deep learning model; and the class determining submodule is used for determining the class of the text to be classified as the class represented by the second classification information.
In another embodiment of the present invention, a first threshold is set for the first deep learning model, and a second threshold is set for the second deep learning model, wherein the first classification information includes first class information and a plurality of first confidences of the text to be classified with respect to a plurality of predetermined classes, and the first class information is determined by a magnitude relationship between the first confidences of the text to be classified with respect to the second class and the first threshold; the second classification information comprises second class information and a plurality of second confidences of the text to be classified relative to a plurality of predetermined classes, and the second class information is determined by the size relation between the second confidences of the text to be classified relative to the second classes and the second threshold. The first threshold value is smaller than the second threshold value, the first category information and the second category information are used for representing categories of the text to be classified, and the plurality of predetermined categories include the first category and the second category.
In a further embodiment of the invention, the first classification information and/or the second classification information comprises a plurality of confidences of the text to be classified with respect to a plurality of predetermined classes. The text classification device further comprises a parameter value determination module, which is used for determining the parameter value of the text to be classified as the confidence degree of the text to be classified relative to the second category. Wherein the plurality of predetermined categories includes the first category and the second category.
In another embodiment of the present invention, the text classification apparatus further includes: a first model optimization module to perform the following operations: acquiring a plurality of second texts and actual categories of the second texts; taking the plurality of second texts as the input of the first deep learning model, and acquiring first classification information of the plurality of second texts; and optimizing the first deep learning model according to the first classification information of the second texts and the actual classification of the second texts. Wherein the first deep learning model comprises: a logistic regression model or a long-time and short-time memory network model.
In yet another embodiment of the present invention, a ratio of the text having the actual category of the first text in the plurality of second texts is greater than a ratio of the text having the actual category of the first text in the plurality of first texts.
In a further embodiment of the present invention, the text classification apparatus further includes a second model optimization module, configured to perform the following operations: acquiring a plurality of first texts and actual categories of the first texts; taking the plurality of first texts as input of a second deep learning model to obtain second classification information of the plurality of first texts; and optimizing the second deep learning model according to the second classification information of the plurality of first texts and the actual classification of the plurality of first texts. Wherein the second deep learning model comprises: a support vector machine model, a random forest model and a long-time and short-time memory network model.
In yet another embodiment of the present invention, obtaining the plurality of first texts comprises: acquiring a plurality of third texts; taking the third texts as the input of the first deep learning model, and acquiring first classification information of the third texts; determining that the third text characterized by the first classification information is not in the first category as the first text.
In a further embodiment of the present invention, the text classification method further includes a preprocessing module, configured to perform preprocessing on the text to be classified by using a preprocessing model before the first classification information of the text to be classified is acquired by the first classification determining module, so as to extract feature information of the text to be classified. The first category determining module is specifically configured to use the feature information of the text to be classified as the input of the first deep learning model, and output the feature information to obtain first classification information of the text to be classified; the preprocessing model comprises a word frequency-inverse file frequency model or a word vector model.
In a third aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method of text classification as provided according to the first aspect of embodiments of the present invention.
In a fourth aspect of embodiments of the present invention, a computing device is provided. The computing device includes one or more memory units storing executable instructions, and one or more processing units. The processing unit executes the executable instructions to implement the text classification method provided according to the first aspect of the embodiments of the present invention.
According to the text classification method, the text classification device, the text classification medium and the text classification computing equipment, two deep learning models are adopted to determine text classification, wherein sample data of the second deep learning model is a text obtained by recalling through the first deep learning model, so that the concentration of effective samples in the sample data of the second deep learning model is high, the precision of the second deep learning model can be improved, and the accuracy of text classification results is improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates an application scenario of a text classification method, apparatus, medium, and computer device according to embodiments of the present invention;
FIG. 2A schematically illustrates a flow diagram of a text classification method according to an embodiment of the invention;
FIG. 2B schematically illustrates a flow diagram for determining a category of text to be classified using a second deep learning model, according to an embodiment of the invention;
FIG. 3 schematically illustrates a flow diagram of a text classification method according to another embodiment of the invention;
FIG. 4 schematically illustrates a flow chart for optimizing a first deep learning model in a text classification method according to an embodiment of the present invention;
FIG. 5A schematically illustrates a flow chart for optimizing a second deep learning model in a text classification method according to an embodiment of the present invention;
FIG. 5B schematically shows a flowchart for obtaining a first text according to an embodiment of the invention;
FIG. 6 schematically illustrates a technical flow diagram of a text classification method according to an embodiment of the invention;
FIG. 7 schematically shows a block diagram of a text classification apparatus according to an embodiment of the present invention;
FIG. 8 schematically shows a schematic view of a program product adapted to perform a text classification method according to an embodiment of the present invention; and
FIG. 9 schematically shows a block diagram of a computing device adapted to perform a text classification method according to an embodiment of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Thus, the present invention may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a text classification method, a text classification device, a text classification medium and a computing device are provided.
In this context, it is to be understood that the terms referred to are to be interpreted as follows:
machine learning is a multi-domain interdiscipline that studies how a computer simulates or realizes human learning behaviors, acquires knowledge and skills, and reorganizes an existing knowledge structure by using theories such as probability theory, statistics, approximation theory, convex analysis and the like.
Deep learning, which is a branch of machine learning, is to construct a neural network to interpret and learn data by building a human brain simulator.
Natural language processing is an important branch of machine learning, mainly researches theories and methods for realizing effective communication between people and computers by using natural language, and is a subject integrating linguistics, computer science and mathematics.
Text classification, which is a branch of the natural language processing field, is mainly studied in which a text set is automatically classified by a computer according to a certain classification system or standard.
LR, logistic regression, is a linear regression analysis model, and is commonly used in the fields of data mining, economic prediction, and the like.
An SVM, Support Vector Machine (SVM), is a linear classifier for binary classification of data in a supervised learning manner, and a decision boundary is a maximum edge distance hyperplane for solving a learning sample.
LSTM, Long Short-Term Memory, is a time recurrent neural network suitable for processing and predicting important events with relatively Long intervals and delays in time series.
tf-idf, term frequency-inverse document frequency, is a statistical method used to evaluate the importance of a word to one of a corpus or documents in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
word2vec, a word vector model of a word, has better semantic expressiveness.
Accuracy, an index for evaluating the performance of a machine learning algorithm, for example, for a classification problem, the accuracy of the classifier is obtained by dividing the total number of samples to be tested, which are judged to be correct, by the total number of samples to be tested.
The recall rate is an index for evaluating the performance of a machine learning algorithm, and if a classification problem is solved, the classifier divides the correct number of a certain class of a sample to be detected by the actual total number of the class, namely the recognition recall rate of the class by the classifier.
The principles and spirit of the present invention will be explained in detail below with reference to a number of representative embodiments of the invention.
Summary of The Invention
In the prior art, for texts with unbalanced category distribution, it is not desirable to classify texts by using a machine learning model, because training of the machine learning model often requires a large number of samples to ensure that a certain amount of valid samples (for example, "three-popular" texts in news texts) are available. The machine learning model is not accurate in learning characteristics due to existence of other invalid samples, so that the recall rate of machine learning model classification is low. The inventor finds that if a deep learning model is used for screening samples with unbalanced distribution to obtain a sample set with higher effective sample concentration, and then the sample set is used for training the deep learning model for text classification, the precision of the deep learning model for text classification can be improved to a certain extent. Correspondingly, when in actual classification, the text to be classified is classified through the two deep learning models, so that the feature distribution of the text to be classified is closer to the feature distribution in the training sample, and the classification accuracy can be further improved.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Application scene overview
Reference is first made to fig. 1.
Fig. 1 schematically illustrates an application scenario of a text classification method, apparatus, medium, and computer device according to embodiments of the present invention. It should be noted that fig. 1 is only an example of an application scenario in which the embodiment of the present invention may be applied to help those skilled in the art understand the technical content of the present invention, and does not mean that the embodiment of the present invention may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the application scenario 100 includes terminal devices 111, 112, 113, a server 120, and a network 130. Network 130 is the medium used to provide communication links between end devices 111, 112, 113 and server 120, and may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The terminal devices 111, 112, 113 have, for example, a processing function to classify the text to be classified, and obtain the category of the text to be classified and the confidence degree of the text belonging to the preset category. According to an embodiment of the present invention, the terminal devices 111, 112, 113 include, but are not limited to, desktop computers, laptop portable computers, tablet computers, smart phones, smart wearable devices, or smart appliances, and the like.
According to the embodiment of the present invention, the terminal devices 111, 112, 113 integrate a deep learning model obtained by pre-training, so as to classify the text to be classified through the deep learning model. The deep learning model may be obtained by using a large number of training samples stored in the server 120 for the terminal devices 111, 112, 113, or may be obtained by training the server 120.
The text 121 to be classified may be, for example, a news text, and the classifying of the text 121 to be classified may be dividing the text 121 to be classified into a "triquetral" text or a "non-triquetral" text. The text 121 to be classified may be stored in the server 120 or locally at the terminal devices 111, 112, 113. In order to ensure that a certain amount of training samples can be obtained, the training samples may be specifically stored in the server 120.
The terminal devices 111, 112, 113 may have a display screen, for example, for displaying the classification result of the text to be classified and the "trivia value" (e.g., confidence degree of the text belonging to the "trivia") of the text to be classified to the user, so as to facilitate the user to process the news text belonging to the "trivia" text.
The server 120 may be a server providing various services, such as providing text to be classified or training samples to the terminal devices 111, 112, 113, or providing a pre-trained deep learning model to the terminal devices (for example only). Alternatively, the server 120 may also have a processing function, for example, to classify the stored text 121 to be classified by using a trained deep learning model.
It should be noted that the text classification method provided by the embodiment of the present invention may be generally executed by the terminal devices 111, 112, 113 or the server 120. Accordingly, the text classification apparatus provided by the embodiment of the present invention may be generally disposed in the terminal devices 111, 112, 113 or the server 120. The text classification method provided by the embodiment of the present invention may also be executed by a server or a server cluster that is different from the server 120 and is capable of communicating with the terminal devices 111, 112, 113 and/or the server 120. Accordingly, the text classification apparatus provided in the embodiment of the present invention may also be disposed in a server or a server cluster different from the server 120 and capable of communicating with the terminal devices 111, 112, and 113 and/or the server 120.
It should be understood that the number and types of terminal devices, networks, servers, text in fig. 1 are merely illustrative. There may be any number and type of terminal devices, networks, servers, and text, as desired for an implementation.
Exemplary method
The text classification method according to an exemplary embodiment of the present invention is described below with reference to fig. 2A to 6 in conjunction with the application scenario of fig. 1. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Fig. 2A schematically shows a flowchart of a text classification method according to an embodiment of the present invention, and fig. 2B schematically shows a flowchart of determining a category of a text to be classified using a second deep learning model according to an embodiment of the present invention.
As shown in fig. 2A, the text classification method according to the embodiment of the present invention includes operations S201 to S203. The text classification method may be performed by, for example, the terminal devices 111, 112, 113 or the server 120 in fig. 1.
In operation S201, a text to be classified is acquired.
The text to be classified may be the text 121 to be classified stored in the server 120 in fig. 1. The text to be classified may be, for example, a news text in the news field, so as to determine whether the news text belongs to a "three-colloquial" text by using the text classification method according to the embodiment of the present invention.
In operation S202, according to the text to be classified, first classification information of the text to be classified is obtained by using a first deep learning model.
According to an embodiment of the present invention, the operation S202 may specifically be directly taking the text to be classified as an input of the first deep learning model, and outputting to obtain the first classification information. The first deep learning model may be, for example, a logistic regression model, a long-term memory network model, or any deep learning model that can be used to solve the classification problem. In particular, for text scenes that include only short sentences and have less semantic requirements, the first deep learning model may employ a logistic regression model. For scenes comprising long sentences and having high requirements for text semantic understanding, a long-time and short-time memory network model can be adopted to improve the accuracy of the first classification information. The first deep learning model may be optimized in advance by the optimization method described in fig. 4, for example, and is not described in detail herein.
According to an embodiment of the present invention, the first classification information may, for example, characterize a class to which a text to be classified obtained by the first deep learning model belongs. Specifically, the first classification information includes first class information characterizing a class of the text to be classified. For example, for news text, the first classification information may for example characterize that the text to be classified belongs to "trivia" text or to "non-trivia" text, and accordingly the first classification information may be "trivia" or "non-trivia".
According to the embodiment of the present invention, the first classification information may further include, for example, a plurality of first confidence degrees of the text to be classified with respect to a plurality of predetermined categories, so as to characterize a probability that the text to be classified belongs to each of the plurality of predetermined categories. Further, the first deep learning model may be set with a first threshold, for example, as a basis for determining the first category information according to the plurality of first confidences. In particular, in the classification problem, the first category information may be determined by a magnitude relationship between a first confidence of the text to be classified with respect to the second category and the first threshold. And when the first confidence coefficient of the text to be classified relative to the second category is larger than a first threshold value, determining that the first category information is the second category, otherwise, determining that the first category information is the first category. For example, if the plurality of predetermined categories include "triverdal" and "non-triverdal", when the confidence of the text to be classified with respect to the "triverdal" is greater than a first threshold, the first category information of the text to be classified is determined to be "triverdal", otherwise, the first category information of the text to be classified is determined to be "non-triverdal". Therefore, the first category is "non-third" and the second category is "third". The value of the first threshold may be set according to actual requirements, which is not limited in the embodiments of the present invention. For example, the first threshold value may be any value greater than 0.5 and equal to or less than 0.7, or any other value.
According to the embodiment of the invention, when the text to be classified belongs to the text type with the unevenly distributed classes, the method is mainly used for determining whether the text to be classified belongs to the sparsely distributed classes in a plurality of preset classes. In order to increase the recall amount of the first deep learning model and avoid the text belonging to the sparsely distributed category (the second category) from being mistakenly classified into the densely distributed category (the first category), the value of the first threshold should be as small as possible. For example, for news text, the first threshold should be set as small as possible, so as to avoid the situation that the text belonging to the "trivia" is classified as "non-trivia" because the first deep learning model has insufficient features of the "trivia" text.
In operation S203, in a case that the category of the text to be classified represented by the first classification information is not the first category, the category of the text to be classified is determined using the second deep learning model.
According to the embodiment of the present invention, since the setting of the first threshold is small, there is often a case where the text belonging to the first category is mistaken for the second category. For example, there are often situations where text belonging to "non-trivia" is mistakenly classified as "trivia" text. In order to further improve the accuracy of determining whether the text to be classified belongs to a sparsely distributed category among a plurality of predetermined categories, the text which does not belong to the first category (i.e. belongs to the second category) and is represented by the first classification information should be classified with high accuracy.
In order to improve the accuracy of distinguishing the sparse distribution class by the second deep learning model, the sample data of the second deep learning model obtained by optimization may be, for example, a first text of which the class represented by the first classification information is not the first class (i.e., the second class). Then, since the first classification information is obtained by the first deep learning model, the texts in the first texts belonging to the sparsely distributed category are denser than the texts belonging to the sparsely distributed category in the texts directly obtained from the server 120. Therefore, the second deep learning model can obtain more comprehensive sparse class distribution characteristics conveniently.
According to an embodiment of the present invention, the operation S203 may specifically be to directly use the text to be classified as an input of the second deep learning model, and determine the category of the text to be classified according to an output of the second deep learning model. The second deep learning model may be, for example, a support vector machine model, a random forest model, a long-term memory network model, or any deep learning model that can be used to solve the classification problem. The second deep learning model and the first deep learning model may be different types of models, or may be the same type of models but with different parameters. Specifically, the second deep learning model may be optimized in advance by the optimization method described in fig. 5A, and is not described in detail herein.
According to an embodiment of the present invention, as shown in fig. 2B, the determining the category of the text to be classified by using the second deep learning model may include, for example, operations S213 to S223. In operation S213, second classification information of the text to be classified is obtained using the second deep learning model. In operation S223, it is determined that the category of the text to be classified is the category represented by the second classification information.
The second classification information may, for example, characterize a category to which the text to be classified obtained by the second deep learning model belongs. Specifically, the second classification information includes second category information characterizing a category of the text to be classified. For example, for news text, the second classification information may for example characterize that the text to be classified belongs to "trivia" text or to "non-trivia" text, and correspondingly, the second classification information may be "trivia" or "non-trivia".
According to the embodiment of the present invention, the second classification information may further include, for example, a plurality of second confidences of the text to be classified with respect to the plurality of predetermined categories, so as to characterize a probability that the text to be classified belongs to each of the plurality of predetermined categories. Further, the second deep learning model may be set with a second threshold, for example, as a basis for determining the second category information according to a plurality of second confidences. Similarly, in the classification problem, the second category information may be determined by a magnitude relationship between a second confidence of the text to be classified with respect to the second category and the second threshold. And when the second confidence coefficient of the text to be classified relative to the second category is larger than a second threshold value, determining that the second category information is the second category, otherwise, determining that the second category information is the first category. For example, if the plurality of predetermined categories include "trivia" and "non-trivia", in the case where the confidence of the text to be classified with respect to "trivia" is greater than the second threshold, the second category information of the text to be classified is determined to be "trivia", and accordingly, the text to be classified is determined to be "trivia" text. Otherwise, determining that the second category information of the text to be classified is 'non-trivia', and determining that the text to be classified is 'non-trivia'. In order to improve the accuracy of classifying the text to be classified into the "three-popular" text, the value of the second threshold should be greater than the first threshold. It is understood that the second threshold may be set according to actual requirements, and the embodiment of the present invention does not limit this. For example, the second threshold may take a value greater than 0.9, or any other value.
Considering that the first threshold of the first deep learning model is smaller than the second threshold of the second deep learning model, there is often no first classification information that characterizes the text belonging to the second category as the first category. Therefore, when the category of the text to be classified represented by the first classification information is the first category, the text to be classified can be determined to be the text belonging to the first category.
In summary, in the text classification method according to the embodiment of the present invention, the text to be classified is determined by two deep learning models, so that the accuracy and the recall rate of text classification can be improved to a certain extent. Furthermore, the sample data of the second deep learning model is the text recalled by the first deep learning model. Therefore, the concentration of the effective samples in the sample data of the second deep learning model is high, so that the precision of the second deep learning model can be improved, and the accuracy of the classification result determined by the second deep learning model can be further improved.
Fig. 3 schematically shows a flow chart of a text classification method according to another embodiment of the invention.
According to the embodiment of the disclosure, after the category of the text to be classified is determined, the probability of representing that the text to be classified belongs to the second category (sparsely distributed category) is considered to be a reference for the user to perform subsequent processing on the text to be classified. Therefore, as shown in fig. 3, the text classification method according to the embodiment of the present invention may further include operation S304 after determining the category of the text to be classified through operations S201 to S203.
In operation S304, a parameter value of the text to be classified is determined as a confidence of the text to be classified with respect to the second category.
For the binary problem, when the first classification information obtained in operation S202 represents that the text to be classified is the first category, it is determined that the category of the text to be classified is the first category. The parameter value may be determined as the confidence of the text to be classified in the first classification information with respect to the second classification. When the first classification information obtained in operation S202 represents that the text to be classified is not the first class, but the class of the text to be classified is determined to be the first class/the second class through operation S203, the parameter value may be determined as the confidence of the text to be classified in the second classification information with respect to the second class.
Where the plurality of predetermined categories include "trivia" and "non-trivia," the parameter value may specifically be a "trivia value" to characterize the degree of trivia of the text to be classified. The user can correspondingly process the text to be classified according to the three popular degrees of the text to be classified.
According to the embodiment of the invention, in order to obtain more accurate first classification information according to the input by the first deep learning model, the text to be classified can be preprocessed in advance to extract the features of the text to be classified. As shown in fig. 3, the text classification method according to the embodiment of the present invention further includes operation S305 before operation S202.
In operation S305, the text to be classified is preprocessed by using the preprocessing model, and feature information of the text to be classified is extracted. The input of the first deep learning model in operation S202 is the extracted feature information of the text to be classified, and the output is the first classification information of the text to be classified.
The preprocessing specifically comprises the steps of firstly identifying text information in a text to be classified, and then extracting key words according to an identification result; the keyword is then converted into a vector that characterizes the keyword, resulting in an input to the first deep learning model. The preprocessing model used when extracting the key words to obtain the vectors may include a word frequency-inverse file frequency model, a word vector model, or any model that can be used to extract text features in the prior art, which is not limited in the present invention.
Fig. 4 schematically shows a flowchart for optimizing the first deep learning model in the text classification method according to the embodiment of the present invention.
As shown in fig. 4, the method for optimizing the first deep learning model according to the embodiment of the present invention may include operations S406 to S408.
In operation S406, a plurality of second texts and actual categories of the plurality of second texts are obtained.
The second texts may be, for example, a plurality of texts randomly acquired from a text library stored in the server 120, and the second texts may be specifically acquired news texts. The actual categories of the plurality of second texts are pre-assigned categories. Specifically, the actual category of each second text may be obtained by a label assigned to each second text by the user, and the label assigned to each second text indicates the actual category of each second text. The operation S406 may further include, after obtaining the plurality of second texts, an operation of obtaining a label assigned to each second text by the user. Further, in order to facilitate the tag to be simultaneously used as an input of the first deep learning model, after the tag is obtained, operation S406 may further include the following operations: and splicing each second text with the label of each second text to form second sample data corresponding to each second text.
And considering that after the first deep learning model is passed, a second deep learning model is further adopted to determine the category of the text to be classified. Therefore, the embodiment of the invention has low requirement on the precision of the first deep learning model. When the first deep learning model is obtained through optimization, the required second text amount does not need to be too much, so that the lower annotation amount is ensured, and the annotation cost is reduced.
In operation S407, first classification information of a plurality of second texts is acquired with the plurality of second texts as input of the first deep learning model.
According to an embodiment of the present invention, the operation S407 may specifically be that the plurality of second texts are input into the first deep learning model one by one to obtain the first classification information one by one. The specific implementation process of operation S407 is similar to that of operation S202, except that the first deep learning model is an unoptimized logistic regression model or a long-term memory network model. Specifically, the input of the first deep learning model may be second sample data corresponding to each second text obtained by the above splicing.
In operation S408, the first deep learning model is optimized according to the first classification information of the plurality of second texts and the actual classification of the plurality of second texts.
According to the embodiment of the present invention, in operation S408, specifically, a first loss value corresponding to each second text is calculated by using a first loss function according to the category of each second text represented by the first classification information of each text in the plurality of second texts and the actual category of each second text. Parameters in the first deep learning model are then adjusted according to the first loss value to optimize the first deep learning model.
Fig. 5A schematically shows a flowchart for optimizing the second deep learning model in the text classification method according to the embodiment of the present invention, and fig. 5B schematically shows a flowchart for acquiring the first text according to the embodiment of the present invention.
As shown in fig. 5A, the method for optimizing the second deep learning model according to the embodiment of the present invention may include operations S509 to S511.
In operation S509, a plurality of first texts and actual categories of the plurality of first texts are acquired.
The first text is the text of which the first classification information representation is not the first classification, so that the first text is the text of which the first classification information is obtained through the first deep learning model. As shown in fig. 5B, the operation of acquiring the plurality of first texts may include operations S519 to S539.
In operation S519, a plurality of third texts are obtained; in operation S529, acquiring first classification information of a plurality of third texts, with the plurality of third texts as an input of the first deep learning model; in operation S539, it is determined that the third text, of which the category characterized by the first classification information is not the first category, is the first text.
The third texts in operation S519 are texts randomly obtained from the text library stored in the server 120 according to the embodiment of the present invention. The plurality of third texts may include the second text described in fig. 4, or may not include the second text described in fig. 4. The first classification information acquired in operation S529 may be acquired in operation S202 described in fig. 2A, or may be acquired in operation S407 described in fig. 4. Operation S539 determines whether each third text is the first text according to the first classification information of each third text.
According to the embodiment of the present invention, it is considered that the first text represents only the third text that is not the first category for the first classification information determined by operation S539 for the text in the text repository; and the second text acquired in operation S404 in fig. 4 is randomly acquired directly from the text library. The proportion of the texts with the actual category of the second texts being the first category is greater than the proportion of the texts with the actual category of the first texts. In the case of news text stored in the text repository, the proportion of the plurality of second texts actually belonging to the "trivia" text is smaller than the proportion of the plurality of first texts actually belonging to the "trivia" text.
According to an embodiment of the present invention, the actual category of each third text may be obtained by a label assigned to each second text by the user, and the label assigned to each third text indicates the actual category of each third text. Operation S509 may further include, after acquiring the plurality of first texts, an operation of acquiring a label assigned by the user for each first text. Further, in order to facilitate the simultaneous use of the tag as an input of the second deep learning model, after the tag is obtained, operation S509 may further include the following operations: and splicing each first text with the label of each first text to form first sample data corresponding to each first text.
In operation S510, the plurality of first texts are used as input of the second deep learning model, and second classification information of the plurality of first texts is obtained.
According to an embodiment of the present invention, the operation S510 may specifically be that the plurality of first texts are input into the second deep learning model one by one to obtain the second classification information one by one. The specific implementation process of operation S510 is similar to that of operation S203, except that the second deep learning model is an unoptimized support vector machine model, a random forest model, a long-term and short-term memory network model, or the like. Specifically, the input of the second deep learning model may be first sample data corresponding to each first text obtained by the above splicing.
Operation S511 optimizes the second deep learning model according to the second classification information of the plurality of first texts and the actual classification of the plurality of first texts,
according to the embodiment of the present invention, in operation S511, specifically, a second loss value corresponding to each first text is calculated by using a second loss function according to the category of each first text characterized by the second classification information of each first text in the plurality of first texts and the actual category of each first text. Parameters in the first deep learning model are then adjusted according to the second loss value to optimize the second deep learning model. The second loss function and the first loss function may be, for example, a cross entropy loss function, a sigmoid loss function, or any other loss function, and the second loss function and the first loss function may be the same loss function or different loss functions, which is not limited in the present invention.
In summary, the sample data for optimizing the second deep learning model is obtained by labeling the first text which is obtained by screening the first deep learning model and does not belong to the first category. Compared with the texts in the text library (or the input texts of the first deep learning model), the density of the texts belonging to the first category (non-trivia) in the first texts can be effectively reduced, and the density of the texts belonging to the second category (trivia) in the first texts is increased. Under the condition that the amount of the texts belonging to the second category in the sample data is ensured to be certain, compared with the amount of the texts in the text library required in the prior art, the amount of the required first texts can be greatly reduced, and therefore the labeling amount of the texts can be effectively reduced. Therefore, the difficulty of obtaining the sample data is reduced to a certain degree, the data marking amount is reduced, and the marking efficiency is improved.
FIG. 6 schematically illustrates a technical flow diagram of a text classification method according to an embodiment of the invention.
As shown in fig. 6, the text classification method according to the embodiment of the present invention may include a training process and a detection process. Two deep learning models are involved in the text classification method, and one of the two deep learning models takes full-scale data (namely, text directly acquired from the server 120) as training sample data or detection data input as a high-recall model. The other deep learning model of the two deep learning models takes the text obtained by recalling through the high-recall model (namely the 'trivia' text obtained by the high-recall model) as training sample data or detection data input to serve as a high-accuracy model and determine the classification result of the recalled text.
In the training stage, a high recall model is trained by using a small data volume training sample (a sample with a 'three-way' proportion being a natural distribution proportion) to recall the full volume data to obtain a training sample (a sample with a 'three-way' proportion being a high-accuracy model). In order to increase the recall amount of the high recall model, a smaller threshold (the first threshold described above) may be set for the high recall model. In order to ensure that the training samples of the high-accuracy model have a certain number of samples, after the high-recall model is trained, the full data can be continuously used as the input of the high-recall model, and the training samples of the high-accuracy model can be obtained again. After a sufficient number of training samples of the highly accurate model are obtained, the large number of training samples can be used for training the highly accurate model, and the threshold value of the highly accurate model is larger than that of the highly recalled model.
In the detection stage, each detected text (such as news) needing to be detected is firstly judged whether to belong to a first category (non-trivia) through a high recall model. If the high recall model is judged to be non-trivia, the text is directly skipped over the high-accuracy model, the overall result is non-trivia, and the trivia value is the confidence coefficient obtained by the high recall model relative to the trivia category. If the high recall model is judged to be 'three-popular', the text enters the high-accuracy model to be judged again, if the output result of the high-accuracy model is 'three-popular', the overall result of the text is 'three-popular', and the 'three-popular value' is the confidence coefficient obtained by the high-accuracy model relative to the 'three-popular' category. If the output result of the high-accuracy model is 'non-trivia', the overall result of the text is 'non-trivia', and the 'trivia value' is the confidence coefficient obtained by the high-accuracy model relative to the 'trivia' category.
In summary, in the text classification method according to the embodiment of the present invention, the training data of the high-accuracy model is determined by the recall model trained with a small data volume, and the "three-way" concentration of the training data is much higher than that of the natural distribution set, so that the labeling efficiency is higher. In addition, in the detection process, all the detection texts to be detected pass through the high-recall model, so that the feature distribution of the detection texts is closer to the distribution of the training samples, and the overall accuracy can be improved.
Exemplary devices
Having described the method of the exemplary embodiment of the present invention, the text classification device of the exemplary embodiment of the present invention will be explained with reference to fig. 7.
Fig. 7 schematically shows a block diagram of a text classification apparatus according to an embodiment of the present invention.
As shown in fig. 7, according to an embodiment of the present invention, the text classification apparatus 700 may include a text acquiring module 710 to be classified, a first category determining module 720, and a second category determining module 730. The text classification apparatus 700 may be used to implement a text classification method according to an embodiment of the present invention.
The to-be-classified text acquiring module 710 is configured to acquire a text to be classified (operation S201).
The first class determining module 720 is configured to obtain first classification information of the text to be classified by using a first deep learning model according to the text to be classified (operation S202).
The second class determining module 730 is configured to determine the class of the text to be classified by using the second deep learning model if the class of the text to be classified represented by the first classification information is not the first class (operation S203). The sample data of the second deep learning model obtained through optimization comprises a first text, and the category represented by the first classification information of the first text is not the first category.
According to an embodiment of the present invention, as shown in fig. 7, the second category determining module 730 may include a second classification information obtaining sub-module 731 and a category determining sub-module 732. The second classification information obtaining sub-module 731 is configured to obtain second classification information of the text to be classified by using the second deep learning model (operation S213). The category determining sub-module 732 is configured to determine the category of the text to be classified as the category represented by the second classification information (operation S223).
According to an embodiment of the present invention, the first deep learning model is set with a first threshold, and the second deep learning model is set with a second threshold. The first classification information comprises first class information and a plurality of first confidence degrees of the text to be classified relative to a plurality of preset classes, and the first class information is determined by the size relation between the first confidence degrees of the text to be classified relative to the second classes and a first threshold value. The second classification information comprises second class information and a plurality of second confidence degrees of the text to be classified relative to a plurality of preset classes, and the second class information is determined by the size relation between the second confidence degrees of the text to be classified relative to the second classes and a second threshold value. The first threshold value is smaller than the second threshold value, the first category information and the second category information are used for representing categories of texts to be classified, and the plurality of preset categories comprise the first category.
According to an embodiment of the present invention, the first classification information and/or the second classification information includes a plurality of confidences of the text to be classified with respect to a plurality of predetermined classes. As shown in fig. 7, the text classification apparatus 700 may further include a parameter value determining module 740 configured to determine a parameter value of the text to be classified as a confidence of the text to be classified with respect to the second class (operation S304). Wherein the plurality of predetermined categories includes a first category and a second category.
According to an embodiment of the present invention, as shown in fig. 7, the text classification apparatus 700 may further include a first model optimization module 750, configured to perform the following operations: acquiring a plurality of second texts and actual categories of the plurality of second texts (S406); acquiring first classification information of the plurality of second texts by using the plurality of second texts as an input of the first deep learning model (operation S407); the first deep learning model is optimized according to the first classification information of the plurality of second texts and the actual classification of the plurality of second texts (operation S408). Wherein the first deep learning model comprises: a logistic regression model or a long-time and short-time memory network model.
According to an embodiment of the present invention, a ratio of texts of which actual categories are the first categories in the second texts is greater than a ratio of texts of which actual categories are the first categories in the first texts.
According to an embodiment of the present invention, as shown in fig. 7, the text classification apparatus 700 may further include a second model optimization module 760, configured to perform the following operations: acquiring a plurality of first texts and actual categories of the plurality of first texts (operation S509); using the plurality of first texts as an input of a second deep learning model, obtaining second classification information of the plurality of first texts (operation S510); and optimizing the second deep learning model according to the second classification information of the plurality of first texts and the actual classification of the plurality of first texts (operation S511). Wherein the second deep learning model comprises: a support vector machine model, a random forest model and a long-time and short-time memory network model.
According to an embodiment of the present invention, obtaining the plurality of first texts comprises: acquiring a plurality of third texts (operation S519); acquiring first classification information of the plurality of third texts by using the plurality of third texts as an input of the first deep learning model (operation S529); it is determined that the third text, which is characterized by the first classification information and is not the first category, is the first text (operation S539).
According to an embodiment of the present invention, as shown in fig. 7, the text classification method may further include a preprocessing module 770, configured to, before the first classification determining module 720 obtains the first classification information of the text to be classified, preprocess the text to be classified by using a preprocessing model, and extract feature information of the text to be classified (operation S305). The first category determining module 720 is specifically configured to use feature information of the text to be classified as input of the first deep learning model, and output the feature information to obtain first classification information of the text to be classified; the preprocessing model comprises a word frequency-inverse document frequency model or a word vector model.
Exemplary Medium
Having described the method of an exemplary embodiment of the present invention, a computer-readable storage medium suitable for performing the text classification method of an exemplary embodiment of the present invention is described next with reference to fig. 8.
There is also provided, in accordance with an embodiment of the present invention, a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a text classification method in accordance with an embodiment of the present invention.
In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a computing device to perform the steps of the method for performing text classification according to various exemplary embodiments of the present invention described in the above section "exemplary method" of this specification when the program product is run on the computing device, for example, the computing device may perform the step S201 as shown in fig. 2A: acquiring a text to be classified; step S202: according to the text to be classified, adopting a first deep learning model to obtain first classification information of the text to be classified; step S203: and under the condition that the category of the text to be classified represented by the first classification information is not the first category, determining the category of the text to be classified by adopting a second deep learning model.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As shown in fig. 8, a program product 800 suitable for a text classification method according to an embodiment of the present invention is depicted, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Exemplary computing device
Having described the methods, media, and apparatus of exemplary embodiments of the present invention, a computing device suitable for performing a text classification method of exemplary embodiments of the present invention is described next with reference to FIG. 9.
The embodiment of the invention also provides the computing equipment. As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, a computing device according to the present invention may include at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps in the text classification method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit may perform step S201 as shown in fig. 2A: acquiring a text to be classified; step S202: according to the text to be classified, adopting a first deep learning model to obtain first classification information of the text to be classified; step S203: and under the condition that the category of the text to be classified represented by the first classification information is not the first category, determining the category of the text to be classified by adopting a second deep learning model.
A computing device 900 adapted to perform the text classification method according to this embodiment of the invention is described below with reference to fig. 9. The computing device 900 shown in FIG. 9 is only one example and should not be taken to limit the scope of use and functionality of embodiments of the present invention.
As shown in fig. 9, computing device 900 is embodied in a general purpose computing device. Components of computing device 900 may include, but are not limited to: the at least one processing unit 901, the at least one memory unit 902, and the bus 903 connecting the various system components (including the memory unit 902 and the processing unit 901).
The bus 903 may include a data bus, an address bus, and a control bus.
The storage unit 902 may include volatile memory, such as a Random Access Memory (RAM)9021 and/or a cache memory 9022, and may further include a Read Only Memory (ROM) 923.
Storage unit 902 may also include a program/utility 9025 having a set (at least one) of program modules 9024, such program modules 9024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Computing device 900 may also communicate with one or more external devices 904 (e.g., keyboard, pointing device, bluetooth device, etc.), which may be through an input/output (I/0) interface 905. Moreover, computing device 900 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via network adapter 906. As shown, the network adapter 906 communicates with the other modules of the computing device 900 over the bus 903. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (18)

1. A method of text classification, comprising:
acquiring a text to be classified;
according to the text to be classified, adopting a first deep learning model to obtain first classification information of the text to be classified;
under the condition that the category of the text to be classified represented by the first classification information is not a first category, determining the category of the text to be classified by adopting a second deep learning model;
the sample data of the second deep learning model obtained through optimization comprises a first text, and the category represented by the first classification information of the first text is not the first category;
the first deep learning model is used as a recall model, a small data size is used as training sample data input for training, the second deep learning model is used as an accurate model, and a large data size obtained by recalling full data by using the first deep learning model as the recall model is used as training sample data input for training;
the method further includes optimizing the first deep learning model, the optimizing the first deep learning model including:
acquiring a plurality of second texts and actual categories of the second texts;
taking the plurality of second texts as the input of the first deep learning model, and acquiring first classification information of the plurality of second texts;
optimizing the first deep learning model according to the first classification information of the second texts and the actual classification of the second texts;
the proportion of the texts with the actual categories of the first texts in the plurality of second texts is greater than the proportion of the texts with the actual categories of the first texts in the plurality of first texts.
2. The method of claim 1, wherein determining the category of the text to be classified using a second deep learning model comprises:
acquiring second classification information of the text to be classified by adopting a second deep learning model: and determining the category of the text to be classified as the category represented by the second classification information.
3. The method of claim 2, wherein the first deep learning model is set with a first threshold and the second deep learning model is set with a second threshold, wherein:
the first classification information comprises first class information and a plurality of first confidence degrees of the text to be classified relative to a plurality of preset classes, and the first class information is determined by the magnitude relation between the first confidence degree of the text to be classified relative to a second class and the first threshold value;
the second classification information comprises second class information and a plurality of second confidences of the text to be classified relative to a plurality of predetermined classes, the second class information is determined by the magnitude relation between the second confidences of the text to be classified relative to the second classes and the second threshold value,
wherein the first threshold is smaller than the second threshold, the first category information and the second category information are used for representing categories of the text to be classified, and the plurality of predetermined categories include the first category and the second category.
4. The method of claim 2, wherein the first classification information and/or the second classification information comprises a plurality of confidences of the text to be classified with respect to a plurality of predetermined classes, and after determining the class of the text to be classified, the method further comprises:
determining the parameter value of the text to be classified as the confidence of the text to be classified relative to a second class,
wherein the plurality of predetermined categories includes the first category and the second category.
5. The method of claim 1, wherein the first deep learning model comprises: a logistic regression model or a long-time and short-time memory network model.
6. The method of claim 1, further comprising:
acquiring a plurality of first texts and actual categories of the first texts;
taking the plurality of first texts as input of a second deep learning model to obtain second classification information of the plurality of first texts; and
optimizing the second deep learning model based on the second classification information of the plurality of first texts and the actual classification of the plurality of first texts,
wherein the second deep learning model comprises: a support vector machine model, a random forest model and a long-time and short-time memory network model.
7. The method of claim 6, wherein the obtaining a plurality of first texts comprises:
acquiring a plurality of third texts;
taking the third texts as the input of the first deep learning model, and acquiring first classification information of the third texts;
determining that the third text characterized by the first classification information is not in the first category as the first text.
8. The method of claim 1, wherein prior to obtaining the first classification information of the text to be classified, the method further comprises:
preprocessing the text to be classified by adopting a preprocessing model, extracting to obtain the characteristic information of the text to be classified,
the feature information of the text to be classified is used as the input of the first deep learning model, and the first classification information of the text to be classified is output; the preprocessing model comprises a word frequency-inverse file frequency model or a word vector model.
9. A text classification apparatus comprising:
the text to be classified acquisition module is used for acquiring a text to be classified;
the first class determination module is used for acquiring first classification information of the text to be classified by adopting a first deep learning model according to the text to be classified; and
the second category determining module is used for determining the category of the text to be classified by adopting a second deep learning model under the condition that the category of the text to be classified represented by the first classification information is not the first category;
the sample data of the second deep learning model obtained through optimization comprises a first text, and the category represented by the first classification information of the first text is not the first category;
the first deep learning model is used as a recall model, a small data size is used as training sample data input for training, the second deep learning model is used as an accurate model, and a large data size obtained by recalling full data by using the first deep learning model as the recall model is used as training sample data input for training;
the apparatus also includes a first model optimization module to:
acquiring a plurality of second texts and actual categories of the second texts;
taking the plurality of second texts as the input of the first deep learning model, and acquiring first classification information of the plurality of second texts;
optimizing the first deep learning model according to the first classification information of the second texts and the actual classification of the second texts;
the proportion of the texts with the actual categories of the plurality of second texts being the first categories is larger than the proportion of the texts with the actual categories of the plurality of first texts being the first categories.
10. The apparatus of claim 9, wherein the second category determination module comprises:
the second classification information acquisition submodule is used for acquiring second classification information of the text to be classified by adopting a second deep learning model; and
and the category determining submodule is used for determining the category of the text to be classified as the category represented by the second classification information.
11. The apparatus of claim 10, wherein the first deep learning model is set with a first threshold and the second deep learning model is set with a second threshold; wherein:
the first classification information comprises first class information and a plurality of first confidence degrees of the text to be classified relative to a plurality of preset classes, and the first class information is determined by the magnitude relation between the first confidence degree of the text to be classified relative to a second class and the first threshold value;
the second classification information comprises second class information and a plurality of second confidences of the text to be classified relative to a plurality of predetermined classes, the second class information is determined by the magnitude relation between the second confidences of the text to be classified relative to the second classes and the second threshold value,
the first threshold value is smaller than the second threshold value, the first category information and the second category information are used for representing categories of the text to be classified, and the plurality of predetermined categories include the first category and the second category.
12. The apparatus of claim 10, wherein the first classification information and/or the second classification information comprises a plurality of confidences of the text to be classified relative to a plurality of predetermined classes; the device further comprises:
a parameter value determining module for determining the parameter value of the text to be classified as the confidence of the text to be classified relative to the second class,
wherein the plurality of predetermined categories includes the first category and the second category.
13. The apparatus of claim 9, wherein the first deep learning model comprises: a logistic regression model or a long-time and short-time memory network model.
14. The apparatus of claim 9, further comprising a second model optimization module to perform the operations of:
acquiring a plurality of first texts and actual categories of the first texts;
taking the plurality of first texts as input of a second deep learning model to obtain second classification information of the plurality of first texts; and
optimizing the second deep learning model based on the second classification information of the plurality of first texts and the actual classification of the plurality of first texts,
wherein the second deep learning model comprises: a support vector machine model, a random forest model and a long-time and short-time memory network model.
15. The apparatus of claim 14, wherein the obtaining a plurality of first texts comprises:
acquiring a plurality of third texts;
taking the third texts as the input of the first deep learning model, and acquiring first classification information of the third texts;
determining that the third text characterized by the first classification information is not in the first category as the first text.
16. The apparatus of claim 9, further comprising:
the preprocessing module is used for preprocessing the text to be classified by adopting a preprocessing model before the first classification information of the text to be classified is acquired by the first classification determining module, extracting the characteristic information of the text to be classified,
the first category determining module is specifically configured to use the feature information of the text to be classified as the input of the first deep learning model, and output the feature information to obtain first classification information of the text to be classified; the preprocessing model comprises a word frequency-inverse file frequency model or a word vector model.
17. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, implement a method according to any one of claims 1 to 8.
18. A computing device, comprising:
one or more memories storing executable instructions; and
one or more processors executing the executable instructions to implement the method of any one of claims 1-8.
CN201910480256.XA 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment Active CN110245232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910480256.XA CN110245232B (en) 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910480256.XA CN110245232B (en) 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment

Publications (2)

Publication Number Publication Date
CN110245232A CN110245232A (en) 2019-09-17
CN110245232B true CN110245232B (en) 2022-02-18

Family

ID=67886011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910480256.XA Active CN110245232B (en) 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment

Country Status (1)

Country Link
CN (1) CN110245232B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112711940B (en) * 2019-10-08 2024-06-11 台达电子工业股份有限公司 Information processing system, information processing method and non-transitory computer readable recording medium
CN111243607A (en) * 2020-03-26 2020-06-05 北京字节跳动网络技术有限公司 Method, apparatus, electronic device, and medium for generating speaker information
CN111930939A (en) * 2020-07-08 2020-11-13 泰康保险集团股份有限公司 Text detection method and device
CN112257814A (en) * 2020-11-26 2021-01-22 携程计算机技术(上海)有限公司 Mail labeling method, system, equipment and storage medium based on deep learning
CN113536806B (en) * 2021-07-18 2023-09-08 北京奇艺世纪科技有限公司 Text classification method and device
CN114065759B (en) * 2021-11-19 2023-10-13 深圳数阔信息技术有限公司 Model failure detection method and device, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521656A (en) * 2011-12-29 2012-06-27 北京工商大学 Integrated transfer learning method for classification of unbalance samples
CN103593470A (en) * 2013-11-29 2014-02-19 河南大学 Double-degree integrated unbalanced data stream classification algorithm
CN103824092A (en) * 2014-03-04 2014-05-28 国家电网公司 Image classification method for monitoring state of electric transmission and transformation equipment on line
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN107644057A (en) * 2017-08-09 2018-01-30 天津大学 A kind of absolute uneven file classification method based on transfer learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156885B (en) * 2010-02-12 2014-03-26 中国科学院自动化研究所 Image classification method based on cascaded codebook generation
US20170032276A1 (en) * 2015-07-29 2017-02-02 Agt International Gmbh Data fusion and classification with imbalanced datasets
US10685044B2 (en) * 2017-06-07 2020-06-16 Accenture Global Solutions Limited Identification and management system for log entries

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521656A (en) * 2011-12-29 2012-06-27 北京工商大学 Integrated transfer learning method for classification of unbalance samples
CN103593470A (en) * 2013-11-29 2014-02-19 河南大学 Double-degree integrated unbalanced data stream classification algorithm
CN103824092A (en) * 2014-03-04 2014-05-28 国家电网公司 Image classification method for monitoring state of electric transmission and transformation equipment on line
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN107644057A (en) * 2017-08-09 2018-01-30 天津大学 A kind of absolute uneven file classification method based on transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"一种基于级联模型的类别不平衡数据分类方法";刘胥影 等;《南京大学学报(自然科学版)》;20060331;第42卷(第02期);第148-155页 *

Also Published As

Publication number Publication date
CN110245232A (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN110245232B (en) Text classification method, device, medium and computing equipment
CN107679039B (en) Method and device for determining statement intention
US11062090B2 (en) Method and apparatus for mining general text content, server, and storage medium
CN107908635B (en) Method and device for establishing text classification model and text classification
US9923860B2 (en) Annotating content with contextually relevant comments
US20180075368A1 (en) System and Method of Advising Human Verification of Often-Confused Class Predictions
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
CN111143226B (en) Automatic test method and device, computer readable storage medium and electronic equipment
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
CN108121699B (en) Method and apparatus for outputting information
CN112084334B (en) Label classification method and device for corpus, computer equipment and storage medium
CN109359290B (en) Knowledge point determining method of test question text, electronic equipment and storage medium
CN111783450B (en) Phrase extraction method and device in corpus text, storage medium and electronic equipment
CN109657056B (en) Target sample acquisition method and device, storage medium and electronic equipment
CN111079432A (en) Text detection method and device, electronic equipment and storage medium
CN112188312A (en) Method and apparatus for determining video material of news
CN114298050A (en) Model training method, entity relation extraction method, device, medium and equipment
US11416682B2 (en) Evaluating chatbots for knowledge gaps
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN115798661A (en) Knowledge mining method and device in clinical medicine field
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
CN111666405B (en) Method and device for identifying text implication relationship
CN114841471B (en) Knowledge point prediction method and device, electronic equipment and storage medium
CN116450943A (en) Artificial intelligence-based speaking recommendation method, device, equipment and storage medium
CN113806485B (en) Intention recognition method and device based on small sample cold start and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant