WO2020199591A1

WO2020199591A1 - Text categorization model training method, apparatus, computer device, and storage medium

Info

Publication number: WO2020199591A1
Application number: PCT/CN2019/117095
Authority: WO
Inventors: 金戈; 徐亮
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-03-29
Filing date: 2019-11-11
Publication date: 2020-10-08
Also published as: CN110110080A

Abstract

Disclosed by the present application are a text categorization model training method, apparatus, computer device, and storage medium, said method comprising: obtaining, from a preset sample library, first sample data having a category label and second sample data not having a category label; establishing a primary categorization model according to the first sample data; at the same time, calculating an information entropy value and a correlation value of the second sample data; according to a preset category labeling method, labeling the second sample data whose information entropy value and correlation value meet preset conditions to obtain third sample data; using the third sample data to train the primary categorization model to obtain an intermediate categorization model; using the first sample data and the third sample data to train the intermediate categorization model to obtain a text categorization model. The technical solution of the present application solves the problem, during text categorization model training, of the training sample size being enormous and the training time being long.

Description

Text classification model training method, device, computer equipment and storage medium

This application is based on the Chinese invention application filed on March 29, 2019 with the application number 201910247846.8 and titled "text classification model training method, device, computer equipment and storage medium", and claims its priority.

Technical field

This application relates to the field of information processing, in particular to text classification model training methods, devices, computer equipment and storage media.

Background technique

Text classification is an important application direction in the research field of natural language processing. Text classification refers to the use of a classifier to classify data documents containing text, so as to determine the category to which each document belongs, so that users can easily obtain the required documents.

Among them, the classifier is also called a classification model, which is obtained by training the classification criteria or model parameters by using a large number of sample data with category labels. Use the trained classifier to recognize text data of unknown categories, so as to realize the automatic classification of large-scale text data. Therefore, the quality of the classification model directly affects the final effect of the classification.

However, in the real large-scale text classification problem, the sample data with category label is very limited, and most of the samples are not labeled with category. This makes it necessary to use experts in the field to perform manual labeling during the construction of the classification model. This method requires a lot of manpower, financial resources and time, and the scale of the training samples is huge, and the training process will also take a lot of time.

Summary of the invention

The embodiments of the present application provide a text classification model training method, device, computer equipment, and storage medium to solve the problem of large training samples and long training time in the text classification model training process.

A text classification model training method, including:

Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;

Acquiring the second sample data without the category mark from the preset sample library;

Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;

Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;

Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;

Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;

Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;

According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.

A text classification model training device, including:

The primary model establishment module is used to obtain first sample data with category marks from a preset sample library, and establish a primary classification model according to the first sample data;

A sample data acquisition module, configured to acquire second sample data without the category mark from the preset sample library;

An information entropy calculation module, configured to calculate the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;

The correlation calculation module is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data;

A data selection module to be labeled, configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset relevance threshold as the data to be labeled;

The labeling module is used to label the data to be labeled according to the preset category labeling method to obtain the third sample data;

The first model training module is configured to use the third sample data to train the primary classification model according to a preset model training method to obtain an intermediate classification model;

The second model training module is configured to use the first sample data and the third sample data to train the intermediate classification model according to the preset model training method to obtain a text classification model.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a schematic diagram of an application environment of a text classification model training method in an embodiment of the present application;

Figure 2 is a flowchart of a text classification model training method in an embodiment of the present application;

3 is a flowchart of step S1 in the text classification model training method in an embodiment of the present application;

Fig. 4 is a flowchart of step S4 in the text classification model training method in an embodiment of the present application;

FIG. 5 is a flowchart of step S5 in a text classification model training method in an embodiment of the present application;

FIG. 6 is a schematic diagram of a text classification model training device in an embodiment of the present application;

Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.

The text classification model training method provided by this application can be applied to the application environment as shown in Figure 1. The server is a computer device for text classification model training, and the server can be a server or a server cluster; the preset sample library provides The database for training sample data, which can be various relational or non-relational databases, such as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase, etc.; the server and the preset sample database are connected through the network Connection, the network can be a wired network or a wireless network. The text classification model training method provided by the embodiment of the application is applied to the server.

In an embodiment, as shown in FIG. 2, a method for training a text classification model is provided. The specific implementation process includes the following steps:

S1: Obtain first sample data with category marks from a preset sample library, and establish a primary classification model based on the first sample data.

The preset sample library is a database that provides training sample data. The preset sample library can be deployed locally on the server or connected to the server through the network.

The first sample data is text data with category marks. Among them, the text data is a text document containing text information, text on the Internet, news, and the body of an e-mail, etc.; the category tag is a classification label for the text data, which is a classification restriction on the text data.

For example, if the category of an article is marked as "emotion", it means that the content of the article is related to "emotion". Understandably, category tags also include but are not limited to "science popularization", "sports", "inspirational", "poetry prose", etc., used to indicate the category of text data.

Specifically, in the preset sample library, the category mark and text data are stored in association, and each text data has a field indicating whether it has a category mark. The server can obtain the text data with the category mark as the first sample data through the SQL query statement.

The primary classification model is a classification tool constructed based on the first sample data. The established primary classification model can roughly classify the sample data with class labels.

Specifically, the server can obtain the text feature information of the first sample data by performing feature analysis on the first sample data with the category tag, and then store the category tag and the text feature information as a primary classification model. For example, the server may perform word segmentation processing on the text in the first sample data, and use high word frequency segmentation as text feature information. Among them, word segmentation processing is to segment the words in the text in the processing of text information to obtain individual words. As a word processing method, word segmentation is widely used in the fields of full-text retrieval and text content mining.

Alternatively, the server can use a neural network-based training method to obtain the primary classification model based on the first sample data.

S2: Obtain the second sample data without the category mark from the preset sample library.

The second sample data is text data without a category mark. That is, compared with the first sample data, the second sample data does not have a category label. If it is not manually labelled, the server does not know the text category to which the second sample data belongs or the meaning expressed.

Specifically, the server can obtain the second sample data from the preset sample library through the SQL query statement.

S3: Calculate the information entropy of each second sample data to obtain the information entropy value of each second sample data.

Information entropy is the concept of measuring the amount of information proposed by Shannon, which is a quantitative measure of the amount of information. The greater the information entropy, the richer the amount of information contained in the sample data, and the greater the uncertainty of the information.

The value of information entropy is a specific quantitative value of information entropy.

The server can determine the information entropy value according to how much text data is contained in the second sample data. For example, the number of characters in the second sample data is used as the information entropy value. Understandably, the amount of information contained in a 5000-word article is greater than the amount of information contained in an email body of only 20 words.

Specifically, the server calculates the number of characters in each second sample data, and uses the number of characters as the information entropy value of each second sample data.

Or, the server uses the number of word segmentation after the auxiliary word is removed from the second sample data as the information entropy value of the second sample data. Among them, the auxiliary words include but are not limited to "ba", "um", "de", "le" and so on.

Specifically, the server performs word segmentation processing on the second sample data to obtain a word segmentation set, removes auxiliary words in the word segmentation set, and uses the remaining number of word segments as the information entropy value of the second sample data.

S4: Calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data.

The correlation value of the second sample data reflects whether the information provided by the second sample data is repeated and redundant. The higher the correlation value, the higher the repetition and redundancy of the information provided by the second sample data; the lower the correlation value, the greater the difference in the information provided by the second sample data. .

The server determines the relevance value according to the number of identical phrases contained in the second sample data.

For example, if the second sample data A includes the phrases "culture", "civilization", and "history", the second sample data B includes the phrases "culture", "country", and "history", and the second sample data C Includes the phrases "travel", "mountain" and "country"; then the second sample data A and the second sample data B both contain the phrases "culture" and "history", and the correlation value between A and B is 2; Understandably, the correlation value of A and C is 0, and the correlation value of B and C is 1. At the same time, the correlation value of each second sample data can be determined by the cumulative sum of the correlation values of the second sample data and each other second sample data. That is, the correlation value of A is 2, the correlation value of B is 3, and the correlation value of C is 1.

S5: Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as the data to be labeled.

The preset information entropy threshold and the preset relevance threshold are conditions for filtering the second sample data that does not have a category mark.

The data to be labeled is the data obtained after filtering the second sample data according to the preset information entropy threshold and the preset relevance threshold.

The second sample data whose information entropy value exceeds the preset information entropy threshold and the relevance value is lower than the preset relevance threshold indicates that the content of its information is uncertain, and the greater the difference between the information amounts, the The preferred data used to train the model.

Specifically, if the preset information entropy threshold is 1000 and the preset relevance threshold is 100, the server selects the information entropy value and relevance value of each second sample data, and sets the information entropy value greater than 1000, and the correlation The second sample data whose degree value is lower than 100 is regarded as the data to be labeled.

S6: Perform category labeling on the data to be labeled according to the preset category labeling method to obtain the third sample data.

The category labeling is a process of labeling the second sample data that does not have a category label, so that the second sample data has a corresponding category label. For example, label an article by category, and add tags such as "fiction" and "suspense" that reflect the content of the subject. The data obtained after category labeling is the third sample data.

The preset category labeling method means that the server can use multiple labeling methods to label the second sample data.

For example, the server can extract the keywords in the second sample data, that is, use the five words with the highest word frequency as keywords; then, the keywords are consistent with the target keywords in the preset category tag thesaurus In comparison, if the keyword is consistent with the target keyword, the target keyword is labeled on the second sample data to obtain the third sample data.

Or, the server can directly call a third-party expert system for marking. For example, an API (Application Programming Interface) interface provided by a third-party expert system is used to input the second sample data to obtain a category mark corresponding to the second sample data, thereby obtaining the third sample data.

S7: According to the preset model training method, use the third sample data to train the primary classification model to obtain the intermediate classification model.

The intermediate classification model is a classification model obtained after training with the third sample data on the basis of the primary classification model. The difference between the intermediate classification model and the primary classification model is that the training set of the intermediate classification model is the third sample data that has a category label, and the information entropy value and the correlation value meet certain conditions.

The preset model training method is that the server uses the third sample data as training data, and uses multiple frameworks or algorithms to train the primary classification model. For example, the server can use existing machine learning frameworks or tools, such as Scikit-Learn, TensorFlow, etc.

Among them, Scikit-Learn, referred to as sklearn, is an open source, Python-based machine learning tool library. Sklearn has built-in classification algorithms such as naive Bayes algorithm, decision tree algorithm, and random forest algorithm. Data preprocessing can be achieved using sklearn. , Classification, regression, dimensionality reduction, model selection and other commonly used machine learning algorithms. TensorFlow is an open source software library for numerical calculations originally developed by researchers and engineers from the Google Brain Group (belonging to the Google Machine Intelligence Research Institute). It can be used for research on machine learning and deep neural networks, but this The versatility of the system makes it also widely used in other computing fields.

Specifically, taking sklearn as an example, the server uses the third sample data as input data and calls the built-in training method in sklearn until the model tends to converge, and then the intermediate classification model can be obtained.

S8: According to the preset model training method, use the first sample data and the third sample data to train the intermediate classification model to obtain the text classification model.

The text classification model is the final classification model obtained after retraining the intermediate classification model.

Among them, the preset model training method adopted by the server is the same as the training process of step S7, and will not be repeated here. The difference from the training process of step S7 is that the first sample data and the third sample data are used to train the intermediate classification model at the same time, that is, the intermediate classification model is iteratively trained using class-labeled sample data to improve the intermediate classification model The classification accuracy.

Specifically, taking sklearn as an example, the server takes the first sample data and the third sample data as input data, and calls the built-in training method in sklearn until the model tends to converge, and the text classification model can be obtained.

In this embodiment, the first sample data with category labels is obtained from the preset sample library, and the primary classification model is established according to the first sample data, that is, a small part of the sample data with category labels is used for training, and The primary classification model can reduce the demand for sample data with category marks and save training costs; obtain second sample data without category marks from the preset sample library; calculate the information entropy and correlation of the second sample data Classify the second sample data whose information entropy value and correlation value meet the preset conditions; according to the preset model training method, use the labeled third sample data to train the primary classification model to obtain the intermediate classification The model uses the third sample data to have a large information entropy, a small correlation between each other, and has the characteristics of category labels, which optimizes the classification accuracy of the primary classification model; finally, according to the first sample data and the third sample The data trains the intermediate classification model to obtain the text classification model, that is, through step-by-step iteration, optimization is used to obtain the final text classification model. A method for training a text classification model using a small amount of class-labeled sample data is proposed, so that a better performance classification model can be obtained through training with less sample data, which saves labor costs and improves training speed.

Further, in an embodiment, as shown in FIG. 3, for step S1, the first sample data with category tags is obtained from the preset sample library, and the primary classification model is established according to the first sample data. Including the following steps:

S11: Select the first sample data with the category mark from the preset sample library according to the preset sample selection method.

The preset sample selection method is to select a certain number of representative first sample data with category marks from the preset sample library. Among them, the number is as small as possible to reduce the demand for sample data; at the same time, the first sample selected should cover the text data category as much as possible. For example, for the selection of news text data, try to cover categories such as "politics", "business", "sports", "style and entertainment".

Specifically, if there are 100,000 articles in the preset sample library, among them, there are 3000 articles with category tags, the server can select 30% of the 3000 articles, that is, 900 articles are selected, and 900 articles are selected Select 5 articles each representing the text data category as the first sample data.

S12: Establish a primary classification model by combining the first sample data with the category mark and the preset training algorithm.

Preset training algorithms, including various algorithms for training models in machine learning. The process in which the server uses the first sample data with category labels to establish the primary classification model belongs to the supervised learning mode. Among them, supervised learning is to train to obtain an optimal model through existing training samples, that is, known data and its corresponding output. This model belongs to a set of certain functions, and optimal means that it is the best under certain evaluation criteria.

Specifically, taking the naive Bayes classification algorithm as an example, the server can import the naive Bayes function from the sklearn library, and then call MultinomialNB().fit() for training.

When the training is completed, the server can use the Joblib library to realize the function of saving training data. Among them, Joblib is a part of the SciPy ecology and provides tools for pipelined python work. Alternatively, the server can call the function of the pickle library to save the primary classification model.

In this embodiment, according to the preset sample selection method, the server selects the first sample data that is as small as possible and the type of sample data covers as wide as possible; then, the primary classification model is established in combination with the preset training algorithm, so that the sample The need for data is as small as possible to further reduce training costs. At the same time, due to the wide coverage of the first sample data, the recognizable range of the primary classification model is wider.

Further, in an embodiment, for step S3, that is, calculating the information entropy of each second sample data to obtain the information entropy value of each second sample data, specifically includes the following steps:

Calculate the information entropy of each second sample data according to the following formula:

Among them, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p _(x) represents the frequency of occurrence of the phrase.

The phrases in the second sample data are words obtained after the server performs word segmentation processing on the second sample data. The frequency of the phrase, that is, the number of times the phrase appears in the second sample data.

Specifically, the server first performs word segmentation processing on each second sample data to obtain a word segmentation set; then, substituting the frequency of all word segmentation in the word segmentation set into the formula, the information entropy value of the second sample data can be obtained.

In this embodiment, the server calculates the information entropy of the second sample data according to the Shannon formula and the word frequency of the phrase in the second sample data, so that the quantification of the amount of information contained in the sample data is more accurate.

Further, in one embodiment, as shown in FIG. 4, for step S4, that is, calculating the correlation value of each second sample data according to the number of the same phrase in the second sample data, it specifically includes the following steps:

S41: Perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data.

Specifically, the server can use multiple methods to perform word segmentation processing. For example, a regular expression is used to segment the second sample data to obtain a set consisting of several word segmentation, that is, a word segmentation set. Understandably, there is a one-to-one correspondence between the number of second sample data and the number of word segmentation sets.

Among them, regular expression, namely Regular Expression, also known as regular expression, is a processing method used to retrieve or replace target text in context.

Specifically, the server can use the built-in regular expression engine in Perl or Python to segment the second sample data; or, the server can segment the second sample data using the grep tool that comes with the Unix system , Get a set containing several participles. Among them, grep, namely Globally search a Regular Expression and Print, is a powerful text search tool.

S42: For each second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and determine the number of phrases contained in each intersection The local correlation value between the second sample data and other N-1 second sample data is obtained, and N-1 local correlation values corresponding to the second sample data are obtained.

Calculate the intersection between the word segmentation sets. Specifically, different word segmentation sets can be compared, and the intersection is the same phrase.

The local correlation value represents the degree of correlation between a second sample data and other second sample data.

For example, the participle set a is represented as {"people", "interest", "bank", "borrow"}, and the participle set b is represented as {"bank", "borrow", "income"}, then the participle set a The intersection with b is {"bank", "borrow"}, the number of phrases contained in the intersection is 2, and the local correlation value of the word segmentation set a and b is 2. In the same way, if the word segmentation set c is represented as {"meeting", "report", "income"}, the local correlation value of the word segmentation set a and c is 0, and the local correlation value of the word segmentation set b and c is 1. .

S43: Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.

Still taking the word segmentation sets a, b, and c in step S42 as an example, the correlation value of the second sample data corresponding to the word segmentation set a is the sum of the partial correlation values of the word segmentation sets a and b, and the word segmentation sets a and c. The average value is 1. Similarly, it can be known that the correlation values of the second sample data corresponding to the word segmentation sets b and c are 1.5 and 0.5, respectively.

In this embodiment, the server performs word segmentation processing on the second sample data to determine the local correlation value between the second sample data by the intersection of the word segmentation sets, and averages the local correlation values Obtain the correlation value of each second sample data, so that the correlation value can more accurately reflect the degree of correlation between the second sample data.

Further, in an embodiment, as shown in FIG. 5, for step S5, the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold is selected as the second sample data to be labeled The data includes the following steps:

S51: Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as candidate sample data.

The server re-screens the second sample data that meets the specific conditions, which not only reduces the number of training samples, but also finds sample data that is difficult to identify by ordinary classifiers. Among them, the specific condition means that the information entropy value exceeds the preset information entropy threshold, and the correlation value is lower than the preset correlation threshold.

S52: Use at least two preset sample classifiers to classify the candidate sample data to obtain a classification result.

Preset sample classifiers, namely text classification models. For example, common FastText, Text-CNN models, etc.

Among them, FastText is a word vector and text classification tool open sourced by Facebook, and its typical application scenario is "supervised text classification problem". It provides a simple and efficient method for text classification and characterization learning, with performance comparable to deep learning and faster. TextCNN is an algorithm that uses convolutional neural networks to classify text. Because of its simple structure and good effect, it is widely used in the field of text classification.

Different preset sample classifiers may have different results for classifying the same sample data. That is, after the same sample data is classified by different classification models such as FastText and Text-CNN, it may be recognized as different categories.

The classification result includes the category to which each candidate sample data belongs.

S53: Select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.

Candidate sample data belonging to different categories at the same time, that is, different preset classifiers have different recognition results for the same candidate sample data. For example, an article is recognized as "historical" by FastText and at the same time recognized as "literary and artistic" by Text-CNN. Therefore, it means that the article is difficult to be recognized, or it is difficult to simply divide it into a certain category.

Specifically, the server determines whether the candidate sample data belongs to different categories at the same time according to the category to which the candidate sample data in the classification result belongs.

In this embodiment, the server screens the second sample data that meets specific conditions according to different preset classifiers, and picks out the second sample data that is difficult to be identified as the data to be labeled, which removes the simple and easy to identify data. Sample data to further reduce the number of training samples and training time, improve training efficiency; at the same time, select the sample data that is not easy to be identified as the data to be labeled, so that the classification of these data to be labeled is beneficial to the accuracy of model training improve.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

In one embodiment, a text classification model training device is provided, and the text classification model training device corresponds to the text classification model training method in the above-mentioned embodiment one-to-one. As shown in FIG. 6, the text classification model training device includes a primary model building module 61, a sample data acquisition module 62, an information entropy calculation module 63, a correlation calculation module 64, a data selection module 65 to be labeled, a labeling module 66, a first The model training module 67 and the second model training module 68. The detailed description of each functional module is as follows:

The primary model establishment module 61 is configured to obtain the first sample data with category marks from the preset sample library, and establish a primary classification model according to the first sample data;

The sample data acquisition module 62 is configured to acquire second sample data without a category mark from a preset sample library;

The information entropy calculation module 63 is configured to calculate the information entropy of each second sample data to obtain the information entropy value of each second sample data;

The correlation calculation module 64 is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data;

The to-be-labeled data selection module 65 is configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold as the data to be labeled;

The labeling module 66 is configured to perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;

The first model training module 67 is configured to use the third sample data to train the primary classification model according to the preset model training method to obtain the intermediate classification model;

The second model training module 68 is configured to use the first sample data and the third sample data to train the intermediate classification model according to a preset model training method to obtain a text classification model.

Further, the primary model establishment module 61 includes:

The selection sub-module 611 is used to select the first sample data with the category mark from the preset sample library according to the preset sample selection method;

The training sub-module 612 is used to establish a primary classification model by combining the first sample data with category labels and a preset training algorithm.

Further, the information entropy calculation module 63 includes

The information entropy calculation sub-module 631 is configured to calculate the information entropy of each second sample data according to the following formula:

Further, the correlation calculation module 64 includes:

The word segmentation sub-module 641 is used to perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data;

The local correlation calculation sub-module 642 is used to calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data for each second sample data, and according to each intersection The number of phrases contained in the set is determined, the local correlation value between the second sample data and the other N-1 second sample data is determined, and the N-1 local correlation values corresponding to the second sample data are obtained;

The average value calculation sub-module 643 is used to calculate the average value of N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.

Further, the data selection module 65 to be labeled includes:

The candidate sample selection submodule 651 is configured to select second sample data whose information entropy value exceeds a preset information entropy threshold and whose correlation value is lower than the preset correlation threshold as candidate sample data;

The classification sub-module 652 is configured to classify candidate sample data by using at least two preset sample classifiers to obtain a classification result;

The labeling submodule 653 is used to select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.

For the specific definition of the text classification model training device, please refer to the above definition of the text classification model training method, which will not be repeated here. Each module in the text classification model training device described above can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 7. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a text classification model training method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer-readable instructions, the text in the above-mentioned embodiment is implemented. The steps of the classification model training method are, for example, steps S1 to S8 shown in FIG. 2. Alternatively, when the processor executes the computer-readable instructions, the functions of the modules/units of the text classification model training device in the above-mentioned embodiment are realized, for example, the functions of the modules 61 to 68 shown in FIG. 6. To avoid repetition, I won’t repeat them here.

In an embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media . The readable storage medium stores computer readable instructions, and the computer readable instructions implement the text classification model training method in the above method embodiment when executed by the processor, or implement the text classification model training method in the above method embodiments, or implement the computer readable instructions when executed by one or more processors The function of each module/unit in the text classification model training device in the above device embodiment. To avoid repetition, I won’t repeat them here.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A text classification model training method, characterized in that the text classification model training method includes:

Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;

Acquiring the second sample data without the category mark from the preset sample library;

Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;

Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;

Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;

Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;

Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;

According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
5. The text classification model training method according to claim 1, wherein said acquiring first sample data with category tags from a preset sample library, and establishing a primary classification model according to said first sample data, include:

Selecting the first sample data with the category mark from the preset sample library according to a preset sample selection method;

The primary classification model is established by combining the first sample data with the category mark and a preset training algorithm.
The text classification model training method according to claim 1, wherein the calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data comprises:

Calculate the information entropy of each of the second sample data according to the following formula:

Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
3. The text classification model training method according to claim 1, wherein the calculating the correlation value of each second sample data according to the number of the same phrase in the second sample data comprises:

Perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;

For each of the second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and according to the number of phrases contained in each of the intersections, Determine the local correlation value between the second sample data and other N-1 second sample data, and obtain the N-1 local correlation values corresponding to the second sample data;

Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
The text classification model training method according to claim 1, wherein said selecting said information entropy value exceeds a preset information entropy threshold, and said correlation value is lower than said preset correlation threshold. The second sample data, as the data to be labeled, includes:

Selecting the second sample data whose information entropy value exceeds the preset information entropy threshold value and the correlation value is lower than the preset correlation threshold value as candidate sample data;

Classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;

The candidate sample data belonging to different categories at the same time are selected from the classification result as the data to be labeled.
A text classification model training device, wherein the text classification model training device includes:

The primary model establishment module is used to obtain first sample data with category marks from a preset sample library, and establish a primary classification model according to the first sample data;

A sample data acquisition module, configured to acquire second sample data without the category mark from the preset sample library;

An information entropy calculation module, configured to calculate the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;

The correlation calculation module is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data;

A data selection module to be labeled, configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset relevance threshold as the data to be labeled;

The labeling module is used to label the data to be labeled according to the preset category labeling method to obtain the third sample data;

The first model training module is configured to use the third sample data to train the primary classification model according to a preset model training method to obtain an intermediate classification model;

The second model training module is configured to use the first sample data and the third sample data to train the intermediate classification model according to the preset model training method to obtain a text classification model.
7. The text classification model training device according to claim 6, wherein the primary model establishment module comprises:

The selection sub-module is configured to select the first sample data with the category mark from the preset sample library according to a preset sample selection method;

The training sub-module is used to establish the primary classification model by combining the first sample data with the category mark and a preset training algorithm.
7. The text classification model training device according to claim 6, wherein the information entropy calculation module comprises:

The information entropy calculation sub-module is used to calculate the information entropy of each second sample data according to the following formula:

Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
7. The text classification model training device according to claim 6, wherein the relevance calculation module comprises:

The word segmentation sub-module is used to perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;

The local correlation calculation sub-module is used to calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data for each second sample data, and according to each The number of phrases included in the intersection set determines the local correlation value between the second sample data and other N-1 second sample data, and obtains N-1 local correlations corresponding to the second sample data value;

The average value calculation sub-module is used to calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation degree of each second sample data value.
7. The text classification model training device according to claim 6, wherein the data selection module to be labeled comprises:

A candidate sample selection sub-module, configured to select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as candidate sample data;

The classification sub-module is used to classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;

The labeling sub-module is used to select the candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;

Acquiring the second sample data without the category mark from the preset sample library;

Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;

Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;

Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;

Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;

Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;

According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
11. The computer device according to claim 11, wherein said acquiring first sample data with category marks from a preset sample library and establishing a primary classification model according to said first sample data comprises:

Selecting the first sample data with the category mark from the preset sample library according to a preset sample selection method;

The primary classification model is established by combining the first sample data with the category mark and a preset training algorithm.
The computer device according to claim 11, wherein said calculating the information entropy of each of said second sample data to obtain the information entropy value of each of said second sample data comprises:

Calculate the information entropy of each of the second sample data according to the following formula:

Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
11. The computer device of claim 11, wherein the calculating the correlation value of each of the second sample data according to the number of identical phrases contained in the second sample data comprises:

Perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;

For each of the second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and according to the number of phrases contained in each of the intersections, Determine the local correlation value between the second sample data and other N-1 second sample data, and obtain the N-1 local correlation values corresponding to the second sample data;

Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
The computer device according to claim 11, wherein the selecting the second sample whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset correlation threshold Data as data to be labeled, including:

Selecting the second sample data whose information entropy value exceeds the preset information entropy threshold value and the correlation value is lower than the preset correlation threshold value as candidate sample data;

Classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;

The candidate sample data belonging to different categories at the same time are selected from the classification result as the data to be labeled.
One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;

Acquiring the second sample data without the category mark from the preset sample library;

Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;

Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;

Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;

Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;

Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;

According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
The readable storage medium according to claim 16, wherein the acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data, comprises :

Selecting the first sample data with the category mark from the preset sample library according to a preset sample selection method;

The primary classification model is established by combining the first sample data with the category mark and a preset training algorithm.
15. The readable storage medium of claim 16, wherein the calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data comprises:

Calculate the information entropy of each of the second sample data according to the following formula:

Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
15. The readable storage medium of claim 16, wherein the calculating the correlation value of each of the second sample data according to the number of the same phrase in the second sample data comprises:

Perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;

For each of the second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and according to the number of phrases contained in each of the intersections, Determine the local correlation value between the second sample data and other N-1 second sample data, and obtain the N-1 local correlation values corresponding to the second sample data;

Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
The readable storage medium according to claim 16, wherein the selection of the information entropy value exceeds a preset information entropy threshold, and the correlation degree value is lower than the first correlation degree threshold value. The second sample data is the data to be labeled, including:

Selecting the second sample data whose information entropy value exceeds the preset information entropy threshold value and the correlation value is lower than the preset correlation threshold value as candidate sample data;

Classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;

The candidate sample data belonging to different categories at the same time are selected from the classification result as the data to be labeled.