WO2020199591A1 - Text categorization model training method, apparatus, computer device, and storage medium - Google Patents

Text categorization model training method, apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2020199591A1
WO2020199591A1 PCT/CN2019/117095 CN2019117095W WO2020199591A1 WO 2020199591 A1 WO2020199591 A1 WO 2020199591A1 CN 2019117095 W CN2019117095 W CN 2019117095W WO 2020199591 A1 WO2020199591 A1 WO 2020199591A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample data
preset
sample
information entropy
value
Prior art date
Application number
PCT/CN2019/117095
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020199591A1 publication Critical patent/WO2020199591A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Definitions

  • This application relates to the field of information processing, in particular to text classification model training methods, devices, computer equipment and storage media.
  • Text classification is an important application direction in the research field of natural language processing. Text classification refers to the use of a classifier to classify data documents containing text, so as to determine the category to which each document belongs, so that users can easily obtain the required documents.
  • the classifier is also called a classification model, which is obtained by training the classification criteria or model parameters by using a large number of sample data with category labels.
  • a classification model which is obtained by training the classification criteria or model parameters by using a large number of sample data with category labels.
  • the embodiments of the present application provide a text classification model training method, device, computer equipment, and storage medium to solve the problem of large training samples and long training time in the text classification model training process.
  • a text classification model training method including:
  • the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
  • a text classification model training device including:
  • the primary model establishment module is used to obtain first sample data with category marks from a preset sample library, and establish a primary classification model according to the first sample data;
  • a sample data acquisition module configured to acquire second sample data without the category mark from the preset sample library
  • An information entropy calculation module configured to calculate the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
  • the correlation calculation module is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data
  • a data selection module to be labeled configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset relevance threshold as the data to be labeled;
  • the labeling module is used to label the data to be labeled according to the preset category labeling method to obtain the third sample data;
  • the first model training module is configured to use the third sample data to train the primary classification model according to a preset model training method to obtain an intermediate classification model;
  • the second model training module is configured to use the first sample data and the third sample data to train the intermediate classification model according to the preset model training method to obtain a text classification model.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
  • FIG. 1 is a schematic diagram of an application environment of a text classification model training method in an embodiment of the present application
  • Figure 2 is a flowchart of a text classification model training method in an embodiment of the present application
  • step S1 in the text classification model training method in an embodiment of the present application
  • Fig. 4 is a flowchart of step S4 in the text classification model training method in an embodiment of the present application.
  • FIG. 5 is a flowchart of step S5 in a text classification model training method in an embodiment of the present application
  • FIG. 6 is a schematic diagram of a text classification model training device in an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.
  • the text classification model training method provided by this application can be applied to the application environment as shown in Figure 1.
  • the server is a computer device for text classification model training, and the server can be a server or a server cluster; the preset sample library provides The database for training sample data, which can be various relational or non-relational databases, such as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase, etc.; the server and the preset sample database are connected through the network Connection, the network can be a wired network or a wireless network.
  • the text classification model training method provided by the embodiment of the application is applied to the server.
  • a method for training a text classification model is provided.
  • the specific implementation process includes the following steps:
  • S1 Obtain first sample data with category marks from a preset sample library, and establish a primary classification model based on the first sample data.
  • the preset sample library is a database that provides training sample data.
  • the preset sample library can be deployed locally on the server or connected to the server through the network.
  • the first sample data is text data with category marks.
  • the text data is a text document containing text information, text on the Internet, news, and the body of an e-mail, etc.
  • the category tag is a classification label for the text data, which is a classification restriction on the text data.
  • category tags also include but are not limited to "science popularization”, “sports”, “inspirational”, “poetry prose”, etc., used to indicate the category of text data.
  • the category mark and text data are stored in association, and each text data has a field indicating whether it has a category mark.
  • the server can obtain the text data with the category mark as the first sample data through the SQL query statement.
  • the primary classification model is a classification tool constructed based on the first sample data.
  • the established primary classification model can roughly classify the sample data with class labels.
  • the server can obtain the text feature information of the first sample data by performing feature analysis on the first sample data with the category tag, and then store the category tag and the text feature information as a primary classification model.
  • the server may perform word segmentation processing on the text in the first sample data, and use high word frequency segmentation as text feature information.
  • word segmentation processing is to segment the words in the text in the processing of text information to obtain individual words.
  • word segmentation is widely used in the fields of full-text retrieval and text content mining.
  • the server can use a neural network-based training method to obtain the primary classification model based on the first sample data.
  • the second sample data is text data without a category mark. That is, compared with the first sample data, the second sample data does not have a category label. If it is not manually labelled, the server does not know the text category to which the second sample data belongs or the meaning expressed.
  • the server can obtain the second sample data from the preset sample library through the SQL query statement.
  • Information entropy is the concept of measuring the amount of information proposed by Shannon, which is a quantitative measure of the amount of information. The greater the information entropy, the richer the amount of information contained in the sample data, and the greater the uncertainty of the information.
  • the value of information entropy is a specific quantitative value of information entropy.
  • the server can determine the information entropy value according to how much text data is contained in the second sample data. For example, the number of characters in the second sample data is used as the information entropy value. Understandably, the amount of information contained in a 5000-word article is greater than the amount of information contained in an email body of only 20 words.
  • the server calculates the number of characters in each second sample data, and uses the number of characters as the information entropy value of each second sample data.
  • the server uses the number of word segmentation after the auxiliary word is removed from the second sample data as the information entropy value of the second sample data.
  • the auxiliary words include but are not limited to "ba”, “um”, “de”, “le” and so on.
  • the server performs word segmentation processing on the second sample data to obtain a word segmentation set, removes auxiliary words in the word segmentation set, and uses the remaining number of word segments as the information entropy value of the second sample data.
  • S4 Calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data.
  • the correlation value of the second sample data reflects whether the information provided by the second sample data is repeated and redundant. The higher the correlation value, the higher the repetition and redundancy of the information provided by the second sample data; the lower the correlation value, the greater the difference in the information provided by the second sample data. .
  • the server determines the relevance value according to the number of identical phrases contained in the second sample data.
  • the second sample data A includes the phrases “culture”, “civilization”, and “history”
  • the second sample data B includes the phrases “culture”, “country”, and “history”
  • the second sample data C Includes the phrases “travel”, “mountain” and “country”
  • the correlation value of A and C is 0, and the correlation value of B and C is 1.
  • the correlation value of each second sample data can be determined by the cumulative sum of the correlation values of the second sample data and each other second sample data. That is, the correlation value of A is 2, the correlation value of B is 3, and the correlation value of C is 1.
  • S5 Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as the data to be labeled.
  • the preset information entropy threshold and the preset relevance threshold are conditions for filtering the second sample data that does not have a category mark.
  • the data to be labeled is the data obtained after filtering the second sample data according to the preset information entropy threshold and the preset relevance threshold.
  • the second sample data whose information entropy value exceeds the preset information entropy threshold and the relevance value is lower than the preset relevance threshold indicates that the content of its information is uncertain, and the greater the difference between the information amounts, the The preferred data used to train the model.
  • the server selects the information entropy value and relevance value of each second sample data, and sets the information entropy value greater than 1000, and the correlation
  • the second sample data whose degree value is lower than 100 is regarded as the data to be labeled.
  • S6 Perform category labeling on the data to be labeled according to the preset category labeling method to obtain the third sample data.
  • the category labeling is a process of labeling the second sample data that does not have a category label, so that the second sample data has a corresponding category label. For example, label an article by category, and add tags such as "fiction” and "suspense" that reflect the content of the subject.
  • the data obtained after category labeling is the third sample data.
  • the preset category labeling method means that the server can use multiple labeling methods to label the second sample data.
  • the server can extract the keywords in the second sample data, that is, use the five words with the highest word frequency as keywords; then, the keywords are consistent with the target keywords in the preset category tag thesaurus In comparison, if the keyword is consistent with the target keyword, the target keyword is labeled on the second sample data to obtain the third sample data.
  • the server can directly call a third-party expert system for marking.
  • a third-party expert system for marking.
  • an API Application Programming Interface
  • a third-party expert system is used to input the second sample data to obtain a category mark corresponding to the second sample data, thereby obtaining the third sample data.
  • S7 According to the preset model training method, use the third sample data to train the primary classification model to obtain the intermediate classification model.
  • the intermediate classification model is a classification model obtained after training with the third sample data on the basis of the primary classification model.
  • the difference between the intermediate classification model and the primary classification model is that the training set of the intermediate classification model is the third sample data that has a category label, and the information entropy value and the correlation value meet certain conditions.
  • the preset model training method is that the server uses the third sample data as training data, and uses multiple frameworks or algorithms to train the primary classification model.
  • the server can use existing machine learning frameworks or tools, such as Scikit-Learn, TensorFlow, etc.
  • Scikit-Learn referred to as sklearn
  • Sklearn has built-in classification algorithms such as naive Bayes algorithm, decision tree algorithm, and random forest algorithm. Data preprocessing can be achieved using sklearn. , Classification, regression, dimensionality reduction, model selection and other commonly used machine learning algorithms.
  • TensorFlow is an open source software library for numerical calculations originally developed by researchers and engineers from the Google Brain Group (belonging to the Google Machine Intelligence Research Institute). It can be used for research on machine learning and deep neural networks, but this The versatility of the system makes it also widely used in other computing fields.
  • the server uses the third sample data as input data and calls the built-in training method in sklearn until the model tends to converge, and then the intermediate classification model can be obtained.
  • S8 According to the preset model training method, use the first sample data and the third sample data to train the intermediate classification model to obtain the text classification model.
  • the text classification model is the final classification model obtained after retraining the intermediate classification model.
  • the preset model training method adopted by the server is the same as the training process of step S7, and will not be repeated here.
  • the difference from the training process of step S7 is that the first sample data and the third sample data are used to train the intermediate classification model at the same time, that is, the intermediate classification model is iteratively trained using class-labeled sample data to improve the intermediate classification model The classification accuracy.
  • the server takes the first sample data and the third sample data as input data, and calls the built-in training method in sklearn until the model tends to converge, and the text classification model can be obtained.
  • the first sample data with category labels is obtained from the preset sample library, and the primary classification model is established according to the first sample data, that is, a small part of the sample data with category labels is used for training, and
  • the primary classification model can reduce the demand for sample data with category marks and save training costs; obtain second sample data without category marks from the preset sample library; calculate the information entropy and correlation of the second sample data Classify the second sample data whose information entropy value and correlation value meet the preset conditions; according to the preset model training method, use the labeled third sample data to train the primary classification model to obtain the intermediate classification
  • the model uses the third sample data to have a large information entropy, a small correlation between each other, and has the characteristics of category labels, which optimizes the classification accuracy of the primary classification model; finally, according to the first sample data and the third sample
  • the data trains the intermediate classification model to obtain the text classification model, that is, through step-by-step iteration, optimization is used to obtain the final text classification model.
  • step S1 the first sample data with category tags is obtained from the preset sample library, and the primary classification model is established according to the first sample data. Including the following steps:
  • S11 Select the first sample data with the category mark from the preset sample library according to the preset sample selection method.
  • the preset sample selection method is to select a certain number of representative first sample data with category marks from the preset sample library. Among them, the number is as small as possible to reduce the demand for sample data; at the same time, the first sample selected should cover the text data category as much as possible. For example, for the selection of news text data, try to cover categories such as "politics”, “business”, “sports”, “style and entertainment”.
  • the server can select 30% of the 3000 articles, that is, 900 articles are selected, and 900 articles are selected Select 5 articles each representing the text data category as the first sample data.
  • S12 Establish a primary classification model by combining the first sample data with the category mark and the preset training algorithm.
  • Preset training algorithms including various algorithms for training models in machine learning.
  • the process in which the server uses the first sample data with category labels to establish the primary classification model belongs to the supervised learning mode.
  • supervised learning is to train to obtain an optimal model through existing training samples, that is, known data and its corresponding output.
  • This model belongs to a set of certain functions, and optimal means that it is the best under certain evaluation criteria.
  • the server can import the naive Bayes function from the sklearn library, and then call MultinomialNB().fit() for training.
  • the server can use the Joblib library to realize the function of saving training data.
  • Joblib is a part of the SciPy ecology and provides tools for pipelined python work.
  • the server can call the function of the pickle library to save the primary classification model.
  • the server selects the first sample data that is as small as possible and the type of sample data covers as wide as possible; then, the primary classification model is established in combination with the preset training algorithm, so that the sample The need for data is as small as possible to further reduce training costs. At the same time, due to the wide coverage of the first sample data, the recognizable range of the primary classification model is wider.
  • step S3 that is, calculating the information entropy of each second sample data to obtain the information entropy value of each second sample data, specifically includes the following steps:
  • H represents the information entropy value of the second sample data
  • x represents the phrase in the second sample data
  • p (x) represents the frequency of occurrence of the phrase.
  • the phrases in the second sample data are words obtained after the server performs word segmentation processing on the second sample data.
  • the frequency of the phrase that is, the number of times the phrase appears in the second sample data.
  • the server first performs word segmentation processing on each second sample data to obtain a word segmentation set; then, substituting the frequency of all word segmentation in the word segmentation set into the formula, the information entropy value of the second sample data can be obtained.
  • the server calculates the information entropy of the second sample data according to the Shannon formula and the word frequency of the phrase in the second sample data, so that the quantification of the amount of information contained in the sample data is more accurate.
  • step S4 that is, calculating the correlation value of each second sample data according to the number of the same phrase in the second sample data, it specifically includes the following steps:
  • S41 Perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data.
  • the server can use multiple methods to perform word segmentation processing. For example, a regular expression is used to segment the second sample data to obtain a set consisting of several word segmentation, that is, a word segmentation set. Understandably, there is a one-to-one correspondence between the number of second sample data and the number of word segmentation sets.
  • regular expression namely Regular Expression, also known as regular expression, is a processing method used to retrieve or replace target text in context.
  • the server can use the built-in regular expression engine in Perl or Python to segment the second sample data; or, the server can segment the second sample data using the grep tool that comes with the Unix system , Get a set containing several participles.
  • grep namely Globally search a Regular Expression and Print, is a powerful text search tool.
  • the local correlation value represents the degree of correlation between a second sample data and other second sample data.
  • the participle set a is represented as ⁇ "people", “interest”, “bank”, “borrow” ⁇
  • the participle set b is represented as ⁇ "bank", “borrow”, “income” ⁇
  • the participle set a The intersection with b is ⁇ "bank", “borrow” ⁇ , the number of phrases contained in the intersection is 2, and the local correlation value of the word segmentation set a and b is 2.
  • the word segmentation set c is represented as ⁇ "meeting", “report”, "income” ⁇
  • the local correlation value of the word segmentation set a and c is 0, and the local correlation value of the word segmentation set b and c is 1. .
  • S43 Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
  • the correlation value of the second sample data corresponding to the word segmentation set a is the sum of the partial correlation values of the word segmentation sets a and b, and the word segmentation sets a and c.
  • the average value is 1.
  • the correlation values of the second sample data corresponding to the word segmentation sets b and c are 1.5 and 0.5, respectively.
  • the server performs word segmentation processing on the second sample data to determine the local correlation value between the second sample data by the intersection of the word segmentation sets, and averages the local correlation values Obtain the correlation value of each second sample data, so that the correlation value can more accurately reflect the degree of correlation between the second sample data.
  • step S5 the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold is selected as the second sample data to be labeled
  • the data includes the following steps:
  • S51 Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as candidate sample data.
  • the server re-screens the second sample data that meets the specific conditions, which not only reduces the number of training samples, but also finds sample data that is difficult to identify by ordinary classifiers.
  • the specific condition means that the information entropy value exceeds the preset information entropy threshold, and the correlation value is lower than the preset correlation threshold.
  • S52 Use at least two preset sample classifiers to classify the candidate sample data to obtain a classification result.
  • Preset sample classifiers namely text classification models.
  • text classification models For example, common FastText, Text-CNN models, etc.
  • FastText is a word vector and text classification tool open sourced by Facebook, and its typical application scenario is "supervised text classification problem". It provides a simple and efficient method for text classification and characterization learning, with performance comparable to deep learning and faster.
  • TextCNN is an algorithm that uses convolutional neural networks to classify text. Because of its simple structure and good effect, it is widely used in the field of text classification.
  • Different preset sample classifiers may have different results for classifying the same sample data. That is, after the same sample data is classified by different classification models such as FastText and Text-CNN, it may be recognized as different categories.
  • the classification result includes the category to which each candidate sample data belongs.
  • S53 Select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
  • Candidate sample data belonging to different categories at the same time that is, different preset classifiers have different recognition results for the same candidate sample data.
  • an article is recognized as "historical” by FastText and at the same time recognized as "literary and artistic” by Text-CNN. Therefore, it means that the article is difficult to be recognized, or it is difficult to simply divide it into a certain category.
  • the server determines whether the candidate sample data belongs to different categories at the same time according to the category to which the candidate sample data in the classification result belongs.
  • the server screens the second sample data that meets specific conditions according to different preset classifiers, and picks out the second sample data that is difficult to be identified as the data to be labeled, which removes the simple and easy to identify data.
  • Sample data to further reduce the number of training samples and training time, improve training efficiency; at the same time, select the sample data that is not easy to be identified as the data to be labeled, so that the classification of these data to be labeled is beneficial to the accuracy of model training improve.
  • a text classification model training device is provided, and the text classification model training device corresponds to the text classification model training method in the above-mentioned embodiment one-to-one.
  • the text classification model training device includes a primary model building module 61, a sample data acquisition module 62, an information entropy calculation module 63, a correlation calculation module 64, a data selection module 65 to be labeled, a labeling module 66, a first The model training module 67 and the second model training module 68.
  • the detailed description of each functional module is as follows:
  • the primary model establishment module 61 is configured to obtain the first sample data with category marks from the preset sample library, and establish a primary classification model according to the first sample data;
  • the sample data acquisition module 62 is configured to acquire second sample data without a category mark from a preset sample library
  • the information entropy calculation module 63 is configured to calculate the information entropy of each second sample data to obtain the information entropy value of each second sample data;
  • the correlation calculation module 64 is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data
  • the to-be-labeled data selection module 65 is configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold as the data to be labeled;
  • the labeling module 66 is configured to perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
  • the first model training module 67 is configured to use the third sample data to train the primary classification model according to the preset model training method to obtain the intermediate classification model;
  • the second model training module 68 is configured to use the first sample data and the third sample data to train the intermediate classification model according to a preset model training method to obtain a text classification model.
  • the primary model establishment module 61 includes:
  • the selection sub-module 611 is used to select the first sample data with the category mark from the preset sample library according to the preset sample selection method;
  • the training sub-module 612 is used to establish a primary classification model by combining the first sample data with category labels and a preset training algorithm.
  • the information entropy calculation module 63 includes
  • the information entropy calculation sub-module 631 is configured to calculate the information entropy of each second sample data according to the following formula:
  • H represents the information entropy value of the second sample data
  • x represents the phrase in the second sample data
  • p (x) represents the frequency of occurrence of the phrase.
  • the correlation calculation module 64 includes:
  • the word segmentation sub-module 641 is used to perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data;
  • the local correlation calculation sub-module 642 is used to calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data for each second sample data, and according to each intersection The number of phrases contained in the set is determined, the local correlation value between the second sample data and the other N-1 second sample data is determined, and the N-1 local correlation values corresponding to the second sample data are obtained;
  • the average value calculation sub-module 643 is used to calculate the average value of N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
  • the data selection module 65 to be labeled includes:
  • the candidate sample selection submodule 651 is configured to select second sample data whose information entropy value exceeds a preset information entropy threshold and whose correlation value is lower than the preset correlation threshold as candidate sample data;
  • the classification sub-module 652 is configured to classify candidate sample data by using at least two preset sample classifiers to obtain a classification result
  • the labeling submodule 653 is used to select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
  • Each module in the text classification model training device described above can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 7.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a text classification model training method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions, the text in the above-mentioned embodiment is implemented.
  • the steps of the classification model training method are, for example, steps S1 to S8 shown in FIG. 2.
  • the processor executes the computer-readable instructions, the functions of the modules/units of the text classification model training device in the above-mentioned embodiment are realized, for example, the functions of the modules 61 to 68 shown in FIG. 6. To avoid repetition, I won’t repeat them here.
  • one or more readable storage media storing computer readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media .
  • the readable storage medium stores computer readable instructions, and the computer readable instructions implement the text classification model training method in the above method embodiment when executed by the processor, or implement the text classification model training method in the above method embodiments, or implement the computer readable instructions when executed by one or more processors
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed by the present application are a text categorization model training method, apparatus, computer device, and storage medium, said method comprising: obtaining, from a preset sample library, first sample data having a category label and second sample data not having a category label; establishing a primary categorization model according to the first sample data; at the same time, calculating an information entropy value and a correlation value of the second sample data; according to a preset category labeling method, labeling the second sample data whose information entropy value and correlation value meet preset conditions to obtain third sample data; using the third sample data to train the primary categorization model to obtain an intermediate categorization model; using the first sample data and the third sample data to train the intermediate categorization model to obtain a text categorization model. The technical solution of the present application solves the problem, during text categorization model training, of the training sample size being enormous and the training time being long.

Description

文本分类模型训练方法、装置、计算机设备及存储介质Text classification model training method, device, computer equipment and storage medium
本申请以2019年3月29日提交的申请号为201910247846.8,名称为“文本分类模型训练方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。This application is based on the Chinese invention application filed on March 29, 2019 with the application number 201910247846.8 and titled "text classification model training method, device, computer equipment and storage medium", and claims its priority.
技术领域Technical field
本申请涉及信息处理领域,尤其涉及文本分类模型训练方法、装置、计算机设备及存储介质。This application relates to the field of information processing, in particular to text classification model training methods, devices, computer equipment and storage media.
背景技术Background technique
文本分类是自然语言处理研究领域中的一个重要应用方向。文本分类是指利用分类器对包含文本的数据文档进行分类,从而确定每个文档所属的类别,使得用户能够方便的获取需要的文档。Text classification is an important application direction in the research field of natural language processing. Text classification refers to the use of a classifier to classify data documents containing text, so as to determine the category to which each document belongs, so that users can easily obtain the required documents.
其中,分类器又称为分类模型,是通过使用大量的带有类别标记的样本数据,对分类准则或模型参数进行训练而得到的。利用训练得到的分类器对未知类别的文本数据进行识别,从而实现对大规模文本数据的自动分类。因此,分类模型的优劣直接影响到分类的最终效果。Among them, the classifier is also called a classification model, which is obtained by training the classification criteria or model parameters by using a large number of sample data with category labels. Use the trained classifier to recognize text data of unknown categories, so as to realize the automatic classification of large-scale text data. Therefore, the quality of the classification model directly affects the final effect of the classification.
然而,在现实的大型文本分类问题中,有类别标记的样本数据非常有限,大部分样本是没有类别标记的。这使得在分类模型的构建过程中,不得不采用由领域内的专家来进行人工标注的方式。这种方式需要耗费大量的人力、财力和时间,并且训练样本的规模庞大,训练过程也将花费大量的时间。However, in the real large-scale text classification problem, the sample data with category label is very limited, and most of the samples are not labeled with category. This makes it necessary to use experts in the field to perform manual labeling during the construction of the classification model. This method requires a lot of manpower, financial resources and time, and the scale of the training samples is huge, and the training process will also take a lot of time.
发明内容Summary of the invention
本申请实施例提供一种文本分类模型训练方法、装置、计算机设备及存储介质,以解决在文本分类模型训练过程中,训练样本规模庞大,训练时间长的问题。The embodiments of the present application provide a text classification model training method, device, computer equipment, and storage medium to solve the problem of large training samples and long training time in the text classification model training process.
一种文本分类模型训练方法,包括:A text classification model training method, including:
从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;
从所述预设样本库中获取不具有所述类别标记的第二样本数据;Acquiring the second sample data without the category mark from the preset sample library;
计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;
选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;
根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;
按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
一种文本分类模型训练装置,包括:A text classification model training device, including:
初级模型建立模块,用于从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;The primary model establishment module is used to obtain first sample data with category marks from a preset sample library, and establish a primary classification model according to the first sample data;
样本数据获取模块,用于从所述预设样本库中获取不具有所述类别标记的第二样本数据;A sample data acquisition module, configured to acquire second sample data without the category mark from the preset sample library;
信息熵计算模块,用于计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;An information entropy calculation module, configured to calculate the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
相关度计算模块,用于根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;The correlation calculation module is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data;
待标注数据选取模块,用于选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;A data selection module to be labeled, configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset relevance threshold as the data to be labeled;
标注模块,用于根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;The labeling module is used to label the data to be labeled according to the preset category labeling method to obtain the third sample data;
第一模型训练模块,用于按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;The first model training module is configured to use the third sample data to train the primary classification model according to a preset model training method to obtain an intermediate classification model;
第二模型训练模块,用于按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。The second model training module is configured to use the first sample data and the third sample data to train the intermediate classification model according to the preset model training method to obtain a text classification model.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;
从所述预设样本库中获取不具有所述类别标记的第二样本数据;Acquiring the second sample data without the category mark from the preset sample library;
计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;
选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;
根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;
按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;
从所述预设样本库中获取不具有所述类别标记的第二样本数据;Acquiring the second sample data without the category mark from the preset sample library;
计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;
选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;
根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;
按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和 优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中文本分类模型训练方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a text classification model training method in an embodiment of the present application;
图2是本申请一实施例中文本分类模型训练方法的流程图;Figure 2 is a flowchart of a text classification model training method in an embodiment of the present application;
图3是本申请一实施例中文本分类模型训练方法中步骤S1的流程图;3 is a flowchart of step S1 in the text classification model training method in an embodiment of the present application;
图4是本申请一实施例中文本分类模型训练方法中步骤S4的流程图;Fig. 4 is a flowchart of step S4 in the text classification model training method in an embodiment of the present application;
图5是本申请一实施例中文本分类模型训练方法中步骤S5的流程图;FIG. 5 is a flowchart of step S5 in a text classification model training method in an embodiment of the present application;
图6是本申请一实施例中文本分类模型训练装置的示意图;FIG. 6 is a schematic diagram of a text classification model training device in an embodiment of the present application;
图7是本申请一实施例中计算机设备的示意图。Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.
本申请提供的文本分类模型训练方法,可应用在如图1的应用环境中,其中,服务端是进行文本分类模型训练的计算机设备,服务端可以是服务器或服务器集群;预设样本库是提供训练样本数据的数据库,具体可以是各种关系型或非关系型数据库,如MS-SQL、Oracle、MySQL、Sybase、DB2、Redis、MongodDB、Hbase等;服务端与预设样本库之间通过网络连接,网络可以是有线网络或无线网络。本申请实施例提供的文本分类模型训练方法应用于服务端。The text classification model training method provided by this application can be applied to the application environment as shown in Figure 1. The server is a computer device for text classification model training, and the server can be a server or a server cluster; the preset sample library provides The database for training sample data, which can be various relational or non-relational databases, such as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase, etc.; the server and the preset sample database are connected through the network Connection, the network can be a wired network or a wireless network. The text classification model training method provided by the embodiment of the application is applied to the server.
在一实施例中,如图2所示,提供了一种文本分类模型训练方法,其具体实现流程包括如下步骤:In an embodiment, as shown in FIG. 2, a method for training a text classification model is provided. The specific implementation process includes the following steps:
S1:从预设样本库中获取具有类别标记的第一样本数据,并根据第一样本数据建立初级分类模型。S1: Obtain first sample data with category marks from a preset sample library, and establish a primary classification model based on the first sample data.
预设样本库,即提供训练样本数据的数据库。预设样本库可以部署在服务端本地,或者通过网络与服务端相连。The preset sample library is a database that provides training sample data. The preset sample library can be deployed locally on the server or connected to the server through the network.
第一样本数据,是具有类别标记的文本数据。其中,文本数据是包含有文本信息的文本文档、互联网上的文字、新闻、以及电子邮件正文等;类别标记是对文本数据所作的分类标签,是对文本数据的分类限定。The first sample data is text data with category marks. Among them, the text data is a text document containing text information, text on the Internet, news, and the body of an e-mail, etc.; the category tag is a classification label for the text data, which is a classification restriction on the text data.
例如,一篇文章的类别标记为“情感”,则代表该篇文章的内容以与“情感”相关。可以理解地,类别标记还包括但不限于“科普”、“运动”、“励志”、“诗歌散文”等用于表示文本数据所属类别的标记。For example, if the category of an article is marked as "emotion", it means that the content of the article is related to "emotion". Understandably, category tags also include but are not limited to "science popularization", "sports", "inspirational", "poetry prose", etc., used to indicate the category of text data.
具体地,在预设样本库中,类别标记和文本数据是关联存储的,每个文本数据均有表示其是否具有类别标记的字段。服务端可以通过SQL查询语句获取有类别标记的文本数据作为第一样本数据。Specifically, in the preset sample library, the category mark and text data are stored in association, and each text data has a field indicating whether it has a category mark. The server can obtain the text data with the category mark as the first sample data through the SQL query statement.
初级分类模型,是根据第一样本数据构建的分类工具。建立起的初级分类模型能够对有类别标记的样本数据进行粗略的分类。The primary classification model is a classification tool constructed based on the first sample data. The established primary classification model can roughly classify the sample data with class labels.
具体地,服务端可以通过对具有类别标记的第一样本数据进行特征分析,得到第一样本数据的文本特征信息,然后将类别标记与文本特征信息进行关联存储,作为初级分类模 型。例如,服务端可以对第一样本数据中的文字进行分词处理,以高词频的分词作为文本特征信息。其中,分词处理,是在对文字信息处理中,将文中的词进去切分,得到一个个单独的词。分词处理作为一种文字处理手段,被广泛应用于全文检索、文本内容挖掘等领域。Specifically, the server can obtain the text feature information of the first sample data by performing feature analysis on the first sample data with the category tag, and then store the category tag and the text feature information as a primary classification model. For example, the server may perform word segmentation processing on the text in the first sample data, and use high word frequency segmentation as text feature information. Among them, word segmentation processing is to segment the words in the text in the processing of text information to obtain individual words. As a word processing method, word segmentation is widely used in the fields of full-text retrieval and text content mining.
或者,服务端可以根据第一样本数据,使用基于神经网络的训练方法得到初级分类模型。Alternatively, the server can use a neural network-based training method to obtain the primary classification model based on the first sample data.
S2:从预设样本库中获取不具有类别标记的第二样本数据。S2: Obtain the second sample data without the category mark from the preset sample library.
第二样本数据,是不具有类别标记的文本数据。即,与第一样本数据相比,第二样本数据没有类别标记,若不通过人工标记的方式,服务端不清楚第二样本数据所属的文本类别或表达的意思。The second sample data is text data without a category mark. That is, compared with the first sample data, the second sample data does not have a category label. If it is not manually labelled, the server does not know the text category to which the second sample data belongs or the meaning expressed.
具体地,服务端可以通过SQL查询语句,从预设样本库中获取第二样本数据。Specifically, the server can obtain the second sample data from the preset sample library through the SQL query statement.
S3:计算每个第二样本数据的信息熵,得到每个第二样本数据的信息熵值。S3: Calculate the information entropy of each second sample data to obtain the information entropy value of each second sample data.
信息熵,是由香浓提出的衡量信息量的概念,是对信息多少的量化度量。信息熵越大,即样本数据中所包含的信息量也就越丰富,同时代表信息的不确定性越大。Information entropy is the concept of measuring the amount of information proposed by Shannon, which is a quantitative measure of the amount of information. The greater the information entropy, the richer the amount of information contained in the sample data, and the greater the uncertainty of the information.
信息熵值,是对信息熵的具体量化值。The value of information entropy is a specific quantitative value of information entropy.
服务端可以根据第二样本数据中包含文本数据的多少来确定信息熵值。例如,以第二样本数据中文字字数的数量作为信息熵值。可以理解地,一篇5000字的文章中所包含的信息量要大于一篇只有20字的电子邮件正文所包含的信息量。The server can determine the information entropy value according to how much text data is contained in the second sample data. For example, the number of characters in the second sample data is used as the information entropy value. Understandably, the amount of information contained in a 5000-word article is greater than the amount of information contained in an email body of only 20 words.
具体地,服务端计算每个第二样本数据中的文字字数,以文字字数作为每个第二样本数据的信息熵值。Specifically, the server calculates the number of characters in each second sample data, and uses the number of characters as the information entropy value of each second sample data.
或者,服务端以第二样本数据中去掉语助词之后的分词数量作为第二样本数据的信息熵值。其中,语助词包括但不限于“吧”、“嗯”、“的”、“了”等。Or, the server uses the number of word segmentation after the auxiliary word is removed from the second sample data as the information entropy value of the second sample data. Among them, the auxiliary words include but are not limited to "ba", "um", "de", "le" and so on.
具体地,服务端对第二样本数据作分词处理,得到分词集合,并将分词集合中的语助词去掉,以剩下的分词数量作为第二样本数据的信息熵值。Specifically, the server performs word segmentation processing on the second sample data to obtain a word segmentation set, removes auxiliary words in the word segmentation set, and uses the remaining number of word segments as the information entropy value of the second sample data.
S4:根据第二样本数据中包含相同词组的数量,计算每个第二样本数据的相关度值。S4: Calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data.
第二样本数据的相关度值,即反应了第二样本数据提供的信息是否重复和冗余。相关度值越高,则代表第二样本数据彼此之间提供的信息重复度和冗余性越高;相关度值越低,则代表第二样本数据彼此之间提供的信息的差异性越大。The correlation value of the second sample data reflects whether the information provided by the second sample data is repeated and redundant. The higher the correlation value, the higher the repetition and redundancy of the information provided by the second sample data; the lower the correlation value, the greater the difference in the information provided by the second sample data. .
服务端根据第二样本数据中包含相同词组的数量来确定相关度值。The server determines the relevance value according to the number of identical phrases contained in the second sample data.
举例来说,若第二样本数据A中包括词组“文化”、“文明”、“历史”,第二样本数据B中包括词组“文化”、“国家”、“历史”,第二样本数据C中包括词组“旅行”、“山川”、“国家”;则第二样本数据A和第二样本数据B中均包含词组“文化”和“历史”,则A和B的相关度值为2;可以理解地,A和C的相关度值为0,B和C的相关度值为1。同时,每个第二样本数据的相关度值可以由该第二样本数据与其他每个第二样本数据的相关度值的累加和确定。即A的相关度值为2,B的相关度值为3,C的相关度值为1。For example, if the second sample data A includes the phrases "culture", "civilization", and "history", the second sample data B includes the phrases "culture", "country", and "history", and the second sample data C Includes the phrases "travel", "mountain" and "country"; then the second sample data A and the second sample data B both contain the phrases "culture" and "history", and the correlation value between A and B is 2; Understandably, the correlation value of A and C is 0, and the correlation value of B and C is 1. At the same time, the correlation value of each second sample data can be determined by the cumulative sum of the correlation values of the second sample data and each other second sample data. That is, the correlation value of A is 2, the correlation value of B is 3, and the correlation value of C is 1.
S5:选取信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值的第二样本数据作为待标注数据。S5: Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as the data to be labeled.
预设信息熵阈值和预设相关度阈值是对不具有类别标记的第二样本数据进行筛选的条件。The preset information entropy threshold and the preset relevance threshold are conditions for filtering the second sample data that does not have a category mark.
待标注数据,是根据预设信息熵阈值和预设相关度阈值,对第二样本数据进行筛选后得到的数据。The data to be labeled is the data obtained after filtering the second sample data according to the preset information entropy threshold and the preset relevance threshold.
信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值的第二样本数据,代表其信息量的内容具不确定性,并且信息量之间的差异性越大,是用于训练模型的首选数据。The second sample data whose information entropy value exceeds the preset information entropy threshold and the relevance value is lower than the preset relevance threshold indicates that the content of its information is uncertain, and the greater the difference between the information amounts, the The preferred data used to train the model.
具体地,若预设信息熵阈值为1000,预设相关度阈值为100,则服务端根据每个第二样本数据的信息熵值和相关度值进行选取,将信息熵值大于1000,并且相关度值低于100的第二样本数据作为待标注数据。Specifically, if the preset information entropy threshold is 1000 and the preset relevance threshold is 100, the server selects the information entropy value and relevance value of each second sample data, and sets the information entropy value greater than 1000, and the correlation The second sample data whose degree value is lower than 100 is regarded as the data to be labeled.
S6:根据预设的类别标注方式,对待标注数据进行类别标注,得到第三样本数据。S6: Perform category labeling on the data to be labeled according to the preset category labeling method to obtain the third sample data.
类别标注,是对不具有类别标记的第二样本数据进行标记,使第二样本数据具有相应的类别标记的过程。例如,对某篇文章进行类别标注,对其加上如“小说”、“悬疑”等反应其主题内容的标签。经过类别标注后得到的数据即为第三样本数据。The category labeling is a process of labeling the second sample data that does not have a category label, so that the second sample data has a corresponding category label. For example, label an article by category, and add tags such as "fiction" and "suspense" that reflect the content of the subject. The data obtained after category labeling is the third sample data.
预设的类别标注方式,是指服务端具体可以采用多种标注方式对第二样本数据进行类别标注。The preset category labeling method means that the server can use multiple labeling methods to label the second sample data.
举例来说,服务端可以提取第二样本数据中的关键词,即以词频最高的五个词作为关键词;然后,将关键词与预设的类别标记词库中的目标关键词进行一致性比较,若关键词与目标关键词一致,则将目标关键词对第二样本数据进行标注,从而得到第三样本数据。For example, the server can extract the keywords in the second sample data, that is, use the five words with the highest word frequency as keywords; then, the keywords are consistent with the target keywords in the preset category tag thesaurus In comparison, if the keyword is consistent with the target keyword, the target keyword is labeled on the second sample data to obtain the third sample data.
或者,服务端可以直接调用第三方的专家系统进行标记。例如,利用第三方专家系统提供的API(Application Programming Interface,应用程序编程接口)接口,将第二样本数据进行输入,得到与第二样本数据对应的类别标记,从而得到第三样本数据。Or, the server can directly call a third-party expert system for marking. For example, an API (Application Programming Interface) interface provided by a third-party expert system is used to input the second sample data to obtain a category mark corresponding to the second sample data, thereby obtaining the third sample data.
S7:按照预设的模型训练方式,使用第三样本数据对初级分类模型进行训练,得到中级分类模型。S7: According to the preset model training method, use the third sample data to train the primary classification model to obtain the intermediate classification model.
中级分类模型,是在初级分类模型的基础上,使用第三样本数据进行训练后得到的分类模型。中级分类模型与初级分类模型的区别在于,中级分类模型的训练集是具有类别标记,并且信息熵值和相关度值满足特定条件的第三样本数据。The intermediate classification model is a classification model obtained after training with the third sample data on the basis of the primary classification model. The difference between the intermediate classification model and the primary classification model is that the training set of the intermediate classification model is the third sample data that has a category label, and the information entropy value and the correlation value meet certain conditions.
预设的模型训练方式,即服务端以第三样本数据作为训练数据,采用多种框架或算法对初级分类模型进行训练。例如,服务端可以利用现有的机器学习框架或工具,如Scikit-Learn、TensorFlow等。The preset model training method is that the server uses the third sample data as training data, and uses multiple frameworks or algorithms to train the primary classification model. For example, the server can use existing machine learning frameworks or tools, such as Scikit-Learn, TensorFlow, etc.
其中,Scikit-Learn,简称sklearn,是一个开源的、基于Python的机器学习工具库,sklearn中内置了朴素贝叶斯算法、决策树算法、随机森林算法等分类算法,使用sklearn可以实现数据预处理、分类、回归、降维、模型选择等常用的机器学习算法。TensorFlow,是最初由Google大脑小组(隶属于Google机器智能研究机构)的研究员和工程师们开发出来的用于数值计算的开源软件库,其可以用于机器学习和深度神经网络方面的研究,但这个系统的通用性使其也可广泛用于其他计算领域。Among them, Scikit-Learn, referred to as sklearn, is an open source, Python-based machine learning tool library. Sklearn has built-in classification algorithms such as naive Bayes algorithm, decision tree algorithm, and random forest algorithm. Data preprocessing can be achieved using sklearn. , Classification, regression, dimensionality reduction, model selection and other commonly used machine learning algorithms. TensorFlow is an open source software library for numerical calculations originally developed by researchers and engineers from the Google Brain Group (belonging to the Google Machine Intelligence Research Institute). It can be used for research on machine learning and deep neural networks, but this The versatility of the system makes it also widely used in other computing fields.
具体地,以sklearn为例,服务端将第三样本数据作为输入数据,调用sklearn中的内置训练方法,直到模型趋于收敛,即可得到中级分类模型。Specifically, taking sklearn as an example, the server uses the third sample data as input data and calls the built-in training method in sklearn until the model tends to converge, and then the intermediate classification model can be obtained.
S8:按照预设的模型训练方式,使用第一样本数据和第三样本数据对中级分类模型进行训练,得到文本分类模型。S8: According to the preset model training method, use the first sample data and the third sample data to train the intermediate classification model to obtain the text classification model.
文本分类模型,是对中级分类模型进行再训练后得到的最终分类模型。The text classification model is the final classification model obtained after retraining the intermediate classification model.
其中,服务端采用的预设的模型训练方式与步骤S7的训练过程一样,此处不再赘述。与步骤S7的训练过程不同的是,同时使用第一样本数据和第三样本数据对中级分类模型进行训练,即使用有类别标记的样本数据对中级分类模型进行迭代训练,以提高中级分类模型的分类精度。Among them, the preset model training method adopted by the server is the same as the training process of step S7, and will not be repeated here. The difference from the training process of step S7 is that the first sample data and the third sample data are used to train the intermediate classification model at the same time, that is, the intermediate classification model is iteratively trained using class-labeled sample data to improve the intermediate classification model The classification accuracy.
具体地,以sklearn为例,服务端将第一样本数据和第三样本数据作为输入数据,调用sklearn中的内置训练方法,直到模型趋于收敛,即可得到文本分类模型。Specifically, taking sklearn as an example, the server takes the first sample data and the third sample data as input data, and calls the built-in training method in sklearn until the model tends to converge, and the text classification model can be obtained.
在本实施例中,从预设样本库中获取具有类别标记的第一样本数据,并根据第一样本数据建立初级分类模型,即利用一小部分有类别标记的样本数据进行训练,得到初级分类模型,可以减少对有类别标记的样本数据的需求量,节约训练成本;从预设样本库中获取不具有类别标记的第二样本数据;计算第二样本数据的信息熵值和相关度值,并对信息熵值和相关度值符合预设条件的第二样本数据进行类别标注;按照预设的模型训练方式,使 用标注后的第三样本数据对初级分类模型进行训练,得到中级分类模型,即利用了第三样本数据的信息熵大,彼此之间的相关性小,且有类别标记的特点,优化了初级分类模型的分类精度;最后,根据第一样本数据和第三样本数据对所述中级分类模型进行训练,得到文本分类模型,即通过逐级的迭代,优化得到最终的文本分类模型。提出了一种利用少量有类别标记的样本数据训练得到文本分类模型的方法,使得可以通过对较少的样本数据进行训练,获得性能较好的分类模型,节约了人力成本,提高了训练速度。In this embodiment, the first sample data with category labels is obtained from the preset sample library, and the primary classification model is established according to the first sample data, that is, a small part of the sample data with category labels is used for training, and The primary classification model can reduce the demand for sample data with category marks and save training costs; obtain second sample data without category marks from the preset sample library; calculate the information entropy and correlation of the second sample data Classify the second sample data whose information entropy value and correlation value meet the preset conditions; according to the preset model training method, use the labeled third sample data to train the primary classification model to obtain the intermediate classification The model uses the third sample data to have a large information entropy, a small correlation between each other, and has the characteristics of category labels, which optimizes the classification accuracy of the primary classification model; finally, according to the first sample data and the third sample The data trains the intermediate classification model to obtain the text classification model, that is, through step-by-step iteration, optimization is used to obtain the final text classification model. A method for training a text classification model using a small amount of class-labeled sample data is proposed, so that a better performance classification model can be obtained through training with less sample data, which saves labor costs and improves training speed.
进一步地,在一实施例中,如图3所示,针对步骤S1,即从预设样本库中获取具有类别标记的第一样本数据,并根据第一样本数据建立初级分类模型,具体包括如下步骤:Further, in an embodiment, as shown in FIG. 3, for step S1, the first sample data with category tags is obtained from the preset sample library, and the primary classification model is established according to the first sample data. Including the following steps:
S11:按照预设样本选取方式从预设样本库中选取具有类别标记的第一样本数据。S11: Select the first sample data with the category mark from the preset sample library according to the preset sample selection method.
预设样本选取方式,即从预设样本库中选取一定数量的、并且有代表性的有类别标记的第一样本数据。其中,数量尽量的少,以减少对样本数据的需求量;同时,选取的第一样本尽量覆盖文本数据的类别。例如,对新闻类文本数据的选取,尽量覆盖“政治”、“商业”、“体育”、“文体娱乐”等类别。The preset sample selection method is to select a certain number of representative first sample data with category marks from the preset sample library. Among them, the number is as small as possible to reduce the demand for sample data; at the same time, the first sample selected should cover the text data category as much as possible. For example, for the selection of news text data, try to cover categories such as "politics", "business", "sports", "style and entertainment".
具体地,若预设样本库中有10万篇文章,其中,具有类别标记的文章有3000篇,则服务端可以选取3000篇文章中的30%,即选取900篇文章,并且从900篇文章中选取代表文本数据类别的文章各5篇文章作为第一样本数据。Specifically, if there are 100,000 articles in the preset sample library, among them, there are 3000 articles with category tags, the server can select 30% of the 3000 articles, that is, 900 articles are selected, and 900 articles are selected Select 5 articles each representing the text data category as the first sample data.
S12:结合具有类别标记的第一样本数据和预设训练算法建立初级分类模型。S12: Establish a primary classification model by combining the first sample data with the category mark and the preset training algorithm.
预设训练算法,包括机器学习中对模型进行训练的各种算法。服务端使用具有类别标记的第一样本数据建立初级分类模型的过程属于监督学习模式。其中,监督学习就是通过已有的训练样本,即已知数据以及其对应的输出,去训练得到一个最优模型。这个模型属于某个函数的集合,最优则表示在某个评价准则下是最佳的。Preset training algorithms, including various algorithms for training models in machine learning. The process in which the server uses the first sample data with category labels to establish the primary classification model belongs to the supervised learning mode. Among them, supervised learning is to train to obtain an optimal model through existing training samples, that is, known data and its corresponding output. This model belongs to a set of certain functions, and optimal means that it is the best under certain evaluation criteria.
具体地,以朴素贝叶斯分类算法为例,服务端可以从sklearn库中导入朴素贝叶斯函数,然后调用MultinomialNB().fit()进行训练。Specifically, taking the naive Bayes classification algorithm as an example, the server can import the naive Bayes function from the sklearn library, and then call MultinomialNB().fit() for training.
当训练完成,服务端可以使用Joblib库实现保存训练数据的功能。其中,Joblib是SciPy生态的一部分,为管道化python的工作提供的工具。或者,服务端可以调用pickle库的函数将初级分类模型保存。When the training is completed, the server can use the Joblib library to realize the function of saving training data. Among them, Joblib is a part of the SciPy ecology and provides tools for pipelined python work. Alternatively, the server can call the function of the pickle library to save the primary classification model.
在本实施例中,服务端按照预设样本选取方式,选取出数量尽量少,且样本数据的类型覆盖尽量广的第一样本数据;然后结合预设训练算法建立初级分类模型,使得对样本数据的需求尽量少,进一步减轻训练成本,同时,由于第一样本数据的覆盖面广,使得初级分类模型的可识别范围更广。In this embodiment, according to the preset sample selection method, the server selects the first sample data that is as small as possible and the type of sample data covers as wide as possible; then, the primary classification model is established in combination with the preset training algorithm, so that the sample The need for data is as small as possible to further reduce training costs. At the same time, due to the wide coverage of the first sample data, the recognizable range of the primary classification model is wider.
进一步地,在一实施例中,针对步骤S3,即计算每个第二样本数据的信息熵,得到每个第二样本数据的信息熵值,具体包括如下步骤:Further, in an embodiment, for step S3, that is, calculating the information entropy of each second sample data to obtain the information entropy value of each second sample data, specifically includes the following steps:
根据如下公式计算每个第二样本数据的信息熵:Calculate the information entropy of each second sample data according to the following formula:
Figure PCTCN2019117095-appb-000001
Figure PCTCN2019117095-appb-000001
其中,H代表第二样本数据的信息熵值,x代表第二样本数据中的词组,p (x)代表词组出现的频率。 Among them, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
第二样本数据中的词组,是服务端对第二样本数据作分词处理后得到的词。词组出现的频率,即词组在第二样本数据中出现的次数。The phrases in the second sample data are words obtained after the server performs word segmentation processing on the second sample data. The frequency of the phrase, that is, the number of times the phrase appears in the second sample data.
具体地,服务端先对每个第二样本数据作分词处理,得到分词的集合;然后将分词集合中所有分词的频率代入公式中,即可得到该第二样本数据的信息熵值。Specifically, the server first performs word segmentation processing on each second sample data to obtain a word segmentation set; then, substituting the frequency of all word segmentation in the word segmentation set into the formula, the information entropy value of the second sample data can be obtained.
在本实施例中,服务端根据香浓公式和第二样本数据中的词组的词频计算出第二样本数据的信息熵,使得对样本数据包含信息量的量化更加准确。In this embodiment, the server calculates the information entropy of the second sample data according to the Shannon formula and the word frequency of the phrase in the second sample data, so that the quantification of the amount of information contained in the sample data is more accurate.
进一步地,在一实施例中,如图4所示,针对步骤S4,即根据第二样本数据中包含相 同词组的数量,计算每个第二样本数据的相关度值,具体包括如下步骤:Further, in one embodiment, as shown in FIG. 4, for step S4, that is, calculating the correlation value of each second sample data according to the number of the same phrase in the second sample data, it specifically includes the following steps:
S41:对每个第二样本数据作分词处理,得到N个分词集合,其中,N为第二样本数据的数量。S41: Perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data.
具体地,服务端可以采用多种方式进行分词处理。例如,采用正则表达式对第二样本数据进行切分,得到由若干分词构成的集合,即分词集合。可以理解地,第二样本数据的数量与分词集合的数量是一一对应的。Specifically, the server can use multiple methods to perform word segmentation processing. For example, a regular expression is used to segment the second sample data to obtain a set consisting of several word segmentation, that is, a word segmentation set. Understandably, there is a one-to-one correspondence between the number of second sample data and the number of word segmentation sets.
其中,正则表达式,即Regular Expression,又称规则表达式,是用来在上下文中检索或替换目标文本的处理方法。Among them, regular expression, namely Regular Expression, also known as regular expression, is a processing method used to retrieve or replace target text in context.
具体地,服务端可以采用Perl或Python语言中内置的正则表达式引擎,对第二样本数据进行切分;或者,服务端使用Unix系统中自带的grep工具,对第二样本数据进行切分,得到包含若干分词的集合。其中,grep,即Globally search a Regular Expression and Print,是一种强大的文本搜索工具。Specifically, the server can use the built-in regular expression engine in Perl or Python to segment the second sample data; or, the server can segment the second sample data using the grep tool that comes with the Unix system , Get a set containing several participles. Among them, grep, namely Globally search a Regular Expression and Print, is a powerful text search tool.
S42:针对每个第二样本数据,计算该第二样本数据的分词集合与其他N-1个第二样本数据的分词集合之间的交集,并根据每个交集中包含的词组数量,确定该第二样本数据与其他N-1个第二样本数据之间的局部相关度值,得到该第二样本数据对应的N-1个局部相关度值。S42: For each second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and determine the number of phrases contained in each intersection The local correlation value between the second sample data and other N-1 second sample data is obtained, and N-1 local correlation values corresponding to the second sample data are obtained.
计算分词集合之间的交集,具体可以将不同分词集合进行对比,交集即相同的词组。Calculate the intersection between the word segmentation sets. Specifically, different word segmentation sets can be compared, and the intersection is the same phrase.
局部相关度值,代表了一个第二样本数据与其他第二样本数据之间的相关程度。The local correlation value represents the degree of correlation between a second sample data and other second sample data.
举例来说,分词集合a表示为{“人们”、“利息”、“银行”、“借贷”},分词集合b表示为{“银行”、“借贷”、“收入”},则分词集合a与b的交集为{“银行”、“借贷”},交集中包含的词组数量为2,分词集合a与b的局部相关度值为2。同理可知,若分词集合c表示为{“会议”、“报告”、“收入”},则分词集合a与c的局部相关度值为0,分词集合b与c的局部相关度值为1。For example, the participle set a is represented as {"people", "interest", "bank", "borrow"}, and the participle set b is represented as {"bank", "borrow", "income"}, then the participle set a The intersection with b is {"bank", "borrow"}, the number of phrases contained in the intersection is 2, and the local correlation value of the word segmentation set a and b is 2. In the same way, if the word segmentation set c is represented as {"meeting", "report", "income"}, the local correlation value of the word segmentation set a and c is 0, and the local correlation value of the word segmentation set b and c is 1. .
S43:计算每个第二样本数据对应的N-1个局部相关度值的平均值,将平均值作为每个第二样本数据的相关度值。S43: Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
仍以步骤S42中的分词集合a、b和c为例,与分词集合a对应的第二样本数据的相关度值为分词集合a与b、分词集合a与c的局部相关度值之和的平均值,即为1。同理可知,与分词集合b和c对应的第二样本数据的相关度值分别为1.5和0.5。Still taking the word segmentation sets a, b, and c in step S42 as an example, the correlation value of the second sample data corresponding to the word segmentation set a is the sum of the partial correlation values of the word segmentation sets a and b, and the word segmentation sets a and c. The average value is 1. Similarly, it can be known that the correlation values of the second sample data corresponding to the word segmentation sets b and c are 1.5 and 0.5, respectively.
在本实施例中,服务端通过对第二样本数据进行分词处理,以分词集合之间的交集确定第二样本数据彼此之间的局部相关度值,并对局部相关度值求平均值的方式得到每个第二样本数据的相关度值,使得相关度值可以更加准确的反应第二样本数据之间的关联程度。In this embodiment, the server performs word segmentation processing on the second sample data to determine the local correlation value between the second sample data by the intersection of the word segmentation sets, and averages the local correlation values Obtain the correlation value of each second sample data, so that the correlation value can more accurately reflect the degree of correlation between the second sample data.
进一步地,在一实施例中,如图5所示,针对步骤S5,即选取信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值的第二样本数据作为待标注数据,具体包括如下步骤:Further, in an embodiment, as shown in FIG. 5, for step S5, the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold is selected as the second sample data to be labeled The data includes the following steps:
S51:选取信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值的第二样本数据作为候选样本数据。S51: Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as candidate sample data.
服务端对符合特定条件的第二样本数据进行再次筛选,既减少训练样本的数量,又找出普通分类器难以识别的样本数据。其中,特定条件是指信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值。The server re-screens the second sample data that meets the specific conditions, which not only reduces the number of training samples, but also finds sample data that is difficult to identify by ordinary classifiers. Among them, the specific condition means that the information entropy value exceeds the preset information entropy threshold, and the correlation value is lower than the preset correlation threshold.
S52:使用至少两个预设样本分类器对候选样本数据进行分类,得到分类结果。S52: Use at least two preset sample classifiers to classify the candidate sample data to obtain a classification result.
预设样本分类器,即文本分类模型。例如,常见的FastText、Text-CNN模型等。Preset sample classifiers, namely text classification models. For example, common FastText, Text-CNN models, etc.
其中,FastText是facebook开源的一个词向量与文本分类工具,其典型应用场景是“带监督的文本分类问题”。它提供简单而高效的文本分类和表征学习的方法,性能比肩深度学习而且速度更快。TextCNN是利用卷积神经网络对文本进行分类的算法,由于其结构简单、效果好,在文本分类领域应用广泛。Among them, FastText is a word vector and text classification tool open sourced by Facebook, and its typical application scenario is "supervised text classification problem". It provides a simple and efficient method for text classification and characterization learning, with performance comparable to deep learning and faster. TextCNN is an algorithm that uses convolutional neural networks to classify text. Because of its simple structure and good effect, it is widely used in the field of text classification.
不同的预设样本分类器对同一样本数据进行分类的结果可能不同。即同一样本数据被FastText、Text-CNN等不同分类模型进行分类后,可能被识别为不同的类别。Different preset sample classifiers may have different results for classifying the same sample data. That is, after the same sample data is classified by different classification models such as FastText and Text-CNN, it may be recognized as different categories.
分类结果,即包括了每个候选样本数据所属的类别。The classification result includes the category to which each candidate sample data belongs.
S53:从分类结果中选取同时属于不同类别的候选样本数据作为待标注数据。S53: Select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
同时属于不同类别的候选样本数据,即不同的预设分类器对同一候选样本数据的识别结果不同。例如,一篇文章被FastText识别为“历史类”,同时,又被Text-CNN识别为“文艺类”。因此,代表该篇文章难以被识别,或难以简单划分为某一类别。Candidate sample data belonging to different categories at the same time, that is, different preset classifiers have different recognition results for the same candidate sample data. For example, an article is recognized as "historical" by FastText and at the same time recognized as "literary and artistic" by Text-CNN. Therefore, it means that the article is difficult to be recognized, or it is difficult to simply divide it into a certain category.
具体地,服务端根据分类结果中的候选样本数据所属的类别,确定其是否同时属于不同类别的。Specifically, the server determines whether the candidate sample data belongs to different categories at the same time according to the category to which the candidate sample data in the classification result belongs.
在本实施例中,服务端根据不同的预设分类器对满足特定条件的第二样本数据进行筛选,挑出难以被识别的第二样本数据作为待标注数据,既去除掉简单容易被识别的样本数据,进一步减少训练样本的数量和训练时间,提高训练效率;同时,挑选出不容易被识别的样本数据作为待标注数据,使得对这些待标注数据进行类别标注后,有利于模型训练精度的提高。In this embodiment, the server screens the second sample data that meets specific conditions according to different preset classifiers, and picks out the second sample data that is difficult to be identified as the data to be labeled, which removes the simple and easy to identify data. Sample data to further reduce the number of training samples and training time, improve training efficiency; at the same time, select the sample data that is not easy to be identified as the data to be labeled, so that the classification of these data to be labeled is beneficial to the accuracy of model training improve.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
在一实施例中,提供一种文本分类模型训练装置,该文本分类模型训练装置与上述实施例中文本分类模型训练方法一一对应。如图6所示,该文本分类模型训练装置包括初级模型建立模块61、样本数据获取模块62、信息熵计算模块63、相关度计算模块64、待标注数据选取模块65、标注模块66、第一模型训练模块67和第二模型训练模块68。各功能模块详细说明如下:In one embodiment, a text classification model training device is provided, and the text classification model training device corresponds to the text classification model training method in the above-mentioned embodiment one-to-one. As shown in FIG. 6, the text classification model training device includes a primary model building module 61, a sample data acquisition module 62, an information entropy calculation module 63, a correlation calculation module 64, a data selection module 65 to be labeled, a labeling module 66, a first The model training module 67 and the second model training module 68. The detailed description of each functional module is as follows:
初级模型建立模块61,用于从预设样本库中获取具有类别标记的第一样本数据,并根据第一样本数据建立初级分类模型;The primary model establishment module 61 is configured to obtain the first sample data with category marks from the preset sample library, and establish a primary classification model according to the first sample data;
样本数据获取模块62,用于从预设样本库中获取不具有类别标记的第二样本数据;The sample data acquisition module 62 is configured to acquire second sample data without a category mark from a preset sample library;
信息熵计算模块63,用于计算每个第二样本数据的信息熵,得到每个第二样本数据的信息熵值;The information entropy calculation module 63 is configured to calculate the information entropy of each second sample data to obtain the information entropy value of each second sample data;
相关度计算模块64,用于根据第二样本数据中包含相同词组的数量,计算每个第二样本数据的相关度值;The correlation calculation module 64 is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data;
待标注数据选取模块65,用于选取信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值的第二样本数据作为待标注数据;The to-be-labeled data selection module 65 is configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold as the data to be labeled;
标注模块66,用于根据预设的类别标注方式,对待标注数据进行类别标注,得到第三样本数据;The labeling module 66 is configured to perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
第一模型训练模块67,用于按照预设的模型训练方式,使用第三样本数据对初级分类模型进行训练,得到中级分类模型;The first model training module 67 is configured to use the third sample data to train the primary classification model according to the preset model training method to obtain the intermediate classification model;
第二模型训练模块68,用于按照预设的模型训练方式,使用第一样本数据和第三样本数据对中级分类模型进行训练,得到文本分类模型。The second model training module 68 is configured to use the first sample data and the third sample data to train the intermediate classification model according to a preset model training method to obtain a text classification model.
进一步地,初级模型建立模块61,包括:Further, the primary model establishment module 61 includes:
选取子模块611,用于按照预设样本选取方式从预设样本库中选取具有类别标记的第一样本数据;The selection sub-module 611 is used to select the first sample data with the category mark from the preset sample library according to the preset sample selection method;
训练子模块612,用于结合具有类别标记的第一样本数据和预设训练算法建立初级分类模型。The training sub-module 612 is used to establish a primary classification model by combining the first sample data with category labels and a preset training algorithm.
进一步地,信息熵计算模块63,包括Further, the information entropy calculation module 63 includes
信息熵计算子模块631,用于根据如下公式计算每个第二样本数据的信息熵:The information entropy calculation sub-module 631 is configured to calculate the information entropy of each second sample data according to the following formula:
Figure PCTCN2019117095-appb-000002
Figure PCTCN2019117095-appb-000002
其中,H代表第二样本数据的信息熵值,x代表第二样本数据中的词组,p (x)代表词组出现的频率。 Among them, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
进一步地,相关度计算模块64,包括:Further, the correlation calculation module 64 includes:
分词子模块641,用于对每个第二样本数据作分词处理,得到N个分词集合,其中,N为第二样本数据的数量;The word segmentation sub-module 641 is used to perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data;
局部相关度计算子模块642,用于针对每个第二样本数据,计算该第二样本数据的分词集合与其他N-1个第二样本数据的分词集合之间的交集,并根据每个交集中包含的词组数量,确定该第二样本数据与其他N-1个第二样本数据之间的局部相关度值,得到该第二样本数据对应的N-1个局部相关度值;The local correlation calculation sub-module 642 is used to calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data for each second sample data, and according to each intersection The number of phrases contained in the set is determined, the local correlation value between the second sample data and the other N-1 second sample data is determined, and the N-1 local correlation values corresponding to the second sample data are obtained;
平均值计算子模块643,用于计算每个第二样本数据对应的N-1个局部相关度值的平均值,将平均值作为每个第二样本数据的相关度值。The average value calculation sub-module 643 is used to calculate the average value of N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
进一步地,待标注数据选取模块65,包括:Further, the data selection module 65 to be labeled includes:
候选样本选取子模块651,用于选取信息熵值超过预设信息熵阈值,并且相关度值低于预设相关度阈值的第二样本数据作为候选样本数据;The candidate sample selection submodule 651 is configured to select second sample data whose information entropy value exceeds a preset information entropy threshold and whose correlation value is lower than the preset correlation threshold as candidate sample data;
分类子模块652,用于使用至少两个预设样本分类器对候选样本数据进行分类,得到分类结果;The classification sub-module 652 is configured to classify candidate sample data by using at least two preset sample classifiers to obtain a classification result;
标注子模块653,用于从分类结果中选取同时属于不同类别的候选样本数据作为待标注数据。The labeling submodule 653 is used to select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
关于文本分类模型训练装置的具体限定可以参见上文中对于文本分类模型训练方法的限定,在此不再赘述。上述文本分类模型训练装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the text classification model training device, please refer to the above definition of the text classification model training method, which will not be repeated here. Each module in the text classification model training device described above can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种文本分类模型训练方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 7. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a text classification model training method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中文本分类模型训练方法的步骤,例如图2所示的步骤S1至步骤S8。或者,处理器执行计算机可读指令时实现上述实施例中文本分类模型训练装置的各模块/单元的功能,例如图6所示模块61至模块68的功能。为避免重复,这里不再赘述。In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer-readable instructions, the text in the above-mentioned embodiment is implemented. The steps of the classification model training method are, for example, steps S1 to S8 shown in FIG. 2. Alternatively, when the processor executes the computer-readable instructions, the functions of the modules/units of the text classification model training device in the above-mentioned embodiment are realized, for example, the functions of the modules 61 to 68 shown in FIG. 6. To avoid repetition, I won’t repeat them here.
在一实施例中,提供一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。可读存储介质上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述方法实施例中文本分类模型训练方法,或者,该计算机可读指令被一个或多个处理器执行时实现上述装置实施例中文本分类模型训练装置中各模块/单元的功能。为避免重复,这里不再赘述。In an embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media . The readable storage medium stores computer readable instructions, and the computer readable instructions implement the text classification model training method in the above method embodiment when executed by the processor, or implement the text classification model training method in the above method embodiments, or implement the computer readable instructions when executed by one or more processors The function of each module/unit in the text classification model training device in the above device embodiment. To avoid repetition, I won’t repeat them here.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上 述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种文本分类模型训练方法,其特征在于,所述文本分类模型训练方法包括:A text classification model training method, characterized in that the text classification model training method includes:
    从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;
    从所述预设样本库中获取不具有所述类别标记的第二样本数据;Acquiring the second sample data without the category mark from the preset sample library;
    计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
    根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;
    选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;
    根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
    按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;
    按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
  2. 如权利要求1所述的文本分类模型训练方法,其特征在于,所述从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型,包括:5. The text classification model training method according to claim 1, wherein said acquiring first sample data with category tags from a preset sample library, and establishing a primary classification model according to said first sample data, include:
    按照预设样本选取方式从所述预设样本库中选取所述具有类别标记的第一样本数据;Selecting the first sample data with the category mark from the preset sample library according to a preset sample selection method;
    结合所述具有类别标记的第一样本数据和预设训练算法建立所述初级分类模型。The primary classification model is established by combining the first sample data with the category mark and a preset training algorithm.
  3. 如权利要求1所述的文本分类模型训练方法,其特征在于,所述计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值,包括:The text classification model training method according to claim 1, wherein the calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data comprises:
    根据如下公式计算每个所述第二样本数据的信息熵:Calculate the information entropy of each of the second sample data according to the following formula:
    Figure PCTCN2019117095-appb-100001
    Figure PCTCN2019117095-appb-100001
    其中,H代表所述第二样本数据的信息熵值,x代表所述第二样本数据中的词组,p (x)代表所述词组出现的频率。 Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
  4. 如权利要求1所述的文本分类模型训练方法,其特征在于,所述根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值,包括:3. The text classification model training method according to claim 1, wherein the calculating the correlation value of each second sample data according to the number of the same phrase in the second sample data comprises:
    对每个所述第二样本数据作分词处理,得到N个分词集合,其中,N为所述第二样本数据的数量;Perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;
    针对每个所述第二样本数据,计算该第二样本数据的分词集合与其他N-1个第二样本数据的分词集合之间的交集,并根据每个所述交集中包含的词组数量,确定该第二样本数据与其他N-1个第二样本数据之间的局部相关度值,得到该第二样本数据对应的N-1个所述局部相关度值;For each of the second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and according to the number of phrases contained in each of the intersections, Determine the local correlation value between the second sample data and other N-1 second sample data, and obtain the N-1 local correlation values corresponding to the second sample data;
    计算每个所述第二样本数据对应的N-1个所述局部相关度值的平均值,将所述平均值作为每个所述第二样本数据的相关度值。Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
  5. 如权利要求1所述的文本分类模型训练方法,其特征在于,所述选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据,包括:The text classification model training method according to claim 1, wherein said selecting said information entropy value exceeds a preset information entropy threshold, and said correlation value is lower than said preset correlation threshold. The second sample data, as the data to be labeled, includes:
    选取所述信息熵值超过所述预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为候选样本数据;Selecting the second sample data whose information entropy value exceeds the preset information entropy threshold value and the correlation value is lower than the preset correlation threshold value as candidate sample data;
    使用至少两个预设样本分类器对所述候选样本数据进行分类,得到分类结果;Classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;
    从所述分类结果中选取同时属于不同类别的所述候选样本数据作为所述待标注数据。The candidate sample data belonging to different categories at the same time are selected from the classification result as the data to be labeled.
  6. 一种文本分类模型训练装置,其特征在于,所述文本分类模型训练装置,包括:A text classification model training device, wherein the text classification model training device includes:
    初级模型建立模块,用于从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;The primary model establishment module is used to obtain first sample data with category marks from a preset sample library, and establish a primary classification model according to the first sample data;
    样本数据获取模块,用于从所述预设样本库中获取不具有所述类别标记的第二样本数据;A sample data acquisition module, configured to acquire second sample data without the category mark from the preset sample library;
    信息熵计算模块,用于计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;An information entropy calculation module, configured to calculate the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
    相关度计算模块,用于根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;The correlation calculation module is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data;
    待标注数据选取模块,用于选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;A data selection module to be labeled, configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset relevance threshold as the data to be labeled;
    标注模块,用于根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;The labeling module is used to label the data to be labeled according to the preset category labeling method to obtain the third sample data;
    第一模型训练模块,用于按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;The first model training module is configured to use the third sample data to train the primary classification model according to a preset model training method to obtain an intermediate classification model;
    第二模型训练模块,用于按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。The second model training module is configured to use the first sample data and the third sample data to train the intermediate classification model according to the preset model training method to obtain a text classification model.
  7. 如权利要求6所述的文本分类模型训练装置,其特征在于,所述初级模型建立模块,包括:7. The text classification model training device according to claim 6, wherein the primary model establishment module comprises:
    选取子模块,用于按照预设样本选取方式从所述预设样本库中选取所述具有类别标记的第一样本数据;The selection sub-module is configured to select the first sample data with the category mark from the preset sample library according to a preset sample selection method;
    训练子模块,用于结合所述具有类别标记的第一样本数据和预设训练算法建立所述初级分类模型。The training sub-module is used to establish the primary classification model by combining the first sample data with the category mark and a preset training algorithm.
  8. 如权利要求6所述的文本分类模型训练装置,其特征在于,所述信息熵计算模块,包括:7. The text classification model training device according to claim 6, wherein the information entropy calculation module comprises:
    信息熵计算子模块,用于根据如下公式计算每个所述第二样本数据的信息熵:The information entropy calculation sub-module is used to calculate the information entropy of each second sample data according to the following formula:
    Figure PCTCN2019117095-appb-100002
    Figure PCTCN2019117095-appb-100002
    其中,H代表所述第二样本数据的信息熵值,x代表所述第二样本数据中的词组,p (x)代表所述词组出现的频率。 Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
  9. 如权利要求6所述的文本分类模型训练装置,其特征在于,所述相关度计算模块包括:7. The text classification model training device according to claim 6, wherein the relevance calculation module comprises:
    分词子模块,用于对每个所述第二样本数据作分词处理,得到N个分词集合,其中,N为所述第二样本数据的数量;The word segmentation sub-module is used to perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;
    局部相关度计算子模块,用于针对每个所述第二样本数据,计算该第二样本数据的分词集合与其他N-1个第二样本数据的分词集合之间的交集,并根据每个所述交集中包含的词组数量,确定该第二样本数据与其他N-1个第二样本数据之间的局部相关度值,得到该第二样本数据对应的N-1个所述局部相关度值;The local correlation calculation sub-module is used to calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data for each second sample data, and according to each The number of phrases included in the intersection set determines the local correlation value between the second sample data and other N-1 second sample data, and obtains N-1 local correlations corresponding to the second sample data value;
    平均值计算子模块,用于计算每个所述第二样本数据对应的N-1个所述局部相关度值的平均值,将所述平均值作为每个所述第二样本数据的相关度值。The average value calculation sub-module is used to calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation degree of each second sample data value.
  10. 如权利要求6所述的文本分类模型训练装置,其特征在于,所述待标注数据选取模块包括:7. The text classification model training device according to claim 6, wherein the data selection module to be labeled comprises:
    候选样本选取子模块,用于选取所述信息熵值超过所述预设信息熵阈值,并且所述相 关度值低于所述预设相关度阈值的所述第二样本数据作为候选样本数据;A candidate sample selection sub-module, configured to select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as candidate sample data;
    分类子模块,用于使用至少两个预设样本分类器对所述候选样本数据进行分类,得到分类结果;The classification sub-module is used to classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;
    标注子模块,用于从所述分类结果中选取同时属于不同类别的所述候选样本数据作为所述待标注数据。The labeling sub-module is used to select the candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:
    从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;
    从所述预设样本库中获取不具有所述类别标记的第二样本数据;Acquiring the second sample data without the category mark from the preset sample library;
    计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
    根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;
    选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;
    根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
    按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;
    按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
  12. 如权利要求11所述的计算机设备,其特征在于,所述从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型,包括:11. The computer device according to claim 11, wherein said acquiring first sample data with category marks from a preset sample library and establishing a primary classification model according to said first sample data comprises:
    按照预设样本选取方式从所述预设样本库中选取所述具有类别标记的第一样本数据;Selecting the first sample data with the category mark from the preset sample library according to a preset sample selection method;
    结合所述具有类别标记的第一样本数据和预设训练算法建立所述初级分类模型。The primary classification model is established by combining the first sample data with the category mark and a preset training algorithm.
  13. 如权利要求11所述的计算机设备,其特征在于,所述计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值,包括:The computer device according to claim 11, wherein said calculating the information entropy of each of said second sample data to obtain the information entropy value of each of said second sample data comprises:
    根据如下公式计算每个所述第二样本数据的信息熵:Calculate the information entropy of each of the second sample data according to the following formula:
    Figure PCTCN2019117095-appb-100003
    Figure PCTCN2019117095-appb-100003
    其中,H代表所述第二样本数据的信息熵值,x代表所述第二样本数据中的词组,p (x)代表所述词组出现的频率。 Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
  14. 如权利要求11所述的计算机设备,其特征在于,所述根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值,包括:11. The computer device of claim 11, wherein the calculating the correlation value of each of the second sample data according to the number of identical phrases contained in the second sample data comprises:
    对每个所述第二样本数据作分词处理,得到N个分词集合,其中,N为所述第二样本数据的数量;Perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;
    针对每个所述第二样本数据,计算该第二样本数据的分词集合与其他N-1个第二样本数据的分词集合之间的交集,并根据每个所述交集中包含的词组数量,确定该第二样本数据与其他N-1个第二样本数据之间的局部相关度值,得到该第二样本数据对应的N-1个所述局部相关度值;For each of the second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and according to the number of phrases contained in each of the intersections, Determine the local correlation value between the second sample data and other N-1 second sample data, and obtain the N-1 local correlation values corresponding to the second sample data;
    计算每个所述第二样本数据对应的N-1个所述局部相关度值的平均值,将所述平均值作为每个所述第二样本数据的相关度值。Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
  15. 如权利要求11所述的计算机设备,其特征在于,所述选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注 数据,包括:The computer device according to claim 11, wherein the selecting the second sample whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset correlation threshold Data as data to be labeled, including:
    选取所述信息熵值超过所述预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为候选样本数据;Selecting the second sample data whose information entropy value exceeds the preset information entropy threshold value and the correlation value is lower than the preset correlation threshold value as candidate sample data;
    使用至少两个预设样本分类器对所述候选样本数据进行分类,得到分类结果;Classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;
    从所述分类结果中选取同时属于不同类别的所述候选样本数据作为所述待标注数据。The candidate sample data belonging to different categories at the same time are selected from the classification result as the data to be labeled.
  16. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型;Acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data;
    从所述预设样本库中获取不具有所述类别标记的第二样本数据;Acquiring the second sample data without the category mark from the preset sample library;
    计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值;Calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
    根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值;Calculating the correlation value of each second sample data according to the number of identical phrases contained in the second sample data;
    选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据;Selecting the second sample data whose information entropy value exceeds a preset information entropy threshold value and whose relevance value is lower than the preset relevance threshold value as data to be labeled;
    根据预设的类别标注方式,对所述待标注数据进行类别标注,得到第三样本数据;Perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
    按照预设的模型训练方式,使用所述第三样本数据对所述初级分类模型进行训练,得到中级分类模型;Training the primary classification model using the third sample data according to a preset model training method to obtain an intermediate classification model;
    按照所述预设的模型训练方式,使用所述第一样本数据和所述第三样本数据对所述中级分类模型进行训练,得到文本分类模型。According to the preset model training method, the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
  17. 如权利要求16所述的可读存储介质,其特征在于,所述从预设样本库中获取具有类别标记的第一样本数据,并根据所述第一样本数据建立初级分类模型,包括:The readable storage medium according to claim 16, wherein the acquiring first sample data with category marks from a preset sample library, and establishing a primary classification model according to the first sample data, comprises :
    按照预设样本选取方式从所述预设样本库中选取所述具有类别标记的第一样本数据;Selecting the first sample data with the category mark from the preset sample library according to a preset sample selection method;
    结合所述具有类别标记的第一样本数据和预设训练算法建立所述初级分类模型。The primary classification model is established by combining the first sample data with the category mark and a preset training algorithm.
  18. 如权利要求16所述的可读存储介质,其特征在于,所述计算每个所述第二样本数据的信息熵,得到每个所述第二样本数据的信息熵值,包括:15. The readable storage medium of claim 16, wherein the calculating the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data comprises:
    根据如下公式计算每个所述第二样本数据的信息熵:Calculate the information entropy of each of the second sample data according to the following formula:
    Figure PCTCN2019117095-appb-100004
    Figure PCTCN2019117095-appb-100004
    其中,H代表所述第二样本数据的信息熵值,x代表所述第二样本数据中的词组,p (x)代表所述词组出现的频率。 Wherein, H represents the information entropy value of the second sample data, x represents the phrase in the second sample data, and p (x) represents the frequency of occurrence of the phrase.
  19. 如权利要求16所述的可读存储介质,其特征在于,所述根据所述第二样本数据中包含相同词组的数量,计算每个所述第二样本数据的相关度值,包括:15. The readable storage medium of claim 16, wherein the calculating the correlation value of each of the second sample data according to the number of the same phrase in the second sample data comprises:
    对每个所述第二样本数据作分词处理,得到N个分词集合,其中,N为所述第二样本数据的数量;Perform word segmentation processing on each of the second sample data to obtain N word segmentation sets, where N is the number of the second sample data;
    针对每个所述第二样本数据,计算该第二样本数据的分词集合与其他N-1个第二样本数据的分词集合之间的交集,并根据每个所述交集中包含的词组数量,确定该第二样本数据与其他N-1个第二样本数据之间的局部相关度值,得到该第二样本数据对应的N-1个所述局部相关度值;For each of the second sample data, calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data, and according to the number of phrases contained in each of the intersections, Determine the local correlation value between the second sample data and other N-1 second sample data, and obtain the N-1 local correlation values corresponding to the second sample data;
    计算每个所述第二样本数据对应的N-1个所述局部相关度值的平均值,将所述平均值作为每个所述第二样本数据的相关度值。Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
  20. 如权利要求16所述的可读存储介质,其特征在于,所述选取所述信息熵值超过预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为待标注数据,包括:The readable storage medium according to claim 16, wherein the selection of the information entropy value exceeds a preset information entropy threshold, and the correlation degree value is lower than the first correlation degree threshold value. The second sample data is the data to be labeled, including:
    选取所述信息熵值超过所述预设信息熵阈值,并且所述相关度值低于所述预设相关度阈值的所述第二样本数据作为候选样本数据;Selecting the second sample data whose information entropy value exceeds the preset information entropy threshold value and the correlation value is lower than the preset correlation threshold value as candidate sample data;
    使用至少两个预设样本分类器对所述候选样本数据进行分类,得到分类结果;Classify the candidate sample data by using at least two preset sample classifiers to obtain a classification result;
    从所述分类结果中选取同时属于不同类别的所述候选样本数据作为所述待标注数据。The candidate sample data belonging to different categories at the same time are selected from the classification result as the data to be labeled.
PCT/CN2019/117095 2019-03-29 2019-11-11 Text categorization model training method, apparatus, computer device, and storage medium WO2020199591A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910247846.8A CN110110080A (en) 2019-03-29 2019-03-29 Textual classification model training method, device, computer equipment and storage medium
CN201910247846.8 2019-03-29

Publications (1)

Publication Number Publication Date
WO2020199591A1 true WO2020199591A1 (en) 2020-10-08

Family

ID=67484695

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117095 WO2020199591A1 (en) 2019-03-29 2019-11-11 Text categorization model training method, apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN110110080A (en)
WO (1) WO2020199591A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348203A (en) * 2020-11-05 2021-02-09 中国平安人寿保险股份有限公司 Model training method and device, terminal device and storage medium
CN112446441A (en) * 2021-02-01 2021-03-05 北京世纪好未来教育科技有限公司 Model training data screening method, device, equipment and storage medium
CN112528022A (en) * 2020-12-09 2021-03-19 广州摩翼信息科技有限公司 Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112541595A (en) * 2020-12-30 2021-03-23 中国建设银行股份有限公司 Model construction method and device, storage medium and electronic equipment
CN112632219A (en) * 2020-12-17 2021-04-09 中国联合网络通信集团有限公司 Method and device for intercepting junk short messages
CN112651447A (en) * 2020-12-29 2021-04-13 广东电网有限责任公司电力调度控制中心 Resource classification labeling method and system based on ontology
CN113190154A (en) * 2021-04-29 2021-07-30 北京百度网讯科技有限公司 Model training method, entry classification method, device, apparatus, storage medium, and program
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
CN113793191A (en) * 2021-02-09 2021-12-14 京东科技控股股份有限公司 Commodity matching method and device and electronic equipment
CN114648980A (en) * 2022-03-03 2022-06-21 科大讯飞股份有限公司 Data classification and voice recognition method and device, electronic equipment and storage medium
CN115994225A (en) * 2023-03-20 2023-04-21 北京百分点科技集团股份有限公司 Text classification method and device, storage medium and electronic equipment
CN116304058A (en) * 2023-04-27 2023-06-23 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
WO2023151488A1 (en) * 2022-02-11 2023-08-17 阿里巴巴(中国)有限公司 Model training method, training device, electronic device and computer-readable medium
CN117973522A (en) * 2024-04-02 2024-05-03 成都派沃特科技股份有限公司 Knowledge data training technology-based application model construction method and system

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium
CN112711940B (en) * 2019-10-08 2024-06-11 台达电子工业股份有限公司 Information processing system, information processing method and non-transitory computer readable recording medium
CN111026851B (en) * 2019-10-18 2023-09-15 平安科技(深圳)有限公司 Model prediction capability optimization method, device, equipment and readable storage medium
CN111159396B (en) * 2019-12-04 2022-04-22 中国电子科技集团公司第三十研究所 Method for establishing text data classification hierarchical model facing data sharing exchange
CN111081221B (en) * 2019-12-23 2022-10-14 合肥讯飞数码科技有限公司 Training data selection method and device, electronic equipment and computer storage medium
CN111143568A (en) * 2019-12-31 2020-05-12 郑州工程技术学院 Method, device and equipment for buffering during paper classification and storage medium
CN111382268B (en) * 2020-02-25 2023-12-01 北京小米松果电子有限公司 Text training data processing method, device and storage medium
CN111368515B (en) * 2020-03-02 2021-01-26 中国农业科学院农业信息研究所 Industry dynamic interactive report generation method and system based on PDF document fragmentation
CN111767400B (en) * 2020-06-30 2024-04-26 平安国际智慧城市科技股份有限公司 Training method and device for text classification model, computer equipment and storage medium
CN111914061B (en) * 2020-07-13 2021-04-16 上海乐言科技股份有限公司 Radius-based uncertainty sampling method and system for text classification active learning
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN111881983B (en) * 2020-07-30 2024-05-28 平安科技(深圳)有限公司 Data processing method and device based on classification model, electronic equipment and medium
CN111881295A (en) * 2020-07-31 2020-11-03 中国光大银行股份有限公司 Text classification model training method and device and text labeling method and device
CN112069293B (en) * 2020-09-14 2024-04-19 上海明略人工智能(集团)有限公司 Data labeling method, device, electronic equipment and computer readable medium
CN112434736A (en) * 2020-11-24 2021-03-02 成都潜在人工智能科技有限公司 Deep active learning text classification method based on pre-training model
CN112651211A (en) * 2020-12-11 2021-04-13 北京大米科技有限公司 Label information determination method, device, server and storage medium
CN113239128B (en) * 2021-06-01 2022-03-18 平安科技(深圳)有限公司 Data pair classification method, device, equipment and storage medium based on implicit characteristics
CN113590822B (en) * 2021-07-28 2023-08-08 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for processing document title
CN113761034B (en) * 2021-09-15 2022-06-17 深圳信息职业技术学院 Data processing method and device
CN117520836A (en) * 2022-07-29 2024-02-06 上海智臻智能网络科技股份有限公司 Training sample generation method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063642A (en) * 2010-12-30 2011-05-18 上海电机学院 Selection method for fuzzy neural network sample on basis of active learning
CN104166706A (en) * 2014-08-08 2014-11-26 苏州大学 Multi-label classifier constructing method based on cost-sensitive active learning
US20150379072A1 (en) * 2014-06-30 2015-12-31 Amazon Technologies, Inc. Input processing for machine learning
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7574409B2 (en) * 2004-11-04 2009-08-11 Vericept Corporation Method, apparatus, and system for clustering and classification
US9292797B2 (en) * 2012-12-14 2016-03-22 International Business Machines Corporation Semi-supervised data integration model for named entity classification
CN106131613B (en) * 2016-07-26 2019-10-01 深圳Tcl新技术有限公司 Smart television video sharing method and video sharing system
CN107025218B (en) * 2017-04-07 2021-03-02 腾讯科技(深圳)有限公司 Text duplicate removal method and device
CN108304427B (en) * 2017-04-28 2020-03-17 腾讯科技(深圳)有限公司 User passenger group classification method and device
CN107506793B (en) * 2017-08-21 2020-12-18 中国科学院重庆绿色智能技术研究院 Garment identification method and system based on weakly labeled image
CN108665158A (en) * 2018-05-08 2018-10-16 阿里巴巴集团控股有限公司 A kind of method, apparatus and equipment of trained air control model
CN109101997B (en) * 2018-07-11 2020-07-28 浙江理工大学 Traceability method for sampling limited active learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063642A (en) * 2010-12-30 2011-05-18 上海电机学院 Selection method for fuzzy neural network sample on basis of active learning
US20150379072A1 (en) * 2014-06-30 2015-12-31 Amazon Technologies, Inc. Input processing for machine learning
CN104166706A (en) * 2014-08-08 2014-11-26 苏州大学 Multi-label classifier constructing method based on cost-sensitive active learning
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy
CN110110080A (en) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 Textual classification model training method, device, computer equipment and storage medium

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348203A (en) * 2020-11-05 2021-02-09 中国平安人寿保险股份有限公司 Model training method and device, terminal device and storage medium
CN112528022A (en) * 2020-12-09 2021-03-19 广州摩翼信息科技有限公司 Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112632219A (en) * 2020-12-17 2021-04-09 中国联合网络通信集团有限公司 Method and device for intercepting junk short messages
CN112632219B (en) * 2020-12-17 2022-10-04 中国联合网络通信集团有限公司 Method and device for intercepting junk short messages
CN112651447A (en) * 2020-12-29 2021-04-13 广东电网有限责任公司电力调度控制中心 Resource classification labeling method and system based on ontology
CN112651447B (en) * 2020-12-29 2023-09-26 广东电网有限责任公司电力调度控制中心 Ontology-based resource classification labeling method and system
CN112541595A (en) * 2020-12-30 2021-03-23 中国建设银行股份有限公司 Model construction method and device, storage medium and electronic equipment
CN112446441A (en) * 2021-02-01 2021-03-05 北京世纪好未来教育科技有限公司 Model training data screening method, device, equipment and storage medium
CN113793191A (en) * 2021-02-09 2021-12-14 京东科技控股股份有限公司 Commodity matching method and device and electronic equipment
CN113793191B (en) * 2021-02-09 2024-05-24 京东科技控股股份有限公司 Commodity matching method and device and electronic equipment
CN113190154A (en) * 2021-04-29 2021-07-30 北京百度网讯科技有限公司 Model training method, entry classification method, device, apparatus, storage medium, and program
CN113190154B (en) * 2021-04-29 2023-10-13 北京百度网讯科技有限公司 Model training and entry classification methods, apparatuses, devices, storage medium and program
CN113343695A (en) * 2021-05-27 2021-09-03 镁佳(北京)科技有限公司 Text labeling noise detection method and device, storage medium and electronic equipment
WO2023151488A1 (en) * 2022-02-11 2023-08-17 阿里巴巴(中国)有限公司 Model training method, training device, electronic device and computer-readable medium
CN114648980A (en) * 2022-03-03 2022-06-21 科大讯飞股份有限公司 Data classification and voice recognition method and device, electronic equipment and storage medium
CN115994225B (en) * 2023-03-20 2023-06-27 北京百分点科技集团股份有限公司 Text classification method and device, storage medium and electronic equipment
CN115994225A (en) * 2023-03-20 2023-04-21 北京百分点科技集团股份有限公司 Text classification method and device, storage medium and electronic equipment
CN116304058B (en) * 2023-04-27 2023-08-08 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN116304058A (en) * 2023-04-27 2023-06-23 云账户技术(天津)有限公司 Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN117973522A (en) * 2024-04-02 2024-05-03 成都派沃特科技股份有限公司 Knowledge data training technology-based application model construction method and system
CN117973522B (en) * 2024-04-02 2024-06-04 成都派沃特科技股份有限公司 Knowledge data training technology-based application model construction method and system

Also Published As

Publication number Publication date
CN110110080A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
WO2020199591A1 (en) Text categorization model training method, apparatus, computer device, and storage medium
CN109871446B (en) Refusing method in intention recognition, electronic device and storage medium
WO2020177230A1 (en) Medical data classification method and apparatus based on machine learning, and computer device and storage medium
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN108536800B (en) Text classification method, system, computer device and storage medium
CN109063217B (en) Work order classification method and device in electric power marketing system and related equipment thereof
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN110633366B (en) Short text classification method, device and storage medium
CN108710894B (en) Active learning labeling method and device based on clustering representative points
CN111209738A (en) Multi-task named entity recognition method combining text classification
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
US20190188277A1 (en) Method and device for processing an electronic document
CN107844533A (en) A kind of intelligent Answer System and analysis method
WO2014085776A2 (en) Web search ranking
CN112270188B (en) Questioning type analysis path recommendation method, system and storage medium
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN111191031A (en) Entity relation classification method of unstructured text based on WordNet and IDF
CN110377618B (en) Method, device, computer equipment and storage medium for analyzing decision result
CN110377690B (en) Information acquisition method and system based on remote relationship extraction
US20220414099A1 (en) Using query logs to optimize execution of parametric queries
CN117474010A (en) Power grid language model-oriented power transmission and transformation equipment defect corpus construction method
Marconi et al. Hyperbolic manifold regression
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
CN113342964B (en) Recommendation type determination method and system based on mobile service
CN113987170A (en) Multi-label text classification method based on convolutional neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19923410

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19923410

Country of ref document: EP

Kind code of ref document: A1