WO2020199591A1 - Procédé, appareil, dispositif informatique, et support d'informations d'entraînement de modèles de catégorisation de textes - Google Patents
Procédé, appareil, dispositif informatique, et support d'informations d'entraînement de modèles de catégorisation de textes Download PDFInfo
- Publication number
- WO2020199591A1 WO2020199591A1 PCT/CN2019/117095 CN2019117095W WO2020199591A1 WO 2020199591 A1 WO2020199591 A1 WO 2020199591A1 CN 2019117095 W CN2019117095 W CN 2019117095W WO 2020199591 A1 WO2020199591 A1 WO 2020199591A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample data
- preset
- sample
- information entropy
- value
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
Definitions
- This application relates to the field of information processing, in particular to text classification model training methods, devices, computer equipment and storage media.
- Text classification is an important application direction in the research field of natural language processing. Text classification refers to the use of a classifier to classify data documents containing text, so as to determine the category to which each document belongs, so that users can easily obtain the required documents.
- the classifier is also called a classification model, which is obtained by training the classification criteria or model parameters by using a large number of sample data with category labels.
- a classification model which is obtained by training the classification criteria or model parameters by using a large number of sample data with category labels.
- the embodiments of the present application provide a text classification model training method, device, computer equipment, and storage medium to solve the problem of large training samples and long training time in the text classification model training process.
- a text classification model training method including:
- the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
- a text classification model training device including:
- the primary model establishment module is used to obtain first sample data with category marks from a preset sample library, and establish a primary classification model according to the first sample data;
- a sample data acquisition module configured to acquire second sample data without the category mark from the preset sample library
- An information entropy calculation module configured to calculate the information entropy of each of the second sample data to obtain the information entropy value of each of the second sample data;
- the correlation calculation module is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data
- a data selection module to be labeled configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and the correlation value is lower than the preset relevance threshold as the data to be labeled;
- the labeling module is used to label the data to be labeled according to the preset category labeling method to obtain the third sample data;
- the first model training module is configured to use the third sample data to train the primary classification model according to a preset model training method to obtain an intermediate classification model;
- the second model training module is configured to use the first sample data and the third sample data to train the intermediate classification model according to the preset model training method to obtain a text classification model.
- a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
- the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
- One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
- the intermediate classification model is trained using the first sample data and the third sample data to obtain a text classification model.
- FIG. 1 is a schematic diagram of an application environment of a text classification model training method in an embodiment of the present application
- Figure 2 is a flowchart of a text classification model training method in an embodiment of the present application
- step S1 in the text classification model training method in an embodiment of the present application
- Fig. 4 is a flowchart of step S4 in the text classification model training method in an embodiment of the present application.
- FIG. 5 is a flowchart of step S5 in a text classification model training method in an embodiment of the present application
- FIG. 6 is a schematic diagram of a text classification model training device in an embodiment of the present application.
- Fig. 7 is a schematic diagram of a computer device in an embodiment of the present application.
- the text classification model training method provided by this application can be applied to the application environment as shown in Figure 1.
- the server is a computer device for text classification model training, and the server can be a server or a server cluster; the preset sample library provides The database for training sample data, which can be various relational or non-relational databases, such as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase, etc.; the server and the preset sample database are connected through the network Connection, the network can be a wired network or a wireless network.
- the text classification model training method provided by the embodiment of the application is applied to the server.
- a method for training a text classification model is provided.
- the specific implementation process includes the following steps:
- S1 Obtain first sample data with category marks from a preset sample library, and establish a primary classification model based on the first sample data.
- the preset sample library is a database that provides training sample data.
- the preset sample library can be deployed locally on the server or connected to the server through the network.
- the first sample data is text data with category marks.
- the text data is a text document containing text information, text on the Internet, news, and the body of an e-mail, etc.
- the category tag is a classification label for the text data, which is a classification restriction on the text data.
- category tags also include but are not limited to "science popularization”, “sports”, “inspirational”, “poetry prose”, etc., used to indicate the category of text data.
- the category mark and text data are stored in association, and each text data has a field indicating whether it has a category mark.
- the server can obtain the text data with the category mark as the first sample data through the SQL query statement.
- the primary classification model is a classification tool constructed based on the first sample data.
- the established primary classification model can roughly classify the sample data with class labels.
- the server can obtain the text feature information of the first sample data by performing feature analysis on the first sample data with the category tag, and then store the category tag and the text feature information as a primary classification model.
- the server may perform word segmentation processing on the text in the first sample data, and use high word frequency segmentation as text feature information.
- word segmentation processing is to segment the words in the text in the processing of text information to obtain individual words.
- word segmentation is widely used in the fields of full-text retrieval and text content mining.
- the server can use a neural network-based training method to obtain the primary classification model based on the first sample data.
- the second sample data is text data without a category mark. That is, compared with the first sample data, the second sample data does not have a category label. If it is not manually labelled, the server does not know the text category to which the second sample data belongs or the meaning expressed.
- the server can obtain the second sample data from the preset sample library through the SQL query statement.
- Information entropy is the concept of measuring the amount of information proposed by Shannon, which is a quantitative measure of the amount of information. The greater the information entropy, the richer the amount of information contained in the sample data, and the greater the uncertainty of the information.
- the value of information entropy is a specific quantitative value of information entropy.
- the server can determine the information entropy value according to how much text data is contained in the second sample data. For example, the number of characters in the second sample data is used as the information entropy value. Understandably, the amount of information contained in a 5000-word article is greater than the amount of information contained in an email body of only 20 words.
- the server calculates the number of characters in each second sample data, and uses the number of characters as the information entropy value of each second sample data.
- the server uses the number of word segmentation after the auxiliary word is removed from the second sample data as the information entropy value of the second sample data.
- the auxiliary words include but are not limited to "ba”, “um”, “de”, “le” and so on.
- the server performs word segmentation processing on the second sample data to obtain a word segmentation set, removes auxiliary words in the word segmentation set, and uses the remaining number of word segments as the information entropy value of the second sample data.
- S4 Calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data.
- the correlation value of the second sample data reflects whether the information provided by the second sample data is repeated and redundant. The higher the correlation value, the higher the repetition and redundancy of the information provided by the second sample data; the lower the correlation value, the greater the difference in the information provided by the second sample data. .
- the server determines the relevance value according to the number of identical phrases contained in the second sample data.
- the second sample data A includes the phrases “culture”, “civilization”, and “history”
- the second sample data B includes the phrases “culture”, “country”, and “history”
- the second sample data C Includes the phrases “travel”, “mountain” and “country”
- the correlation value of A and C is 0, and the correlation value of B and C is 1.
- the correlation value of each second sample data can be determined by the cumulative sum of the correlation values of the second sample data and each other second sample data. That is, the correlation value of A is 2, the correlation value of B is 3, and the correlation value of C is 1.
- S5 Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as the data to be labeled.
- the preset information entropy threshold and the preset relevance threshold are conditions for filtering the second sample data that does not have a category mark.
- the data to be labeled is the data obtained after filtering the second sample data according to the preset information entropy threshold and the preset relevance threshold.
- the second sample data whose information entropy value exceeds the preset information entropy threshold and the relevance value is lower than the preset relevance threshold indicates that the content of its information is uncertain, and the greater the difference between the information amounts, the The preferred data used to train the model.
- the server selects the information entropy value and relevance value of each second sample data, and sets the information entropy value greater than 1000, and the correlation
- the second sample data whose degree value is lower than 100 is regarded as the data to be labeled.
- S6 Perform category labeling on the data to be labeled according to the preset category labeling method to obtain the third sample data.
- the category labeling is a process of labeling the second sample data that does not have a category label, so that the second sample data has a corresponding category label. For example, label an article by category, and add tags such as "fiction” and "suspense" that reflect the content of the subject.
- the data obtained after category labeling is the third sample data.
- the preset category labeling method means that the server can use multiple labeling methods to label the second sample data.
- the server can extract the keywords in the second sample data, that is, use the five words with the highest word frequency as keywords; then, the keywords are consistent with the target keywords in the preset category tag thesaurus In comparison, if the keyword is consistent with the target keyword, the target keyword is labeled on the second sample data to obtain the third sample data.
- the server can directly call a third-party expert system for marking.
- a third-party expert system for marking.
- an API Application Programming Interface
- a third-party expert system is used to input the second sample data to obtain a category mark corresponding to the second sample data, thereby obtaining the third sample data.
- S7 According to the preset model training method, use the third sample data to train the primary classification model to obtain the intermediate classification model.
- the intermediate classification model is a classification model obtained after training with the third sample data on the basis of the primary classification model.
- the difference between the intermediate classification model and the primary classification model is that the training set of the intermediate classification model is the third sample data that has a category label, and the information entropy value and the correlation value meet certain conditions.
- the preset model training method is that the server uses the third sample data as training data, and uses multiple frameworks or algorithms to train the primary classification model.
- the server can use existing machine learning frameworks or tools, such as Scikit-Learn, TensorFlow, etc.
- Scikit-Learn referred to as sklearn
- Sklearn has built-in classification algorithms such as naive Bayes algorithm, decision tree algorithm, and random forest algorithm. Data preprocessing can be achieved using sklearn. , Classification, regression, dimensionality reduction, model selection and other commonly used machine learning algorithms.
- TensorFlow is an open source software library for numerical calculations originally developed by researchers and engineers from the Google Brain Group (belonging to the Google Machine Intelligence Research Institute). It can be used for research on machine learning and deep neural networks, but this The versatility of the system makes it also widely used in other computing fields.
- the server uses the third sample data as input data and calls the built-in training method in sklearn until the model tends to converge, and then the intermediate classification model can be obtained.
- S8 According to the preset model training method, use the first sample data and the third sample data to train the intermediate classification model to obtain the text classification model.
- the text classification model is the final classification model obtained after retraining the intermediate classification model.
- the preset model training method adopted by the server is the same as the training process of step S7, and will not be repeated here.
- the difference from the training process of step S7 is that the first sample data and the third sample data are used to train the intermediate classification model at the same time, that is, the intermediate classification model is iteratively trained using class-labeled sample data to improve the intermediate classification model The classification accuracy.
- the server takes the first sample data and the third sample data as input data, and calls the built-in training method in sklearn until the model tends to converge, and the text classification model can be obtained.
- the first sample data with category labels is obtained from the preset sample library, and the primary classification model is established according to the first sample data, that is, a small part of the sample data with category labels is used for training, and
- the primary classification model can reduce the demand for sample data with category marks and save training costs; obtain second sample data without category marks from the preset sample library; calculate the information entropy and correlation of the second sample data Classify the second sample data whose information entropy value and correlation value meet the preset conditions; according to the preset model training method, use the labeled third sample data to train the primary classification model to obtain the intermediate classification
- the model uses the third sample data to have a large information entropy, a small correlation between each other, and has the characteristics of category labels, which optimizes the classification accuracy of the primary classification model; finally, according to the first sample data and the third sample
- the data trains the intermediate classification model to obtain the text classification model, that is, through step-by-step iteration, optimization is used to obtain the final text classification model.
- step S1 the first sample data with category tags is obtained from the preset sample library, and the primary classification model is established according to the first sample data. Including the following steps:
- S11 Select the first sample data with the category mark from the preset sample library according to the preset sample selection method.
- the preset sample selection method is to select a certain number of representative first sample data with category marks from the preset sample library. Among them, the number is as small as possible to reduce the demand for sample data; at the same time, the first sample selected should cover the text data category as much as possible. For example, for the selection of news text data, try to cover categories such as "politics”, “business”, “sports”, “style and entertainment”.
- the server can select 30% of the 3000 articles, that is, 900 articles are selected, and 900 articles are selected Select 5 articles each representing the text data category as the first sample data.
- S12 Establish a primary classification model by combining the first sample data with the category mark and the preset training algorithm.
- Preset training algorithms including various algorithms for training models in machine learning.
- the process in which the server uses the first sample data with category labels to establish the primary classification model belongs to the supervised learning mode.
- supervised learning is to train to obtain an optimal model through existing training samples, that is, known data and its corresponding output.
- This model belongs to a set of certain functions, and optimal means that it is the best under certain evaluation criteria.
- the server can import the naive Bayes function from the sklearn library, and then call MultinomialNB().fit() for training.
- the server can use the Joblib library to realize the function of saving training data.
- Joblib is a part of the SciPy ecology and provides tools for pipelined python work.
- the server can call the function of the pickle library to save the primary classification model.
- the server selects the first sample data that is as small as possible and the type of sample data covers as wide as possible; then, the primary classification model is established in combination with the preset training algorithm, so that the sample The need for data is as small as possible to further reduce training costs. At the same time, due to the wide coverage of the first sample data, the recognizable range of the primary classification model is wider.
- step S3 that is, calculating the information entropy of each second sample data to obtain the information entropy value of each second sample data, specifically includes the following steps:
- H represents the information entropy value of the second sample data
- x represents the phrase in the second sample data
- p (x) represents the frequency of occurrence of the phrase.
- the phrases in the second sample data are words obtained after the server performs word segmentation processing on the second sample data.
- the frequency of the phrase that is, the number of times the phrase appears in the second sample data.
- the server first performs word segmentation processing on each second sample data to obtain a word segmentation set; then, substituting the frequency of all word segmentation in the word segmentation set into the formula, the information entropy value of the second sample data can be obtained.
- the server calculates the information entropy of the second sample data according to the Shannon formula and the word frequency of the phrase in the second sample data, so that the quantification of the amount of information contained in the sample data is more accurate.
- step S4 that is, calculating the correlation value of each second sample data according to the number of the same phrase in the second sample data, it specifically includes the following steps:
- S41 Perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data.
- the server can use multiple methods to perform word segmentation processing. For example, a regular expression is used to segment the second sample data to obtain a set consisting of several word segmentation, that is, a word segmentation set. Understandably, there is a one-to-one correspondence between the number of second sample data and the number of word segmentation sets.
- regular expression namely Regular Expression, also known as regular expression, is a processing method used to retrieve or replace target text in context.
- the server can use the built-in regular expression engine in Perl or Python to segment the second sample data; or, the server can segment the second sample data using the grep tool that comes with the Unix system , Get a set containing several participles.
- grep namely Globally search a Regular Expression and Print, is a powerful text search tool.
- the local correlation value represents the degree of correlation between a second sample data and other second sample data.
- the participle set a is represented as ⁇ "people", “interest”, “bank”, “borrow” ⁇
- the participle set b is represented as ⁇ "bank", “borrow”, “income” ⁇
- the participle set a The intersection with b is ⁇ "bank", “borrow” ⁇ , the number of phrases contained in the intersection is 2, and the local correlation value of the word segmentation set a and b is 2.
- the word segmentation set c is represented as ⁇ "meeting", “report”, "income” ⁇
- the local correlation value of the word segmentation set a and c is 0, and the local correlation value of the word segmentation set b and c is 1. .
- S43 Calculate the average value of the N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
- the correlation value of the second sample data corresponding to the word segmentation set a is the sum of the partial correlation values of the word segmentation sets a and b, and the word segmentation sets a and c.
- the average value is 1.
- the correlation values of the second sample data corresponding to the word segmentation sets b and c are 1.5 and 0.5, respectively.
- the server performs word segmentation processing on the second sample data to determine the local correlation value between the second sample data by the intersection of the word segmentation sets, and averages the local correlation values Obtain the correlation value of each second sample data, so that the correlation value can more accurately reflect the degree of correlation between the second sample data.
- step S5 the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold is selected as the second sample data to be labeled
- the data includes the following steps:
- S51 Select the second sample data whose information entropy value exceeds the preset information entropy threshold and the correlation value is lower than the preset correlation threshold as candidate sample data.
- the server re-screens the second sample data that meets the specific conditions, which not only reduces the number of training samples, but also finds sample data that is difficult to identify by ordinary classifiers.
- the specific condition means that the information entropy value exceeds the preset information entropy threshold, and the correlation value is lower than the preset correlation threshold.
- S52 Use at least two preset sample classifiers to classify the candidate sample data to obtain a classification result.
- Preset sample classifiers namely text classification models.
- text classification models For example, common FastText, Text-CNN models, etc.
- FastText is a word vector and text classification tool open sourced by Facebook, and its typical application scenario is "supervised text classification problem". It provides a simple and efficient method for text classification and characterization learning, with performance comparable to deep learning and faster.
- TextCNN is an algorithm that uses convolutional neural networks to classify text. Because of its simple structure and good effect, it is widely used in the field of text classification.
- Different preset sample classifiers may have different results for classifying the same sample data. That is, after the same sample data is classified by different classification models such as FastText and Text-CNN, it may be recognized as different categories.
- the classification result includes the category to which each candidate sample data belongs.
- S53 Select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
- Candidate sample data belonging to different categories at the same time that is, different preset classifiers have different recognition results for the same candidate sample data.
- an article is recognized as "historical” by FastText and at the same time recognized as "literary and artistic” by Text-CNN. Therefore, it means that the article is difficult to be recognized, or it is difficult to simply divide it into a certain category.
- the server determines whether the candidate sample data belongs to different categories at the same time according to the category to which the candidate sample data in the classification result belongs.
- the server screens the second sample data that meets specific conditions according to different preset classifiers, and picks out the second sample data that is difficult to be identified as the data to be labeled, which removes the simple and easy to identify data.
- Sample data to further reduce the number of training samples and training time, improve training efficiency; at the same time, select the sample data that is not easy to be identified as the data to be labeled, so that the classification of these data to be labeled is beneficial to the accuracy of model training improve.
- a text classification model training device is provided, and the text classification model training device corresponds to the text classification model training method in the above-mentioned embodiment one-to-one.
- the text classification model training device includes a primary model building module 61, a sample data acquisition module 62, an information entropy calculation module 63, a correlation calculation module 64, a data selection module 65 to be labeled, a labeling module 66, a first The model training module 67 and the second model training module 68.
- the detailed description of each functional module is as follows:
- the primary model establishment module 61 is configured to obtain the first sample data with category marks from the preset sample library, and establish a primary classification model according to the first sample data;
- the sample data acquisition module 62 is configured to acquire second sample data without a category mark from a preset sample library
- the information entropy calculation module 63 is configured to calculate the information entropy of each second sample data to obtain the information entropy value of each second sample data;
- the correlation calculation module 64 is configured to calculate the correlation value of each second sample data according to the number of the same phrase in the second sample data
- the to-be-labeled data selection module 65 is configured to select the second sample data whose information entropy value exceeds a preset information entropy threshold and whose relevance value is lower than the preset relevance threshold as the data to be labeled;
- the labeling module 66 is configured to perform category labeling on the data to be labeled according to a preset category labeling method to obtain third sample data;
- the first model training module 67 is configured to use the third sample data to train the primary classification model according to the preset model training method to obtain the intermediate classification model;
- the second model training module 68 is configured to use the first sample data and the third sample data to train the intermediate classification model according to a preset model training method to obtain a text classification model.
- the primary model establishment module 61 includes:
- the selection sub-module 611 is used to select the first sample data with the category mark from the preset sample library according to the preset sample selection method;
- the training sub-module 612 is used to establish a primary classification model by combining the first sample data with category labels and a preset training algorithm.
- the information entropy calculation module 63 includes
- the information entropy calculation sub-module 631 is configured to calculate the information entropy of each second sample data according to the following formula:
- H represents the information entropy value of the second sample data
- x represents the phrase in the second sample data
- p (x) represents the frequency of occurrence of the phrase.
- the correlation calculation module 64 includes:
- the word segmentation sub-module 641 is used to perform word segmentation processing on each second sample data to obtain N word segmentation sets, where N is the number of second sample data;
- the local correlation calculation sub-module 642 is used to calculate the intersection between the word segmentation set of the second sample data and the word segmentation sets of other N-1 second sample data for each second sample data, and according to each intersection The number of phrases contained in the set is determined, the local correlation value between the second sample data and the other N-1 second sample data is determined, and the N-1 local correlation values corresponding to the second sample data are obtained;
- the average value calculation sub-module 643 is used to calculate the average value of N-1 local correlation values corresponding to each second sample data, and use the average value as the correlation value of each second sample data.
- the data selection module 65 to be labeled includes:
- the candidate sample selection submodule 651 is configured to select second sample data whose information entropy value exceeds a preset information entropy threshold and whose correlation value is lower than the preset correlation threshold as candidate sample data;
- the classification sub-module 652 is configured to classify candidate sample data by using at least two preset sample classifiers to obtain a classification result
- the labeling submodule 653 is used to select candidate sample data belonging to different categories at the same time from the classification result as the data to be labeled.
- Each module in the text classification model training device described above can be implemented in whole or in part by software, hardware, and a combination thereof.
- the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
- a computer device is provided.
- the computer device may be a server, and its internal structure diagram may be as shown in FIG. 7.
- the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
- the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
- the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions are executed by the processor to realize a text classification model training method.
- the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
- a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
- the processor executes the computer-readable instructions, the text in the above-mentioned embodiment is implemented.
- the steps of the classification model training method are, for example, steps S1 to S8 shown in FIG. 2.
- the processor executes the computer-readable instructions, the functions of the modules/units of the text classification model training device in the above-mentioned embodiment are realized, for example, the functions of the modules 61 to 68 shown in FIG. 6. To avoid repetition, I won’t repeat them here.
- one or more readable storage media storing computer readable instructions are provided.
- the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media .
- the readable storage medium stores computer readable instructions, and the computer readable instructions implement the text classification model training method in the above method embodiment when executed by the processor, or implement the text classification model training method in the above method embodiments, or implement the computer readable instructions when executed by one or more processors
- Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un procédé, un appareil, un dispositif informatique, et un support d'informations d'entraînement de modèles de catégorisation de textes, ledit procédé consistant : à obtenir, à partir d'une bibliothèque d'échantillons prédéfinis, des premières données d'échantillons ayant une étiquette de catégorie et des deuxièmes données d'échantillons n'ayant pas d'étiquette de catégorie ; à établir un modèle de catégorisation primaire en fonction des premières données d'échantillons ; en même temps, à calculer une valeur d'entropie d'informations et une valeur de corrélation des deuxièmes données d'échantillons ; en fonction d'un procédé d'étiquetage de catégories prédéfini, à étiqueter les deuxièmes données d'échantillons dont la valeur d'entropie d'informations et la valeur de corrélation satisfont à des conditions prédéfinies pour obtenir des troisièmes données d'échantillons ; à utiliser les troisièmes données d'échantillons pour entraîner le modèle de catégorisation primaire pour obtenir un modèle de catégorisation intermédiaire ; à utiliser les premières données d'échantillons et les troisièmes données d'échantillons pour entraîner le modèle de catégorisation intermédiaire pour obtenir un modèle de catégorisation de textes. La solution technique selon la présente invention résout le problème, durant l'entraînement de modèles de catégorisation de textes, lié au fait que la taille d'échantillons d'entraînement est énorme et que le temps d'entraînement est long.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910247846.8A CN110110080A (zh) | 2019-03-29 | 2019-03-29 | 文本分类模型训练方法、装置、计算机设备及存储介质 |
CN201910247846.8 | 2019-03-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020199591A1 true WO2020199591A1 (fr) | 2020-10-08 |
Family
ID=67484695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/117095 WO2020199591A1 (fr) | 2019-03-29 | 2019-11-11 | Procédé, appareil, dispositif informatique, et support d'informations d'entraînement de modèles de catégorisation de textes |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110110080A (fr) |
WO (1) | WO2020199591A1 (fr) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348203A (zh) * | 2020-11-05 | 2021-02-09 | 中国平安人寿保险股份有限公司 | 模型训练方法、装置、终端设备及存储介质 |
CN112446441A (zh) * | 2021-02-01 | 2021-03-05 | 北京世纪好未来教育科技有限公司 | 模型训练数据筛选方法、装置、设备及存储介质 |
CN112528022A (zh) * | 2020-12-09 | 2021-03-19 | 广州摩翼信息科技有限公司 | 主题类别对应的特征词提取和文本主题类别识别方法 |
CN112541595A (zh) * | 2020-12-30 | 2021-03-23 | 中国建设银行股份有限公司 | 模型构建方法及装置、存储介质及电子设备 |
CN112632219A (zh) * | 2020-12-17 | 2021-04-09 | 中国联合网络通信集团有限公司 | 一种垃圾短信的拦截方法和拦截装置 |
CN112633344A (zh) * | 2020-12-16 | 2021-04-09 | 中国平安财产保险股份有限公司 | 质检模型的训练方法、装置、设备及可读存储介质 |
CN112651447A (zh) * | 2020-12-29 | 2021-04-13 | 广东电网有限责任公司电力调度控制中心 | 一种基于本体的资源分类标注方法及系统 |
CN113190154A (zh) * | 2021-04-29 | 2021-07-30 | 北京百度网讯科技有限公司 | 模型训练、词条分类方法、装置、设备、存储介质及程序 |
CN113343695A (zh) * | 2021-05-27 | 2021-09-03 | 镁佳(北京)科技有限公司 | 一种文本标注噪声检测方法、装置、存储介质及电子设备 |
CN113793191A (zh) * | 2021-02-09 | 2021-12-14 | 京东科技控股股份有限公司 | 商品的匹配方法、装置及电子设备 |
CN114648980A (zh) * | 2022-03-03 | 2022-06-21 | 科大讯飞股份有限公司 | 数据分类和语音识别方法、装置、电子设备及存储介质 |
CN115994225A (zh) * | 2023-03-20 | 2023-04-21 | 北京百分点科技集团股份有限公司 | 文本的分类方法、装置、存储介质及电子设备 |
CN116304058A (zh) * | 2023-04-27 | 2023-06-23 | 云账户技术(天津)有限公司 | 企业负面信息的识别方法、装置、电子设备及存储介质 |
WO2023151488A1 (fr) * | 2022-02-11 | 2023-08-17 | 阿里巴巴(中国)有限公司 | Procédé d'entraînement de modèle, dispositif d'entraînement, dispositif électronique et support lisible par ordinateur |
CN117783377A (zh) * | 2024-02-27 | 2024-03-29 | 南昌怀特科技有限公司 | 一种用于牙贴生产的成分分析方法及系统 |
CN117973522A (zh) * | 2024-04-02 | 2024-05-03 | 成都派沃特科技股份有限公司 | 基于知识数据训练技术的应用模型构建方法及系统 |
CN112633344B (zh) * | 2020-12-16 | 2024-10-22 | 中国平安财产保险股份有限公司 | 质检模型的训练方法、装置、设备及可读存储介质 |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110080A (zh) * | 2019-03-29 | 2019-08-09 | 平安科技(深圳)有限公司 | 文本分类模型训练方法、装置、计算机设备及存储介质 |
CN112711940B (zh) * | 2019-10-08 | 2024-06-11 | 台达电子工业股份有限公司 | 信息处理系统、信息处理法及非暂态电脑可读取记录媒体 |
CN111026851B (zh) * | 2019-10-18 | 2023-09-15 | 平安科技(深圳)有限公司 | 模型预测能力优化方法、装置、设备及可读存储介质 |
CN111159396B (zh) * | 2019-12-04 | 2022-04-22 | 中国电子科技集团公司第三十研究所 | 面向数据共享交换的文本数据分类分级模型的建立方法 |
CN111081221B (zh) * | 2019-12-23 | 2022-10-14 | 合肥讯飞数码科技有限公司 | 训练数据选择方法、装置、电子设备及计算机存储介质 |
CN111143568A (zh) * | 2019-12-31 | 2020-05-12 | 郑州工程技术学院 | 一种论文分类时的缓冲方法、装置、设备及存储介质 |
CN111382268B (zh) * | 2020-02-25 | 2023-12-01 | 北京小米松果电子有限公司 | 文本训练数据处理方法、装置及存储介质 |
CN111368515B (zh) * | 2020-03-02 | 2021-01-26 | 中国农业科学院农业信息研究所 | 基于pdf文档碎片化的行业动态交互式报告生成方法及系统 |
CN111767400B (zh) * | 2020-06-30 | 2024-04-26 | 平安国际智慧城市科技股份有限公司 | 文本分类模型的训练方法、装置、计算机设备和存储介质 |
CN111914061B (zh) * | 2020-07-13 | 2021-04-16 | 上海乐言科技股份有限公司 | 文本分类主动学习的基于半径的不确定度采样方法和系统 |
CN112036166A (zh) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | 一种数据标注方法、装置、存储介质及计算机设备 |
CN111881983B (zh) * | 2020-07-30 | 2024-05-28 | 平安科技(深圳)有限公司 | 基于分类模型的数据处理方法、装置、电子设备及介质 |
CN111881295B (zh) * | 2020-07-31 | 2024-08-02 | 中国光大银行股份有限公司 | 文本分类模型训练方法及装置、文本标注方法及装置 |
CN112069293B (zh) * | 2020-09-14 | 2024-04-19 | 上海明略人工智能(集团)有限公司 | 一种数据标注方法、装置、电子设备和计算机可读介质 |
CN112434736B (zh) * | 2020-11-24 | 2024-08-02 | 成都潜在人工智能科技有限公司 | 一种基于预训练模型的深度主动学习文本分类方法 |
CN112651211A (zh) * | 2020-12-11 | 2021-04-13 | 北京大米科技有限公司 | 标签信息确定方法、装置、服务器及存储介质 |
CN113239128B (zh) * | 2021-06-01 | 2022-03-18 | 平安科技(深圳)有限公司 | 基于隐式特征的数据对分类方法、装置、设备和存储介质 |
CN113590822B (zh) * | 2021-07-28 | 2023-08-08 | 北京百度网讯科技有限公司 | 文档标题的处理方法、装置、设备、存储介质及程序产品 |
CN113761034B (zh) * | 2021-09-15 | 2022-06-17 | 深圳信息职业技术学院 | 一种数据处理方法及其装置 |
CN114417882A (zh) * | 2022-01-04 | 2022-04-29 | 马上消费金融股份有限公司 | 一种数据标注方法、装置、电子设备及可读存储介质 |
CN117520836A (zh) * | 2022-07-29 | 2024-02-06 | 上海智臻智能网络科技股份有限公司 | 训练样本的生成方法、装置、设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063642A (zh) * | 2010-12-30 | 2011-05-18 | 上海电机学院 | 基于主动学习的模糊神经网络样本选择方法 |
CN104166706A (zh) * | 2014-08-08 | 2014-11-26 | 苏州大学 | 基于代价敏感主动学习的多标签分类器构建方法 |
US20150379072A1 (en) * | 2014-06-30 | 2015-12-31 | Amazon Technologies, Inc. | Input processing for machine learning |
CN108090231A (zh) * | 2018-01-12 | 2018-05-29 | 北京理工大学 | 一种基于信息熵的主题模型优化方法 |
CN110110080A (zh) * | 2019-03-29 | 2019-08-09 | 平安科技(深圳)有限公司 | 文本分类模型训练方法、装置、计算机设备及存储介质 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7574409B2 (en) * | 2004-11-04 | 2009-08-11 | Vericept Corporation | Method, apparatus, and system for clustering and classification |
US9292797B2 (en) * | 2012-12-14 | 2016-03-22 | International Business Machines Corporation | Semi-supervised data integration model for named entity classification |
CN106131613B (zh) * | 2016-07-26 | 2019-10-01 | 深圳Tcl新技术有限公司 | 智能电视视频分享方法及视频分享系统 |
CN107025218B (zh) * | 2017-04-07 | 2021-03-02 | 腾讯科技(深圳)有限公司 | 一种文本去重方法和装置 |
CN108304427B (zh) * | 2017-04-28 | 2020-03-17 | 腾讯科技(深圳)有限公司 | 一种用户客群分类方法和装置 |
CN107506793B (zh) * | 2017-08-21 | 2020-12-18 | 中国科学院重庆绿色智能技术研究院 | 基于弱标注图像的服装识别方法及系统 |
CN108665158A (zh) * | 2018-05-08 | 2018-10-16 | 阿里巴巴集团控股有限公司 | 一种训练风控模型的方法、装置及设备 |
CN109101997B (zh) * | 2018-07-11 | 2020-07-28 | 浙江理工大学 | 一种采样受限主动学习的溯源方法 |
-
2019
- 2019-03-29 CN CN201910247846.8A patent/CN110110080A/zh active Pending
- 2019-11-11 WO PCT/CN2019/117095 patent/WO2020199591A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063642A (zh) * | 2010-12-30 | 2011-05-18 | 上海电机学院 | 基于主动学习的模糊神经网络样本选择方法 |
US20150379072A1 (en) * | 2014-06-30 | 2015-12-31 | Amazon Technologies, Inc. | Input processing for machine learning |
CN104166706A (zh) * | 2014-08-08 | 2014-11-26 | 苏州大学 | 基于代价敏感主动学习的多标签分类器构建方法 |
CN108090231A (zh) * | 2018-01-12 | 2018-05-29 | 北京理工大学 | 一种基于信息熵的主题模型优化方法 |
CN110110080A (zh) * | 2019-03-29 | 2019-08-09 | 平安科技(深圳)有限公司 | 文本分类模型训练方法、装置、计算机设备及存储介质 |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348203A (zh) * | 2020-11-05 | 2021-02-09 | 中国平安人寿保险股份有限公司 | 模型训练方法、装置、终端设备及存储介质 |
CN112528022A (zh) * | 2020-12-09 | 2021-03-19 | 广州摩翼信息科技有限公司 | 主题类别对应的特征词提取和文本主题类别识别方法 |
CN112633344A (zh) * | 2020-12-16 | 2021-04-09 | 中国平安财产保险股份有限公司 | 质检模型的训练方法、装置、设备及可读存储介质 |
CN112633344B (zh) * | 2020-12-16 | 2024-10-22 | 中国平安财产保险股份有限公司 | 质检模型的训练方法、装置、设备及可读存储介质 |
CN112632219A (zh) * | 2020-12-17 | 2021-04-09 | 中国联合网络通信集团有限公司 | 一种垃圾短信的拦截方法和拦截装置 |
CN112632219B (zh) * | 2020-12-17 | 2022-10-04 | 中国联合网络通信集团有限公司 | 一种垃圾短信的拦截方法和拦截装置 |
CN112651447B (zh) * | 2020-12-29 | 2023-09-26 | 广东电网有限责任公司电力调度控制中心 | 一种基于本体的资源分类标注方法及系统 |
CN112651447A (zh) * | 2020-12-29 | 2021-04-13 | 广东电网有限责任公司电力调度控制中心 | 一种基于本体的资源分类标注方法及系统 |
CN112541595A (zh) * | 2020-12-30 | 2021-03-23 | 中国建设银行股份有限公司 | 模型构建方法及装置、存储介质及电子设备 |
CN112446441A (zh) * | 2021-02-01 | 2021-03-05 | 北京世纪好未来教育科技有限公司 | 模型训练数据筛选方法、装置、设备及存储介质 |
CN113793191B (zh) * | 2021-02-09 | 2024-05-24 | 京东科技控股股份有限公司 | 商品的匹配方法、装置及电子设备 |
CN113793191A (zh) * | 2021-02-09 | 2021-12-14 | 京东科技控股股份有限公司 | 商品的匹配方法、装置及电子设备 |
CN113190154B (zh) * | 2021-04-29 | 2023-10-13 | 北京百度网讯科技有限公司 | 模型训练、词条分类方法、装置、设备、存储介质及程序 |
CN113190154A (zh) * | 2021-04-29 | 2021-07-30 | 北京百度网讯科技有限公司 | 模型训练、词条分类方法、装置、设备、存储介质及程序 |
CN113343695A (zh) * | 2021-05-27 | 2021-09-03 | 镁佳(北京)科技有限公司 | 一种文本标注噪声检测方法、装置、存储介质及电子设备 |
WO2023151488A1 (fr) * | 2022-02-11 | 2023-08-17 | 阿里巴巴(中国)有限公司 | Procédé d'entraînement de modèle, dispositif d'entraînement, dispositif électronique et support lisible par ordinateur |
CN114648980A (zh) * | 2022-03-03 | 2022-06-21 | 科大讯飞股份有限公司 | 数据分类和语音识别方法、装置、电子设备及存储介质 |
CN115994225B (zh) * | 2023-03-20 | 2023-06-27 | 北京百分点科技集团股份有限公司 | 文本的分类方法、装置、存储介质及电子设备 |
CN115994225A (zh) * | 2023-03-20 | 2023-04-21 | 北京百分点科技集团股份有限公司 | 文本的分类方法、装置、存储介质及电子设备 |
CN116304058A (zh) * | 2023-04-27 | 2023-06-23 | 云账户技术(天津)有限公司 | 企业负面信息的识别方法、装置、电子设备及存储介质 |
CN116304058B (zh) * | 2023-04-27 | 2023-08-08 | 云账户技术(天津)有限公司 | 企业负面信息的识别方法、装置、电子设备及存储介质 |
CN117783377A (zh) * | 2024-02-27 | 2024-03-29 | 南昌怀特科技有限公司 | 一种用于牙贴生产的成分分析方法及系统 |
CN117973522A (zh) * | 2024-04-02 | 2024-05-03 | 成都派沃特科技股份有限公司 | 基于知识数据训练技术的应用模型构建方法及系统 |
CN117973522B (zh) * | 2024-04-02 | 2024-06-04 | 成都派沃特科技股份有限公司 | 基于知识数据训练技术的应用模型构建方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN110110080A (zh) | 2019-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020199591A1 (fr) | Procédé, appareil, dispositif informatique, et support d'informations d'entraînement de modèles de catégorisation de textes | |
CN109871446B (zh) | 意图识别中的拒识方法、电子装置及存储介质 | |
CN111209738B (zh) | 一种联合文本分类的多任务命名实体识别方法 | |
WO2020177230A1 (fr) | Procédé et appareil de classification de données médicales basés sur un apprentissage machine et dispositif informatique et support de stockage | |
CN104699763B (zh) | 多特征融合的文本相似性度量系统 | |
CN108536800B (zh) | 文本分类方法、系统、计算机设备和存储介质 | |
CN110825877A (zh) | 一种基于文本聚类的语义相似度分析方法 | |
CN108710894B (zh) | 一种基于聚类代表点的主动学习标注方法和装置 | |
CN113094578B (zh) | 基于深度学习的内容推荐方法、装置、设备及存储介质 | |
CN110633366B (zh) | 一种短文本分类方法、装置和存储介质 | |
US20190188277A1 (en) | Method and device for processing an electronic document | |
CN112270188B (zh) | 一种提问式的分析路径推荐方法、系统及存储介质 | |
CN107844533A (zh) | 一种智能问答系统及分析方法 | |
WO2014085776A2 (fr) | Classement de recherche internet | |
CN107590177A (zh) | 一种结合监督学习的中文文本分类方法 | |
US20200364216A1 (en) | Method, apparatus and storage medium for updating model parameter | |
CN110377690B (zh) | 一种基于远程关系抽取的信息获取方法和系统 | |
CN111191031A (zh) | 一种基于WordNet和IDF的非结构化文本的实体关系分类方法 | |
CN107977456A (zh) | 一种基于多任务深度网络的多源大数据分析方法 | |
CN110377618B (zh) | 裁决结果分析方法、装置、计算机设备和存储介质 | |
CN117474010A (zh) | 面向电网语言模型的输变电设备缺陷语料库构建方法 | |
US20220414099A1 (en) | Using query logs to optimize execution of parametric queries | |
Marconi et al. | Hyperbolic manifold regression | |
CN115795037B (zh) | 一种基于标签感知的多标签文本分类方法 | |
CN114385808A (zh) | 文本分类模型构建方法与文本分类方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19923410 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19923410 Country of ref document: EP Kind code of ref document: A1 |