CN112819023A - Sample set acquisition method and device, computer equipment and storage medium - Google Patents
Sample set acquisition method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN112819023A CN112819023A CN202010529394.5A CN202010529394A CN112819023A CN 112819023 A CN112819023 A CN 112819023A CN 202010529394 A CN202010529394 A CN 202010529394A CN 112819023 A CN112819023 A CN 112819023A
- Authority
- CN
- China
- Prior art keywords
- sample
- sample set
- classification models
- training
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001537 neural Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 12
- 238000000034 method Methods 0.000 description 14
- 238000010801 machine learning Methods 0.000 description 11
- 230000000875 corresponding Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 9
- 241000282414 Homo sapiens Species 0.000 description 7
- 238000002372 labelling Methods 0.000 description 7
- 238000010200 validation analysis Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000001815 facial Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000000644 propagated Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006011 modification reaction Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000003068 static Effects 0.000 description 1
Images
Classifications
-
- G06F18/214—
-
- G06F18/24765—
-
- G06F18/254—
Abstract
The application relates to a method and a device for acquiring a sample set, a computer device and a storage medium. The method comprises the following steps: searching objects based on keywords of the labels, and obtaining a sample set of the labels according to the found positive samples with the keywords and the negative samples without the keywords; selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models; predicting each sample in the sample set by using K classification models respectively, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models; updating the sample set according to the classification result of each sample in the sample set, taking K classification models as initial classification models, iterating, returning to select K training sets from the sample set, respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, and obtaining the labeled sample set. The method improves the acquisition efficiency of the sample set.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for obtaining a sample set, a computer device, and a storage medium.
Background
Machine learning is to make a machine have the same learning ability as a human, and to specially study how a computer simulates or realizes human learning behaviors to acquire new knowledge or skills and reorganize an existing knowledge structure to continuously improve the performance of the computer.
Machine learning generally requires a lot of labeled data for machine learning, and a generalized model is built by continuously learning and optimizing the labeled data, so that the machine can classify or predict new data when the new data passes through the model. Therefore, the sample set and the labeling of each sample in the sample set play a very critical role in the artificial intelligence technology. For example, text classification is used, and a certain amount of labeled data is required for each label to train a text classification model. And then predicting the text by using the text classification model, and determining the classification label of the text. However, in reality, data with high quality and labeled by concept labels are very few, so that a part of data is often extracted by a certain method or rule to obtain a training sample, and each sample of the training sample is manually labeled to obtain a sample set for model training.
However, manual labeling requires a lot of time, resulting in inefficient acquisition of the sample set.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for acquiring a sample set, which can improve efficiency.
A method of obtaining a sample set, the method comprising:
searching an object based on keywords of a label, and obtaining a sample set of the label according to the found positive sample with the keywords and the found negative sample without the keywords;
selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models;
respectively predicting each sample in the sample set by using the K classification models, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;
and updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models until an iteration stop condition is met, thereby obtaining the sample set of the label.
An apparatus for obtaining a sample set, the apparatus comprising:
the search acquisition module is used for searching an object based on keywords of the label and obtaining a sample set of the label according to the found positive sample with the keywords and the found negative sample without the keywords;
the training module is used for selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models;
the prediction module is used for predicting each sample in the sample set by using the K classification models respectively, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;
and the iteration module is used for updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, respectively training the initial classification models to obtain K classification models until an iteration stop condition is met, and obtaining the sample set of the label.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
searching an object based on keywords of a label, and obtaining a sample set of the label according to the found positive sample with the keywords and the found negative sample without the keywords;
selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models;
respectively predicting each sample in the sample set by using the K classification models, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;
and updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models until an iteration stop condition is met, thereby obtaining the sample set of the label.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
searching an object based on keywords of a label, and obtaining a sample set of the label according to the found positive sample with the keywords and the found negative sample without the keywords;
selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models;
respectively predicting each sample in the sample set by using the K classification models, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;
and updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models until an iteration stop condition is met, thereby obtaining the sample set of the label.
The method, the device, the computer equipment and the storage medium for obtaining the sample set search the keywords of the labels, preliminarily determine the sample set according to the search result, select K training sets from the sample set to train the initial classification model on the basis to obtain K classification models, further predict the samples of the sample set by using the K classification models, obtain whether the samples belong to the classification result of the labels according to the classification result of the K models, improve the classification accuracy due to the fact that the results of a plurality of classifiers are fused, further update the sample set to perform iterative training according to the classification result, and obtain the sample set of the labels when the training is finished. According to the method, the sample set of the labels and the labels of all samples in the sample set are determined in a keyword searching and model training mode, manual labeling is not needed, and the acquisition efficiency of the sample set is improved.
Drawings
FIG. 1 is a diagram of an example of an application environment for a method for obtaining a sample set in one embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a method for obtaining a sample set in one embodiment;
FIG. 3 is a diagram of the topology structure of the fasttext model in one embodiment;
FIG. 4 is a diagram illustrating a text representation method based on the fastText model in one embodiment;
FIG. 5 is a diagram illustrating an exemplary embodiment of a sample set acquisition method;
FIG. 6 is a diagram illustrating iteration of a data set and model in one embodiment;
FIG. 7 is a block diagram showing an example of an apparatus for acquiring a sample set according to an embodiment;
FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The scheme provided by the embodiment of the application relates to the technology of obtaining an artificial intelligence sample set and the like, and is specifically explained by the following embodiment:
the method for acquiring the sample set provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 obtains a sample set and obtains a model for label classification based on the sample set training. The server 104 classifies the objects based on the label classification model, and sets classification labels for the objects according to the classification result. The server can mark classification labels on the objects according to the classification labels, optimize content distribution to the terminal 102 according to the classification labels, and improve user experience.
The server searches the object based on the keyword of the label, and obtains a sample set of the label according to the found positive sample with the keyword and the negative sample without the keyword; selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models; predicting each sample in the sample set by using K classification models respectively, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models; updating the sample set according to the classification result of each sample in the sample set, taking K classification models as initial classification models, iterating, returning to select K training sets from the sample set, respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, and obtaining the labeled sample set. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for obtaining a sample set is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, searching the object based on the keyword of the label, and obtaining a sample set of the label according to the found positive sample with the keyword and the negative sample without the keyword.
Where tags are used to describe attributes of things, the same thing has different attributes from different angles, and thus, a transaction may have multiple tags. For example, an article whose content is a news article tells a basketball player that a science was incident. From an article category perspective, this article belongs to news, to which a "news" tag may be added. From a content perspective, the character's proportions are involved, and proportions are known basketball players, to which a "basketball" label may be added. The tags have their tag names determined in advance by data mining. The method comprises the step of determining a training set for machine learning by utilizing the determined labels mined.
Keywords are words that describe characteristics of the tag content. And aiming at the tags, setting keywords corresponding to each tag according to the characteristics of the content corresponding to the tag. That is, the keyword has directivity and representativeness, can represent the content characteristics of the tag, and has a correspondence with the tag to point to the tag. Generally, if an object is tagged with a certain tag, it usually contains the keyword corresponding to the tag. For example, the label named "basketball", the corresponding keywords typically include the organization of basketball such as "NBA", "CBA" and the star character of basketball such as "Cobi", "Jodan", etc. Also, keywords such as the label name "fashion" typically include fashion brands such as "xinel", "diso", and the names of fashion characters, among others.
The object is a label classification target, and the object can be a text object and a non-text object according to an actual application scene, wherein the non-text object comprises various forms such as an image, a video and an audio. It can be understood that, since the search is based on the keyword search, no matter which form of object is, it is premised on that the object has a text description using the method of the present application, that is, for a non-text object, it is premised on that the non-text object has a text description, such as an image has been materialized, audio has been recognized as a character, a summary of a video has been extracted, and the like, using the method of the present application to construct a training set.
And searching in the mass objects based on the keywords, and if the searched objects have the keywords, taking the sample with the keywords as a positive sample. And if the object does not have the keyword, taking the sample without the keyword as a negative sample. Taking an object as an example, articles including the keywords are found out from a large number of articles and are used as positive samples, and articles not including the keywords are used as negative samples. Taking an image as an example, a text description including an entity tag has been set for the image in advance according to an entity contained in the image. For example, an image is a photograph of a basketball game, which is provided in advance with a text description obtained from the entity tags "NBA", "science", "lake team", etc. of the entities contained in the photograph. And searching in the massive photos based on the keywords, if the text description of the photos is found to have the keywords, taking the photos with the keywords as a positive sample, and if the text description of the photos does not have the keywords, taking the photos without the keywords as a negative sample. And obtaining a sample set of the label according to the positive sample and the negative sample.
And 204, selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models.
And selecting K training sets from the sample set, and training the initial classification model by using each training set to obtain different K classification models. To facilitate subsequent voting, K is set to an odd number. According to the classes of the objects in the actual service scene, the neural network model can be flexibly selected as the initial classification model. Taking an object as an example of a text, the initial classification model may be a Convolutional Neural Network (CNN), a long-and-short cyclic neural network (LSTM), a fasttext (a text classifier, a text classifier that facial makeup companies open sources in 16 years), or the like. Taking the object as an image as an example, the initial classification model may be a Convolutional Neural Network (CNN).
Specifically, the process of training the initial classification model by using the K sample sets is similar to the process of training the model in other machine learning, the prediction result output by the initial classification model and the difference of the labeling result are propagated in a return direction, and model parameters are adjusted to obtain K classification models corresponding to the K sample sets.
Wherein, randomly extracting a certain number of samples with different ranges from the sample set to obtain K sample sets, and dividing the sample sets into K equal parts by using a K-fold cross validation method to obtain K-1 parts
As samples of five sample sets.
The sample set comprises positive samples and negative samples, the attributes (positive samples or negative samples) of the samples are labels of the samples in the sample set, and the process of training the K classification models can be regarded as a process of training the classification models, so as to obtain the classification model for judging whether the samples belong to the labels.
And step 206, predicting each sample in the sample set by using K classification models respectively, and obtaining whether each sample belongs to the classification result of the label according to K prediction results of each sample output by the K classification models.
And for K different classification models which are preliminarily trained, taking each sample in the sample set as a verification set, and respectively predicting by using K models to obtain K prediction results of the K models on the samples. That is, the K models respectively predict each sample in the sample set, each classification model respectively outputs the prediction result of each sample, and the K models output the K classification results of each sample.
Since the K classification models are used for judging whether the sample belongs to the label, the prediction result output by the classification model includes two categories, one is 1, which indicates that the sample belongs to the classification model, and the other is 0, which indicates that the sample does not belong to the classification label.
Specifically, the predicting of each sample in the sample set by using K classification models respectively, and obtaining a classification result whether each sample belongs to a label according to K prediction results of each sample output by the K classification models, includes: predicting each sample in the sample set by using K classification models respectively to obtain K prediction results of each sample output by the K classification models; voting the classification of the samples according to the K prediction results; and obtaining the classification result of whether the sample belongs to the label according to the prediction result with the highest vote.
And voting according to the K prediction results of each sample, and selecting the prediction result with the largest voting quantity as a classification result of whether the sample belongs to the label. For example, for a sample, if more than half of the K prediction results of the K classification models belong to the label, the sample is determined to belong to the label. And if more than half of the K prediction results of the K classification models are that the sample does not belong to the label, determining that the sample does not belong to the label. And obtaining a classification result whether each sample in the training set belongs to the preset belongings or not by adopting the same method.
In this embodiment, each sample of the training set is predicted by using the preliminarily trained K classification models, and whether the sample belongs to the classification result of the label is determined by voting according to the K prediction results. Compared with the method for determining the classification result based on one classification model, the method can fuse the results of a plurality of classifiers and improve the classification accuracy.
And 208, updating the sample set according to the classification result of each sample in the sample set, taking K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models until the iteration stop condition is met, thereby obtaining the sample set of the label.
Specifically, the training set is updated according to the classification result determined by the prediction result of each sample output by using the K models. Specifically, whether the samples determined by the K classification models belong to the classification results of the labels or not is utilized, the determined samples belonging to the labels are added into the positive samples, the samples not belonging to the labels are added into the negative samples, and the updated training set is obtained. And updating the K classification models into initial classification models, returning to select K training sets from the sample set, and respectively training the initial classification models to obtain K classification models until an iteration stop condition is met to obtain a labeled sample set.
The condition for stopping iteration comprises that the prediction result of the sample in the sample set is stable or the prediction accuracy reaches a set value. And when the iteration stop condition is reached, taking the sample set determined for the last time as the sample set of the label.
It will be appreciated that the resulting sample set of labels includes positive and negative samples. The positive sample and the negative sample are determined according to the prediction result voting of the K classification models, and the accuracy is high.
The method for acquiring the training set comprises the steps of searching keywords of the labels, primarily determining the sample set according to the searching result, selecting K training sets from the sample set to train the initial classification model on the basis to obtain K classification models, further predicting samples of the sample set by using the K classification models, obtaining whether the samples belong to the classification result of the labels according to the classification result of the K models, improving the classification accuracy due to the fact that the results of a plurality of classifiers are fused, further updating the sample set according to the classification result, and obtaining the sample set of the labels when the training is finished. According to the method, the sample set of the labels and the labels of all samples in the sample set are determined in a keyword searching and model training mode, manual labeling is not needed, and the acquisition efficiency of the sample set is improved.
In another embodiment, selecting K training sets from the sample set, and training the initial classification models respectively to obtain K classification models, including: and randomly dividing the sample set into K sample subsets, selecting different K-1 sample subsets to form K training sets respectively, and training the initial classification model according to the K training sets respectively to obtain K classification models.
Specifically, a sample set is randomly divided into K sample subsets, different K-1 sample subsets are selected to form K training sets respectively, a corresponding residual sample subset is used as a verification set, each training set is formed by different K-1 divided sample subsets, and a corresponding residual sample subset is used as a verification set, so that the verification sets of all the training sets are all samples of the sample set. And training the initial classification models by the K training sets respectively to obtain K classification models. Taking K as an example of 5, randomly dividing the sample set into 5 parts, selecting 4 different parts, and respectively training the initial classification models to obtain 5 classification models. For example, the initial sample set includes 100 samples, which are divided into 5 sample subsets, i.e., 20 samples per sample subset, A, B, C, D and E sample subsets, respectively. The sample subsets and validation sets assigned to each classification model are shown in table 1.
Table 1 sample subset allocation table
Classification model | Sample subset | Verification set |
First classification model | A、B、C、D | E |
Second classification model | A、C、D、E | B |
Third classification model | A、B、D、E | C |
Fourth classification model | A、B、C、E | D |
Fifth classification model | B、C、D、E | A |
The initial classification model is trained using the training sample subset A, B, C, D to obtain a first classification model, and the sample subset E is used as a verification set. And training the initial classification model by using the training sample subset A, C, D, E to obtain a second classification model, and taking the sample subset B as a verification set. And training the initial classification model by using the training sample subset A, B, D, E to obtain a third classification model, and taking the sample subset C as a verification set. And training the initial classification model by using the training sample subset A, B, C, E to obtain a fourth classification model, and taking the sample subset D as a verification set. And training the initial classification model by using the training sample subset B, C, D, E to obtain a fifth classification model, and taking the sample subset A as a verification set. Thus sample subsets A, B, C, D and E are all validation sets.
In this embodiment, the sample set is randomly divided into K sample subsets, and K training sets are respectively formed by selecting different K-1 sample subsets, so that all samples are fully utilized, and the K training sets are utilized to train the initial classification model, thereby obtaining K classification models. The method makes full use of all samples, makes full use of the data set to train the model under the condition that the sample size is insufficient, tests the algorithm effect, and updates the sample set according to the test result.
The training process of the classification model is similar to the model training process in other machine learning, the difference between the initial classification model and the positive sample label or the negative sample label is propagated in a return direction according to the prediction result output by the initial classification model, and model parameters are adjusted to obtain K classification models corresponding to K sample sets. For example, the sample set has 100 samples, the 100 samples are divided into 5 sample sets, and the initial classification model is trained by using five sample sets, so as to obtain five classification models. For a positive sample, it is labeled 1, indicating that the sample belongs to the label. For negative examples, it is labeled 0, indicating that the example does not belong to the label. If the prediction result of a positive sample is 0 (not belonging to the label), the preset result and the sample label have difference, and the model parameters are adjusted according to the back propagation of the difference.
The K training sets selected from the sample set are independent, and five independent classification models can be trained in parallel. The model structure adopted by the initial classification model can flexibly select the neural network model as the initial classification model according to the class of the object in the actual service scene. Taking an object as an example of a text, the initial classification model may be a Convolutional Neural Network (CNN), a long-and-short cyclic neural network (LSTM), a fasttext (a text classifier, a text classifier that facial makeup companies open sources in 16 years), or the like. Taking the object as an image as an example, the initial classification model may be a Convolutional Neural Network (CNN).
The object is taken as a text, and the classification model is explained by taking a fasttext model structure as an example.
Training the initial classification models according to K training sets respectively to obtain K classification models, including: respectively converting words of each sample in the K training sets into N-element model word bag vectors; converting K training sets into word pairs of central words and context words according to the word order and the N-element model word bag vector; and respectively inputting the word pairs of the K training sets into a single hidden layer neural network of a jumping word model structure for training to obtain K classification models.
Specifically, each sample in the training set is pre-processed by word segmentation and the like, and words of each sample in the training set are respectively converted into N-gram (N-gram) bag-of-word vectors. Wherein each word is represented as a bag of n characters. A vector representation is associated with each character n-gram and the word representation is the sum of these representations. For example, the appearance of the word "the rest year of the day" can be expressed as [ "the rest year of the day", the rest year "," the day "," the rest year "," the year "], and the word vector of the word" the rest year of the day "can be initialized to the sum of six one-hot vectors. Then, the training corpus is converted into word pairs in the form of "central word-context word", the word pairs are used as training samples and input into the neural network shown in fig. 3, finally, word vectors and gram vectors of each word can be trained, and after all word vectors in the text are added and averaged, the whole text vector can be obtained. The process of obtaining a text vector is shown in fig. 4.
The single hidden layer neural network of the word skipping model structure can adopt a fasttext model structure. The Fastext model adopts a single hidden layer neural network with a skip-gram structure, the sample parameters of the Fasttext model are few, the training speed is high, the robustness of the prediction result of a plurality of independent Fasttext models is extremely strong, the Fastext model is suitable for repeated iteration on a training set, and the sample quality of the training set is improved.
In another embodiment, searching for an object based on a keyword of a tag, and obtaining a sample set tag of the tag according to a positive sample with the keyword and a negative sample without the keyword, includes: acquiring keywords related to the label; searching the object according to the keyword to obtain a positive sample with the keyword and a negative sample without the keyword; and extracting the positive sample and the negative sample according to a preset proportion to obtain a sample set of the label.
The training set comprises a positive sample and a negative sample, wherein the proportion of the positive sample set to the negative sample set can be set according to the proportion of the label object in the actual service scene. For example, the basketball label accounts for 1: 100, the ratio of the number of positive samples to the number of negative samples in the training set can be set to 1: 99. in practical application, if the positive and negative sample numbers are relatively large in proportion, the ratio of the positive samples can be appropriately increased, for example, the proportion of the positive sample number to the negative sample number in the training set is set to 10: 90.
in practical application, in the process of searching for a key, an object with the key word is used as a preliminary positive sample set, and an object without the key word is used as a preliminary negative sample set. And extracting samples from the primary positive sample set and the primary negative sample set according to the proportion, merging the samples, and disordering the sequence to obtain a labeled sample set.
Specifically, searching the object according to the keyword to obtain a positive sample with the keyword and a negative sample without the keyword, including:
and searching the text description or the text object of the non-text object according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords. In the present application, the object of the label classification may be a text object or a non-text object. The text object can be labeled, and a label classification model is trained for the text object. Labels can also be marked on the non-text objects, and label classification models are trained on the non-text objects, namely, the objects applicable to the method comprise text objects and non-text objects. It is noted that since the sample set is determined by means of keyword search, the object is assumed to have a text description based on the method of the present application.
For a non-text object, keyword search can be performed based on the text description of the object and the text description of the non-text object, a sample set of labels is determined, and a non-text label classification model can be trained from the perspective of the text description. The method for acquiring the text description of the non-text object comprises the following steps: and calling an identification model to identify the non-text object according to the type of the non-text object to obtain the text description of the non-text object. And different types of the non-text objects correspond to different recognition models. For example, for video and image, entity tags are set for video or image by identifying entities in video or image by using convolutional neural network, and text description is obtained. For audio, using a speech recognition model, the audio content is recognized to obtain a textual description of the audio.
By adopting the method, a sample set of various types of files can be constructed, and a foundation is provided for building label classification models for various types of files.
In another embodiment, as shown in fig. 5, the method for obtaining the sample set includes:
step 502, searching the object based on the keyword of the label, and obtaining a sample set of the label according to the positive sample with the keyword and the negative sample without the keyword.
In this embodiment, the keywords of the tag are manually selected, so that finding the determined sample set according to the keywords is a result of manual intervention. On the other hand, based on keywords determined by human intervention, a high quality sample set can be collected from the unlabeled dataset.
And step 504, selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models.
And step 506, predicting each sample in the sample set by using K classification models respectively, and obtaining whether each sample belongs to the classification result of the label according to K prediction results of each sample output by the K classification models.
And step 508, determining the prediction accuracy according to the classification result of each sample in the sample set.
The prediction accuracy rate is a ratio of the number of samples predicted correctly in the sample set to the total number of samples. The correct prediction means whether the sample determined according to the result voting of the K classification models belongs to the classification result of the label or not, and whether the sample is labeled the same as the sample or not. And if the sample determined by the result voting of the K classification models belongs to the classification result of the label, and the label is the same as the sample label, the sample is predicted correctly. And if the sample determined by result voting of the K classification models belongs to the classification result of the label and is different from the sample label, the sample is predicted wrongly. And marking the sample, namely marking the positive sample or the negative sample. A positive sample indicates that the sample label belongs to the label, and a negative sample indicates that the sample label does not belong to the label. For example, a sample is labeled as a positive sample, indicating that it belongs to a label, and if the sample determined by the resulting votes from the K classification models does not belong to a label, the prediction is incorrect. And if the samples determined by the result voting of the K classification models belong to the labels, the prediction is correct.
Step 510, predicting whether the accuracy reaches a set value. If not, go to step 512, and if yes, go to step 514.
The prediction accuracy reaching the set value is a judgment condition for stopping iteration, and the prediction accuracy reaching the set value is a training target. The preset accuracy may be set to 99%.
In other embodiments, the iteration stop condition may also be that the prediction result of the sample set is stable, that is, the difference of the prediction accuracy of the multiple times of iterative training is not large, and the difference may be set according to the accuracy.
And step 512, updating the sample set according to the classification result of each sample in the sample set, and taking the K classification models as initial classification models. After step 512, the training is iterated back to step 504 until the prediction accuracy reaches the set value.
Specifically, whether the samples determined by the K classification models belong to the classification results of the labels or not is utilized, the determined samples belonging to the labels are added into the positive samples, the samples not belonging to the labels are added into the negative samples, and the updated training set is obtained. And updating the K classification models into initial classification models, returning to the step of selecting K training sets from the sample set, and respectively training the initial classification models to obtain the K classification models until the iteration stop condition is met.
Step 514, a sample set of labels is obtained.
The iterative method of the data set and the model is shown in fig. 6, and the quality of the data set and the real performance of the model can be mutually promoted, and finally a balanced state is achieved. It is noted that when using a model to optimize a data set, erroneous predictions can be corrected with a small amount of human intervention. Whether the few samples with wrong prediction need to be manually corrected through manual intervention can be judged according to the real prediction effect of the model.
The application further provides an application scenario applying the sample set acquisition method. Specifically, the application of the sample set acquisition method in the application scenario is as follows:
(1) and finding out a plurality of keywords related to the labels, finding out articles containing the keywords from a large number of articles as a primary positive sample set, and taking the articles not containing the keywords as a primary negative sample set.
(2) And extracting a proper amount of positive and negative samples from the primary positive sample set and the primary negative sample set, merging the positive and negative samples, and disordering the sequence to obtain a sample set.
(3) Five-fold cross-validation was performed on the sample set using the fasttext model as shown in figure 3. Specifically, a sample set is divided into five parts at random, four parts of the sample set are sequentially selected as training sets, the remaining part of the sample set is selected as a verification set, five training sets are constructed to train the fasttext model, and five independent fasttext models can be trained;
specifically, each sample in the training set is pre-processed by word segmentation and the like, and words of each sample in the training set are respectively converted into N-gram (N-gram) bag-of-word vectors. Wherein each word is represented as a bag of n characters. A vector representation is associated with each character n-gram and the word representation is the sum of these representations. For example, the appearance of the word "the rest year of the day" can be expressed as [ "the rest year of the day", the rest year "," the day "," the rest year "," the year "], and the word vector of the word" the rest year of the day "can be initialized to the sum of six one-hot vectors. Then, the training corpus is converted into word pairs in the form of "central word-context word", the word pairs are used as training samples and input into the neural network shown in fig. 3, finally, word vectors and gram vectors of each word can be trained, and after all word vectors in the text are added and averaged, the whole text vector can be obtained. The process of obtaining a text vector is shown in fig. 4.
The single hidden layer neural network of the word skipping model structure can adopt a fasttext model structure. The Fastext model adopts a single hidden layer neural network with a skip-gram structure, the sample parameters of the Fasttext model are few, the training speed is high, the robustness of the prediction result of a plurality of independent Fasttext models is extremely strong, the Fastext model is suitable for repeated iteration on a training set, and the sample quality of the training set is improved.
(4) Predicting in the whole sample set by using the five independent models, wherein each sample can obtain five prediction results, the prediction results show whether the sample belongs to the article under the concept label, and voting is carried out according to the prediction results to obtain the only prediction result (whether the sample belongs to the class label) of the training set;
(5) readjusting the positive and negative samples in the training set according to the unique prediction result of the samples, specifically, predicting that the samples which do not belong to the labels are added into the negative sample set, and predicting that the samples which belong to the labels are added into the positive sample set;
(6) and continuing to execute the 3 rd step and iterating repeatedly until the prediction result of the training sample is stable or the prediction accuracy of the sample set reaches approximately 100%.
The sample set acquisition method can rapidly collect the training set with high labeling quality from the label-free data set, has few manual intervention, does not need to label the training samples in a large quantity, and can finish the training of the samples with large data volume in a short time. The whole data set and model iteration method can be effectively and timely applied to the establishment of a label system in a recommendation system. The new concept label in the data set can also be quickly responded and a classification model can be trained in time, so that the downstream recommendation system is obviously helped and promoted.
It should be understood that although the various steps in the flowcharts of fig. 2 and 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 7, an apparatus for obtaining a sample set is provided, and the apparatus may adopt a software module or a hardware module, or a combination of the two modules, as a part of a computer device, and specifically includes:
a search obtaining module 702, configured to search for an object based on a keyword of a tag, and obtain a sample set of the tag according to a positive sample with the keyword and a negative sample without the keyword;
a training module 704, configured to select K training sets from the sample set, and train the initial classification models respectively to obtain K classification models;
the prediction module 706 is configured to respectively use K classification models to predict each sample in the sample set, and obtain a classification result of whether each sample belongs to a label according to K prediction results of each sample output by the K classification models;
and the iteration module 708 is configured to update the sample set according to the classification result of each sample in the sample set, use the K classification models as initial classification models, perform iteration to return to a step of selecting K training sets from the sample set, and train the initial classification models respectively to obtain K classification models until an iteration stop condition is met, so as to obtain a labeled sample set.
The device for acquiring the sample set comprises a sample set acquisition device, a label acquisition device and a label updating device, wherein the sample set acquisition device is used for searching keywords of the label, determining the sample set preliminarily according to a search result, selecting K training sets from the sample set to train an initial classification model to obtain K classification models, predicting samples of the sample set by using the K classification models, and obtaining whether the samples belong to the classification result of the label or not according to the classification result of the K models. According to the method, the sample set of the labels and the labels of all samples in the sample set are determined in a keyword searching and model training mode, manual labeling is not needed, and the acquisition efficiency of the sample set is improved. In one embodiment, the training module comprises:
the training set acquisition module is used for randomly dividing the sample set into K sample subsets, selecting different K-1 sample subsets to respectively form K training sets, and taking the corresponding residual sample subset as a verification set; the full validation set includes each sample in the sample set.
And the classification model training module is used for training the initial classification model according to the K training sets respectively to obtain K classification models.
In another implementation, a naming, lookup acquisition module includes:
and the keyword acquisition module is used for acquiring keywords related to the label.
And the searching module is used for searching the object according to the keyword to obtain a positive sample with the keyword and a negative sample without the keyword.
And the sample set acquisition module is used for extracting the positive sample and the negative sample according to a preset proportion to obtain a sample set of the label.
In another embodiment, the search module is configured to search for a text description or a text object of the non-text object according to the keyword, and obtain a positive sample with the keyword and a negative sample without the keyword.
In another embodiment, the classification model training module is configured to convert words of each sample in the K training sets into N-tuple model bag-of-words vectors, respectively; converting K training sets into word pairs of central words and context words according to the word order and the N-element model word bag vector; and respectively inputting the word pairs of the K training sets into a single hidden layer neural network of a jumping word model structure for training to obtain K classification models.
In another embodiment, the prediction module includes:
and the prediction result acquisition module is used for predicting each sample in the sample set by using K classification models respectively to obtain K prediction results of each sample output by the K classification models.
And the voting module is used for voting the classification of the samples according to the K prediction results.
And the classification module is used for obtaining a classification result of whether the sample belongs to the label or not according to the prediction result with the highest vote.
In another embodiment, the apparatus for obtaining a sample set further comprises:
a prediction accuracy acquisition module to: and determining the prediction accuracy according to the classification result of each sample in the sample set. Wherein the iteration stop condition comprises: the prediction accuracy reaches a set value.
For the specific definition of the acquiring apparatus of the sample set, reference may be made to the above definition of the acquiring method of the sample set, and details are not described here. The modules in the device for acquiring the sample set can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the acquired data of the sample set. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of acquiring a sample set.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method of obtaining a sample set, the method comprising:
searching an object based on keywords of a label, and obtaining a sample set of the label according to the found positive sample with the keywords and the found negative sample without the keywords;
selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models;
respectively predicting each sample in the sample set by using the K classification models, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;
and updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models until an iteration stop condition is met, thereby obtaining the sample set of the label.
2. The method of claim 1, wherein the selecting K training sets from the sample set, and respectively training initial classification models to obtain K classification models comprises:
randomly dividing the sample set into K sample subsets, and selecting different K-1 sample subsets to respectively form K training sets;
and training the initial classification models according to the K training sets respectively to obtain K classification models.
3. The method of claim 1, wherein the searching for the object based on the keyword of the tag, and obtaining the sample set of the tag according to the found positive sample with the keyword and the found negative sample without the keyword comprises:
acquiring keywords related to the label;
searching an object according to the keyword to obtain a positive sample with the keyword and a negative sample without the keyword;
and extracting the positive sample and the negative sample according to a preset proportion to obtain a sample set of the label.
4. The method of claim 3, wherein the searching for objects according to the keyword to obtain a positive sample with the keyword and a negative sample without the keyword comprises:
and searching text description or text objects of the non-text objects according to the keywords to obtain a positive sample with the keywords and a negative sample without the keywords.
5. The method of claim 2, wherein the training the initial classification model according to the K training sets respectively to obtain K classification models comprises:
respectively converting words of each sample in the K training sets into N-element model word bag vectors;
converting the K training sets into word pairs of central words and context words according to the word order and the N-element model word bag vector;
and respectively inputting the word pairs of the K training sets into a single hidden layer neural network of a jumping word model structure for training to obtain K classification models.
6. The method of claim 1, wherein the predicting the samples in the sample set by using the K classification models respectively and obtaining whether each sample belongs to the labeled classification result according to K prediction results of each sample output by the K classification models comprises:
predicting each sample in the sample set by using the K classification models respectively to obtain K prediction results of each sample output by the K classification models;
voting the classification of the samples according to the K prediction results;
and obtaining a classification result of whether the sample belongs to the label or not according to the prediction result with the highest vote.
7. The method of claim 1, further comprising: determining the prediction accuracy according to the classification result of each sample in the sample set;
the iteration stop condition includes: the prediction accuracy reaches a set value.
8. An apparatus for obtaining a sample set, the apparatus comprising:
the search acquisition module is used for searching an object based on keywords of the label and obtaining a sample set of the label according to the found positive sample with the keywords and the found negative sample without the keywords;
the training module is used for selecting K training sets from the sample set, and respectively training the initial classification models to obtain K classification models;
the prediction module is used for predicting each sample in the sample set by using the K classification models respectively, and obtaining a classification result of whether each sample belongs to the label according to K prediction results of each sample output by the K classification models;
and the iteration module is used for updating the sample set according to the classification result of each sample in the sample set, taking the K classification models as initial classification models, iterating, returning to the step of selecting K training sets from the sample set, respectively training the initial classification models to obtain K classification models until an iteration stop condition is met, and obtaining the sample set of the label.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010529394.5A CN112819023A (en) | 2020-06-11 | 2020-06-11 | Sample set acquisition method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010529394.5A CN112819023A (en) | 2020-06-11 | 2020-06-11 | Sample set acquisition method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112819023A true CN112819023A (en) | 2021-05-18 |
Family
ID=75853154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010529394.5A Pending CN112819023A (en) | 2020-06-11 | 2020-06-11 | Sample set acquisition method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112819023A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113408291A (en) * | 2021-07-09 | 2021-09-17 | 平安国际智慧城市科技股份有限公司 | Training method, device and equipment for Chinese entity recognition model and storage medium |
CN113761925A (en) * | 2021-07-23 | 2021-12-07 | 中国科学院自动化研究所 | Named entity identification method, device and equipment based on noise perception mechanism |
CN114997535A (en) * | 2022-08-01 | 2022-09-02 | 联通(四川)产业互联网有限公司 | Intelligent analysis method and system platform for big data produced in whole process of intelligent agriculture |
-
2020
- 2020-06-11 CN CN202010529394.5A patent/CN112819023A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113408291A (en) * | 2021-07-09 | 2021-09-17 | 平安国际智慧城市科技股份有限公司 | Training method, device and equipment for Chinese entity recognition model and storage medium |
CN113761925A (en) * | 2021-07-23 | 2021-12-07 | 中国科学院自动化研究所 | Named entity identification method, device and equipment based on noise perception mechanism |
CN113761925B (en) * | 2021-07-23 | 2022-10-28 | 中国科学院自动化研究所 | Named entity identification method, device and equipment based on noise perception mechanism |
CN114997535A (en) * | 2022-08-01 | 2022-09-02 | 联通(四川)产业互联网有限公司 | Intelligent analysis method and system platform for big data produced in whole process of intelligent agriculture |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109376222B (en) | Question-answer matching degree calculation method, question-answer automatic matching method and device | |
CN112819023A (en) | Sample set acquisition method and device, computer equipment and storage medium | |
EP2973038A1 (en) | Classifying resources using a deep network | |
CN109284406B (en) | Intention identification method based on difference cyclic neural network | |
CN109635083B (en) | Document retrieval method for searching topic type query in TED (tele) lecture | |
CN113298197B (en) | Data clustering method, device, equipment and readable storage medium | |
CN111382283B (en) | Resource category label labeling method and device, computer equipment and storage medium | |
CN108959305A (en) | A kind of event extraction method and system based on internet big data | |
CN113821670A (en) | Image retrieval method, device, equipment and computer readable storage medium | |
CN112749274A (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN113569001A (en) | Text processing method and device, computer equipment and computer readable storage medium | |
CN113408282B (en) | Method, device, equipment and storage medium for topic model training and topic prediction | |
CN114565104A (en) | Language model pre-training method, result recommendation method and related device | |
CN114358109A (en) | Feature extraction model training method, feature extraction model training device, sample retrieval method, sample retrieval device and computer equipment | |
CN113704534A (en) | Image processing method and device and computer equipment | |
CN112100377A (en) | Text classification method and device, computer equipment and storage medium | |
CN112131345A (en) | Text quality identification method, device, equipment and storage medium | |
CN113537206A (en) | Pushed data detection method and device, computer equipment and storage medium | |
CN111538898A (en) | Web service package recommendation method and system based on combined feature extraction | |
CN111259650A (en) | Text automatic generation method based on class mark sequence generation type countermeasure model | |
CN114936327B (en) | Element recognition model acquisition method and device, computer equipment and storage medium | |
CN113627447B (en) | Label identification method, label identification device, computer equipment, storage medium and program product | |
CN111476037A (en) | Text processing method and device, computer equipment and storage medium | |
CN115129863A (en) | Intention recognition method, device, equipment, storage medium and computer program product | |
CN114358188A (en) | Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40048280 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |