CN109740750B - Data collection method and device - Google Patents

Data collection method and device Download PDF

Info

Publication number
CN109740750B
CN109740750B CN201811542893.7A CN201811542893A CN109740750B CN 109740750 B CN109740750 B CN 109740750B CN 201811542893 A CN201811542893 A CN 201811542893A CN 109740750 B CN109740750 B CN 109740750B
Authority
CN
China
Prior art keywords
collection
sample data
sample
data
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811542893.7A
Other languages
Chinese (zh)
Other versions
CN109740750A (en
Inventor
李超然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201811542893.7A priority Critical patent/CN109740750B/en
Publication of CN109740750A publication Critical patent/CN109740750A/en
Application granted granted Critical
Publication of CN109740750B publication Critical patent/CN109740750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a data collection method and a device, wherein the method comprises the following steps: receiving sample data to be collected; acquiring the current ratio of sample data belonging to the category to which the sample data to be collected belongs in a sample collection dataset, wherein the sample collection dataset is a dataset with a fixed size; determining the collection probability of the sample data of the category to which the sample data to be collected belongs according to the current proportion and the target proportion of the sample data of the category to which the sample data to be collected belongs; and adding the sample data to be collected into the sample collection data set according to the collection probability so as to train a neural network model. By the scheme, the sample data set meeting the class distribution requirement of machine learning can be obtained under the condition that new samples are continuously generated.

Description

Data collection method and device
Technical Field
The invention relates to the technical field of deep learning, in particular to a data collection method and device.
Background
Neural networks commonly used in deep learning require training using a large amount of sample data. If the class distribution of the sample data in the sample data set is not balanced, the neural network model fails to train. For the classification problem, the sample data is not uniform, i.e., the number of sample data per class in the dataset is very different. More specifically, for example, in a two-class problem, if there are 100 sample data (100 rows of data, each row of data being a representation of one sample), 80 sample data belong to class 1, and the remaining 20 sample data belong to class 2, where class 1: class 2 is 80: 20: 4:1, which belongs to class imbalance. In reinforcement learning, AI (artificial intelligence) interaction with the environment produces a large amount of sample data, and if the sample data is classified, the generation probability of different types of sample data is different.
The class imbalance of the sample data in the sample data set is a typical problem in machine learning. For a specific and fixed sample data set, a common solution is to perform data undersampling on a category with a large number of samples or perform data oversampling on a category with a small number of samples, so as to obtain a sample data set with a balanced category through resampling; the other mode is that the new sample data is generated manually by using the existing sample data; there are also methods that do not start from the dataset itself, but improve the effect of model training by penalizing the algorithms of the classifier. These methods all address the problem of model training of a fixed data set.
Resampling is difficult to do for cases where the sample size is very large or the sample size is unknown and new samples are continuously generated. Therefore, the problem of unbalanced category is not solved well for the situation that sample data is continuously generated in reinforcement learning.
Disclosure of Invention
In view of this, the present invention provides a data collection method and apparatus, so as to obtain a sample data set that meets the class distribution requirement of machine learning under the condition that new samples are continuously generated.
In order to achieve the purpose, the invention is realized by adopting the following scheme:
in an embodiment of the present invention, a data collection method includes:
receiving sample data to be collected;
acquiring the current ratio of sample data belonging to the category to which the sample data to be collected belongs in a sample collection dataset, wherein the sample collection dataset is a dataset with a fixed size;
determining the collection probability of the sample data of the category to which the sample data to be collected belongs according to the current proportion and the target proportion of the sample data of the category to which the sample data to be collected belongs;
and adding the sample data to be collected into the sample collection data set according to the collection probability so as to train a neural network model.
In an embodiment of the present invention, a computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the steps of the method according to the above-mentioned embodiment.
In an embodiment of the present invention, a computer-readable storage medium, on which a computer program is stored, is characterized in that the program, when executed by a processor, implements the steps of the method described in the above-mentioned embodiment.
According to the data collection method, the computer equipment and the computer readable storage medium, the current ratio of sample data of each category can be known by collecting the sample data by using the data set with fixed size; the reasonable collection probability can be determined according to the current proportion and the expected proportion of a certain type of sample data; and determining whether to add new sample data into the data set according to the collection probability, so that the sample data in the data set can be more consistent with the class distribution condition required by the training of the neural network model. Therefore, the sample data set meeting the requirement of class distribution of neural network model training can be collected and obtained under the condition that new sample data are generated continuously.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a schematic flow chart diagram of a data collection method according to an embodiment of the invention;
FIG. 2 is a flow chart illustrating a method for determining a collection probability according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating a method for determining a collection probability according to another embodiment of the present invention;
FIG. 4 is a flow chart illustrating a method for adding sample data to be collected to a sample collection dataset according to collection probability according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart diagram of a data collection method according to another embodiment of the present invention;
FIG. 6 is a flow chart illustrating a data collection method according to an embodiment of the invention;
FIG. 7 is a schematic structural diagram of a data collection device according to an embodiment of the present invention;
FIG. 8 is a block diagram of a collection probability determination module according to an embodiment of the invention;
FIG. 9 is a schematic diagram of the structure of a collection probability determination module according to another embodiment of the present invention;
FIG. 10 is a schematic diagram of a data collection unit according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a data collection device according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
FIG. 1 is a flow chart of a data collection method according to an embodiment of the invention. As shown in FIG. 1, the data collection method of some embodiments may include:
step S110: receiving sample data to be collected;
step S120: acquiring the current ratio of sample data belonging to the category to which the sample data to be collected belongs in a sample collection dataset, wherein the sample collection dataset is a dataset with a fixed size;
step S130: determining the collection probability of the sample data of the category to which the sample data to be collected belongs according to the current proportion and the target proportion of the sample data of the category to which the sample data to be collected belongs;
step S140: and adding the sample data to be collected into the sample collection data set according to the collection probability so as to train a neural network model.
In step S110, the sample data to be collected may be sample data in a data stream, and data may be extracted from the data stream in real time without knowing the total amount of the sample. New sample data may be generated continuously by the signal source, for example, during reinforcement learning. The sample data may contain metadata (i.e., the data itself) and category tags, which may differ in their categories. The received sample data to be collected can be temporarily stored in a data set to be read out for subsequent processing.
In step S120, the data in the sample collection data set may be directly used for neural network model training, and may only include metadata or include data pairs composed of metadata and category labels as needed. The size of the sample collection dataset refers to the maximum number of data that can be accommodated by the dataset and may be implemented in a variety of different ways, such as a queue, a linked list, and the like. The size of the sample collection data set should generally be larger than the total number of classes of data, and the specific value may depend on the training requirement of the neural network model, and may be, for example, one hundred times of the total number of classes. Sample data generated by a large number of signal sources may have been collected in the sample collection dataset. The current percentage of a sample data of a certain category in the sample collection data set can be obtained by counting the total number of samples of the category and then dividing by the total number of sample data in the sample collection data set or the size of the sample collection data set. The category may be obtained according to a category label in a data pair stored in the sample collection data set, or obtained by counting category labels in a category label data set corresponding to the sample collection data set that is specially stored.
In step S130, the target ratio of the sample data of a certain class is a desired ratio, and may be set according to the requirement of the neural network model training, specifically, may be determined according to the total number of classes, and the likecThen, the target ratio of one class of sample data may be 1/nc. If the current ratio of the sample data of a certain category is smaller than the expected ratio, the sample data of the category is less, and otherwise, the sample data of the category is more. If the current occupancy is less than the target proportion, a greater collection probability may be set, and if the current occupancy is greater than the target proportion, a lesser collection probability may be set.
In the above step S140, the collection probability may be implemented by a random number. In the case where the sample collection dataset is full, the old sample data may be replaced with the newly added sample data, e.g. the sample data that was oldest added to the sample collection dataset. In the case where the sample collection dataset is not full, it may be added directly to the sample collection dataset.
In the embodiment, the current ratio of sample data of each category can be known by collecting the sample data by using a data set with a fixed size; the reasonable collection probability can be determined according to the current proportion and the expected proportion of a certain type of sample data; and determining whether to add new sample data into the data set according to the collection probability, so that the sample data in the data set can be more consistent with the class distribution condition required by the training of the neural network model. Therefore, the scheme can collect and obtain the sample data set meeting the class distribution requirement of machine learning under the condition that new sample data is generated continuously. Under the condition that the generation probabilities of different types of sample data are different, the generated sample data are filtered and collected, so that the data types in the collected sample data set approach to expected type distribution, for example, the data types tend to be balanced, and the data set is new data generated by continuously collecting data streams, so that the samples used for model training are updated.
In some embodiments, the step S120 of obtaining a current percentage of sample data in the sample collection dataset belonging to the category to which the sample data to be collected belongs may include:
counting and calculating the proportion of the labels of the category to which the sample data to be collected belongs in the category label data set to obtain the current proportion of the sample data of the category to which the sample data to be collected belongs in the sample collection data set; the category label dataset is used for storing a category label of each sample data in the sample collection dataset, and the size of the category label dataset is the same as that of the sample collection dataset.
The category labels in the category label dataset may be added to the category label dataset when sample data is collected by the sample collection dataset, or the category labels may be added to the category label dataset one by one after all current sample data of the sample collection data is obtained. When a class tag of one sample data needs to be added to the class tag dataset, the class tag can be separated from the sample data, and the sample data is added to the class tag dataset after a required conversion process. The category label dataset may be used to store only category labels corresponding to sample data of the sample collection dataset.
In this embodiment, the class label data set is used to specially store the class label of each sample data in the sample collection data set, so that the current class condition of the sample data in the sample collection data set can be conveniently and rapidly counted.
In some embodiments, the step S130 of determining the collection probability of the sample data of the category to which the sample data to be collected belongs according to the current proportion and the target proportion of the sample data of the category to which the sample data to be collected belongs may include:
step S131: determining a first probability as a collection probability of sample data of a category to which the sample data to be collected belongs if the current ratio is less than or equal to a target ratio of the category to which the sample data to be collected belongs, and determining a second probability as the collection probability if the current ratio is greater than the target ratio; the first probability is greater than the second probability.
The specific values of the first probability and the second probability may be determined according to the difference (e.g., difference, mean square error, etc.) between the current and target ratios.
In this embodiment, when the current duty of a certain category is less than or equal to the target duty, it is indicated that the data of the category is less, and more sample data of the category can be obtained through a larger first probability; when the current occupation ratio of the category is larger than the target occupation ratio, the data of the category is more, and less sample data of the category can be obtained through a lower second probability; therefore, as new sample data is continuously received, the sample data of the category can be closer to the expected duty ratio.
FIG. 2 is a flow chart illustrating a method for determining a collection probability according to an embodiment of the present invention. As shown in fig. 2, the step S131, namely, determining a first probability as the collection probability of the sample data of the category to which the sample data to be collected belongs if the current ratio is less than or equal to the target ratio of the category to which the sample data to be collected belongs, and determining a second probability as the collection probability if the current ratio is greater than the target ratio, may include:
step S1311: obtaining the current category distribution of sample data in the sample collection dataset;
step S1312: calculating a mean square error between the current class distribution and a target class distribution of sample data in the sample collection dataset;
step S1313: and under the condition that the mean square error is less than or equal to an error threshold set according to the total number of samples of the sample collection data set, when the current occupation ratio is less than or equal to a target occupation ratio of a category to which the sample data to be collected belongs, determining a first probability obtained by adding the mean square error to 0.5 as the collection probability of the sample data of the category to which the sample data to be collected belongs, and when the current occupation ratio is greater than the target occupation ratio, determining a second probability obtained by subtracting the mean square error from 0.5 as the collection probability of the sample data of the category to which the sample data to be collected belongs.
In step S1311, the current class distribution may be a ratio, or the like of each class of sample data in the sample collection dataset. In the above step S1312, it is assumed that the total number of classes is ncWherein the current ratio of the i-th sample data is piAnd the target ratio is
Figure BDA0001908630330000062
In the case of (2), the mean square error can be expressed as
Figure BDA0001908630330000061
In the above step S1313, the error threshold may be set according to the size n of the sample collection data settarThe determination may be, for example, 5/ntar. When the mean square error is less than or equal to an error threshold set according to the total number of samples of the sample collection data set, the current class distribution can be considered to be close to the target class distribution, and the sample data of a certain class can be collected with a slightly larger probability according to a first probability obtained by adding the mean square error to 0.5. When the current ratio is greater than the target ratio, the difference between the current class distribution and the target class distribution is considered to be large, and according to a second probability obtained by subtracting the mean square error from 0.5, sample data of a certain class can be collected with a slightly smaller probability.
In this embodiment, the mean square error is calculated according to the current class distribution and the target class distribution, and the first probability or the second probability is determined in a manner that the mean square error fluctuates around half the probability, so that the collected probability meets the class adjustment requirement, and the sample class distribution does not oscillate too much.
In other embodiments, a mean value of each class proportion in the current class distribution and a mean value of each class proportion in the sample collection data set may be calculated, a difference between the mean value of each class proportion in the current class distribution and the mean value of each class proportion in the sample collection data set may be calculated, when the current proportion is less than or equal to a target proportion of a class to which the sample data to be collected belongs, a first probability obtained by adding 0.5 to the difference may be determined as a collection probability of sample data of the class to which the sample data to be collected belongs, and when the current proportion is greater than the target proportion, a second probability obtained by subtracting the difference from 0.5 may be determined as a collection probability of sample data of the class to which the sample data to be collected belongs.
FIG. 3 is a flow chart illustrating a method for determining a collection probability according to another embodiment of the present invention. As shown in fig. 3, the method for determining the collection probability shown in fig. 2 may further include:
step S1314: and when the mean square error is larger than an error threshold value set according to the total number of samples of the sample collection data set, determining a first probability taken from one end, close to 1, in the range of (0.5,1) as the collection probability of the sample data of the class to which the sample data to be collected belongs when the current ratio is smaller than or equal to a target ratio of the class to which the sample data to be collected belongs, and determining a second probability taken from one end, close to 0, in the range of (0,0.5) as the collection probability of the sample data of the class to which the sample data to be collected belongs when the current ratio is smaller than or equal to the target ratio.
In the step S1314, when the mean square error is greater than the error threshold set according to the total number of samples in the sample collection data set, it indicates that the current class distribution is greatly different from the target sample distribution. Close to 1 may refer to a value in the range of (0.75,1), e.g., 0.9, 0.99, etc. Close to 0 may refer to a value in the range of (0,0.25), e.g., 0.1, 0.15, etc. The mean square error is larger than an error threshold set according to the total number of samples of the sample collection data set, the target occupation ratio of the sample data of the required category can be achieved more quickly by taking the value from one end close to 1 in the range of (0.5,1) as the collection probability, and the increase speed of the data volume of the category with more sample data can be reduced as much as possible by taking the value from one end close to 0 in the range of (0,0.5) as the collection probability.
In this embodiment, under the condition that the difference between the current class distribution and the target sample distribution is large, the collection speed of the sample data of the required class is increased rapidly by setting a large collection probability, and the sample data of the unnecessary class is reduced as much as possible by setting a small collection probability, so that the sample data of a certain class can reach the target proportion as soon as possible.
Fig. 4 is a flowchart illustrating a method for adding sample data to be collected to a sample collection dataset according to a collection probability according to an embodiment of the present invention. As shown in fig. 4, the adding the sample data to be collected to the sample collection dataset according to the collection probability in step S140 may include:
step S141: generating a random number;
step S142: adding the sample data to be collected to the sample collection dataset if the random number is less than or equal to the collection probability; in the event that the random number is greater than the collection probability, not adding the sample data to be collected to the sample collection dataset.
In the above step S141, the random number may be generated by various random number generation means. In the step S142, when the random number is less than or equal to the collection probability, it may be determined that the sample data to be collected needs to be added to the sample collection dataset, and at this time, an addition identification value may be returned, and the sample data to be collected may be added to the sample collection dataset according to the addition identification value. In the case that the random number is greater than the collection probability, it may be determined that the sample data to be collected does not need to be added to the sample collection data set, and the sample data to be collected may be directly discarded or subjected to other processing.
In this embodiment, the sample data is collected according to the determined collection probability by using the random number, so that the sample data can be automatically collected according to the target category distribution.
In some embodiments, in the step S142, in the case that the random number is less than or equal to the collection probability, adding the sample data to be collected to the sample collection dataset may include:
and under the condition that the random number is less than or equal to the collection probability, if the sample collection data set is full, replacing the sample data added earliest in the sample collection data set with the sample data to be collected.
In this embodiment, when the sample collection data set is full, the new sample data is used to replace the old sample data, so that the required sample data can be collected while the size of the data set is kept fixed.
If the sample collection data set is not full, the sample data to be collected can be directly added into the sample collection data set so as to improve the collection speed of the sample data.
In other embodiments, if the sample collection dataset is full, it may be culled by finding sample data of a category having a higher current category percentage than its target percentage. Thereby, the speed of reaching the target category distribution can be improved.
FIG. 5 is a flow chart of a data collection method according to another embodiment of the invention. As shown in fig. 5, after step S140, that is, after adding the sample data to be collected to the sample collection data set according to the collection probability, the data collection method shown in fig. 1 may further include:
step S150: and adding the category label corresponding to the sample data to be collected into the category label data set.
The category label dataset is a category label for storing each sample data in the sample collection dataset, and the size of the category label dataset is the same as the size of the sample collection dataset.
In this embodiment, the sample data to be collected is added to the sample collection dataset according to the collection probability, which indicates that it is determined that the sample data to be collected is added to the sample collection dataset, and in this case, the category label corresponding to the sample data to be collected is added to the category label dataset, and the category label dataset may be updated synchronously, so that the category label in the category label dataset corresponds to the sample data in the sample data to be collected, thereby facilitating statistics of the current percentage of each category of sample data.
In order that those skilled in the art will better understand the present invention, the following description will illustrate the implementation of the present invention in a specific embodiment.
FIG. 6 is a flow chart of a data collection method according to an embodiment of the invention. As shown in fig. 6, it is assumed that different kinds of sample data are continuously generated by one signal source. The sample data generated includes the data itself (metadata) and the category label. By utilizing the data collection method of the embodiment of the invention, the sample data in the data stream can be analyzed and counted, and then whether the sample data needs to be added into a sample collection data set with a fixed size or not is determined.
Defining the signal source as S, and generating sample data D continuouslysm. Generated sample data DsmContaining metadata D and a category label l, i.e. DsmD, l. Assume that the number of classes in the label is nc. In this embodiment, a sample collection data set of a fixed size is used to collect sample data generated by the signal source S. A fixed-size data set refers to the maximum number of data in the data set that is at most owned. For example, if the total size of the sample collection dataset is set to n, then the sample collection dataset will have a maximum of n sample data items. When the sample collection data set is full and new sample data needs to be added, the old sample data in the sample collection data set may be replaced by a set rule, for example, the oldest sample data in the sample collection data set is replaced by the new sample data.
First, a dataset for statistics and a dataset for storing real sample data may be initialized. For statisticsThe data set may be a class label for holding sample data only and may include a data stream statistics data set Dst1For storing, categorizing a tag dataset Dst2And the like. The data set for storing the real sample data can be only used for storing the metadata of the sample data or can be used for storing the data pair (sample data) consisting of the metadata and the corresponding category label at the same time, and can comprise a sample collection data set Dtar
Data stream statistics set Dst1The probability of occurrence of sample data of each category in the statistical data stream can be used, and the size of the probability can be set to nst1=nc*100. Data stream statistics set Dst1The larger the total size setting is, the higher the accuracy of the probability of occurrence of the sample data for each category obtained by statistics is. Initially, a data stream statistics set Dst1Collecting each sample data in the data stream, counting the data set Dst1When full, the new sample data may be used to replace the data stream statistics set Dst1The oldest sample data currently among them.
Category label dataset Dst2Can be used to count the current sample collection dataset DtarWhat is the proportion of sample data storage of each class in the total size nst2And sample collection dataset DtarTotal size n oftarIn agreement, i.e. nst2=ntar. Category label dataset Dst2And sample collection dataset DtarIn that the category label dataset Dst2Only the class tag of the sample data is stored, and the metadata (data itself) of the sample data is not stored. When a new sample data comes, the sample collection data set DtarWill pass through the class label dataset Dst2The statistical result is judged according to a certain rule to decide whether to add new sample data into the sample collection data set DtarAmong them.
Assume that sample data is required to collect a data set D in a sampletarIs that the sample data of each category in the sample collection dataset DtarIn the middle of possessThe proportions of (a) and (b) are equal. Target class distribution refers to the distribution of desired sample data classes, i.e. ddst={pi|i=1,2,3…ncIn which p isiCollection of data D on sample representing class itarTo a desired ratio of (1). The current class distribution refers to the current sample collection dataset DtarThe distribution of the sample data classes in (a),
Figure BDA0001908630330000101
i.e. wherein
Figure BDA0001908630330000102
Data set D representing the collection of sample data of the ith typetarTo a desired ratio of (1). When the sample collects data set DtarWhen not full, all new sample data may be added, and when full, the data set D may be labeled according to the categoryst2Executes the set judgment rule, and determines whether to add new sample data to the sample collection data set D according to the judgment result of the set ruletar
The data collection process may include the steps of:
(1) updating a data stream statistics set Dst1The data of (2) can obtain the distribution condition of the current data stream, can be used for counting the class distribution of sample data in the data stream, and is used as a reference for subsequent probability values. The data stream statistic data set Dst1The method can also be used for temporarily storing new sample data, and taking out and adding the new sample data to the sample collection data set D under the condition of needtarIn (1). In other embodiments, data stream statistics set D may not be usedst1Instead, new sample data is received directly from the signal source for use in determining whether to add to the sample collection dataset DtarIn (1).
(2) The current category can be divided into two different sets, determined according to the desired category distribution (target category distribution). When it is judged that the desired ratio (target ratio) of one class is larger than the current sample collection data set DtarThe proportion (or proportion) of the sample data of the category(s), i.e. sample collectionData set DtarIf there is too little sample data in this category, new sample data may be added to set SxOtherwise, add set Sy
(3) Calculating the mean square error mse between the current class distribution and the target class distribution, i.e.
Figure BDA0001908630330000103
If mse<5/nst2Then the following step (4) is performed, otherwise the following step (5) is performed.
(4) If the new sample data is classified in set SxIn (1), then can be represented by paccProbability of 0.5+ mse adds new sample data to the sample collection dataset DtarIn, otherwise with paccProbability of 0.5-mse adds new sample data to the sample collection dataset DtarIn the middle, the subsequent step (6) may then be performed.
(5) If the new sample data is classified in set SxIn (1), then p isxthrAdd new data to the data set, otherwise pythrProbability of adding new sample data to the sample collection dataset DtarAmong them. Probability pxthrAnd probability pythrIs a set threshold. Probability pxthrCan be (0.5,1), in order to reach the target probability more quickly, the probability pxthrMay be taken to be 0.99. Probability pythrCan be (0,0.5), in order to reach the target probability more quickly, the probability pythrMay be taken to be 0.01. Then, the subsequent step (6) is performed.
(6) Generating a random number, and if the random number is less than the probability, determining to add new sample data to the sample collection data set DtarAnd if so, returning to True, otherwise, returning to False. The results of the determination may then be returned to the sample collection dataset DtarIf True is returned, new sample data will be added to the sample collection dataset DtarSimultaneously adding the class label of the new sample to the class label dataset Dst2Otherwise, the data is discarded. When the sample collects data set DtarWhen not full, will addAll new sample data to sample collection dataset DtarAdding the class label of the new sample to the class label dataset Dst2When the sample collects the data set DtarIf the sample data is full, the judgment rule is executed to judge whether to add new sample data.
In the present embodiment, the problem of the class imbalance of the data generated by the data stream in the case where new data is continuously generated is solved, and the solved case is that the total size of the data volume is uncertain. The problem of class imbalance cannot be solved with the prior art by sampling all data of a fixed size. Data is extracted from the data stream in real time without knowing the total number of samples. The old data in the data set may be replaced with new data so that the content in the data set may be updated at all times. The problem of how to collect a data set with a fixed size under the condition that new samples are continuously generated and how to judge whether to replace data in the existing data set with new data under the condition that the data set is full is solved.
Based on the same inventive concept as the data collection method shown in fig. 1, the embodiment of the present invention further provides a data collection device, as described in the following embodiments. Because the principle of solving the problems of the data collection device is similar to that of the data collection method, the implementation of the data collection device can be referred to the implementation of the data collection method, and repeated details are not repeated.
Fig. 7 is a schematic structural diagram of a data collection device according to an embodiment of the present invention. As shown in fig. 7, the data collection device of some embodiments may include: a data receiving unit 210, a current ratio acquiring unit 220, a collection probability determining unit 230, and a data collecting unit 240, which are linked in order.
A data receiving unit 210, configured to receive sample data to be collected;
a current duty obtaining unit 220, configured to obtain a current duty of sample data in a sample collection dataset, where the sample collection dataset is a dataset with a fixed size and belongs to a category to which the sample data to be collected belongs;
a collection probability determining unit 230, configured to determine, according to the current ratio and a target ratio of sample data of a category to which the sample data to be collected belongs, a collection probability of the sample data of the category to which the sample data to be collected belongs;
a data collection unit 240, configured to add the sample data to be collected to the sample collection dataset according to the collection probability, so as to train a neural network model.
In some embodiments, the current proportion obtaining unit 220 may include: and a current ratio acquisition module.
The current proportion obtaining module is used for counting and calculating the proportion of the label of the category to which the sample data to be collected belongs in the category label data set to obtain the current proportion of the sample data of the category to which the sample data to be collected belongs in the sample collection data set; the category label dataset is used for storing a category label of each sample data in the sample collection dataset, and the size of the category label dataset is the same as that of the sample collection dataset.
In some embodiments, the collection probability determination unit 230 may include: a collection probability determination module.
A collection probability determination module, configured to determine a first probability as a collection probability of sample data of a category to which the sample data to be collected belongs if the current ratio is less than or equal to a target ratio of the category to which the sample data to be collected belongs, and determine a second probability as the collection probability if the current ratio is greater than the target ratio; the first probability is greater than the second probability.
Fig. 8 is a schematic structural diagram of a collection probability determination module according to an embodiment of the present invention. As shown in fig. 8, the collection probability determination module may include: a current category distribution obtaining module 2311, a mean square error calculating module 2312, and a first collection probability generating module 2313, which are connected in sequence.
A current category distribution obtaining module 2311, configured to obtain a current category distribution of sample data in the sample collection dataset;
a mean square error calculation module 2312 for calculating a mean square error between the current class distribution and a target class distribution of sample data in the sample collection dataset;
a first collection probability generating module 2313, configured to, when the mean square error is less than or equal to an error threshold set according to the total number of samples of the sample collection data set, determine a first probability obtained by adding the mean square error to 0.5 as a collection probability of sample data of a category to which the sample data to be collected belongs when the current duty is less than or equal to a target duty of the category to which the sample data to be collected belongs, and determine a second probability obtained by subtracting the mean square error from 0.5 as a collection probability of sample data of the category to which the sample data to be collected belongs when the current duty is greater than the target duty.
Fig. 9 is a schematic structural diagram of a collection probability determination module according to another embodiment of the present invention. As shown in fig. 9, the collection probability determination module shown in fig. 8 may further include: the second collection probability generation module 2314 is coupled to the mean square error calculation module 2312.
A second collection probability generating module 2314, configured to, when the mean square error is greater than an error threshold set according to the total number of samples of the sample collection data set, determine a first probability taken from an end close to 1 in the range of (0.5,1) as a collection probability of sample data of the category to which the sample data to be collected belongs when the current ratio is less than or equal to a target ratio of the category to which the sample data to be collected belongs, and determine a second probability taken from an end close to 0 in the range of (0,0.5) as a collection probability of sample data of the category to which the sample data to be collected belongs when the current ratio is less than or equal to the target ratio.
FIG. 10 is a schematic diagram of a data collection unit according to an embodiment of the invention. As shown in fig. 10, the data collection unit 240 may include: a random number generation module 241 and a data collection module 242, which are connected to each other.
A random number generating module 241, configured to generate a random number;
a data collection module 242, configured to add the sample data to be collected to the sample collection dataset if the random number is less than or equal to the collection probability; in the event that the random number is greater than the collection probability, not adding the sample data to be collected to the sample collection dataset.
In some embodiments, the data collection module 242 may include: a sample collection dataset update module.
And a sample collection data set updating module, configured to, when the random number is less than or equal to the collection probability, replace, by the sample data to be collected, sample data that is added earliest in the sample collection data set if the sample collection data set is full.
Fig. 11 is a schematic structural diagram of a data collection device according to another embodiment of the present invention. As shown in fig. 11, the data collection apparatus shown in fig. 7 may further include: the category label dataset update module 250 is connected to the data collection unit 240.
A category label data set updating module 250, configured to add a category label corresponding to the sample data to be collected to the category label data set.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps of the method described in the above embodiment are implemented.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method described in the above embodiments.
In summary, in the data collection method, the data collection apparatus, the computer device and the computer readable storage medium according to the embodiments of the present invention, the current ratio of each type of sample data can be known by collecting the sample data using the data set with a fixed size; the reasonable collection probability can be determined according to the current proportion and the expected proportion of a certain type of sample data; and determining whether to add new sample data into the data set according to the collection probability, so that the sample data in the data set can be more consistent with the class distribution condition required by the training of the neural network model. Therefore, the scheme can collect and obtain the sample data set meeting the class distribution requirement of machine learning under the condition that new sample data is generated continuously.
In the description herein, reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the various embodiments is provided to schematically illustrate the practice of the invention, and the sequence of steps is not limited and can be suitably adjusted as desired.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of data collection, comprising:
receiving sample data to be collected;
acquiring the current ratio of sample data belonging to the category to which the sample data to be collected belongs in a sample collection dataset, wherein the sample collection dataset is a dataset with a fixed size;
determining the collection probability of the sample data of the category to which the sample data to be collected belongs according to the current proportion and the target proportion of the sample data of the category to which the sample data to be collected belongs;
adding the sample data to be collected to the sample collection dataset according to the collection probability for training a neural network model;
wherein, the determining the collection probability of the sample data of the category to which the sample data to be collected belongs according to the current ratio and the target ratio of the sample data of the category to which the sample data to be collected belongs includes:
and calculating the mean square error according to the current class distribution and the target class distribution of the sample data in the sample collection data set, and determining the collection probability in a mode that the mean square error fluctuates above and below half the probability, wherein the current class distribution is the distribution of the sample data in the sample collection data set.
2. The method of claim 1, wherein obtaining a current fraction of sample data in a sample collection dataset belonging to a category to which the sample data to be collected belongs comprises:
counting and calculating the proportion of the labels of the category to which the sample data to be collected belongs in the category label data set to obtain the current proportion of the sample data of the category to which the sample data to be collected belongs in the sample collection data set; the category label dataset is used for storing a category label of each sample data in the sample collection dataset, and the size of the category label dataset is the same as that of the sample collection dataset.
3. The data collection method of claim 1, wherein determining the collection probability of sample data of a category to which the sample data to be collected belongs according to the current fraction and a target fraction of the sample data of the category to which the sample data to be collected belongs comprises:
determining a first probability as a collection probability of sample data of a category to which the sample data to be collected belongs if the current ratio is less than or equal to a target ratio of the category to which the sample data to be collected belongs, and determining a second probability as the collection probability if the current ratio is greater than the target ratio; the first probability is greater than the second probability.
4. The data collection method according to claim 3, wherein determining a first probability as a collection probability of sample data of a category to which the sample data to be collected belongs in a case where the current proportion is less than or equal to a target proportion of the category to which the sample data to be collected belongs, and determining a second probability as the collection probability in a case where the current proportion is greater than the target proportion, comprises:
obtaining the current category distribution of sample data in the sample collection dataset;
calculating a mean square error between the current class distribution and a target class distribution of sample data in the sample collection dataset;
and under the condition that the mean square error is less than or equal to an error threshold set according to the total number of samples of the sample collection data set, when the current occupation ratio is less than or equal to a target occupation ratio of a category to which the sample data to be collected belongs, determining a first probability obtained by adding the mean square error to 0.5 as the collection probability of the sample data of the category to which the sample data to be collected belongs, and when the current occupation ratio is greater than the target occupation ratio, determining a second probability obtained by subtracting the mean square error from 0.5 as the collection probability of the sample data of the category to which the sample data to be collected belongs.
5. The data collection method according to claim 4, wherein a first probability is determined as a collection probability of sample data of a category to which the sample data to be collected belongs in a case where the current duty is less than or equal to a target duty of the category to which the sample data to be collected belongs, and a second probability is determined as the collection probability in a case where the current duty is greater than the target duty, further comprising:
and when the mean square error is larger than an error threshold value set according to the total number of samples of the sample collection data set, determining a first probability taken from one end, close to 1, in the range of (0.5,1) as the collection probability of the sample data of the class to which the sample data to be collected belongs when the current ratio is smaller than or equal to a target ratio of the class to which the sample data to be collected belongs, and determining a second probability taken from one end, close to 0, in the range of (0,0.5) as the collection probability of the sample data of the class to which the sample data to be collected belongs when the current ratio is smaller than or equal to the target ratio.
6. The data collection method of claim 1, wherein adding the sample data to be collected to the sample collection dataset according to the collection probability comprises:
generating a random number;
adding the sample data to be collected to the sample collection dataset if the random number is less than or equal to the collection probability; in the event that the random number is greater than the collection probability, not adding the sample data to be collected to the sample collection dataset.
7. The data collection method of claim 6, wherein adding the sample data to be collected to the sample collection dataset if the random number is less than or equal to the collection probability comprises:
and under the condition that the random number is less than or equal to the collection probability, if the sample collection data set is full, replacing the sample data added earliest in the sample collection data set with the sample data to be collected.
8. The data collection method of claim 2, wherein after adding the sample data to be collected to the sample collection dataset according to the collection probability, further comprising:
and adding the category label corresponding to the sample data to be collected into the category label data set.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of claims 1 to 8 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of claims 1 to 8.
CN201811542893.7A 2018-12-17 2018-12-17 Data collection method and device Active CN109740750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811542893.7A CN109740750B (en) 2018-12-17 2018-12-17 Data collection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811542893.7A CN109740750B (en) 2018-12-17 2018-12-17 Data collection method and device

Publications (2)

Publication Number Publication Date
CN109740750A CN109740750A (en) 2019-05-10
CN109740750B true CN109740750B (en) 2021-06-15

Family

ID=66360404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811542893.7A Active CN109740750B (en) 2018-12-17 2018-12-17 Data collection method and device

Country Status (1)

Country Link
CN (1) CN109740750B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529172A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Data processing method and data processing apparatus
US20220147668A1 (en) * 2020-11-10 2022-05-12 Advanced Micro Devices, Inc. Reducing burn-in for monte-carlo simulations via machine learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897918A (en) * 2017-02-24 2017-06-27 上海易贷网金融信息服务有限公司 A kind of hybrid machine learning credit scoring model construction method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909981B (en) * 2015-12-23 2020-08-25 阿里巴巴集团控股有限公司 Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN105975993A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on boundary upsampling
CN108920477A (en) * 2018-04-11 2018-11-30 华南理工大学 A kind of unbalanced data processing method based on binary tree structure
CN108960561A (en) * 2018-05-04 2018-12-07 阿里巴巴集团控股有限公司 A kind of air control model treatment method, device and equipment based on unbalanced data
CN108647727A (en) * 2018-05-10 2018-10-12 广州大学 Unbalanced data classification lack sampling method, apparatus, equipment and medium
CN108694413A (en) * 2018-05-10 2018-10-23 广州大学 Adaptively sampled unbalanced data classification processing method, device, equipment and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897918A (en) * 2017-02-24 2017-06-27 上海易贷网金融信息服务有限公司 A kind of hybrid machine learning credit scoring model construction method

Also Published As

Publication number Publication date
CN109740750A (en) 2019-05-10

Similar Documents

Publication Publication Date Title
CN109871954B (en) Training sample generation method, abnormality detection method and apparatus
CN112241494B (en) Key information pushing method and device based on user behavior data
CN109471847B (en) I/O congestion control method and control system
CN109327480B (en) Multi-step attack scene mining method
CN111754345A (en) Bit currency address classification method based on improved random forest
CN110224850A (en) Telecommunication network fault early warning method, device and terminal device
CN107003992A (en) Perception associative memory for neural language performance identifying system
CN109740750B (en) Data collection method and device
CN106911591A (en) The sorting technique and system of network traffics
CN111160959A (en) User click conversion estimation method and device
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN112907632A (en) Single-towing ship target identification method and device
Li et al. Scalable random forests for massive data
CN113705215A (en) Meta-learning-based large-scale multi-label text classification method
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN106909492B (en) Method and device for tracking service data
CN104468276B (en) Network flow identification method based on random sampling multi-categorizer
CN111027599B (en) Clustering visualization method and device based on random sampling
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN113420733A (en) Efficient distributed big data acquisition implementation method and system
CN112819527A (en) User grouping processing method and device
CN111860334A (en) Cascade vehicle type classification method and device based on confusion matrix
Köktürk et al. Model-free expectation maximization for divisive hierarchical clustering of multicolor flow cytometry data
CN117520994B (en) Method and system for identifying abnormal air ticket searching user based on user portrait and clustering technology
Leshem Improvement of adaboost algorithm by using random forests as weak learner and using this algorithm as statistics machine learning for traffic flow prediction. Research proposal for a Ph. D

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200201

Address after: 100041, room 2, building 3, building 30, Xing Xing street, Shijingshan District, Beijing,

Applicant after: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Address before: 100083 the first floor of the western small building, No. 18, No. 18, Xue Qing Lu Jia, Beijing

Applicant before: Beijing Shenji Intelligent Technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant