WO2019200782A1 - Procédé de classification de données d'échantillon, procédé d'entraînement de modèle, dispositif électronique et support de stockage - Google Patents

Procédé de classification de données d'échantillon, procédé d'entraînement de modèle, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2019200782A1
WO2019200782A1 PCT/CN2018/100157 CN2018100157W WO2019200782A1 WO 2019200782 A1 WO2019200782 A1 WO 2019200782A1 CN 2018100157 W CN2018100157 W CN 2018100157W WO 2019200782 A1 WO2019200782 A1 WO 2019200782A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
distance
value
density
samples
Prior art date
Application number
PCT/CN2018/100157
Other languages
English (en)
Chinese (zh)
Inventor
王晨羽
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200782A1 publication Critical patent/WO2019200782A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Definitions

  • the present application relates to the field of data processing, and in particular, to a sample data classification method, a model training method, an electronic device, and a storage medium.
  • a sample data classification method comprising:
  • the sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
  • a model training method comprising:
  • the sample data of each category is classified by using the sample data classification method described in any embodiment to obtain a plurality of subsets of each category;
  • the subsets with the same sorting position are read from the plurality of sorted subsets of each category in turn as training samples of the model, and the model is trained.
  • An electronic device comprising: a memory for storing at least one instruction, the processor for executing the at least one instruction to implement a sample data classification method as in any of the embodiments, And/or the model training method of any of any of the embodiments.
  • a non-volatile readable storage medium storing at least one instruction that, when executed by a processor, implements a sample data classification method as described in any embodiment, And/or the model training method as described in any embodiment.
  • the present application calculates features of each sample in the sample data; calculates a distance set of each sample according to characteristics of each sample, and the distance set of each sample includes each sample corresponding to each sample The distance between each sample in the remaining samples; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; according to the density value of each sample and the density distance of each sample a value, determining at least one cluster center; clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
  • the present application trains from easy to difficult to avoid the difficult training samples being eliminated, thereby improving the adaptability of the model parameters.
  • FIG. 1 is a flow chart of a first preferred embodiment of a sample data classification method of the present application.
  • FIG. 2 is a flow chart of a first preferred embodiment of the method of training a model of the present application.
  • FIG. 3 is a block diagram showing the program of a first preferred embodiment of the sample data sorting apparatus of the present application.
  • FIG. 4 is a block diagram of a program of a first preferred embodiment of the model training device of the present application.
  • FIG. 5 is a schematic structural diagram of a preferred embodiment of an electronic device in at least one example of the present application.
  • FIG. 1 it is a flowchart of a first preferred embodiment of the sample data classification method of the present application.
  • the order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
  • the electronic device calculates characteristics of each sample of the sample data.
  • the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
  • the features of each sample are extracted using a feature extraction model.
  • the feature extraction model includes, but is not limited to, a deep convolutional neural network model.
  • the sample data is extracted by a deep convolutional neural network.
  • a deep convolutional neural network For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken.
  • the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer.
  • the model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a ⁇ a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2 ⁇ 2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
  • the training sample is used for training learning to obtain a trained deep convolutional neural network model.
  • Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data.
  • the larger the size of the training samples the more accurate the extracted features of the post-training deep convolutional neural network model.
  • the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
  • the electronic device calculates a distance set of each sample according to characteristics of each sample.
  • the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data.
  • the distance matrix is a 3*2 or 2*3 matrix.
  • the distance includes, without limitation, a European distance, a cosine distance, and the like.
  • Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than 0, the absolute value of the calculated cosine distance is taken.
  • the electronic device calculates a density value of each sample according to a distance set of each sample and calculates a density distance value of each sample.
  • the distance set of each sample is compared with the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample.
  • the density value of any one of the samples is calculated as follows:
  • ⁇ i represents the density value of the ith sample in the sample data
  • D ij represents the distance between the ith sample and the jth sample in the sample data
  • d c represents the distance threshold
  • the calculating the density distance value of each sample comprises:
  • the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
  • the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
  • the density distance value of each sample is calculated as follows:
  • ⁇ i represents the density distance value of the i-th sample
  • ⁇ i represents the density value of the i-th sample
  • ⁇ j represents the density value of the j-th sample
  • the distance between the i-th sample and the j-th sample of D ij is the distance between the i-th sample and the j-th sample of D ij .
  • the electronic device determines at least one cluster center according to a density value of each sample and a density distance value of each sample.
  • the determining, according to the density value of each sample and the density distance value of each sample, determining at least one cluster center comprises:
  • cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
  • determining, according to the cluster metric value of each sample, the at least one cluster center includes:
  • the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
  • the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
  • the electronic device clusters the sample data into a plurality of subsets based on the at least one cluster center and characteristics of each sample.
  • the sample data is clustered into a plurality of subsets according to a distance set of samples corresponding to each cluster center in the at least one cluster center by using a clustering algorithm.
  • the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
  • a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold is determined as an error sample. This can effectively eliminate the wrong sample.
  • the larger the density value of the sample the more samples are represented similar to the sample.
  • the larger the density distance value of the sample the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
  • FIG. 2 is a flow chart of a second preferred embodiment of the model training method of the present application.
  • the order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
  • the electronic device acquires sample data of each category.
  • the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
  • the electronic device classifies sample data of each category to obtain multiple subsets of each category.
  • the sample data of each category is classified in the first preferred embodiment.
  • step S21 the processing of the step S21 is the same as the data classification method in the first preferred embodiment, and will not be described in detail herein.
  • the electronic device calculates a correlation between each subset of the plurality of subsets of each category and a category of each subset.
  • the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the characteristics of the representative pictures, the more the data in the subset is related to the category in which the subset is located, and the higher the similarity, the simple samples.
  • the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
  • the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
  • the electronic device sorts multiple subsets of each category according to the relevance of each subset and category in multiple subsets of each category, and obtains multiple sorted sub-categories of each category. set.
  • three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
  • the electronic device sequentially reads, from a plurality of sorted subsets of each category, a subset with the same sorting position as a training sample of the model, and trains the model.
  • the first subset of each category is read as a training sample of the model, and the model is trained to reach the first termination condition and then read each A second subset of the categories, the second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples.
  • the subsets of each subset in each subset of each category are sorted from high to low, based on the relevance of each subset to the category, such simple samples are ranked first, and when training the model, simple The sample is easier to train, and the difficult sample is ranked later. It is more difficult to train. This will divide the training of the model into multiple subtasks. According to the difficulty of the task, it will be trained from easy to difficult to avoid the difficult training samples being rejected. So that the model can learn the characteristics of each category from easy to difficult, thereby improving the adaptability of the model parameters.
  • the higher the ranking position the larger the weight corresponding to the subset.
  • category A there are two categories, category A and category B.
  • the sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5.
  • the sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5.
  • First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model.
  • Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
  • a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part.
  • the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
  • the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample.
  • the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training.
  • the samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
  • the sample data classification device 3 includes, but is not limited to, one or more of the following modules: a calculation module 30, a determination module 31, and a clustering module 32.
  • a unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the sample data classification device 3 and that are capable of performing a fixed function, which are stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
  • the calculation module 30 calculates features of each sample of sample data.
  • the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
  • the calculation module 30 extracts features of each sample using a feature extraction model.
  • the feature extraction model includes, but is not limited to, a deep convolutional neural network model.
  • the sample data is extracted by a deep convolutional neural network.
  • a deep convolutional neural network For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken. As the extracted feature.
  • the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer.
  • the model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a ⁇ a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2 ⁇ 2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
  • the training sample is used for training learning to obtain a trained deep convolutional neural network model.
  • Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data.
  • the larger the size of the training samples the more accurate the extracted features of the post-training deep convolutional neural network model.
  • the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
  • the calculation module 30 calculates a distance set for each sample based on the characteristics of each sample.
  • the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data.
  • the distance matrix is a 3*2 or 2*3 matrix.
  • the distance includes, without limitation, a European distance, a cosine distance, and the like.
  • Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than zero, the absolute value of the calculated cosine distance is taken.
  • the calculation module 30 calculates a density value for each sample and calculates a density distance value for each sample based on the distance set of each sample.
  • the calculation module 30 compares each distance of each sample with a distance threshold, determines a distance number greater than the distance threshold, and uses the distance number corresponding to each sample as the density value of each sample. . The larger the density value of such a sample, the more samples are represented similar to the sample.
  • the density value of any one of the samples is calculated as follows:
  • ⁇ i represents the density value of the ith sample in the sample data
  • D ij represents the distance between the ith sample and the jth sample in the sample data
  • d c represents the distance threshold
  • the calculating module 30 calculates the density distance value of each sample includes:
  • the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
  • the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
  • the density distance value of each sample is calculated as follows:
  • ⁇ i represents the density distance value of the i-th sample
  • ⁇ i represents the density value of the i-th sample
  • ⁇ j represents the density value of the j-th sample
  • the distance between the i-th sample and the j-th sample of D ij is the distance between the i-th sample and the j-th sample of D ij .
  • the determining module 31 determines at least one cluster center according to the density value of each sample and the density distance value of each sample.
  • the determining module 31 determines, according to the density value of each sample and the density distance value of each sample, that the at least one cluster center comprises:
  • cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
  • the determining module 31 determines, according to the cluster metric value of each sample, the at least one cluster center includes:
  • the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
  • the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
  • the clustering module 32 clusters the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
  • the clustering module 32 clusters the sample data into a plurality of subsets according to a distance set of samples corresponding to each of the cluster centers in the at least one cluster center.
  • the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
  • the determining module 31 determines a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold as an error sample. This can effectively eliminate the wrong sample.
  • the larger the density value of the sample the more samples are represented similar to the sample.
  • the larger the density distance value of the sample the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
  • the model training device 4 includes, but is not limited to, one or more of the following modules: a data acquisition module 40, a data clustering module 41, a correlation calculation module 42, a ranking module 43, and a training module 44.
  • a unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the model training device 4 and that are capable of performing a fixed function, which is stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
  • the data acquisition module 40 acquires sample data for each category.
  • the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
  • the data clustering module 41 classifies the sample data for each category to obtain a plurality of subsets of each category.
  • the sample data of each category is classified in the first preferred embodiment.
  • the data clustering module 41 is used to implement the sample data classification method in the first preferred embodiment, which is not described in detail herein.
  • the relevance calculation module 42 calculates the relevance of each subset of the plurality of subsets of each category to the category in which each subset is located.
  • the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the sample of such a subset the more similar the features of the representative image, the more relevant the data in the subset is to the category in which the subset is located, and the higher the similarity, the simple sample.
  • the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
  • the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
  • the sorting module 43 sorts multiple subsets of each category according to the degree of relevance of each subset and category in each subset of each category, and obtains a plurality of sorted subsets of each category. .
  • three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
  • the training module 44 sequentially reads the subsets with the same sorting position from the plurality of sorted subsets of each category as training samples of the model, and trains the model.
  • the training module 44 reads the first subset of each category from the plurality of sorted subsets of each category as a training sample of the model, and trains the model to reach the first termination condition. Thereafter, a second subset of each category is read, a second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples.
  • the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid the difficult training samples being eliminated, so that the model can learn from each category from easy to difficult.
  • the higher the ranking position the larger the weight corresponding to the subset.
  • category A there are two categories, category A and category B.
  • the sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5.
  • the sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5.
  • First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model.
  • Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
  • a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part.
  • the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
  • the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample.
  • the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training.
  • the samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
  • the above-described integrated unit implemented in the form of a software program module can be stored in a non-volatile readable storage medium.
  • the software program module described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the method of each embodiment of the present application. Part of the steps.
  • the electronic device 5 comprises at least one transmitting device 51, at least one memory 52, at least one processor 53, at least one receiving device 54, and at least one communication bus.
  • the communication bus is used to implement connection communication between these components.
  • the electronic device 5 is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance, and the hardware includes, but not limited to, a microprocessor, an application specific integrated circuit (ASIC). ), Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc.
  • the electronic device 5 may also comprise a network device and/or a user device.
  • the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud computing-based cloud composed of a large number of hosts or network servers, where the cloud computing is distributed computing.
  • a super virtual computer consisting of a group of loosely coupled computers.
  • the electronic device 5 can be, but is not limited to, any electronic product that can interact with a user through a keyboard, a touch pad, or a voice control device, such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
  • a keyboard e.g., a keyboard
  • a touch pad e.g., a touch pad
  • a voice control device such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
  • PDA Personal Digital Assistant
  • the network in which the electronic device 5 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
  • the Internet includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
  • VPN virtual private network
  • the receiving device 54 and the sending device 51 may be wired transmission ports, or may be wireless devices, for example, including antenna devices, for performing data communication with other devices.
  • the memory 52 is used to store program code.
  • the memory 52 may be a circuit having a storage function, such as a RAM (Random-Access Memory), a FIFO (First In First Out), or the like, which is not in a physical form in the integrated circuit.
  • the memory 52 may also be a memory having a physical form, such as a memory stick, a TF card (Trans-flash Card), a smart media card, a secure digital card, a flash memory card.
  • Storage devices such as (flash card) and the like.
  • the processor 53 can include one or more microprocessors, digital processors.
  • the processor 53 can call program code stored in the memory 52 to perform related functions.
  • the various modules described in FIG. 3 are program code stored in the memory 52 and executed by the processor 53 to implement a sample data classification method; and/or as described in FIG.
  • the various modules are program code stored in the memory 52 and executed by the processor 53 to implement a model training method.
  • the processor 53 also known as a central processing unit (CPU), is a very large-scale integrated circuit, which is a computing core (Core) and a control unit (Control Unit).
  • the embodiment of the present application further provides a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above.
  • a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above.
  • the memory 52 in the electronic device 5 stores a plurality of instructions to implement a sample data classification method, and the processor 53 can execute the plurality of instructions to implement:
  • Calculating a feature of each sample in the sample data calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including each sample in each of the remaining samples corresponding to each sample Distance; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; determine at least one cluster center according to the density value of each sample and the density distance value of each sample And clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
  • the processor executing the plurality of instructions further includes:
  • Each distance distance set of each sample is compared with a distance threshold, a distance number greater than the distance threshold is determined, and the distance number corresponding to each sample is taken as the density value of each sample.
  • the calculating the density distance value of each sample comprises:
  • the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
  • the processor may execute the plurality of instructions further including:
  • At least one cluster center is determined based on the cluster metric of each sample.
  • the cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
  • the executing the plurality of instructions by the processor when determining the at least one clustering center according to the clustering metric value of each sample, the executing the plurality of instructions by the processor further includes:
  • the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
  • the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
  • the processor executable to execute the plurality of instructions further includes:
  • a sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
  • the plurality of instructions corresponding to the sample data classification method are stored in the memory 52 in any of the embodiments and executed by the processor 53, and are not described in detail herein.
  • the memory 52 in the electronic device 5 stores a plurality of instructions to implement a model training method
  • the processor 53 can execute the plurality of instructions to implement:
  • sample data for each category classifying sample data for each category to obtain multiple subsets of each category; calculating the relevance of each subset of each subset of each category to the category of each subset; The relevance of each subset and category in multiple subsets of a category, from high to low, sorting multiple subsets of each category to obtain multiple sorted subsets of each category; In a plurality of sorted subsets, a subset of the same sort position is read as a training sample of the model, and the model is trained.
  • the subset of the higher ranking positions corresponds to a greater weight.
  • the above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device performs the functions of: calculating characteristics of each sample in the sample data; calculating a distance set of each sample according to characteristics of each sample, The distance set of each sample includes the distance between each sample of each sample corresponding to each sample; according to the distance set of each sample, the density value of each sample is calculated and the density distance of each sample is calculated. a value; determining at least one cluster center according to a density value of each sample and a density distance value of each sample; clustering the sample data into a plurality of children based on the at least one cluster center and characteristics of each sample set.
  • the functions that can be implemented by the sample data classification method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can perform the sample data classification method in any embodiment.
  • the functions implemented are not detailed here.
  • the above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device functions to acquire sample data of each category; classify sample data of each category to obtain multiple subsets of each category Calculating the relevance of each subset in each subset of each category to the category in which each subset is located; the relevance of each subset to the category in multiple subsets of each category, from high to low, for each category Sorting subsets to obtain multiple sorted subsets of each category; sequentially reading subsets with the same sorting position from multiple sorted subsets of each category as training samples of the model, for the model Train.
  • the functions that can be implemented by the model training method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can be implemented by the model training method in any embodiment. Function, no longer detailed here.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a non-volatile readable storage medium.
  • a computer device which may be a personal computer, server or network device, etc.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un procédé de classification de données d'échantillon. Le procédé consiste à : calculer des caractéristiques de chaque échantillon dans des données d'échantillon ; selon les caractéristiques de chaque échantillon, calculer un ensemble de distances pour ledit échantillon, l'ensemble de distances pour ledit échantillon comprenant une distance entre ledit échantillon et chacun des échantillons restants correspondant audit échantillon ; selon l'ensemble de distances de chaque échantillon, calculer une valeur de densité dudit échantillon et calculer une valeur de distance de densité dudit échantillon ; déterminer au moins un centre de groupe selon la valeur de densité de chaque échantillon et la valeur de distance de densité dudit échantillon ; regrouper les données d'échantillon en une pluralité de sous-ensembles en fonction du ou des centres de groupe et des caractéristiques de chaque échantillon. La présente invention concerne aussi un procédé d'entraînement de modèle utilisant le procédé de classification de données d'échantillon et un dispositif électronique. Dans la présente invention, un entraînement est réalisé selon une séquence de niveau de difficulté de tâche de facile à difficile, pour éviter d'éliminer des échantillons difficiles pour l'entraînement, ce qui améliore l'adaptabilité de paramètres de modèle.
PCT/CN2018/100157 2018-04-18 2018-08-13 Procédé de classification de données d'échantillon, procédé d'entraînement de modèle, dispositif électronique et support de stockage WO2019200782A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810350730.2A CN108595585B (zh) 2018-04-18 2018-04-18 样本数据分类方法、模型训练方法、电子设备及存储介质
CN201810350730.2 2018-04-18

Publications (1)

Publication Number Publication Date
WO2019200782A1 true WO2019200782A1 (fr) 2019-10-24

Family

ID=63613517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100157 WO2019200782A1 (fr) 2018-04-18 2018-08-13 Procédé de classification de données d'échantillon, procédé d'entraînement de modèle, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN108595585B (fr)
WO (1) WO2019200782A1 (fr)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109378003B (zh) * 2018-11-02 2021-10-01 科大讯飞股份有限公司 一种声纹模型训练的方法和系统
CN109299279B (zh) * 2018-11-29 2020-08-21 奇安信科技集团股份有限公司 一种数据处理方法、设备、系统和介质
CN109508750A (zh) * 2018-12-03 2019-03-22 斑马网络技术有限公司 用户起讫点聚类分析方法、装置及存储介质
CN109671007A (zh) * 2018-12-27 2019-04-23 沈阳航空航天大学 一种基于多维度数据的火车站附近打车难易度评估方法
CN109856307B (zh) * 2019-03-27 2021-04-16 大连理工大学 一种代谢组分子变量综合筛选技术
CN109993234B (zh) * 2019-04-10 2021-05-28 百度在线网络技术(北京)有限公司 一种无人驾驶训练数据分类方法、装置及电子设备
CN110141226B (zh) * 2019-05-29 2022-03-15 清华大学深圳研究生院 自动睡眠分期方法、装置、计算机设备及计算机存储介质
CN111079830A (zh) * 2019-12-12 2020-04-28 北京金山云网络技术有限公司 目标任务模型的训练方法、装置和服务器
CN111414952B (zh) * 2020-03-17 2023-10-17 腾讯科技(深圳)有限公司 行人重识别的噪声样本识别方法、装置、设备和存储介质
CN112131362B (zh) * 2020-09-22 2023-12-12 腾讯科技(深圳)有限公司 对话语句生成方法和装置、存储介质及电子设备
CN112132239B (zh) * 2020-11-24 2021-03-16 北京远鉴信息技术有限公司 一种训练方法、装置、设备和存储介质
CN112884040B (zh) * 2021-02-19 2024-04-30 北京小米松果电子有限公司 训练样本数据的优化方法、系统、存储介质及电子设备
CN113035347A (zh) * 2021-03-15 2021-06-25 武汉中旗生物医疗电子有限公司 一种心电数据病症识别方法、设备及存储介质
CN112990337B (zh) * 2021-03-31 2022-11-29 电子科技大学中山学院 一种面向目标识别的多阶段训练方法
CN113837000A (zh) * 2021-08-16 2021-12-24 天津大学 一种基于任务排序元学习的小样本故障诊断方法
CN115979891B (zh) * 2023-03-16 2023-06-23 中建路桥集团有限公司 高压液气混合流体喷射破碎及固化粘性土的检测方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140270495A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Multiple Cluster Instance Learning for Image Classification
CN105653598A (zh) * 2015-12-22 2016-06-08 北京奇虎科技有限公司 一种关联新闻的确定方法以及装置
CN106874923A (zh) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 一种商品的风格分类确定方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447676B (zh) * 2016-10-12 2019-01-22 浙江工业大学 一种基于快速密度聚类算法的图像分割方法
CN107180075A (zh) * 2017-04-17 2017-09-19 浙江工商大学 文本分类集成层次聚类分析的标签自动生成方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140270495A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Multiple Cluster Instance Learning for Image Classification
CN106874923A (zh) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 一种商品的风格分类确定方法及装置
CN105653598A (zh) * 2015-12-22 2016-06-08 北京奇虎科技有限公司 一种关联新闻的确定方法以及装置

Also Published As

Publication number Publication date
CN108595585A (zh) 2018-09-28
CN108595585B (zh) 2019-11-12

Similar Documents

Publication Publication Date Title
WO2019200782A1 (fr) Procédé de classification de données d'échantillon, procédé d'entraînement de modèle, dispositif électronique et support de stockage
US10438091B2 (en) Method and apparatus for recognizing image content
WO2019200781A1 (fr) Procédé et dispositif de reconnaissance de reçu, et support de stockage
WO2019119505A1 (fr) Procédé et dispositif de reconnaissance faciale, dispositif informatique et support d'enregistrement
US11238310B2 (en) Training data acquisition method and device, server and storage medium
JP6144839B2 (ja) 画像を検索するための方法およびシステム
US10013637B2 (en) Optimizing multi-class image classification using patch features
CN109817339B (zh) 基于大数据的患者分组方法和装置
CN107209861A (zh) 使用否定数据优化多类别多媒体数据分类
WO2020114108A1 (fr) Procédé et dispositif d'interprétation de résultats de regroupement
CN111291817A (zh) 图像识别方法、装置、电子设备和计算机可读介质
US10423817B2 (en) Latent fingerprint ridge flow map improvement
CN111401339A (zh) 识别人脸图像中的人的年龄的方法、装置及电子设备
WO2022088411A1 (fr) Procédé et appareil de détection d'image, procédé et appareil d'entraînement de modèle associé, ainsi que dispositif, support et programme
CN118094118B (zh) 数据集质量评估方法、系统、电子设备及存储介质
JP7341962B2 (ja) 学習データ収集装置、学習装置、学習データ収集方法およびプログラム
CN111401343B (zh) 识别图像中人的属性的方法、识别模型的训练方法和装置
Shoohi et al. DCGAN for Handling Imbalanced Malaria Dataset based on Over-Sampling Technique and using CNN.
Liong et al. Automatic traditional Chinese painting classification: A benchmarking analysis
CN113656660A (zh) 跨模态数据的匹配方法、装置、设备及介质
CN110414562B (zh) X光片的分类方法、装置、终端及存储介质
CN107944363A (zh) 人脸图像处理方法、系统及服务器
CN112632000A (zh) 日志文件聚类方法、装置、电子设备和可读存储介质
CN111382712A (zh) 一种手掌图像识别方法、系统及设备
JP2020140488A (ja) 情報処理装置、情報処理方法及びプログラム

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18915611

Country of ref document: EP

Kind code of ref document: A1