CN112465020B - Training data set generation method and device, electronic equipment and storage medium - Google Patents

Training data set generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112465020B
CN112465020B CN202011351822.6A CN202011351822A CN112465020B CN 112465020 B CN112465020 B CN 112465020B CN 202011351822 A CN202011351822 A CN 202011351822A CN 112465020 B CN112465020 B CN 112465020B
Authority
CN
China
Prior art keywords
data set
feature vector
feature
clustering
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011351822.6A
Other languages
Chinese (zh)
Other versions
CN112465020A (en
Inventor
张发恩
纪双西
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ainnovation Hefei Technology Co ltd
Original Assignee
Ainnovation Hefei Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ainnovation Hefei Technology Co ltd filed Critical Ainnovation Hefei Technology Co ltd
Priority to CN202011351822.6A priority Critical patent/CN112465020B/en
Publication of CN112465020A publication Critical patent/CN112465020A/en
Application granted granted Critical
Publication of CN112465020B publication Critical patent/CN112465020B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a method and a device for generating a training data set, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a classified source data set and an unclassified target data set; extracting a first feature vector set of a source data set and a second feature vector set of a target data set through a feature extractor; determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label and an average feature vector in a cluster of the target data set according to the second feature vector set; through an iterative optimization feature extractor, the overall difference between the feature vector of the source data concentrated sample and the class center feature vector, and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized; and obtaining a training data set according to the clustering label of the target data set and the elements in the clustering. The method can reduce the workload of manual marking, reduce the cost of manual marking and improve the marking precision.

Description

Training data set generation method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for generating a training data set, an electronic device, and a computer-readable storage medium.
Background
In the process of commodity classification and identification in a retail scene, the problems of package differentiation of different product lines, fast iterative update of product packages, picture characteristic differentiation in the image acquisition process, huge product types, redundancy of partial category data and the like are frequently faced; therefore, when a new project is started, it is difficult to prepare classification model training data through a small amount of data and a quick and concise algorithm, and an initial training set is formed by manually performing classification labeling work on a large amount of data; therefore, how to pre-divide mass unmarked data, improve local sampling quality, reduce the amount of manually marked initial data, and quickly form an initial training set to improve the efficiency of subsequent data collection is a very important and urgent problem in the current work.
At present, for the preparation of image classification data, manual classification marking is mainly performed on collected full-sample pictures, but the data volume needing to be processed at one time is possibly huge, and the full-manual marking conventionally causes low marking precision and high marking cost, so that the iterative optimization of a subsequent model is influenced.
Disclosure of Invention
The embodiment of the application provides a method for generating a training data set, which is used for realizing automatic classification labeling, reducing the manual labeling cost and improving the labeling precision.
The embodiment of the application provides a method for generating a training data set, which comprises the following steps:
acquiring a classified source data set and an unclassified target data set;
extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set;
determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label of the target data set and an average feature vector in a cluster according to the second feature vector set;
by iteratively optimizing the feature extractor, minimizing overall differences between the feature vector of the sample in the source data set and the center-like feature vector, and between the feature vector of an element within a cluster and an average feature vector within the cluster;
and obtaining a training data set according to the clustering label of the target data set and the elements in the clustering.
In an embodiment, the extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set includes:
extracting a feature vector of each sample in the source data set through a feature extractor to obtain the first feature vector set;
and extracting the feature vector of each element in the target data set through a feature extractor to obtain the second feature vector set.
In an embodiment, the determining, according to the first feature vector set, a center-like feature vector corresponding to the source data set includes:
and calculating the mean value of the feature vectors of a plurality of samples in the first feature vector set to obtain the class center feature vector corresponding to the source data set.
In an embodiment, the determining the cluster label and the average feature vector within the cluster of the target data set according to the second feature vector set includes:
clustering and dividing the second feature vector set by using a clustering algorithm to obtain a cluster;
determining a clustering label of the clustering cluster and an average characteristic vector in the clustering cluster according to the characteristic vector of each element in the clustering cluster;
in an embodiment, after performing cluster partitioning on the second feature vector set, the method further includes:
obtaining an unclustered isolated point set;
and generating label information of the isolated point set and the isolated point set characteristic vector according to the characteristic vector of each element in the isolated point set.
In one embodiment, the optimizing the feature extractor to minimize overall differences between the feature vectors of the samples in the source data set and the center-like feature vectors, and between the feature vectors of the elements within the clusters and the average feature vector within the clusters includes:
and by iteratively optimizing the feature extractor, the overall difference between the feature vector of each element in the isolated point set and the feature vector of the isolated point set, between the feature vector of the sample in the source data set and the feature vector of the center class, and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized.
In an embodiment, the obtaining a training data set according to the cluster label and the elements in the cluster of the target data set includes:
obtaining a plurality of first elements from the cluster according to a first sampling proportion, obtaining a plurality of second elements from the isolated point set according to a second sampling proportion, and forming the training data set by the plurality of first elements and the plurality of second elements;
and if the number of any class of elements in the training data set is less than a threshold value, retrieving class samples similar to the class of elements from the corresponding cluster, and expanding the training data set.
An embodiment of the present application further provides a device for generating a training data set, where the device includes:
the data set acquisition module is used for acquiring the classified source data set and the unclassified target data set;
the characteristic extraction module is used for extracting a first characteristic vector set of the source data set and a second characteristic vector set of the target data set through a characteristic extractor;
the characteristic clustering module is used for determining a center-like characteristic vector corresponding to the source data set according to the first characteristic vector set, and determining a clustering label and an average characteristic vector in a clustering cluster of the target data set according to the second characteristic vector set;
a model optimization module for minimizing overall differences between the feature vectors of the samples in the source data set and the center-like feature vectors, and between the feature vectors of the elements in the clusters and the average feature vector in the clusters by iteratively optimizing the feature extractor;
and the training set obtaining module is used for obtaining a training data set according to the clustering label and the clustering elements of the target data set.
An embodiment of the present application further provides an electronic device, where the electronic device includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the above-described training data set generation method.
An embodiment of the present application further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is executable by a processor to perform the method for generating the training data set.
According to the technical scheme provided by the embodiment of the application, the classification characteristics of the classified source data set and the data characteristics of the target data set are fully utilized, a high-efficiency specific feature extractor is obtained through training, so that the target data set with unknown classification attributes is clustered and divided, the workload of manual labeling is reduced, the labeling precision is improved, the understanding and the recognition of unknown data can be improved more quickly, and the iterative optimization work of a subsequent classification model is accelerated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for generating a training data set according to an embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of a method for generating a training data set according to another embodiment of the present application;
fig. 4 is a block diagram of an apparatus for generating a training data set according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and cannot be construed as indicating or implying relative importance.
Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. The electronic device 100 may be configured to perform the method for generating a training data set provided in the embodiment of the present application. As shown in fig. 1, the electronic device 100 includes: one or more processors 102, and one or more memories 104 storing instructions executable by the processors 102. Wherein the processor 102 is configured to execute a method for generating a training data set provided in the following embodiments of the present application.
The processor 102 may be a gateway, or may be an intelligent terminal, or may be a device including a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other form of processing unit having data processing capability and/or instruction execution capability, and may process data of other components in the electronic device 100, and may control other components in the electronic device 100 to perform desired functions.
The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement the method of training data set generation described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.
In one embodiment, the electronic device 100 shown in FIG. 1 may also include an input device 106, an output device 108, and a data acquisition device 110, which may be interconnected via a bus system 112 and/or other type of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device 100 may have other components and structures as desired.
The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The data acquisition device 110 may acquire an image of a subject and store the acquired image in the memory 104 for use by other components. Illustratively, the data acquisition device 110 may be a camera.
In an embodiment, the devices in the example electronic device 100 for implementing the training data set generation method of the embodiment of the present application may be integrally disposed or may be separately disposed, such as integrally disposing the processor 102, the memory 104, the input device 106 and the output device 108, and separately disposing the data acquisition device 110.
In an embodiment, the example electronic device 100 for implementing the method for generating a training data set of the embodiment of the present application may be implemented as an intelligent terminal such as a smartphone, a tablet computer, a desktop computer, a server, and the like.
Fig. 2 is a flowchart illustrating a method for generating a training data set according to an embodiment of the present application. As shown in fig. 2, the method includes: step S210-step S250.
Step S210: a classified source data set and an unclassified target data set are obtained.
Where the source data set includes a number of classified images, the labels for the first type of image may be denoted by "1", the labels for the second type of image may be denoted by "2", the labels for the third type of image may be denoted by "3", and so on.
The target data set contains a large number of images that are not classified and whose labels are unknown. To distinguish, the images in the source data set may be referred to as samples and the images in the target data set may be referred to as elements. In an embodiment, the target data set and the source data set may be image sets in a retail scene, and the existing product images are classified according to a retail product classification criterion, and the class with visual semantic ambiguity removed is used as a source data set (Xo, yo), where Xo is a sample image and Yo is a sample label. The source data set is used to define the underlying criteria for feature training for model classification. And the target data set can detect the acquired real scene image through a universal detection model, and cut out an unclassified product image according to a detection frame to be used as a target data set Xt.
Step S220: a first set of feature vectors of the source data set and a second set of feature vectors of the target data set are extracted by a feature extractor.
The feature extractor is used for extracting feature vectors of samples in the source data set and feature vectors of elements in the target data set. In an embodiment, a feature extractor network may be built by using a basic depth model framework such as Restnet, inclusion, and the like, and a model parameter pre-trained by using ImageNet is introduced to obtain an initial feature extractor, and a first feature vector set Fs of a source data set and a second feature vector set Ft of a target data set are extracted by the initial feature extractor. The first set of feature vectors Fs is a set of feature vectors for each sample in the source data set. The second set of feature vectors Ft is a set of feature vectors for each element in the target data set.
In an embodiment, a feature vector of each sample in the source data set may be extracted by a feature extractor, so as to obtain the first feature vector set; and extracting the feature vector of each element in the target data set through a feature extractor to obtain the second feature vector set. The feature vector refers to a feature representing an image in the form of a vector.
Step S230: and determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label of the target data set and an average feature vector in a clustering cluster according to the second feature vector set.
The class center feature vector Ws refers to the center of all feature vectors in the first set of feature vectors. In an embodiment, a mean value of the feature vectors of the plurality of samples in the first feature vector set may be calculated, and the mean value is used as the class center feature vector corresponding to the source data set. Here, the feature vectors of the plurality of samples may be all feature vectors in the first feature vector set, or may be partial feature vectors.
The cluster labels are used to characterize the category to which the element in the target dataset belongs. For example, 1,2,3 … may be used to represent different categories, respectively. The feature vectors with similar distances may form a cluster, and the average feature vector Ct in the cluster may be an average of all the feature vectors in the cluster.
In an embodiment, a clustering algorithm may be used to perform cluster division on the second feature vector set Ft to obtain a cluster; determining a clustering label of a clustering cluster and an average characteristic vector in the clustering cluster according to the characteristic vector of each element in the clustering cluster;
wherein the number of cluster clusters may be plural. In one embodiment, the cluster category number (i.e., the number of cluster clusters) may be preset to m, the average contour coefficient threshold thr1, and the sample number threshold thr2. And performing cluster division on the second feature vector set Ft by using an existing clustering algorithm (such as a Kmean algorithm, a density clustering algorithm and a hierarchical clustering algorithm), calculating the contour coefficient of each element, and calculating an average contour coefficient Sm _ i in a cluster and the number n _ i (i =1, … m) of samples in the cluster, wherein if Sm _ i < thr1 or n _ i < thr2, the cluster label is modified to be label = -1 (namely defined as an uncolustered isolated point), and the other elements in the cluster keep the original label (label > = 0). Therefore, after clustering division, a set of non-clustered isolated points (referred to as an isolated point set for short) can be obtained in addition to the cluster clusters. The cluster label of a cluster may be denoted by 0,1,2,3 …, and the label information of an uncolustered set of outliers may be denoted as "-1". In an embodiment, an isolated point set feature vector Ot may be obtained by calculation according to a feature vector of each element in the isolated point set, where the isolated point set feature vector may be a set of feature vectors labeled "-1" in the second feature vector set Ft.
Wherein the contour coefficient s of a certain element is
Figure BDA0002797006840000091
a represents the average distance of an element from other elements in the cluster in which it is located. B represents the average distance of a certain element from other cluster elements. s has a value range of [ -1,1]A value close to about 1 indicates better clustering performance, and conversely, a value close to-1 indicates worse clustering performance.
Step S240: and through iterative optimization of the feature extractor, the overall difference between the feature vector of the source data set sample and the center-like feature vector and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized.
In one embodiment, the overall difference may be the sum of the difference x between the feature vector of the sample in the source data set and the centroid feature vector and the difference y between the feature vector of the element within the cluster and the average feature vector within the cluster (i.e., x + y). The difference between the feature vector of the sample in the source data set and the feature vector of the class center can be represented by calculating the Euclidean distance, and the difference between the feature vector of the element in the cluster and the average feature vector in the cluster can be represented by calculating the Euclidean distance.
In one embodiment, the feature extractor may be iteratively optimized to minimize the overall difference between the feature vector of each element in the outlier set and the outlier set feature vector, between the feature vector of the sample in the source data set and the center-like feature vector, and between the feature vector of an element within a cluster and the average feature vector within a cluster.
For example, the parameters of the feature extractor may be optimized by using the already obtained feature vector group (Fs, ft), the corresponding class prototype feature vector (Ws, ct, ot), and an objective loss function similar to the following function, to obtain a new feature extractor:
Figure BDA0002797006840000101
and after one-time data training is finished, acquiring the data feature set (Fs ', ft') again by using the updated model parameters, iteratively updating the class prototype features (Ws, ct, ot), and calculating the value L of the loss function until the value L is reduced and stagnated or the preset maximum iteration number is reached. In one embodiment, the loss function may also be a cross-entropy, triple loss, or other loss function. Therefore, a second feature vector set with an accurate target data set can be obtained by continuously optimizing the feature extractor, and then accurate clustering clusters, clustering labels, isolated point sets and label information of the isolated point sets are obtained through a clustering algorithm.
Step S250: and obtaining a training data set according to the clustering label of the target data set and the elements in the clustering.
The training data set may include a number of sample images, which may be elements (i.e., images) within a cluster, and sample labels, which may be cluster labels of the cluster.
In an embodiment, a plurality of first elements may be obtained from within a cluster at a first sampling rate (e.g., 10%), and a plurality of second elements may be obtained from within the set of outliers at a second sampling rate (e.g., 20% -30%), the plurality of first elements and the plurality of second elements comprising the training data set. The target data set can be divided into a plurality of cluster clusters, so that images with corresponding proportions can be obtained from each cluster according to a first sampling proportion, and the images obtained from each cluster and the images obtained from the isolated point set with a certain proportion can form a training data set together. In one embodiment, the image can be stored according to the original classification, and the image closest to the average feature is given as the sample graph.
In an embodiment, the training data set may be manually checked, screened, and classified according to the project definition requirements, and a preliminary classification data set to be recognized and a negative sample data set are obtained for training the classification model.
In an embodiment, if the number of any class elements in the training data set is less than a threshold, class samples similar to the class elements are retrieved from the corresponding cluster clusters, and the training data set is expanded.
Multiple classes of elements (i.e., images) may be included in the training dataset, and if the number of images of a certain class is less than a threshold, the accuracy of the trained model may be affected, so as to obtain a training dataset that satisfies the data distribution and number. The feature extractor optimized above may be used to extract the class average feature of each class of elements in the training data set (i.e., the mean of the feature vectors of each element in the class), and then retrieve the elements (i.e., class samples) similar to the class average feature from the corresponding cluster, so that the number of elements in the class in the training data set is sufficient. And manual checking, screening and cleaning can be performed again as required to obtain a better training data set.
According to the technical scheme provided by the embodiment of the application, the classification characteristics of the source data set and the data characteristics of the target data set are fully utilized, a more efficient specific characteristic extractor is obtained through training, clustering division and sampling are carried out on the target data set with unknown classification attributes, the diversity and distribution consistency of sampling is improved through the whole process, possible data redundancy is greatly compressed, the workload of manual labeling is reduced, the unknown data can be understood and known through faster promotion, and the iterative optimization work of a subsequent classification model is accelerated.
Fig. 3 is a detailed flowchart illustrating a method for generating a training data set according to another embodiment of the present application, and as shown in fig. 3, the method includes the following steps:
step 1, acquiring a classified source data set;
step 2, acquiring an unclassified target data set;
step 3, building a feature extractor network by using basic depth model frameworks such as Restnet and the like, importing model parameters pre-trained by ImageNet, and acquiring an initial feature extractor;
and 3, sending the source data set and the target data set into a feature extractor together, and acquiring a new feature extractor and a target data clustering result, wherein the specific process is as follows:
3.1, acquiring a feature vector set Fs of the source data set by using the current feature extractor, and simulating a central feature vector Ws;
3.2, acquiring a feature vector set Ft of the target data set, and acquiring a clustering label and an average feature vector Ct in a clustering cluster by using a clustering algorithm;
A. presetting the clustering category number as m, an average contour coefficient threshold thr1 and a sample number threshold thr2;
B. clustering and dividing Ft by using an existing clustering algorithm (such as Kmean), calculating a contour coefficient of each element, and calculating an average contour coefficient Sm _ i in a cluster and a sample number n _ i in the cluster (i =1, … m), wherein if Sm _ i < thr1 or n _ i < thr2, a cluster label is modified to be label = -1 (namely, an uncolustered isolated point is defined), and elements in other clusters keep original labels (label > = 0);
C. according to the clustering label, for the calculated average characteristic vector Ct in the clustering cluster with the label value label > =0, if the label value is not clustered (label = -1), the characteristic vector of the isolated point set is used to obtain the characteristic vector of the isolated point set Ot = Ft (label = = -1);
3.3. optimizing model parameters by using the obtained feature vector group (Fs, ft), the corresponding class prototype feature vector (Ws, ct, ot) and a target loss function similar to the following function to obtain a new feature extractor;
Figure BDA0002797006840000131
after one-time data training is finished, the data feature set (Fs ', ft') is obtained again by using the updated model parameters, the class prototype features (Ws, ct, ot) are updated in an iterative mode, and then the step 3.1 is returned to carry out the iterative training in a circulating mode until the L value is reduced and stagnates or the preset maximum iteration times are reached;
step 4, sampling is carried out on the cluster and the isolated point set of the target data set obtained in the step: generally, sampling can be carried out in a cluster according to a certain proportion (such as 10%), the sampling proportion of an isolated point set is increased (such as 20% -30%), then the sampling proportion is stored according to the original classification, and a picture which is closest to the average characteristic is given as a sample illustration;
step 5, manually checking, screening and classifying the sampled data according to project definition requirements to form a preliminary classified data set to be identified and a negative sample data set;
and 6, according to the data size of the samples of each category in the classified data set in the step 5, searching the expanded data set: and when the data volume of a certain type of sample is smaller than a threshold value, extracting class average features by using the trained feature extractor in the step 3 and all samples of the class, further searching in a corresponding cluster to obtain enough sample data volume of the class, and finally obtaining a classification training data set meeting the data distribution and quantity through manual checking, screening and cleaning.
The following are embodiments of the apparatus of the present application, which can be used to perform embodiments of the method for generating the training data set of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the training data set generation method of the present application.
Fig. 4 is a block diagram of a training data set generation apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes: a data set acquisition module 410, a feature extraction module 420, a feature clustering module 430, a model optimization module 440, and a training set acquisition module 450.
A data set obtaining module 410, configured to obtain a classified source data set and an unclassified target data set;
a feature extraction module 420, configured to extract, by a feature extractor, a first feature vector set of the source data set and a second feature vector set of the target data set;
a feature clustering module 430, configured to determine a center-like feature vector corresponding to the source data set according to the first feature vector set, and determine a clustering label and an average feature vector within a cluster of the target data set according to the second feature vector set;
a model optimization module 440, configured to optimize the feature extractor iteratively to minimize overall differences between the feature vectors of the samples in the source data set and the center-like feature vectors, and between the feature vectors of the elements in the clusters and the average feature vector in the clusters;
a training set obtaining module 450, configured to obtain a training data set according to the clustering label and the elements in the clustering cluster of the target data set.
The implementation processes of the functions and actions of each module in the apparatus are specifically described in the implementation processes of the corresponding steps in the generation method of the training data set, and are not described herein again.
In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions thereof, may be substantially or partially embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims (6)

1. A method of generating a training data set, the method comprising:
acquiring a classified source data set and an unclassified target data set;
extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set;
determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label and an average feature vector in a clustering cluster of the target data set according to the second feature vector set;
by iteratively optimizing the feature extractor, minimizing overall differences between the feature vector of the sample in the source data set and the center-like feature vector, and between the feature vector of an element within a cluster and an average feature vector within the cluster;
obtaining a training data set according to the clustering label of the target data set and the elements in the clustering;
wherein the determining the clustering label and the average feature vector in the clustering of the target data set according to the second feature vector set comprises:
clustering and dividing the second feature vector set by using a clustering algorithm to obtain a cluster;
determining a clustering label of the clustering cluster and an average characteristic vector in the clustering cluster according to the characteristic vector of each element in the clustering cluster;
after clustering the second feature vector set, the method further includes:
obtaining an unclustered isolated point set;
generating label information of the isolated point set and isolated point set characteristic vectors according to the characteristic vector of each element in the isolated point set;
wherein, the obtaining of the training data set according to the clustering label and the clustering intra-cluster element of the target data set comprises:
obtaining a plurality of first elements from the cluster according to a first sampling proportion, obtaining a plurality of second elements from the isolated point set according to a second sampling proportion, wherein the plurality of first elements and the plurality of second elements form the training data set;
if the number of any type of elements in the training data set is smaller than a threshold value, retrieving a type sample similar to the type of elements from a corresponding clustering cluster, and expanding the training data set;
wherein said iteratively optimizing said feature extractor to minimize overall differences between the feature vectors of the samples in the source data set and the centroid-like feature vectors, and between the feature vectors of the elements within a cluster and the average feature vector within a cluster, comprises:
and by iteratively optimizing the feature extractor, the overall difference between the feature vector of each element in the isolated point set and the feature vector of the isolated point set, between the feature vector of the sample in the source data set and the feature vector of the center class, and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized.
2. The method of claim 1, wherein the extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set comprises:
extracting a feature vector of each sample in the source data set through a feature extractor to obtain the first feature vector set;
and extracting the feature vector of each element in the target data set through a feature extractor to obtain the second feature vector set.
3. The method according to claim 2, wherein the determining the center-like feature vector corresponding to the source data set according to the first feature vector set comprises:
and calculating the mean value of the feature vectors of a plurality of samples in the first feature vector set to obtain the class center feature vector corresponding to the source data set.
4. An apparatus for generating a training data set, the apparatus comprising:
the data set acquisition module is used for acquiring the classified source data set and the unclassified target data set;
a feature extraction module, configured to extract, by a feature extractor, a first feature vector set of the source data set and a second feature vector set of the target data set;
the characteristic clustering module is used for determining a center-like characteristic vector corresponding to the source data set according to the first characteristic vector set, and determining a clustering label and an average characteristic vector in a clustering cluster of the target data set according to the second characteristic vector set;
a model optimization module for minimizing overall differences between the feature vectors of the source data set samples and the class center feature vectors, and between the feature vectors of the elements within the clusters and the average feature vector within the clusters by iteratively optimizing the feature extractor;
the training set obtaining module is used for obtaining a training data set according to the clustering label and the clustering elements of the target data set;
wherein the determining the clustering label and the average feature vector in the clustering of the target data set according to the second feature vector set comprises:
clustering and dividing the second feature vector set by using a clustering algorithm to obtain a cluster;
determining a clustering label of the clustering cluster and an average characteristic vector in the clustering cluster according to the characteristic vector of each element in the clustering cluster;
after clustering and dividing the second feature vector set, the method further includes:
obtaining an unclustered isolated point set;
generating label information of the isolated point set and isolated point set characteristic vectors according to the characteristic vector of each element in the isolated point set;
wherein, the obtaining of the training data set according to the clustering label and the clustering intra-cluster element of the target data set comprises:
obtaining a plurality of first elements from the cluster according to a first sampling proportion, obtaining a plurality of second elements from the isolated point set according to a second sampling proportion, wherein the plurality of first elements and the plurality of second elements form the training data set;
if the number of any type of elements in the training data set is smaller than a threshold value, retrieving a type sample similar to the type of elements from a corresponding clustering cluster, and expanding the training data set;
wherein said iteratively optimizing said feature extractor to minimize overall differences between the feature vectors of the samples in the source data set and the centroid-like feature vectors, and between the feature vectors of the elements within a cluster and the average feature vector within a cluster, comprises:
and by iteratively optimizing the feature extractor, the overall difference between the feature vector of each element in the isolated point set and the feature vector of the isolated point set, between the feature vector of the sample in the source data set and the feature vector of the center class, and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized.
5. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of generating a training data set of any of claims 1-3.
6. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of generating a training data set according to any one of claims 1-3.
CN202011351822.6A 2020-11-25 2020-11-25 Training data set generation method and device, electronic equipment and storage medium Active CN112465020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011351822.6A CN112465020B (en) 2020-11-25 2020-11-25 Training data set generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011351822.6A CN112465020B (en) 2020-11-25 2020-11-25 Training data set generation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112465020A CN112465020A (en) 2021-03-09
CN112465020B true CN112465020B (en) 2023-04-07

Family

ID=74808783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011351822.6A Active CN112465020B (en) 2020-11-25 2020-11-25 Training data set generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112465020B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239963B (en) * 2021-04-13 2024-03-01 联合汽车电子有限公司 Method, device, equipment, vehicle and storage medium for processing vehicle data
CN113239964B (en) * 2021-04-13 2024-03-01 联合汽车电子有限公司 Method, device, equipment and storage medium for processing vehicle data
CN112990377B (en) * 2021-05-08 2021-08-13 创新奇智(北京)科技有限公司 Visual category discovery method and device, electronic equipment and storage medium
CN113723507A (en) * 2021-08-30 2021-11-30 联仁健康医疗大数据科技股份有限公司 Data classification identification determination method and device, electronic equipment and storage medium
CN114332500A (en) * 2021-09-14 2022-04-12 腾讯科技(深圳)有限公司 Image processing model training method and device, computer equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252627A (en) * 2013-06-28 2014-12-31 广州华多网络科技有限公司 SVM (support vector machine) classifier training sample acquiring method, training method and training system

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130097103A1 (en) * 2011-10-14 2013-04-18 International Business Machines Corporation Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set
US9443314B1 (en) * 2012-03-29 2016-09-13 Google Inc. Hierarchical conditional random field model for labeling and segmenting images
CN103530689B (en) * 2013-10-31 2016-01-20 中国科学院自动化研究所 A kind of clustering method based on degree of depth study
US20170046510A1 (en) * 2015-08-14 2017-02-16 Qualcomm Incorporated Methods and Systems of Building Classifier Models in Computing Devices
CN107067025B (en) * 2017-02-15 2020-12-22 重庆邮电大学 Text data automatic labeling method based on active learning
US11023710B2 (en) * 2019-02-20 2021-06-01 Huawei Technologies Co., Ltd. Semi-supervised hybrid clustering/classification system
CN109961095B (en) * 2019-03-15 2023-04-28 深圳大学 Image labeling system and method based on unsupervised deep learning
CN110472082B (en) * 2019-08-02 2022-04-01 Oppo广东移动通信有限公司 Data processing method, data processing device, storage medium and electronic equipment
CN110570312B (en) * 2019-09-17 2021-05-28 深圳追一科技有限公司 Sample data acquisition method and device, computer equipment and readable storage medium
CN111178380B (en) * 2019-11-15 2023-07-04 腾讯科技(深圳)有限公司 Data classification method and device and electronic equipment
CN111126470B (en) * 2019-12-18 2023-05-02 创新奇智(青岛)科技有限公司 Image data iterative cluster analysis method based on depth measurement learning
CN111539451B (en) * 2020-03-26 2023-08-15 平安科技(深圳)有限公司 Sample data optimization method, device, equipment and storage medium
CN111680753A (en) * 2020-06-10 2020-09-18 创新奇智(上海)科技有限公司 Data labeling method and device, electronic equipment and storage medium
CN111738351B (en) * 2020-06-30 2023-12-19 创新奇智(重庆)科技有限公司 Model training method and device, storage medium and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252627A (en) * 2013-06-28 2014-12-31 广州华多网络科技有限公司 SVM (support vector machine) classifier training sample acquiring method, training method and training system

Also Published As

Publication number Publication date
CN112465020A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112465020B (en) Training data set generation method and device, electronic equipment and storage medium
JP6005837B2 (en) Image analysis apparatus, image analysis system, and image analysis method
CN113255694B (en) Training image feature extraction model and method and device for extracting image features
US8051021B2 (en) System and method for resource adaptive classification of data streams
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN108027814B (en) Stop word recognition method and device
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN110633594A (en) Target detection method and device
CN110895533B (en) Form mapping method and device, computer equipment and storage medium
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN110532449B (en) Method, device, equipment and storage medium for processing service document
CN110738047A (en) Microblog user interest mining method and system based on image-text data and time effect
CN110209895B (en) Vector retrieval method, device and equipment
JP7330338B2 (en) Human image archiving method, device and storage medium based on artificial intelligence
CN115203408A (en) Intelligent labeling method for multi-modal test data
CN114610953A (en) Data classification method, device, equipment and storage medium
JP2004341948A (en) Concept extraction system, concept extraction method, program therefor, and storing medium thereof
CN113627124A (en) Processing method and device for font migration model and electronic equipment
CN117235137B (en) Professional information query method and device based on vector database
JP4199594B2 (en) OBJECT IDENTIFICATION DEVICE, ITS PROGRAM, AND RECORDING MEDIUM CONTAINING THE PROGRAM
CN113569019B (en) Method, system, equipment and storage medium for knowledge extraction based on chat conversation
CN117688138B (en) Long text similarity comparison method based on paragraph division
CN115114412B (en) Method for retrieving information in document, electronic device and storage medium
CN112990377B (en) Visual category discovery method and device, electronic equipment and storage medium
CN109408706B (en) Image filtering method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant