CN112465020B

CN112465020B - Training data set generation method and device, electronic equipment and storage medium

Info

Publication number: CN112465020B
Application number: CN202011351822.6A
Authority: CN
Inventors: 张发恩; 纪双西
Original assignee: Ainnovation Hefei Technology Co ltd
Current assignee: Ainnovation Hefei Technology Co ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2023-04-07
Anticipated expiration: 2040-11-25
Also published as: CN112465020A

Abstract

The application provides a method and a device for generating a training data set, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a classified source data set and an unclassified target data set; extracting a first feature vector set of a source data set and a second feature vector set of a target data set through a feature extractor; determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label and an average feature vector in a cluster of the target data set according to the second feature vector set; through an iterative optimization feature extractor, the overall difference between the feature vector of the source data concentrated sample and the class center feature vector, and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized; and obtaining a training data set according to the clustering label of the target data set and the elements in the clustering. The method can reduce the workload of manual marking, reduce the cost of manual marking and improve the marking precision.

Description

Training data set generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for generating a training data set, an electronic device, and a computer-readable storage medium.

Background

In the process of commodity classification and identification in a retail scene, the problems of package differentiation of different product lines, fast iterative update of product packages, picture characteristic differentiation in the image acquisition process, huge product types, redundancy of partial category data and the like are frequently faced; therefore, when a new project is started, it is difficult to prepare classification model training data through a small amount of data and a quick and concise algorithm, and an initial training set is formed by manually performing classification labeling work on a large amount of data; therefore, how to pre-divide mass unmarked data, improve local sampling quality, reduce the amount of manually marked initial data, and quickly form an initial training set to improve the efficiency of subsequent data collection is a very important and urgent problem in the current work.

At present, for the preparation of image classification data, manual classification marking is mainly performed on collected full-sample pictures, but the data volume needing to be processed at one time is possibly huge, and the full-manual marking conventionally causes low marking precision and high marking cost, so that the iterative optimization of a subsequent model is influenced.

Disclosure of Invention

The embodiment of the application provides a method for generating a training data set, which is used for realizing automatic classification labeling, reducing the manual labeling cost and improving the labeling precision.

The embodiment of the application provides a method for generating a training data set, which comprises the following steps:

acquiring a classified source data set and an unclassified target data set;

extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set;

determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label of the target data set and an average feature vector in a cluster according to the second feature vector set;

by iteratively optimizing the feature extractor, minimizing overall differences between the feature vector of the sample in the source data set and the center-like feature vector, and between the feature vector of an element within a cluster and an average feature vector within the cluster;

and obtaining a training data set according to the clustering label of the target data set and the elements in the clustering.

In an embodiment, the extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set includes:

extracting a feature vector of each sample in the source data set through a feature extractor to obtain the first feature vector set;

and extracting the feature vector of each element in the target data set through a feature extractor to obtain the second feature vector set.

In an embodiment, the determining, according to the first feature vector set, a center-like feature vector corresponding to the source data set includes:

and calculating the mean value of the feature vectors of a plurality of samples in the first feature vector set to obtain the class center feature vector corresponding to the source data set.

In an embodiment, the determining the cluster label and the average feature vector within the cluster of the target data set according to the second feature vector set includes:

clustering and dividing the second feature vector set by using a clustering algorithm to obtain a cluster;

determining a clustering label of the clustering cluster and an average characteristic vector in the clustering cluster according to the characteristic vector of each element in the clustering cluster;

in an embodiment, after performing cluster partitioning on the second feature vector set, the method further includes:

obtaining an unclustered isolated point set;

and generating label information of the isolated point set and the isolated point set characteristic vector according to the characteristic vector of each element in the isolated point set.

In one embodiment, the optimizing the feature extractor to minimize overall differences between the feature vectors of the samples in the source data set and the center-like feature vectors, and between the feature vectors of the elements within the clusters and the average feature vector within the clusters includes:

and by iteratively optimizing the feature extractor, the overall difference between the feature vector of each element in the isolated point set and the feature vector of the isolated point set, between the feature vector of the sample in the source data set and the feature vector of the center class, and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized.

In an embodiment, the obtaining a training data set according to the cluster label and the elements in the cluster of the target data set includes:

obtaining a plurality of first elements from the cluster according to a first sampling proportion, obtaining a plurality of second elements from the isolated point set according to a second sampling proportion, and forming the training data set by the plurality of first elements and the plurality of second elements;

and if the number of any class of elements in the training data set is less than a threshold value, retrieving class samples similar to the class of elements from the corresponding cluster, and expanding the training data set.

An embodiment of the present application further provides a device for generating a training data set, where the device includes:

the data set acquisition module is used for acquiring the classified source data set and the unclassified target data set;

the characteristic extraction module is used for extracting a first characteristic vector set of the source data set and a second characteristic vector set of the target data set through a characteristic extractor;

the characteristic clustering module is used for determining a center-like characteristic vector corresponding to the source data set according to the first characteristic vector set, and determining a clustering label and an average characteristic vector in a clustering cluster of the target data set according to the second characteristic vector set;

a model optimization module for minimizing overall differences between the feature vectors of the samples in the source data set and the center-like feature vectors, and between the feature vectors of the elements in the clusters and the average feature vector in the clusters by iteratively optimizing the feature extractor;

and the training set obtaining module is used for obtaining a training data set according to the clustering label and the clustering elements of the target data set.

An embodiment of the present application further provides an electronic device, where the electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the above-described training data set generation method.

An embodiment of the present application further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is executable by a processor to perform the method for generating the training data set.

According to the technical scheme provided by the embodiment of the application, the classification characteristics of the classified source data set and the data characteristics of the target data set are fully utilized, a high-efficiency specific feature extractor is obtained through training, so that the target data set with unknown classification attributes is clustered and divided, the workload of manual labeling is reduced, the labeling precision is improved, the understanding and the recognition of unknown data can be improved more quickly, and the iterative optimization work of a subsequent classification model is accelerated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for generating a training data set according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a method for generating a training data set according to another embodiment of the present application;

fig. 4 is a block diagram of an apparatus for generating a training data set according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and cannot be construed as indicating or implying relative importance.

Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. The electronic device 100 may be configured to perform the method for generating a training data set provided in the embodiment of the present application. As shown in fig. 1, the electronic device 100 includes: one or more processors 102, and one or more memories 104 storing instructions executable by the processors 102. Wherein the processor 102 is configured to execute a method for generating a training data set provided in the following embodiments of the present application.

The processor 102 may be a gateway, or may be an intelligent terminal, or may be a device including a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other form of processing unit having data processing capability and/or instruction execution capability, and may process data of other components in the electronic device 100, and may control other components in the electronic device 100 to perform desired functions.

The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement the method of training data set generation described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

In one embodiment, the electronic device 100 shown in FIG. 1 may also include an input device 106, an output device 108, and a data acquisition device 110, which may be interconnected via a bus system 112 and/or other type of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device 100 may have other components and structures as desired.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The data acquisition device 110 may acquire an image of a subject and store the acquired image in the memory 104 for use by other components. Illustratively, the data acquisition device 110 may be a camera.

In an embodiment, the devices in the example electronic device 100 for implementing the training data set generation method of the embodiment of the present application may be integrally disposed or may be separately disposed, such as integrally disposing the processor 102, the memory 104, the input device 106 and the output device 108, and separately disposing the data acquisition device 110.

In an embodiment, the example electronic device 100 for implementing the method for generating a training data set of the embodiment of the present application may be implemented as an intelligent terminal such as a smartphone, a tablet computer, a desktop computer, a server, and the like.

Fig. 2 is a flowchart illustrating a method for generating a training data set according to an embodiment of the present application. As shown in fig. 2, the method includes: step S210-step S250.

Step S210: a classified source data set and an unclassified target data set are obtained.

Where the source data set includes a number of classified images, the labels for the first type of image may be denoted by "1", the labels for the second type of image may be denoted by "2", the labels for the third type of image may be denoted by "3", and so on.

The target data set contains a large number of images that are not classified and whose labels are unknown. To distinguish, the images in the source data set may be referred to as samples and the images in the target data set may be referred to as elements. In an embodiment, the target data set and the source data set may be image sets in a retail scene, and the existing product images are classified according to a retail product classification criterion, and the class with visual semantic ambiguity removed is used as a source data set (Xo, yo), where Xo is a sample image and Yo is a sample label. The source data set is used to define the underlying criteria for feature training for model classification. And the target data set can detect the acquired real scene image through a universal detection model, and cut out an unclassified product image according to a detection frame to be used as a target data set Xt.

Step S220: a first set of feature vectors of the source data set and a second set of feature vectors of the target data set are extracted by a feature extractor.

The feature extractor is used for extracting feature vectors of samples in the source data set and feature vectors of elements in the target data set. In an embodiment, a feature extractor network may be built by using a basic depth model framework such as Restnet, inclusion, and the like, and a model parameter pre-trained by using ImageNet is introduced to obtain an initial feature extractor, and a first feature vector set Fs of a source data set and a second feature vector set Ft of a target data set are extracted by the initial feature extractor. The first set of feature vectors Fs is a set of feature vectors for each sample in the source data set. The second set of feature vectors Ft is a set of feature vectors for each element in the target data set.

In an embodiment, a feature vector of each sample in the source data set may be extracted by a feature extractor, so as to obtain the first feature vector set; and extracting the feature vector of each element in the target data set through a feature extractor to obtain the second feature vector set. The feature vector refers to a feature representing an image in the form of a vector.

Step S230: and determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label of the target data set and an average feature vector in a clustering cluster according to the second feature vector set.

The class center feature vector Ws refers to the center of all feature vectors in the first set of feature vectors. In an embodiment, a mean value of the feature vectors of the plurality of samples in the first feature vector set may be calculated, and the mean value is used as the class center feature vector corresponding to the source data set. Here, the feature vectors of the plurality of samples may be all feature vectors in the first feature vector set, or may be partial feature vectors.

The cluster labels are used to characterize the category to which the element in the target dataset belongs. For example, 1,2,3 … may be used to represent different categories, respectively. The feature vectors with similar distances may form a cluster, and the average feature vector Ct in the cluster may be an average of all the feature vectors in the cluster.

In an embodiment, a clustering algorithm may be used to perform cluster division on the second feature vector set Ft to obtain a cluster; determining a clustering label of a clustering cluster and an average characteristic vector in the clustering cluster according to the characteristic vector of each element in the clustering cluster;

wherein the number of cluster clusters may be plural. In one embodiment, the cluster category number (i.e., the number of cluster clusters) may be preset to m, the average contour coefficient threshold thr1, and the sample number threshold thr2. And performing cluster division on the second feature vector set Ft by using an existing clustering algorithm (such as a Kmean algorithm, a density clustering algorithm and a hierarchical clustering algorithm), calculating the contour coefficient of each element, and calculating an average contour coefficient Sm _ i in a cluster and the number n _ i (i =1, … m) of samples in the cluster, wherein if Sm _ i < thr1 or n _ i < thr2, the cluster label is modified to be label = -1 (namely defined as an uncolustered isolated point), and the other elements in the cluster keep the original label (label > = 0). Therefore, after clustering division, a set of non-clustered isolated points (referred to as an isolated point set for short) can be obtained in addition to the cluster clusters. The cluster label of a cluster may be denoted by 0,1,2,3 …, and the label information of an uncolustered set of outliers may be denoted as "-1". In an embodiment, an isolated point set feature vector Ot may be obtained by calculation according to a feature vector of each element in the isolated point set, where the isolated point set feature vector may be a set of feature vectors labeled "-1" in the second feature vector set Ft.

Wherein the contour coefficient s of a certain element is

a represents the average distance of an element from other elements in the cluster in which it is located. B represents the average distance of a certain element from other cluster elements. s has a value range of [ -1,1]A value close to about 1 indicates better clustering performance, and conversely, a value close to-1 indicates worse clustering performance.

Step S240: and through iterative optimization of the feature extractor, the overall difference between the feature vector of the source data set sample and the center-like feature vector and between the feature vector of the element in the cluster and the average feature vector in the cluster is minimized.

In one embodiment, the overall difference may be the sum of the difference x between the feature vector of the sample in the source data set and the centroid feature vector and the difference y between the feature vector of the element within the cluster and the average feature vector within the cluster (i.e., x + y). The difference between the feature vector of the sample in the source data set and the feature vector of the class center can be represented by calculating the Euclidean distance, and the difference between the feature vector of the element in the cluster and the average feature vector in the cluster can be represented by calculating the Euclidean distance.

In one embodiment, the feature extractor may be iteratively optimized to minimize the overall difference between the feature vector of each element in the outlier set and the outlier set feature vector, between the feature vector of the sample in the source data set and the center-like feature vector, and between the feature vector of an element within a cluster and the average feature vector within a cluster.

For example, the parameters of the feature extractor may be optimized by using the already obtained feature vector group (Fs, ft), the corresponding class prototype feature vector (Ws, ct, ot), and an objective loss function similar to the following function, to obtain a new feature extractor:

and after one-time data training is finished, acquiring the data feature set (Fs ', ft') again by using the updated model parameters, iteratively updating the class prototype features (Ws, ct, ot), and calculating the value L of the loss function until the value L is reduced and stagnated or the preset maximum iteration number is reached. In one embodiment, the loss function may also be a cross-entropy, triple loss, or other loss function. Therefore, a second feature vector set with an accurate target data set can be obtained by continuously optimizing the feature extractor, and then accurate clustering clusters, clustering labels, isolated point sets and label information of the isolated point sets are obtained through a clustering algorithm.

Step S250: and obtaining a training data set according to the clustering label of the target data set and the elements in the clustering.

The training data set may include a number of sample images, which may be elements (i.e., images) within a cluster, and sample labels, which may be cluster labels of the cluster.

In an embodiment, a plurality of first elements may be obtained from within a cluster at a first sampling rate (e.g., 10%), and a plurality of second elements may be obtained from within the set of outliers at a second sampling rate (e.g., 20% -30%), the plurality of first elements and the plurality of second elements comprising the training data set. The target data set can be divided into a plurality of cluster clusters, so that images with corresponding proportions can be obtained from each cluster according to a first sampling proportion, and the images obtained from each cluster and the images obtained from the isolated point set with a certain proportion can form a training data set together. In one embodiment, the image can be stored according to the original classification, and the image closest to the average feature is given as the sample graph.

In an embodiment, the training data set may be manually checked, screened, and classified according to the project definition requirements, and a preliminary classification data set to be recognized and a negative sample data set are obtained for training the classification model.

In an embodiment, if the number of any class elements in the training data set is less than a threshold, class samples similar to the class elements are retrieved from the corresponding cluster clusters, and the training data set is expanded.

Multiple classes of elements (i.e., images) may be included in the training dataset, and if the number of images of a certain class is less than a threshold, the accuracy of the trained model may be affected, so as to obtain a training dataset that satisfies the data distribution and number. The feature extractor optimized above may be used to extract the class average feature of each class of elements in the training data set (i.e., the mean of the feature vectors of each element in the class), and then retrieve the elements (i.e., class samples) similar to the class average feature from the corresponding cluster, so that the number of elements in the class in the training data set is sufficient. And manual checking, screening and cleaning can be performed again as required to obtain a better training data set.

According to the technical scheme provided by the embodiment of the application, the classification characteristics of the source data set and the data characteristics of the target data set are fully utilized, a more efficient specific characteristic extractor is obtained through training, clustering division and sampling are carried out on the target data set with unknown classification attributes, the diversity and distribution consistency of sampling is improved through the whole process, possible data redundancy is greatly compressed, the workload of manual labeling is reduced, the unknown data can be understood and known through faster promotion, and the iterative optimization work of a subsequent classification model is accelerated.

Fig. 3 is a detailed flowchart illustrating a method for generating a training data set according to another embodiment of the present application, and as shown in fig. 3, the method includes the following steps:

step 1, acquiring a classified source data set;

step 2, acquiring an unclassified target data set;

step 3, building a feature extractor network by using basic depth model frameworks such as Restnet and the like, importing model parameters pre-trained by ImageNet, and acquiring an initial feature extractor;

and 3, sending the source data set and the target data set into a feature extractor together, and acquiring a new feature extractor and a target data clustering result, wherein the specific process is as follows:

3.1, acquiring a feature vector set Fs of the source data set by using the current feature extractor, and simulating a central feature vector Ws;

3.2, acquiring a feature vector set Ft of the target data set, and acquiring a clustering label and an average feature vector Ct in a clustering cluster by using a clustering algorithm;

A. presetting the clustering category number as m, an average contour coefficient threshold thr1 and a sample number threshold thr2;

B. clustering and dividing Ft by using an existing clustering algorithm (such as Kmean), calculating a contour coefficient of each element, and calculating an average contour coefficient Sm _ i in a cluster and a sample number n _ i in the cluster (i =1, … m), wherein if Sm _ i < thr1 or n _ i < thr2, a cluster label is modified to be label = -1 (namely, an uncolustered isolated point is defined), and elements in other clusters keep original labels (label > = 0);

C. according to the clustering label, for the calculated average characteristic vector Ct in the clustering cluster with the label value label > =0, if the label value is not clustered (label = -1), the characteristic vector of the isolated point set is used to obtain the characteristic vector of the isolated point set Ot = Ft (label = = -1);

3.3. optimizing model parameters by using the obtained feature vector group (Fs, ft), the corresponding class prototype feature vector (Ws, ct, ot) and a target loss function similar to the following function to obtain a new feature extractor;

after one-time data training is finished, the data feature set (Fs ', ft') is obtained again by using the updated model parameters, the class prototype features (Ws, ct, ot) are updated in an iterative mode, and then the step 3.1 is returned to carry out the iterative training in a circulating mode until the L value is reduced and stagnates or the preset maximum iteration times are reached;

step 4, sampling is carried out on the cluster and the isolated point set of the target data set obtained in the step: generally, sampling can be carried out in a cluster according to a certain proportion (such as 10%), the sampling proportion of an isolated point set is increased (such as 20% -30%), then the sampling proportion is stored according to the original classification, and a picture which is closest to the average characteristic is given as a sample illustration;

step 5, manually checking, screening and classifying the sampled data according to project definition requirements to form a preliminary classified data set to be identified and a negative sample data set;

and 6, according to the data size of the samples of each category in the classified data set in the step 5, searching the expanded data set: and when the data volume of a certain type of sample is smaller than a threshold value, extracting class average features by using the trained feature extractor in the step 3 and all samples of the class, further searching in a corresponding cluster to obtain enough sample data volume of the class, and finally obtaining a classification training data set meeting the data distribution and quantity through manual checking, screening and cleaning.

The following are embodiments of the apparatus of the present application, which can be used to perform embodiments of the method for generating the training data set of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the training data set generation method of the present application.

Fig. 4 is a block diagram of a training data set generation apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes: a data set acquisition module 410, a feature extraction module 420, a feature clustering module 430, a model optimization module 440, and a training set acquisition module 450.

A data set obtaining module 410, configured to obtain a classified source data set and an unclassified target data set;

a feature extraction module 420, configured to extract, by a feature extractor, a first feature vector set of the source data set and a second feature vector set of the target data set;

a feature clustering module 430, configured to determine a center-like feature vector corresponding to the source data set according to the first feature vector set, and determine a clustering label and an average feature vector within a cluster of the target data set according to the second feature vector set;

a model optimization module 440, configured to optimize the feature extractor iteratively to minimize overall differences between the feature vectors of the samples in the source data set and the center-like feature vectors, and between the feature vectors of the elements in the clusters and the average feature vector in the clusters;

a training set obtaining module 450, configured to obtain a training data set according to the clustering label and the elements in the clustering cluster of the target data set.

The implementation processes of the functions and actions of each module in the apparatus are specifically described in the implementation processes of the corresponding steps in the generation method of the training data set, and are not described herein again.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions thereof, may be substantially or partially embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims

1. A method of generating a training data set, the method comprising:

acquiring a classified source data set and an unclassified target data set;

determining a center-like feature vector corresponding to the source data set according to the first feature vector set, and determining a clustering label and an average feature vector in a clustering cluster of the target data set according to the second feature vector set;

obtaining a training data set according to the clustering label of the target data set and the elements in the clustering;

wherein the determining the clustering label and the average feature vector in the clustering of the target data set according to the second feature vector set comprises:

after clustering the second feature vector set, the method further includes:

obtaining an unclustered isolated point set;

generating label information of the isolated point set and isolated point set characteristic vectors according to the characteristic vector of each element in the isolated point set;

wherein, the obtaining of the training data set according to the clustering label and the clustering intra-cluster element of the target data set comprises:

obtaining a plurality of first elements from the cluster according to a first sampling proportion, obtaining a plurality of second elements from the isolated point set according to a second sampling proportion, wherein the plurality of first elements and the plurality of second elements form the training data set;

if the number of any type of elements in the training data set is smaller than a threshold value, retrieving a type sample similar to the type of elements from a corresponding clustering cluster, and expanding the training data set;

wherein said iteratively optimizing said feature extractor to minimize overall differences between the feature vectors of the samples in the source data set and the centroid-like feature vectors, and between the feature vectors of the elements within a cluster and the average feature vector within a cluster, comprises:

2. The method of claim 1, wherein the extracting, by a feature extractor, a first set of feature vectors of the source data set and a second set of feature vectors of the target data set comprises:

3. The method according to claim 2, wherein the determining the center-like feature vector corresponding to the source data set according to the first feature vector set comprises:

4. An apparatus for generating a training data set, the apparatus comprising:

a feature extraction module, configured to extract, by a feature extractor, a first feature vector set of the source data set and a second feature vector set of the target data set;

a model optimization module for minimizing overall differences between the feature vectors of the source data set samples and the class center feature vectors, and between the feature vectors of the elements within the clusters and the average feature vector within the clusters by iteratively optimizing the feature extractor;

the training set obtaining module is used for obtaining a training data set according to the clustering label and the clustering elements of the target data set;

after clustering and dividing the second feature vector set, the method further includes:

obtaining an unclustered isolated point set;

5. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of generating a training data set of any of claims 1-3.

6. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of generating a training data set according to any one of claims 1-3.