CN113313210A

CN113313210A - Method and apparatus for data processing

Info

Publication number: CN113313210A
Application number: CN202110718388.9A
Authority: CN
Inventors: 张武强; 王宝锋; 支蓉; 郭子杰
Original assignee: Daimler AG
Current assignee: Mercedes Benz Group AG
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-08-27

Abstract

The invention relates to the technical field of computers. The invention provides a method for data processing, comprising the following steps: s1: dividing the data sample into first data with labels and second data without labels according to whether the label information exists; s2: acquiring a first index for representing the annotation value of the sample by combining the first data; s3: acquiring a second index for representing the labeled value of the sample based on the self-supervision learning by combining the first data and the second data, wherein the second index is different from the first index; and S4: and screening out the data to be labeled from the second data by means of the first index and the second index and labeling the data. The invention also provides a method for object detection, an apparatus for data processing and a computer program product. The invention aims to screen out the sample data with the most marking value under the condition of comprehensively considering the information quantity and diversity of the data sample, thereby reducing the marking cost and improving the data utilization rate.

Description

Method and apparatus for data processing

Technical Field

The present invention relates to a method for data processing, a method for object detection, an apparatus for data processing and a computer program product.

Background

With the rise and development of artificial intelligence, technologies such as target detection based on deep learning, semantic segmentation, pedestrian pose estimation and the like provide important environmental perception information for subsequent path planning and decision-making in the field of automatic driving. To ensure the robustness of the perceptual task model, high quality labeling of the training data is often required manually. However, due to the high cost of labeling data and the large volume of newly acquired raw data, it is not feasible to label all of the newly acquired data.

In the method, the classifier is used for calculating the total information quantity of samples, only a part of samples are selected for key labeling based on the information quantity, and then the classifier is retrained by using the updated samples.

In addition, a method for selecting a sample image is also known, in which an unlabelled image set and an labeled image set are obtained first, an image classification model is obtained by using the labeled image set as a training set, an uncertainty index and a representative index are calculated according to a classification result, and a labeling value is determined based on the uncertainty index and the representative index. And finally, selecting the sample image with high labeling value from the unlabeled images for manual labeling, thereby optimizing the performance of the image classification model.

However, the above solutions still have many disadvantages, and particularly, most of the existing labeling tasks only focus on the classification task, and the related technologies for actively learning data selection are not mature enough in terms of the image localization regression task with higher labeling cost. In addition, in the existing labeling method, the labeled information is required to obtain the sample diversity basis, so that the information without labeled data cannot be effectively utilized, which results in incomplete diversity expression.

In this context, it is desirable to provide an improved data processing method capable of reducing data annotation cost, reducing unnecessary data annotation, and improving data utilization efficiency.

Disclosure of Invention

It is an object of the present invention to provide a method for data processing, a method for object detection, an apparatus for data processing and a computer program product to solve at least some of the problems of the prior art.

According to a first aspect of the present invention, there is provided a method for data processing, the method comprising the steps of:

s1: dividing the data sample into first data with labels and second data without labels according to whether the label information exists;

s2: acquiring a first index for representing the annotation value of the sample by combining the first data;

s3: acquiring a second index for representing the labeled value of the sample based on the self-supervision learning by combining the first data and the second data, wherein the second index is different from the first index; and

s4: and screening out the data to be labeled from the second data by means of the first index and the second index and labeling the data.

The invention comprises in particular the following technical concepts: the method not only considers that the only evaluation index is induced from the marked data to measure the marking value of the sample, but also can effectively utilize the information of the unmarked data to complete the establishment of the additional evaluation index by introducing the self-supervision learning strategy, thereby estimating the importance contribution of the sample from more angles. Therefore, the data finally screened out are ensured to have higher marking value.

Optionally, the step S2 includes: a target task model, in particular a target detection model, is trained with the aid of first data, and the information content of second data is estimated on the basis of the prediction result of the trained target task model and used as the first indicator.

The following technical advantages are achieved in particular here: the information amount is also called uncertainty, and by introducing the evaluation index in this aspect, the most valuable sample can be screened from the perspective of the abundance or uncertainty of the information contained in the sample. When the samples are specially marked and used for subsequent model training, the performance of the model network can be guaranteed, and meanwhile training time is greatly shortened.

Optionally, the step S3 includes: an expression feature extractor, in particular a variational self-encoder, is trained with the aid of the first data and the second data, the expression features of the second data are extracted on the basis of the trained expression feature extractor, and the diversity of the second data is determined from the expression features and is used as the second indicator.

The following technical advantages are achieved in particular here: the invention provides a feature extraction mode based on self-supervision learning, and diversity index establishment is not completed only by marking data, so that unmarked data can be more fully utilized, the data utilization rate is improved, the network convergence process is accelerated on the whole, and the data acquisition pressure is relieved to a certain extent.

In the sense of the present invention, diversity can also be understood as "rarity" or "representativeness" which can reflect the overall data distribution characteristics of the samples, and since samples with high diversity often have lower similarity to other samples, rare samples capable of reflecting important information in the samples can be retained only by a small amount of labeling work, and the influence of redundant information on the model training quality and training time is filtered out.

Optionally, the step S4 includes: and fusing the first index and the second index into importance information according to an information fusion theory, and selecting data with the highest importance from the second data as data to be labeled based on the importance information.

The following technical advantages are achieved in particular here: the importance of the sample is comprehensively expressed by providing two different evaluation indexes of the fused data sample. In the fusion stage, two independent evaluation indexes can be mutually restricted and balanced, so that the finally screened data can simultaneously meet the consideration of two factors, the information value of the data to be labeled is improved, and the labeling cost is further reduced.

Optionally, the method further comprises the steps of: merging the marked data to be marked into the first data, and updating the first data and the second data divided in the step S1 according to whether marking information exists or not.

The following technical advantages are achieved in particular here: through continuous iterative updating, the marked data set gradually tends to the optimal quality, and accurate division of the non-value data and the valuable data is finally realized. Therefore, manual marking and importance measurement do not need to be carried out on all data in the original sample one by one, and time overhead is reduced. In addition, the model training process is executed by using the higher-quality marking data, so that the generalization error rate of the model is reduced to a certain extent, and the performance of the model is improved.

Optionally, the information amount of the second data is estimated by using the spatial domain information entropy of the prediction result, and the formula is as follows:

wherein, Entrophy_spatialIs the value of the spatial domain information entropy, u is the abscissa of the image space domain, v is the ordinate of the image space domain, and p represents the probability that the random variable is located at position (u, v) in the image space domain.

The following technical advantages are achieved in particular here: in the target task type, not only a classification task but also a positioning regression task exist, and the traditional information entropy cannot be used for estimating the sample information amount of the target detection task. By taking the information entropy of the spatial domain as the measurement standard of the sample information quantity, the data processing method is not only suitable for classification tasks, but also suitable for positioning regression tasks with higher annotation cost (such as image target detection, person key point detection and semantic segmentation of images), so that more annotation cost is saved through active data selection, and the application field is expanded.

Optionally, a nearest neighbor distance of each sample in the second data in the feature space domain is calculated by means of a kernel function k-center, and the diversity contribution of each sample is estimated based on the nearest neighbor distance.

The following technical advantages are achieved in particular here: by introducing the kernel function, the diversity contribution of the features can be continuously reflected on the nearest neighbor distance of the spatial domain, so that the deficiency in the binary classification problem of the traditional classification algorithm is improved.

According to a second aspect of the present invention, there is provided a method for target detection, the method comprising the steps of:

training a target detection model by using data processed by the method of the first aspect of the invention as training data; and

and carrying out target detection by means of the trained target detection model.

According to a third aspect of the present invention, there is provided an apparatus for data processing, the apparatus comprising a processor and a computer readable storage device communicatively connected to the processor, the computer readable storage device having stored thereon a computer program for implementing the method according to the first aspect of the present invention when the computer program is executed by the processor.

According to a fourth aspect of the present invention, a computer program product is provided, wherein the computer program product comprises a computer program for implementing the method according to the first aspect of the present invention when executed by a computer.

Drawings

The principles, features and advantages of the present invention may be better understood by describing the invention in more detail below with reference to the accompanying drawings. The drawings comprise:

fig. 1 shows a schematic block diagram of an apparatus for data processing according to an exemplary embodiment of the present invention;

fig. 2 illustrates a schematic block diagram of an apparatus for processing data according to another exemplary embodiment of the present invention;

FIG. 3 shows a schematic block diagram of the feature extractor of the apparatus of FIG. 2;

FIG. 4 shows a flow diagram of a method for data processing according to an example embodiment of the present invention; and

fig. 5 shows a flow chart of a method for object detection according to an exemplary embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and exemplary embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of the invention.

Fig. 1 shows a schematic block diagram of an apparatus for data processing according to an exemplary embodiment of the present invention. The apparatus 100 includes a processor 10 and a computer readable storage device 20 communicatively connected to the processor 10. The computer-readable storage means 20 have stored therein a computer program for implementing the method for data processing as will be explained in detail below, when the computer program is executed by the processor 10.

According to an exemplary embodiment, an output device 30 is provided in communicative connection with the processor 10. By means of the output device 30, the data processed data can be output as a training data set for use in a subsequent training process of the task model.

According to an exemplary embodiment, an input device 40 is provided in communicative connection with the processor 10. By means of the input device 40, the user can actively add annotation information to the sample data or view the sample data, and furthermore, the user can modify the existing annotation information by means of the input device 40. The input device 40 may include, for example: a keyboard, a mouse, and/or a touch screen.

According to an exemplary embodiment, a camera 50 is provided in communicative connection with the processor 10. With the aid of the camera 50, raw data samples for model training can be acquired. In the automotive field, raw data samples may be collected by a specific camera mounted on the vehicle. It is also possible that the raw data samples are collected by cameras of intelligent infrastructure devices arranged on both sides of the road. The raw data samples can be distinguished during acquisition, inter alia, by a timestamp representing a unique code to prevent similar individuals from repeating multiple times.

According to an exemplary embodiment, the raw data samples may also be stored in a further storage device (not shown) communicatively connected to the processor 10 and provided at an appropriate time.

Fig. 2 illustrates a schematic block diagram of an apparatus for processing data according to another exemplary embodiment of the present invention. Features and details of the apparatus 100' shown in fig. 2 and described below in connection with fig. 2 may be understood as additions or substitutions to features and details of the apparatus 100 shown in fig. 1 and described in connection with fig. 1. In particular, these features and details may be combined in any suitable manner.

The device 100' includes a data update module 110, a target task model 120, a feature extractor 130, a fusion module 140, and a labeling module 150.

The data updating module 110 is configured to divide the original data sample into labeled first data and unlabeled second data according to whether the data sample has labeling information, and may also adjust or update a ratio of the first data and the second data based on the division.

The target task model 120 may be a perceptual model, in particular a target detection model, corresponding to a target task defined by a user. The target task model 120 receives annotated first data from the data update module 110 and is trained using the first data. The target task model 120 pre-trained with the first data may predict the target task result. A first index characterizing the labeled value of the sample is obtained from the target task model 120.

The feature extractor 130 receives the first data and the second data from the data update module 110 and derives them based on an unsupervised learning training. A second index characterizing the annotated value of the sample can be obtained by means of the feature extractor.

In the fusion module 140, the sample information amount obtained from the target task model and the sample diversity obtained from the feature extractor 130 are fused into the importance information of the sample by using an information fusion technique. The fusion module 140 actively selects the data with high importance from the second data as the data to be labeled according to the importance information of the sample.

The annotation module 150 is used for providing corresponding annotation information for the high importance data actively selected by the fusion module 140. The annotation module 150 is also coupled to the data update module 110 to repartition the first data and the second data based on the updated annotation information in the data update module 110.

Fig. 3 shows a schematic block diagram of the feature extractor of the apparatus in fig. 2. In the present embodiment, a variational autoencoder 300 is built and trained as a feature extractor. The variational self-encoder 300 includes a feature encoder 301 and a feature decoder 302. As an example, the first data and the second data in the form of original image data X may be input into a feature encoder 301, where the input image data is encoded into corresponding feature vectors z. After passing through the feature decoder 302, the feature vector z is restored to image data

Here, the variational autocoder VAE network allows automatic feature extraction by sampling from the variance vector of the middle layer and adding to the mean vector, by which process the latent variables that act as interventions can be added externally. Compared with a conventional feature extractor, the variational self-encoder used in the embodiment does not need to match the labeled data for the decoupled learning of the target features, but can convert the input data into a low-dimensional feature space based on a self-supervision learning mode, so that the feature expression corresponding to each sample is directly obtained. Therefore, the measurement index of the sample labeling value can be obtained by more effectively utilizing the label-free data.

Fig. 4 shows a flow chart of a method for data processing according to an exemplary embodiment of the present invention.

In step S1, the data sample is divided into labeled first data and unlabeled second data according to whether or not there is label information. This classification process can be carried out in particular manually by a human being, but can also be performed instead by means of a simple pre-trained classifier.

In step S21, a target task model is built and trained in conjunction with the first data. Here, the target task model may be, in particular, a computer vision perception-based target detection model, which may be used to implement a localization regression task for images and which includes, for example: a character detection model, a pedestrian key point detection model and a semantic segmentation model of the image.

For example, a convolutional neural network composed of convolutional layers, active layers, and other structures may be trained, for example, fast-RCNN may be used as the target detection model, the input of the model is image data, and the output is predicted position, size, and type information of the target bounding box. By means of continuous training of the first data with labels, the network parameters of the convolutional neural network are continuously updated and iterated, and finally the performance of the target detection model is optimal on the basis of the existing data scale.

Next, in step S22, the information amount of the second data is estimated based on the trained target task model. For example, the second data is input into a pre-trained target task model, so that a corresponding prediction result is obtained, and the information entropy can be used to estimate the sample information amount of the second data according to the prediction result.

Here, if the target task type is a simple classification task, the amount of information of the sample can be estimated using the conventional information entropy, which is expressed as follows:

wherein i is different categories, pi represents the probability of occurrence of random events of the category i, and Encopy represents the information Entropy expectation index of the sample image. Here, the larger the overall information entropy, the more tag information the sample is considered to contain, and therefore the higher the value tagged in terms of information content.

The sample information amount can be evaluated in a traditional image classification task by adopting a traditional information entropy-based calculation mode. However, for some target tasks, such as a spatial regression positioning task for images, the traditional information entropy cannot completely represent the information amount or uncertainty of the sample, and thus is not suitable.

In order to solve the problems, the invention adopts the spatial domain information entropy to estimate the information quantity of the data sample, thereby realizing the uncertainty of the angle measurement sample in the spatial domain. The formula of the information entropy of the spatial domain is expressed as follows:

wherein, Entrophy_spatialIs the value of the spatial domain information entropy, u is the abscissa of the image space domain, v is the ordinate of the image space domain, and p represents the probability that the random variable is located at position (u, v) in the image space domain. Here, the smaller the chance (probability) of the random variable appearing in the spatial domain, the greater the uncertainty, and the greater the information entropy.

In step S31, a feature extractor is built and trained in conjunction with the first data and the second data. In this case, for example, the variational autocoder can be trained with the first and second data and used as a data representation feature extractor, whereby the high-dimensional raw data can be mapped to a low-dimensional feature representation. And continuously iterating and updating the network parameters of the variational self-encoder along with the training, and finally obtaining the trained feature extractor.

The loss function employed in the training process may include two parts:

the first part is based on the self-supervised image reconstruction loss (characteristic L2 loss), which is formulated as follows:

wherein X is the input original image data,

for the generated restored image data, { l_cAnd the H is the characteristic layer corresponding to VGG 19.

The second part is the KL divergence distance, which is formulated as follows:

L_KL＝KL(q_E(z|X)||p_D(z|X))，

wherein X is input original image data, and is the overall characteristics of the extracted image，q_EAnd p_DRespectively, represent the (mean) non-sampled feature vectors obtained in the middle layers of the feature extraction network used.

In step S32, the expression features of the second data are extracted based on the trained feature extractor and diversity is sought. In this case, the multiplicity of the data samples can be expressed in the low-dimensional feature space by the distances of the different samples in the feature space domain. For example, the nearest neighbor distance of each sample can be obtained through a k-center kernel function, and the diversity contribution of the sample is measured based on the nearest neighbor distance. The greater the nearest neighbor distance of a sample, the greater the diversity contribution of that sample.

Here, the complex mathematical relationship between the feature and the target class can be converted into a sample similarity minimization metric by the nearest neighbor distance based on the kernel function, and especially, the occurrence of similar samples can be avoided by setting a condition threshold for the nearest neighbor distance, thereby improving the diversity of training samples.

Here, the common distance measurement methods include: KL divergence distance, euclidean distance, manhattan distance, cosine similarity, etc. Different distance formulas can be applied to different target task types and use scenes.

It should be noted here that although the steps S31-S32 for finding the second index are shown after the steps S21-S22 for finding the first index, this is merely exemplary. The first index and the second index should be obtained independently and not sequentially. Depending on different condition factors, the order of the first index and the second index may be switched or the two index calculation processes may be performed in parallel.

In step S41, the information amount index and the diversity index are fused, and the most valuable data is screened out from the second data as data to be annotated based on the fusion result. Here, for example, an importance index may be defined as a result of information fusion. As an example, the importance index may be defined as a product of the entropy of the sample information and the nearest neighbor distance of the sample, the second data may be arranged in order according to the size of the importance index, and thus several data samples with high importance may be screened out as the samples to be labeled.

In step S42, it may be determined whether the data annotation budget has reached an upper limit. As an example, it may be determined whether a preset number of cycle steps is reached.

If the upper limit is not reached, the screened data to be labeled may be labeled in step S44, and the labeled data may be merged into the first data. Thus, the number of samples included in each of the first data and the second data changes.

To better accommodate such changes, one may jump from step S44 back to step S1 to re-perform the classification of the sample data according to the current labeling scenario to obtain updated first data and updated second data.

If it is determined in step S42 that the annotated budget upper limit has been reached, then the current first data may be output as a training data set and the data processing process may end in step S43.

In step 201, data processed by means of the method shown in fig. 4 is acquired as a training data set. In this case, for example, the first data or a subset of the first data obtained in the end can be used directly as the training data set.

In step 202, the object detection model is trained based on the training data set. Here, depending on the application scenario and user requirements, the task type of the object detection model may be freely defined, where the object detection model may be, in particular, a person key point detection task, a pedestrian pose recognition task, an image semantic segmentation task, and the like. The used training data set comprises the marking data which are optimized in terms of information quantity and diversity distribution for the target task, so that the generalization capability of the model can be improved and the training process can be greatly accelerated in the training process.

In step 203, target detection is performed by means of the trained target detection model.

Although specific embodiments of the invention have been described herein in detail, they have been presented for purposes of illustration only and are not to be construed as limiting the scope of the invention. Various substitutions, alterations, and modifications may be devised without departing from the spirit and scope of the present invention.

Claims

1. A method for data processing, the method comprising the steps of:

2. The method according to claim 1, the step S2 comprising: a target task model, in particular a target detection model, is trained with the aid of first data, and the information content of second data is estimated on the basis of the prediction result of the trained target task model and used as the first indicator.

3. The method according to claim 1 or 2, wherein the step S3 comprises: an expression feature extractor, in particular a variational self-encoder, is trained with the aid of the first data and the second data, the expression features of the second data are extracted on the basis of the trained expression feature extractor, and the diversity of the second data is determined from the expression features and is used as the second indicator.

4. The method according to any one of claims 1 to 3, wherein the step S4 includes: and fusing the first index and the second index into importance information according to an information fusion theory, and selecting data with the highest importance from the second data as data to be labeled based on the importance information.

5. The method according to any one of claims 1 to 4, further comprising the steps of: merging the marked data to be marked into the first data, and updating the first data and the second data divided in the step S1 according to whether marking information exists or not.

6. The method of claim 2, wherein the information amount of the second data is estimated by means of the spatial domain information entropy of the prediction result, which is formulated as follows:

7. The method of claim 3, wherein a nearest neighbor distance of each sample in the second data in the feature space domain is evaluated by means of a kernel function k-center, and the diversity contribution of each sample is estimated based on the nearest neighbor distance.

8. A method for target detection, the method comprising the steps of:

training a target detection model using data processed according to the method of any one of claims 1 to 7 as training data; and

9. An apparatus (100) for data processing, the apparatus (100) comprising a processor (10) and a computer readable storage device (20) communicatively connected to the processor (10), the computer readable storage device (20) having stored therein a computer program for implementing the method according to any one of claims 1 to 7 when the computer program is executed by the processor (10).

10. A computer program product, wherein the computer program product comprises a computer program for implementing the method according to any one of claims 1 to 7 when executed by a computer.