WO2024022149A1

WO2024022149A1 - Data enhancement method and apparatus, and electronic device

Info

Publication number: WO2024022149A1
Application number: PCT/CN2023/107709
Authority: WO
Inventors: 吕永春; 朱徽; 周迅溢; 蒋宁; 吴海英
Original assignee: 马上消费金融股份有限公司
Priority date: 2022-07-29
Filing date: 2023-07-17
Publication date: 2024-02-01
Also published as: CN117541770A

Abstract

Provided in the embodiments of the present application are a data enhancement method and apparatus, and an electronic device. The data enhancement method comprises: acquiring an original image and a background image set, wherein one original image corresponds to one background image set; performing target detection on the original image according to a target detection network, so as to obtain a first detection frame of the original image; and fusing an area corresponding to the first detection frame with at least one background image in the background image set corresponding to the original image, so as to obtain an enhanced image for the original image. In this way, the data enhancement effect can be improved.

Description

Data enhancement method, device and electronic equipment

This application claims priority to the Chinese patent application with application number 202210904226.9 and titled "Data Enhancement Method, Device and Electronic Equipment" submitted on July 29, 2022. The content of the above application is incorporated herein by reference.

Technical field

The present application relates to the field of image processing technology, and in particular to a data enhancement method, device and electronic equipment.

Background technique

In recent years, deep learning has been widely used in image processing, computer vision and other fields. However, as the depth of neural networks increases, the overfitting phenomenon of large-scale deep neural networks becomes more and more serious, which will lead to performance degradation. An important cause of the overfitting problem is the insufficient amount of training set data. In order to expand the available training set data, a variety of data enhancement techniques suitable for image type data have been widely proposed. Currently, commonly used image enhancement solutions are to obtain new images by flipping, cropping, translating, and color changing images, thereby achieving the purpose of expanding image data.

Contents of the invention

Embodiments of the present application provide a data enhancement method, device and electronic equipment.

On the one hand, embodiments of the present application provide a data enhancement method, including:

Obtain the original image and background image set, one original image corresponds to one background image set;

According to the target detection network, perform target detection on the original image to obtain the first detection frame of the original image;

The area corresponding to the first detection frame and at least one background image in the background image set corresponding to the original image are fused to obtain an enhanced image of the original image.

It can be seen that in the data enhancement method of the embodiment of the present application, the first detection frame can be obtained by performing target detection on the original image, and the area corresponding to the first detection frame in the original image and the area corresponding to the original image can be used At least one background image in the background image set is fused to obtain at least one enhanced image corresponding to the original image, thereby realizing data enhancement of the original image.

On the one hand, embodiments of the present application provide a data enhancement device, including:

The acquisition module is used to obtain the original image and the background image set. One original image corresponds to one background image set;

A target detection module, configured to perform target detection on the original image according to the target detection network, and obtain the first detection frame of the original image;

A fusion module configured to fuse the area corresponding to the first detection frame and at least one background image in the background image set corresponding to the original image to obtain an enhanced image of the original image.

On the one hand, embodiments of the present application also provide an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the computer program Implement the steps in the above data augmentation method.

On the one hand, embodiments of the present application also provide a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, the steps in the above data enhancement method are implemented.

In one aspect, the present disclosure provides a computer program that, when executed by a processor, implements the steps in the above data enhancement method.

In one aspect, the present disclosure provides a computer program product, including a computer program that implements the steps in the above data enhancement method when executed by a processor.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting any creative effort.

Figure 1 is a flow chart of a data enhancement method provided by an embodiment of the present application;

Figure 2 is a flow chart of a data enhancement method provided by an embodiment of the present application;

Figure 3 is a schematic diagram of a convolutional neural network training provided by an embodiment of the present application;

Figure 4 is a schematic diagram of a data enhancement method provided by an embodiment of the present application;

Figure 5 is an application scenario diagram of a data enhancement method provided by an embodiment of the present application;

Figure 6 is a schematic structural diagram of a data enhancement device provided by an embodiment of the present application;

Figure 7 is a schematic structural diagram of a data enhancement device provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

In the process of network training through image data, the image data used for training is also called image training samples. Image data can directly affect the quality of network training, and over-fitting is a common problem in the network training process. Over-fitting An important cause of the coalescence phenomenon is the insufficient number of image training samples. Image data enhancement is an important way to increase the sample size. However, currently, the enhanced image data obtained through enhancement methods such as flipping, cropping, translation, and color transformation of the original image provides limited additional information compared to the original image. Image data enhancement The effect is poor. Even if the obtained enhanced image and the original image are used for network training, since the enhanced image has smaller changes compared to the original image and can provide less additional information, it is still easy to lead to overfitting. Based on this, embodiments of the present application provide a data enhancement method, which fuses the first detection frame obtained by target detection on the original image with an additional background image to obtain an enhanced image. In this way, the enhanced image is less modified than the original image. Large, the obtained enhanced image can provide more additional information and improve the image enhancement effect. Subsequent network training is carried out through the original image and the enhanced image that has changed greatly relative to the original image and can provide more additional information, which can reduce overfitting. the situation arises.

It should be noted that this method can be applied to an electronic device, and the method is executed by the electronic device. The electronic device can be any device that can implement data enhancement, for example, it can include but is not limited to terminal equipment or server equipment.

Referring to Figure 1, Figure 1 is a flow chart of a data enhancement method provided by an embodiment of the present application. As shown in Figure 1, it includes: Next steps:

Step 101: Obtain the original image and background image set. One original image corresponds to one background image set.

The original image can also be called the image to be data enhanced, and there is a corresponding relationship between the original image and the background image set. There is no specific limit on the number of original images here. It can be one or more. Correspondingly, the background image set can also have one or more. Each background image set may include one or more background images, which is not limited by the embodiments of this application.

Step 102: Perform target detection on the original image according to the target detection network, and obtain the first detection frame of the original image.

After obtaining the original image, target detection can be performed on the original image according to the target detection network to obtain the first detection frame of the original image. Target detection can be understood as detecting the position of the target in the image. The result of target detection can be represented by a detection frame. The detection frame can also be called a bounding box. One of its implementation methods is a rectangular frame. The detected target in the image is located in the corresponding target. within the detection frame. It should be noted that although the embodiment of the present application takes a rectangular frame as an example, it is not limited thereto. The technical solution of the embodiment of the present application can also be applied to target detection frames of other shapes.

It should be noted that if there are multiple original images, then target detection needs to be performed on the original images one by one to obtain the first detection frame of each original image. Of course, if there are multiple target detection networks, each target detection network can be used to perform target detection on the original image, and then the first detection frame corresponding to the original image is obtained based on the detection results.

Step 103: Fusion of the area corresponding to the first detection frame and at least one background image in the background image set corresponding to the original image to obtain an enhanced image of the original image.

After obtaining the first detection frame of the original image, the area corresponding to the first detection frame can be extracted from the original image according to the first detection frame of the original image, and then the extracted area corresponding to the first detection frame in the original image The region is fused with at least one background image in the background image set corresponding to the original image, so that an enhanced image corresponding to the original image can be obtained to achieve image data enhancement. As mentioned above, there may be more than one background image in the background image set. During fusion, the area corresponding to the first detection frame can be fused with each background image according to actual needs, or with some background images in the background image set ( One or more pictures) are fused, and the embodiments of the present application do not limit this.

The area corresponding to the first detection frame in the original image can be understood as the area determined in the original image using the vertex coordinates of the first detection frame. Taking the first detection frame as a rectangular frame as an example, the area corresponding to the first detection frame It may be an area enclosed by the four vertex coordinates of the first detection frame in the original image.

It should be noted that since there can be at least one background image in the background image set, at least one enhanced image corresponding to the original image can be obtained. The above-mentioned at least one enhanced image can be called an enhanced image set. The original image and the Its corresponding enhanced image set can not only increase the amount of training samples for training deep neural networks, but also because the enhanced images obtained by the embodiment of the present application have larger changes compared to the original images, the obtained enhanced images can provide more additional information, thereby reducing the occurrence of overfitting during the training process.

In the data enhancement method of this embodiment, the first detection frame can be obtained by performing target detection on the original image, and using at least one of the areas corresponding to the first detection frame in the original image and the background image corresponding to the original image. The background images are fused to obtain at least one enhanced image corresponding to the original image, thereby realizing data enhancement of the original image. Different from the processing such as flipping the original image described in the related art, in the embodiment of the present application, the entire original image is no longer processed, but the first detection frame of the original image is obtained, and then the first detection frame is obtained. The fusion of the frame and the background image results in an enhanced image, which results in an enhanced image that is significantly changed relative to the original image. The enhanced image obtained in this way can provide more additional information, thus improving the image enhancement effect.

The data enhancement method provided by this application has been described above with reference to Figure 1 . In practice, in order to improve the accuracy of the first detection frame of the original image, more than one target detection network can be used to perform target detection on the original image, thereby basically The first detection frame of the original image is obtained based on the detection results. The above process will be described in detail below in conjunction with Figure 2. It should be noted that in the following embodiments, there may be more than one original image, but it is understandable that the processing process for each original image is similar.

Refer to Figure 2, which is a flow chart of a data enhancement method provided by an embodiment of the present application. As shown in Figure 2, it includes the following steps:

Step 201: Obtain N original images and N background image sets. One original image corresponds to one background image set; N is an integer greater than or equal to 1.

The original image can also be called the image to be data enhanced. N original images correspond to N background image sets one-to-one.

Each background image set in the N background image sets includes at least one background image, that is, each background image set may include one or more background images. In the N background image sets, the number of background images included in each background image set may be the same or different, which is not limited by the embodiment of the present application.

Step 202: Perform target detection on each of the N original images based on M target detection networks, and obtain M target detection frames for each original image, where M is an integer greater than 1.

Similar to step 101, target detection can be understood as detecting the position of the target in the image. The result of target detection can be represented by a detection frame. The detection frame can also be called a bounding box. One implementation method is a rectangular frame. The detection frame in the image The target is located within the corresponding target detection box.

In addition, the M target detection networks can be networks obtained corresponding to M iteration rounds (rounds) during the iterative training process of the initial detection network, and the smallest iteration round among the M iteration rounds is greater than the target round, and the target The rounds can be preset according to the actual situation, or can be set according to the maximum training rounds set for model training. In this embodiment, there is no specific limit to this. For example, the target round can be set to a round between half of the maximum training round and the maximum training round, and the M iteration rounds can be M iterations of training after the initial detection network has been trained for the target round. , the largest iteration round among the M iteration rounds is less than or equal to the maximum training round of model training.

For example, the maximum number of training rounds for model training is 40, the target number can be set to 20, M can be 20, and the initial detection network is first trained for 20 iterations, starting from the 21st iteration training, and each iteration training is completed Then, record the obtained network until the 40th iteration of training is completed. In this way, 20 networks can be recorded, that is, the 20 target detection networks are the recorded networks obtained from the 21st round to the 40th round respectively.

Step 203: Determine the first detection frame of each original image. The first detection frame of the original image is the detection frame determined by the M target detection frames of the original image.

After obtaining the M target detection frames of each original image, the M target detection frames of each original image can be processed to obtain the first detection frame of each original image.

As an implementation method, the M target detection frames of each original image (or M target detection frames corresponding to each original image) can be rectangular frames. It should be noted that although the embodiment of the present application takes a rectangular frame as an example, it is not limited thereto. The technical solution of the embodiment of the present application can also be applied to target detection frames of other shapes.

The processing process for the M target detection frames for each original image can be similar. The following takes the processing process of one original image among N original images as an example. The M target detection frames of the original image can be averaged to obtain the first detection frame of the original image. For example, the target detection frame can be represented by four vertex coordinates, and the four vertex coordinates of the M target detection frames of the original image can be averaged to obtain the first detection frame of the original image, that is, the four first detection frames of the original image. The vertex coordinates are the average of the four vertex coordinates of the M target detection frames of the original image. For example, the coordinates of any vertex of the first detection frame are the average of the vertex coordinates of the M target detection frames of the original image. Need to explain Yes, any vertex coordinate can include two component coordinates. Average means averaging the same component coordinates of the vertex coordinate in M target detection frames. For example, M target detection frames of an original image include a first target detection frame and a second target detection frame. The four vertex coordinates of the first target detection frame are J11 (X11, Y11), J12 (X12, Y12), J13 (X13, Y13) and J14 (X14, Y14), the four vertices of the second target detection frame are J21 (X21, Y21), J22 (X22, Y22), J23 (X23, Y23) and J24 (X24, Y24), J11 can be understood as the upper left corner vertex of the first target detection frame, J12 is the upper right corner vertex of the first target detection frame, J13 is the lower left corner vertex of the first target detection frame, and J14 is the right corner vertex of the first target detection frame For the lower vertex, J21 can be understood as the upper left corner vertex of the second target detection frame, J22 is the upper right corner vertex of the second target detection frame, J23 is the lower left corner vertex of the second target detection frame, and J24 is the right corner vertex of the second target detection frame. The lower corner vertex is averaged to obtain the four vertex coordinates of the first detection frame of the original image, respectively ((X11+X21)/2, (Y11+Y21)/2), ((X12+X22)/2, ( Y12+Y22)/2), ((X13+X23)/2, (Y13+Y23)/2) and ((X14+X24)/2, (Y14+Y24)/2). Perform the above-mentioned similar process on the M target detection frames of each original image to obtain the first detection frame of each original image.

Step 204: Fusion of the area corresponding to the first detection frame in each original image and at least one background image in the background image set corresponding to the original image, to obtain at least one enhanced image corresponding to each original image.

For each original image, after determining the first detection frame of the original image, the area corresponding to the first detection frame can be extracted from the original image according to the first detection frame of the original image, and then the extracted original image The area corresponding to the first detection frame is fused with at least one background image in the background image set corresponding to the original image, so that at least one enhanced image corresponding to the original image can be obtained to achieve image data enhancement. It should be noted that the above-mentioned process of fusing the area corresponding to the first detection frame in the original image with at least one background image in the background image set corresponding to the original image may be to fuse the area corresponding to the first detection frame in the original image. The area of is separately fused with each background image in the background image set corresponding to the original image (the number of enhanced images corresponding to the original image obtained at this time is the same as the number of background images in the background image set corresponding to the original image), or it can be The area corresponding to the first detection frame in the original image is merged with part of the background image in the background image set corresponding to the original image (the number of enhanced images corresponding to the original image obtained at this time is smaller than the background image in the background image set corresponding to the original image) quantity). Performing the above-mentioned similar fusion process for each original image can obtain at least one enhanced image corresponding to each original image.

It should be noted that obtaining at least one enhanced image corresponding to each original image of N images can be understood as obtaining an enhanced image set corresponding to each original image, that is, obtaining N enhanced image sets, and any original image corresponding to The enhanced image set includes at least one enhanced image corresponding to the original image. The N original images and N enhanced image sets can be used for subsequent training of the deep neural network, that is, the N original images and N enhanced image sets can be used subsequently. Training a deep neural network can not only increase the amount of training samples, but also because the enhanced image obtained by the embodiment of the present application has a larger change compared to the original image, the obtained enhanced image can provide more additional information, so that during the training process It can reduce the occurrence of overfitting.

In addition, the area corresponding to the first detection frame in the original image can be understood as the area determined in the original image using the coordinates of each vertex of the first detection frame. Taking the first detection frame as a rectangular frame as an example, the first detection frame corresponds to The area of may be the area enclosed by the four vertex coordinates of the first detection frame in the original image.

In the data enhancement method of this embodiment, M target detection frames of the original image can be obtained through M different target detection networks, and the M target detection frames of the original image are used to determine the first detection frame of the original image to improve the original image. The accuracy of the first detection frame uses the fusion of the area corresponding to the first detection frame in the original image and at least one background image in the background image set corresponding to the original image to obtain at least one enhanced image corresponding to the original image, To achieve the original image data enhancement. Different from the processing such as flipping the original image described in the related art, in the embodiment of the present application, the entire original image is no longer processed, but the first detection frame of the original image is obtained, and then the first detection frame is obtained. The fusion of the frame and the background image results in an enhanced image, which results in an enhanced image that is significantly changed relative to the original image. The enhanced image obtained in this way can provide more additional information, thus improving the image enhancement effect.

In one embodiment, target detection is performed on each of the N original images according to M target detection networks, and M target detection frames for each of the N original images are obtained, including:

Input N original images into M target detection networks for feature extraction, and obtain M feature maps of each original image;

Normalize the M feature maps of each original image to obtain M heat maps of each original image;

Calculate M target detection frames of M heat maps of each original image as M target detection frames of each original image.

As an example, the target detection network may include a convolutional neural network, which may include multiple convolutional layers, and the feature map may be a feature map output by the last convolutional layer in the convolutional neural network. By using a convolutional neural network to extract features from the original image, more detailed features of the original image can be extracted to obtain a feature map that better characterizes the features of the original image. Then, the M feature maps of the original image are normalized to obtain M heat maps of the original image, so as to subsequently determine the target detection frame in the image. As an example, the feature map can be normalized to a heat map with pixel values in the range [0, 1].

As an example, the above convolutional neural network can be obtained by iterative training through the SimSiam self-supervised method. This self-supervised method directly maximizes the similarity of two views of an image without using negative samples and without requiring a momentum encoder. As shown in Figure 3, for an image x (image x), perform two random augmentations (for example, rotation, color processing, etc.) to obtain two different views x ₁ and x _2. As input, the two views _x1 _, _{_} _{_} _{_} _{_} _{_} The fourth vector p ₂ is obtained through the projection layer processing. The encoding network (encoder) is encoder f in Figure 3, and the projection layer (projector) is projector h in Figure 3. The stop gradient operation shown in Figure 3 (i.e. stop-grad in Figure 3) is the key to preventing model collapse. Then, minimize the negative value of cosine similarity:

Cosine similarity is the similarity in Figure 3. D(p ₁ , z ₂ ) is the negative value of the cosine similarity between p ₁ and z ₂ .

The loss function L is in symmetric form:

Among them, D(p ₂ , z ₁ ) is the negative value of the cosine similarity between p ₂ and z ₁ , which is expressed as follows:

The model is trained for n (e.g., 40) epochs using the above loss function via a stochastic gradient descent (SGD) optimizer. It should be noted that the above-mentioned encoding network can be a convolutional neural network (which can include a feature extraction network and a conversion layer. The last convolutional layer of the feature extraction network outputs a feature map, and the conversion layer converts the feature map into a vector). The training is completed. After that, the convolutional neural network training is completed. It should be noted that during the training process, the training set used may include the above-mentioned N original images.

The above-mentioned self-supervised learning can capture the approximate position information of the target. The embodiment of the present application uses this feature to estimate the object border in the corresponding feature map of the image. The position information at the early stage of self-supervised training may not be accurate enough, so it is possible to use convolutional neural iterative training after m (for example, 20) rounds until the convergence round (the training rounds at the end of training, for example, the above n rounds) The network performs feature extraction on the original image A, that is, using the convolutional neural network trained from the m+1th round to the nth round to perform feature extraction on the original image A, (n-m), that is, M feature maps (taken The feature map output by the last convolutional layer of the convolutional neural network). Normalize the feature map to between [0,1], generate a heat map, and then calculate the target detection box in each heat map according to the following method:

B＝K(l[R>i])

Among them, R represents the heat map, i represents the threshold of the activation point (i.e., the preset pixel threshold), l is the indicator function, l[R>i] means that the result is 1 when the value of the pixel in R is greater than i, Otherwise, the result is 0. This processing is performed on each pixel in R, that is, the binary processing of R is implemented to obtain the binary image. K is a function for calculating the rectangular closure and a function for instantly calculating the target detection frame. The K function returns the target detection box of the binary image of the heat map R.

Since M convolutional neural networks are recorded, each convolutional neural network outputs a heat map, and M heat maps of the original image A are obtained. M target detection frames of these M heat maps are respectively calculated as the original image A. M target detection frames, average the results of the M target detection frames. For example, if the target detection frame is a rectangular detection frame, you can average the four vertex coordinates of the M target detection frames in sequence to obtain the original image A the final first detection frame.

In one embodiment, M target detection frames of M heat maps of each original image are calculated as M target detection frames of each original image, including:

For each heat map, the heat map is binarized according to the preset pixel threshold to obtain a binary image of the heat map, and the target detection frame of the heat map is calculated based on the binarized image of the heat map.

The binarization process can adjust the values of pixels in the heat map that are greater than a preset pixel threshold to a first value, for example, 1, and adjust the values of pixels in the heat map that are less than or equal to the preset pixel threshold to a second value. value, such as 0, and the value of any pixel in the binary image obtained is the first value or the second value. As an example, the binarization process can be implemented as the indicator function l[R>i] mentioned above. Since there are only two pixel values in the binarized image, using the binarized image to calculate the target detection frame can improve the accuracy of the calculated detection frame.

In one embodiment, N background image sets are obtained, including:

For each original image, do the following:

Determine at least one category that matches the category of the original image based on the category of the original image, and the similarity between each category in the at least one category and the category of the original image is greater than a preset threshold;

Obtain at least one reference image corresponding to each category in at least one category to obtain a reference image set corresponding to the original image;

Obtain the background image in each reference image of the reference image set to obtain the background image set corresponding to the original image.

The category of the original image can also be understood as the category of the object in the original image. Multiple categories can be set in advance. The category of the original image can be a category in multiple categories. For example, the multiple categories can be but are not limited to people, Pigs, sheep, cats, dogs, deer, horses, birds, etc., the categories of the original images can be obtained in advance. The background image can be understood as an image after removing the target, for example, it may be the remaining background area after removing the target from the reference image, etc.

In this embodiment, the background image of the reference image and the area of the first detection frame are fused, and the reference image For example, the image category matches the category of the original image, thereby reducing the difference between the area of the first detection frame and the background image of the reference image, and improving the rationality of the enhanced image obtained by fusion.

In one embodiment, for each original image, before determining at least one category that matches the category of the original image according to the category of the original image, the method further includes:

determining a similarity between each two categories in a plurality of categories, the plurality of categories including a category of the original image and at least one category;

Wherein, according to the category of the original image, at least one category matching the category of the original image is determined, including:

According to the similarity between the category of the original image and the remaining categories in the plurality of categories, at least one category matching the category of the original image is determined from the remaining categories, and the remaining categories are the categories of the plurality of categories other than the category of the original image. Category, the similarity between each category of at least one category that matches the category of the original image and the category of the original image is greater than a preset threshold.

It can be understood that the above multiple categories include the category of the original image and other categories. Therefore, a category matching the category of the original image can be selected from the remaining categories as at least one of the above categories based on the similarity between the categories. , that is, in this embodiment, the background image of the reference image is fused with the area of the first detection frame, and the similarity between the reference image and the category of the original image is determined by the similarity between categories greater than the preset threshold. images corresponding to at least one category, thereby reducing the difference between the area of the first detection frame and the background image of the reference image, and improving the rationality of the enhanced image obtained by fusion.

As an example, at least one category is one category, that is, the category that matches the original image is one category, and the matching category can be the category with the greatest similarity to the original image category among multiple categories.

For example, as shown in Figures 4 and 5, input the original image A into M target detection networks, extract M target detection frames of the original image, and use the M target detection frames of the original image A to obtain the first detection frame of the original image A. J, based on the similarity between each two categories in the recorded categories, determine the category with the highest similarity to the category of the original image A, obtain the background area D in the reference image C corresponding to the category with the highest similarity, and The first detection frame J is integrated into the background area D in the reference image C, and an enhanced image Q is obtained. It should be noted that the reference image corresponding to the category with the highest similarity can be one or more, that is, the reference image C can be one or more. Correspondingly, the enhanced image Q obtained based on the reference image C can also be one. One or more sheets.

In one embodiment, determining the similarity between each two categories in multiple categories includes:

Input multiple categories into the semantic model for semantic analysis, and obtain the semantic vector representation of each category in the multiple categories;

Calculate the similarity between the semantic vector representations of each two categories in multiple categories.

There are many types of semantic models, which are not specifically limited in this embodiment. For example, the semantic model can use Glove model, word2vec (a word vector model), etc. Since in this embodiment, the semantic analysis of the category is performed through the semantic model, the semantic vector representation (or word vector representation) of the category can be extracted. It can be understood that the semantic vector representation of a category can be used to characterize the semantics of the category. Information (that is, the semantic information of the category is expressed in the form of a vector), the similarity between the two categories can be calculated based on the semantic vector representation of the two categories, also called semantic similarity. For example, the two categories The cosine similarity between the semantic vector representations is used as the similarity between the two categories. In this way, the accuracy of the similarity between categories can be improved.

It should be noted that by using the semantic similarity between categories, a similar category corresponding to the category of the original image is obtained, and the area corresponding to the first detection frame in the original image and at least one category similar to the category of the original image are The background image of the corresponding reference image is fused to achieve image reconstruction, and an enhanced image of the original image is obtained. This constrains and reduces the possibility of ambiguity in the enhanced image generated by reconstruction to a certain extent, and improves the rationality of the enhanced image. For example, for a dog image, the background may be grass, furniture, etc., and for its most similar category, cat, the background may also be similar, with dog as the foreground, and the background of the cat image that is most similar to the dog category. Through fusion, the enhanced image can be obtained as an augmented sample to improve the rationality of the enhanced image obtained by fusion. The data enhancement method proposed in the embodiment of this application makes simple, reasonable and large changes to the original image, can provide more additional information, and has greater potential to resist over-fitting problems.

It should be noted that the original images that can be used can include but are not limited to expression data, face images, natural biological classification, etc. The original images with rich background information have better effects. Before performing data enhancement on the original image, the original image is first used for self-supervised learning to obtain the position estimation result of the target in the original image. Then according to the categories, the semantic similarity between each category is calculated to obtain the most similar category. Then, data enhancement is performed based on the original image, the first detection frame, and the most similar category, etc., for subsequent deep neural network training process. During the training process, for a certain image, more sample images (enhanced images) are reconstructed by mixing the target-removed background areas of reference images with similar categories to its targets.

The data enhancement method of the embodiment of the present application can be applied to image data enhancement during the training process of a deep neural network. For example, during the training process of a deep neural network, due to insufficient amount of original images, it is easy to cause training overfitting. In order to reduce training overfitting, Fitting requires image data enhancement of the original image. In the embodiment of the present application, the first detection frame of the original image is determined by performing target detection on the original image, and the area corresponding to the first detection frame in the original image is compared with the original image. At least one background image in the corresponding background image set is fused to obtain at least one enhanced image corresponding to the original image, thereby achieving image data enhancement. It can be understood that the enhanced image is obtained by fusing the original image with the additional background image. In this way, the obtained enhanced image can provide more additional information compared to the original image. Using the original image and the obtained enhanced image for deep neural network training can not only increase the amount of training samples, but also because the enhanced image obtained in the embodiment of the present application has a larger change compared to the original image, the obtained enhanced image can provide more Additional information can reduce the occurrence of overfitting during training.

Referring to Figure 6, Figure 6 is a structural diagram of a data enhancement device provided by an embodiment of the present application, which can implement the details of the data enhancement method in the above embodiment and achieve the same effect. As shown in Figure 6, data enhancement device 600 includes:

The acquisition module 601 is used to acquire the original image and the background image set. One original image corresponds to one background image set;

The target detection module 602 is used to perform target detection on the original image according to the target detection network and obtain the first detection frame of the original image;

The fusion module 603 is used to fuse the area corresponding to the first detection frame and at least one background image in the background image set corresponding to the original image to obtain an enhanced image of the original image.

In one embodiment, there are M target detection networks;

The target detection module 602 is specifically used for:

According to M target detection networks, target detection is performed on the original image to obtain M target detection frames of the original image;

The first detection frame of the original image is determined based on the M target detection frames of the original image.

In one embodiment, the target detection module 602 is specifically used to:

Input the original image into M target detection networks for feature extraction, and obtain M feature maps of the original image;

Normalize the M feature maps of the original image to obtain M heat maps of the original image;

Calculate the target detection frames of M heat maps of the original image as M target detection frames of the original image.

In one embodiment, the target detection module 602 is specifically used to:

For each original image, the M target detection frames of the original image are averaged to obtain the first detection frame of the original image.

In one embodiment, the acquisition module 601 is specifically used to:

determining at least one category that matches the category of the original image based on the category of the original image;

In one embodiment, the apparatus 600 further includes:

a similarity determination module that determines the similarity between each two categories in a plurality of categories, the plurality of categories including the category of the original image and at least one category;

Among them, the acquisition module 601 is specifically used for:

According to the similarity between the category of the original image and the remaining categories in the plurality of categories, at least one category matching the category of the original image is determined from the remaining categories, and the remaining categories are the categories of the plurality of categories other than the category of the original image. Categories, the similarity between each category in at least one category and the category of the original image is greater than a preset threshold.

In one embodiment, the similarity determination module is specifically used to:

In one embodiment, the similarity between the semantic vector representations of each two categories is a cosine similarity between the semantic vector representations of each two categories.

In one embodiment, the first detection frame is a rectangular detection frame.

The data enhancement device provided by the embodiment of the present application can implement each process implemented by the data enhancement method in the above embodiment, and the technical features correspond one to one. To avoid duplication, they will not be described again here.

Referring to Figure 7, Figure 7 is a structural diagram of a data enhancement device provided by an embodiment of the present application, which can implement the details of the data enhancement method in the above embodiment and achieve the same effect. As shown in Figure 7, data enhancement device 700 includes:

The first acquisition module 701 is used to acquire N original images and N background image sets. One original image corresponds to one background image set, and N is an integer greater than 1;

The target detection module 702 is used to perform target detection on each of the N original images according to M target detection networks, and obtain M target detection frames for each original image, where M is an integer greater than 1;

The first determination module 703 is used to determine the first detection frame of each original image. The first detection frame of the original image is the detection frame determined by the M target detection frames of the original image;

The fusion module 704 is used to fuse the area corresponding to the first detection frame in each original image and at least one background image in the background image set corresponding to the original image, to obtain at least one enhanced image corresponding to each original image.

In one embodiment, the target detection module 702 includes:

The extraction module is used to input N original images into M target detection networks for feature extraction to obtain each M feature maps of the original image;

The normalization processing module is used to normalize the M feature maps of each original image to obtain M heat maps of each original image;

The detection frame determination module is used to calculate M target detection frames of M heat maps of each original image as M target detection frames of each original image.

In one embodiment, the detection frame determination module includes a binarization processing module and a detection frame calculation module;

For each heat map, the binarization processing module is used to binarize the heat map according to the preset pixel threshold to obtain a binary image of the heat map; the detection frame calculation module is used to perform binary processing based on the heat map. image, and calculate the target detection frame of the heat map.

In one embodiment, the first acquisition module 701 includes a category determination module, a first image acquisition module and a second image acquisition module;

For each original image, the category determination module is used to determine at least one category that matches the category of the original image according to the category of the original image; the first image acquisition module is used to acquire at least one category corresponding to each category in the at least one category. A reference image is used to obtain the reference image set corresponding to the original image; the second image acquisition module is used to obtain the background image in each reference image of the reference image set to obtain the background image set corresponding to the original image.

In one embodiment, the apparatus 700 further includes:

a similarity determination module, configured to determine the similarity between each two categories in a plurality of categories, the plurality of categories including the category of the original image and at least one category;

Among them, the category determination module is used for:

In one embodiment, the similarity determination module includes:

The vector representation acquisition module is used to input multiple categories into the semantic model for semantic analysis and obtain the semantic vector representation of each category in the multiple categories;

The similarity calculation module is used to calculate the similarity between the semantic vector representations of each two categories in multiple categories.

In one embodiment, the first determination module 703 is used for:

In one embodiment, the first detection frame is a rectangular detection frame.

FIG. 8 is a schematic diagram of the hardware structure of an electronic device that implements various embodiments of the present application.

The electronic device 800 includes but is not limited to: radio frequency unit 801, network module 802, audio output unit 803, input unit 804, sensor 805, display unit 806, user input unit 807, interface unit 808, memory 809, processor 810, and Power supply 811 and other components. Those skilled in the art can understand that the structure of the electronic device shown in Figure 8 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown in the figure, or Or combine certain parts, or arrange different parts. In the embodiment of the present application, electronic devices include but are not limited to mobile phones, tablet computers, notebook computers, PDAs, vehicle-mounted terminals, wearable devices, and pedometers.

Among them, processor 810 is used for:

Obtain N original images and N background image sets. One original image corresponds to one background image set, and N is an integer greater than 1;

According to M target detection networks, target detection is performed on each of the N original images to obtain M target detection frames for each original image, where M is an integer greater than 1;

Determine the first detection frame of each original image. The first detection frame of the original image is the detection frame determined by the M target detection frames of the original image;

The area corresponding to the first detection frame in each original image and at least one background image in the background image set corresponding to the original image are fused to obtain at least one enhanced image corresponding to each original image.

In one embodiment, the processor 810 is specifically configured to:

For each original image, do the following:

In one embodiment, processor 810 is also used to:

Among them, the processor 810 is also specifically used for:

In one embodiment, the processor 810 is also specifically configured to:

In one embodiment, the first detection frame is a rectangular detection frame.

The embodiments of the present application also have the same beneficial technical effects as the above-mentioned data enhancement method embodiments, and details will not be described again here. In addition, the processor 810 can also perform operations of each module in the embodiment corresponding to FIG. 6 .

It should be understood that in the embodiment of the present application, the radio frequency unit 801 can be used to receive and send information or signals during a call. Specifically, after receiving downlink data from the base station, it is processed by the processor 810; in addition, Uplink data is sent to the base station. Generally, the radio frequency unit 801 includes, but is not limited to, an antenna, at least one amplifier, transceiver, coupler, low noise amplifier, duplexer, etc. In addition, the radio frequency unit 801 can also communicate with the network and other devices through a wireless communication system.

The electronic device provides users with wireless broadband Internet access through the network module 802, such as helping users send and receive emails, browse web pages, and access streaming media.

The audio output unit 803 may convert the audio data received by the radio frequency unit 801 or the network module 802 or stored in the memory 809 into an audio signal and output it as a sound. Furthermore, the audio output unit 803 may also provide audio output related to a specific function performed by the electronic device 800 (eg, call signal reception sound, message reception sound, etc.). The audio output unit 803 includes a speaker, a buzzer, a receiver, and the like.

The input unit 804 is used to receive audio or video signals. Input unit 804 may include a graphics processor

(Graphics Processing Unit, GPU) 8041 and microphone 8042, the graphics processor 8041 processes image data of still pictures or videos obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode. The processed image frames may be displayed on the display unit 806. The image frames processed by the graphics processor 8041 may be stored in the memory 809 (or other storage media) or sent via the radio frequency unit 801 or the network module 802. Microphone 8042 can receive sounds and can process such sounds into audio data. The processed audio data can be converted into a format that can be sent to a mobile communication base station via the radio frequency unit 801 for output in the case of a phone call mode.

Electronic device 800 also includes at least one sensor 805, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 8081 according to the brightness of the ambient light. The proximity sensor can close the display panel 8081 when the electronic device 800 moves to the ear. /or backlight. As a type of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes). It can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of electronic devices (such as horizontal and vertical screen switching, related games , magnetometer attitude calibration), vibration recognition related functions (such as pedometer, knock), etc.; the sensor 805 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, Infrared sensors, etc. will not be described in detail here.

The display unit 806 is used to display information input by the user or information provided to the user. The display unit 806 may include a display panel 8081, which may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (OLED), or the like.

The user input unit 807 may be used to receive input numeric or character information and generate key signal input related to user settings and function control of the electronic device. Specifically, the user input unit 807 includes a touch panel 8081 and other input devices 8072. The touch panel 8081, also known as a touch screen, can collect the user's touch operations on or near the touch panel 8081 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 8081 operate). The touch panel 8081 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. To the processor 810, receive the command sent by the processor 810 and execute it. In addition, touch panel 8081 can be implemented using various types such as resistive, capacitive, infrared and surface acoustic wave. In addition to the touch panel 8081, the user input unit 807 may also include other input devices 8072. Specifically, other input devices 8072 may include but are not limited to physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.

Further, the touch panel 8081 can be covered on the display panel 8081. When the touch panel 8081 detects a touch operation on or near it, it is sent to the processor 810 to determine the type of touch event. Then the processor 810 determines the type of touch event according to the touch. The type of event provides corresponding visual output on display panel 8081. Although in Figure 8, the touch panel 8081 and the display panel 8081 are used as two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 8081 and the display panel 8081 can be integrated. The implementation of input and output functions of electronic equipment is not limited here.

The interface unit 808 is an interface for connecting external devices to the electronic device 800 . For example, external devices may include a wired or wireless headphone port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, audio input/output (I/O) port, video I/O port, headphone port, etc. The interface unit 808 may be used to receive input (eg, data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic device 800 or may be used to connect the electronic device 800 to the external device 800 . Transfer data between devices.

Memory 809 can be used to store software programs as well as various data. The memory 809 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store data based on Data created by the use of mobile phones (such as audio data, phone books, etc.), etc. In addition, memory 809 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 810 is the control center of the electronic device, using various interfaces and lines to connect various parts of the entire electronic device, by running or executing software programs and/or modules stored in the memory 809, and calling data stored in the memory 809 , perform various functions of the electronic device and process data, thereby overall monitoring the electronic device. The processor 810 may include one or more processing units; preferably, the processor 810 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc., and the modem processor The processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 810.

The electronic device 800 may also include a power supply 811 (such as a battery) that supplies power to various components. Preferably, the power supply 811 may be logically connected to the processor 810 through a power management system, thereby managing charging through the power management system. power, discharge, and power consumption management functions.

In addition, the electronic device 800 includes some not-shown functional modules, which will not be described again here.

Preferably, the embodiment of the present application also provides an electronic device, including a processor 810, a memory 809, and a computer program stored on the memory 809 and executable on the processor 810. When the computer program is executed by the processor 810 Each process of the above-mentioned data enhancement method embodiment can be implemented and can achieve the same technical effect. To avoid duplication, it will not be described again here.

Embodiments of the present application also provide a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, each process of the above-mentioned data enhancement method embodiment is implemented, and the same technology can be achieved. The effect will not be described here to avoid repetition. Among them, the computer-readable storage medium is such as read-only memory (ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Embodiments of the present application also provide a computer program that, when executed by a processor, implements each process of the above data enhancement method embodiment and can achieve the same technical effect. To avoid duplication, the details will not be described here.

Embodiments of the present application also provide a computer program product, including a computer program. When executed by a processor, the computer program implements each process of the above data enhancement method embodiment, and can achieve the same technical effect. To avoid duplication, this computer program will not be repeated here. Repeat.

It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element.

Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in various embodiments of this application.

The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims

A data augmentation method that includes:

Obtain the original image and background image set, one original image corresponds to one background image set;

According to the target detection network, perform target detection on the original image to obtain the first detection frame of the original image;

The area corresponding to the first detection frame and at least one background image in the background image set corresponding to the original image are fused to obtain an enhanced image of the original image.
The method according to claim 1, wherein there are M target detection networks;

The method of performing target detection on the original image and obtaining the first detection frame of the original image according to the target detection network includes:

Perform target detection on the original image according to M target detection networks, and obtain M target detection frames of the original image;

The first detection frame of the original image is determined according to the M target detection frames of the original image.
The method according to claim 2, wherein performing target detection on the original image according to M target detection networks to obtain M target detection frames of the original image includes:

Input the original image into M target detection networks for feature extraction, and obtain M feature maps of the original image;

Perform normalization processing on the M feature maps of the original image to obtain M heat maps of the original image;

Calculate the target detection frames of the M heat maps of the original image as the M target detection frames of the original image.
The method according to claim 3, wherein calculating the M target detection frames of the M heat maps of the original image as the M target detection frames of the original image includes:

For each heat map, the heat map is binarized according to a preset pixel threshold to obtain a binary image of the heat map, and the value of the heat map is calculated based on the binarized image of the heat map. Target detection frame.
The method according to any one of claims 2 to 4, wherein determining the first detection frame of the original image includes:

The M target detection frames of the original image are averaged to obtain the first detection frame of the original image.
The method according to any one of claims 1 to 5, wherein said obtaining the background image set includes:

determining at least one category that matches the category of the original image according to the category of the original image;

Obtain at least one reference image corresponding to each category in the at least one category to obtain a reference image set corresponding to the original image;

Obtain the background image in each reference image of the reference image set to obtain the background image set corresponding to the original image.
The method according to claim 6, wherein before determining at least one category matching the category of the original image according to the category of the original image, the method further includes:

determining a degree of similarity between each two categories in a plurality of categories, the plurality of categories including a category of the original image and the at least one category;

Wherein, determining at least one category that matches the category of the original image according to the category of the original image includes:

According to the similarity between the category of the original image and the remaining categories in the plurality of categories, at least one category matching the category of the original image is determined from the remaining categories, the remaining categories are the in multiple categories other than the original categories other than the category of the original image, and the similarity between each category in the at least one category and the category of the original image is greater than a preset threshold.
The method according to claim 7, wherein determining the similarity between each two categories in the plurality of categories includes:

Input the multiple categories into the semantic model for semantic analysis, and obtain the semantic vector representation of each category in the multiple categories;

The similarity between the semantic vector representations of each two categories in the plurality of categories is calculated.
The method according to claim 8, wherein the similarity between the semantic vector representations of every two categories is a cosine similarity between the semantic vector representations of every two categories.
The method according to any one of claims 1 to 9, wherein the first detection frame is a rectangular detection frame.
A data enhancement device including:

The acquisition module is used to obtain the original image and the background image set. One original image corresponds to one background image set;

A target detection module, configured to perform target detection on the original image according to the target detection network, and obtain the first detection frame of the original image;

A fusion module configured to fuse the area corresponding to the first detection frame and at least one background image in the background image set corresponding to the original image to obtain an enhanced image of the original image.
An electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, any one of claims 1 to 10 is implemented. A step in the data augmentation method described in claim 1.
A computer-readable storage medium having a computer program stored on the computer-readable storage medium. When the computer program is executed by a processor, the steps in the data enhancement method as claimed in any one of claims 1 to 10 are implemented. .
A computer program product includes computer-executable instructions. When a processor executes the computer-executable instructions, the steps in the data enhancement method according to any one of claims 1 to 10 are implemented.
A computer program that, when executed by a processor, implements the steps in the data enhancement method as claimed in any one of claims 1 to 10.