WO2022011892A1

WO2022011892A1 - Network training method and apparatus, target detection method and apparatus, and electronic device

Info

Publication number: WO2022011892A1
Application number: PCT/CN2020/125972
Authority: WO
Inventors: 窦浩轩; 王意如; 甘伟豪; 路少卿; 武伟; 闫俊杰
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2020-07-15
Filing date: 2020-11-02
Publication date: 2022-01-20
Also published as: CN111881956A; JP2022544893A; CN111881956B; TWI780751B; TW202205151A; KR20220009965A

Abstract

Embodiments of the present disclosure relate to a network training method and apparatus, a target detection method and apparatus, and an electronic device. The network training method comprises: inputting an unannotated sample image into a target detection network to obtain a target detection result, the result comprising an image area, feature information, and classification probability of a target; determining a category confidence of the target according to the classification probability of the target; for a first target having a category confidence greater than or equal to a threshold, using a sample image where the first target is located as an annotated image, and adding the annotated image into a training set; for a second target having a category confidence less than a first threshold, performing feature-related mining on the second target, determining a fourth target, and adding a sample image where the fourth target is located into the training set; and training the target detection network according to the training set.

Description

Network training method and device, target detection method and device, and electronic equipment

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is based on the Chinese patent application with the application number of 202010681178.2 and the filing date of July 15, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technologies, and in particular, to a network training method and device, a target detection method and device, and electronic equipment.

Background technique

Computer vision is an important direction of artificial intelligence technology. In computer vision processing, it is usually necessary to detect objects (such as pedestrians, objects, etc.) in images or videos. Target detection of large-scale long-tail data has important applications in many fields, such as abnormal object detection in urban surveillance, abnormal behavior detection and emergency alarm. However, due to the huge amount of long-tail data and the serious imbalance of positive and negative samples, that is, most of the data images are background images, and only a small part of the images contain detectable targets. The object detection effect of long-tailed data is poor.

SUMMARY OF THE INVENTION

The embodiments of the present disclosure propose a technical solution for network training and target detection.

According to an aspect of the embodiments of the present disclosure, a network training method is provided, including:

Input the unlabeled first sample image into the target detection network for processing, and obtain the target detection result of the first sample image, and the target detection result includes the image area and feature information of the target in the first sample image and classification probability; according to the classification probability of the target, determine the category confidence of the target; for the first target whose category confidence is greater than or equal to the first threshold in the target, the first target where the first target is located The sample image is taken as the marked second sample image and added to the training set, wherein the annotation information of the second sample image includes the image area of the first target and the category confidence corresponding to the first target. category, the training set includes the labeled third sample image; for the second target whose category confidence in the target is less than the first threshold, according to the feature information of the third target in the third sample image, Feature correlation mining is performed on the second target, and through feature correlation mining, a fourth target and a first sample image where the fourth target is located are determined from the second target, and the fourth target is located. The first sample image is taken as the fourth sample image and added to the training set; according to the annotation information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set images to train the object detection network.

In a possible implementation manner, the target detection network is trained according to the labeling information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set ,include:

According to the category of the target in the positive sample images of the training set, the first number of samples sampled from the positive sample images of each category is determined respectively, and the positive sample images are the sample images including the target in the image; according to the positive samples of each category The first number of samples in the image, sampling positive sample images of each category to obtain a plurality of fifth sample images; sampling the negative sample images of the training set to obtain a plurality of sixth sample images, the negative samples The image is a sample image that does not include a target; the target detection network is trained according to the fifth sample image and the sixth sample image.

In a possible implementation manner, the feature correlation mining is performed on the second target according to the feature information of the third target in the third sample image, and the feature correlation mining is performed from the second target. Determining the fourth target and the first sample image where the fourth target is located includes: determining the information entropy of the second target according to the classification probability of the second target; according to the category confidence of the second target degree and information entropy, select the fifth target from the second target; according to the category of the third target in the third sample image and the total number of sample images to be mined, determine the samples to be mined for each category respectively The second number of images; according to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, from the fifth target A fourth target and a first sample image where the fourth target is located are determined.

In a possible implementation manner, selecting a fifth target from the second target according to the category confidence and information entropy of the second target includes: according to the category confidence and information of the second target entropy, sort the second targets respectively, select the sixth target with the third quantity and the seventh target with the fourth quantity; combine the sixth target and the seventh target to obtain the fifth target Target.

In a possible implementation manner, according to the category of the third target in the third sample image and the total number of sample images to be mined, respectively determining the second number of sample images to be mined for each category, including: according to The category of the third target in the third sample image determines the proportion of the third target of each category; according to the proportion of the third target of each category, the sampling proportion of each category is determined; according to the sampling proportion of each category, the proportion of each category is determined respectively The second number of sample images to be mined for each category.

In a possible implementation manner, according to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, from the third sample image Determining the fourth target and the first sample image where the fourth target is located from among the five targets includes: according to the distance between the characteristic information of the third target of the first category and the characteristic information of each fifth target, respectively determining Among the third targets of the first category, the third target with the smallest distance from each fifth target is used as the eighth target, and the first category is any one of the categories of the third targets; the eighth target is The target with the largest middle distance is determined as the fourth target.

In a possible implementation manner, according to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, from the third sample image Determining the fourth target and the first sample image where the fourth target is located among the five targets, further comprising: adding the determined fourth target to the third target of the first category, and adding the determined fourth target to the third target of the first category. The outgoing fourth target is removed from the unlabeled fifth target.

In a possible implementation manner, the method further includes: inputting the third sample image into the target detection network for processing to obtain feature information of the third target in the third sample image.

In a possible implementation manner, before the step of inputting the unlabeled first sample image into the target detection network for processing to obtain the target detection result of the first sample image, the method further includes:

The target detection network is pre-trained by using the labeled third sample image.

In a possible implementation manner, the first sample image includes a long-tail image.

According to an aspect of the embodiments of the present disclosure, a target detection method is provided, the method includes: inputting an image to be processed into a target detection network for processing, and obtaining a target detection result of the to-be-processed image, where the target detection result includes all The position and category of the target in the image to be processed are obtained, and the target detection network is trained according to the above-mentioned network training method.

According to an aspect of the embodiments of the present disclosure, a network training apparatus is provided, including:

The target detection part is configured to input the unlabeled first sample image into the target detection network for processing, and obtain the target detection result of the first sample image, and the target detection result includes the target detection result in the first sample image. The image area, feature information and classification probability of the target;

a confidence determination part, configured to determine the category confidence of the target according to the classification probability of the target;

The labeling part is configured to take the first sample image where the first target is located as the labeled second sample image for the first target whose category confidence in the target is greater than or equal to the first threshold, and add training set, wherein the labeling information of the second sample image includes the image area of the first target and the category corresponding to the class confidence of the first target, and the training set includes the labelled third sample image;

The feature mining part is configured to, for the second target whose category confidence is less than the first threshold in the target, perform feature information on the second target according to the feature information of the third target in the third sample image Relevance mining, through feature correlation mining, the fourth target and the first sample image where the fourth target is located are determined from the second target, and the first sample image where the fourth target is located is used as the first sample image. Four sample images are added to the training set;

The training part is configured to train the target detection network according to the label information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set.

In a possible implementation manner, the training part includes: a sampling quantity determination sub-part, configured to separately determine, according to the category of the target in the positive sample images of the training set, the number of samples sampled from the positive sample images of each category the first quantity, the positive sample images are sample images including the target in the image; the first sampling subsection is configured to sample the positive sample images of each category according to the first quantity sampled in the positive sample images of each category , to obtain a plurality of fifth sample images; the second sampling subsection is configured to sample the negative sample images of the training set to obtain a plurality of sixth sample images, and the negative sample images are images that do not include the target a sample image; a training subsection configured to train the object detection network according to the fifth sample image and the sixth sample image.

In a possible implementation manner, the feature mining part includes: an information entropy determination sub-section, configured to determine the information entropy of the second target according to the classification probability of the second target; a target selection sub-section, is configured to select a fifth target from the second target according to the category confidence and information entropy of the second target; the mining quantity determination subsection is configured to select a fifth target according to the third sample image in the third sample image The category of the target and the total number of sample images to be mined respectively determine the second quantity of the sample images to be mined for each category; the target and image determination sub-section is configured to be based on the third target in the third sample image. The feature information, the feature information of the fifth target and the second number of sample images to be mined in each category, the fourth target and the first sample image where the fourth target is located are determined from the fifth target.

In a possible implementation manner, the target selection sub-section is configured to: according to the category confidence and information entropy of the second target, sort the second targets respectively, and select a third number of Six targets and a seventh target with a fourth quantity; the sixth target and the seventh target are combined to obtain the fifth target.

In a possible implementation manner, the mining quantity determination subsection is configured to: determine the proportion of the third objects of each category according to the category of the third object in the third sample image; The proportion of the three targets determines the sampling proportion of each category; according to the sampling proportion of each category, the second quantity of sample images to be mined in each category is determined respectively.

In a possible implementation manner, the target and image determination subsection is configured to: determine the third target according to the distance between the feature information of the third target of the first category and the feature information of each fifth target. Among the third targets of a category, the third target with the smallest distance from each fifth target is used as the eighth target, and the first category is any one of the categories of the third targets; the distance among the eighth targets is the largest target, identified as the fourth target.

In a possible implementation manner, the target and image determination subsection is further configured to: add the determined fourth target to the third target of the first category, and add the determined fourth target to the third target of the first category. Four targets were removed from the unlabeled fifth target.

In a possible implementation manner, the apparatus further includes: a feature extraction part, configured to input the third sample image into the target detection network for processing to obtain a third target in the third sample image characteristic information.

In a possible implementation manner, before the target detection part, the apparatus further includes: a pre-training part configured to pre-train the target detection network by using the labeled third sample images.

According to an aspect of the embodiments of the present disclosure, a target detection apparatus is provided, the apparatus includes: a detection processing part configured to input an image to be processed into a target detection network for processing, and obtain a target detection result of the to-be-processed image , the target detection result includes the position and category of the target in the to-be-processed image, and the target detection network is trained according to the above-mentioned network training method.

In a possible implementation manner, before the step of respectively determining the first number of samples sampled from the positive sample images of each category according to the category of the target in the positive sample images of the training set, the method includes: The positive sample images and negative sample images are sampled to obtain the same or similar number of positive sample images and negative sample images.

In a possible implementation manner, the total number of sample images to be mined is 5% to 25% of the total number of the first sample images.

In a possible implementation manner, the combining the sixth target and the seventh target to obtain the fifth target includes: removing the sixth target that is the same as the seventh target target, obtain the remaining target that is different from the seventh target in the sixth target; take the remaining target and the seventh target as the fifth target.

In a possible implementation manner, the method further includes: when the number of the fourth sample images of the first category reaches a second number of sample images of the first category to be mined, ending the pairing process. Feature correlation mining of the first category.

In a possible implementation manner, according to the distance between the characteristic information of the third target of the first category and the characteristic information of each fifth target, determine the distance between the third target of the first category and each of the fifth targets respectively. After the step of using the third target with the smallest distance from the fifth target as the eighth target, the method further includes: after the number of the first sample images where the fourth target is located reaches the target of the first category to be mined. When the second number of sample images is reached, the determination of the eighth target is ended.

In a possible implementation manner, according to the distance between the characteristic information of the third target of the first category and the characteristic information of each fifth target, determine the distance between the third target of the first category and each of the fifth targets respectively. After the step of using the third target with the smallest distance from the fifth target as the eighth target, the method further includes: when the number of the first sample images where the fourth target is located does not reach the number of the first category to be mined When the second number of sample images is stored, and the set of storing the feature information of the fifth target is empty, the determination of the eighth target is ended.

In a possible implementation manner, the inputting the third sample image into the target detection network for processing to obtain the feature information of the third target in the third sample image includes: The sample image is input into the target detection network, and the feature vector output by the hidden layer of the target detection network is obtained; the feature vector is determined as the feature information of the third target.

According to an aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory, to perform the above-mentioned network training method, or to perform the above-mentioned target detection method.

According to an aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, implement the network training method, or implement the above-mentioned target detection method.

According to the embodiments of the present disclosure, the target detection results of unlabeled sample images can be obtained through the target detection network; pseudo-labeling and feature correlation mining are respectively performed according to the target detection results, high-value sample images are marked and collected, and added to the training set; The latter training set trains the target detection network, thereby expanding the number of positive sample data in the training set, alleviating the imbalance between positive and negative samples, and improving the training effect of the target detection network.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the disclosed embodiments. Other features and aspects of embodiments of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated into and constitute a part of the specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the technical solutions of the embodiments of the present disclosure.

FIG. 1 shows a flowchart of a network training method according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of a processing procedure of a network training method according to an embodiment of the present disclosure.

FIG. 3 shows a block diagram of a network training apparatus according to an embodiment of the present disclosure.

FIG. 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

FIG. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

detailed description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the term "at least one" herein refers to any combination of any one of the plurality or at least two of the plurality, for example, including at least one of A, B, and C, and may mean including from A, B, and C. Any one or more elements selected from the set of B and C.

In addition, in order to better illustrate the embodiments of the present disclosure, numerous specific details are given in the following detailed description. It should be understood by those skilled in the art that the embodiments of the present disclosure may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the embodiments of the present disclosure.

FIG. 1 shows a flowchart of a network training method according to an embodiment of the present disclosure. As shown in FIG. 1 , the network training method includes:

In step S11, the unlabeled first sample image is input into a target detection network for processing to obtain a target detection result of the first sample image, where the target detection result includes the target detection result of the target in the first sample image Image area, feature information and classification probability;

In step S12, the category confidence of the target is determined according to the classification probability of the target;

In step S13, for the first target whose category confidence is greater than or equal to the first threshold in the target, the first sample image where the first target is located is taken as the marked second sample image, and added to the training set , wherein the labeling information of the second sample image includes the image area of the first target and the category corresponding to the class confidence of the first target, and the training set includes the labelled third sample image;

In step S14, for the second target whose category confidence is less than the first threshold in the target, perform feature correlation mining on the second target according to the feature information of the third target in the third sample image , through feature correlation mining, determine the fourth target and the first sample image where the fourth target is located from the second target, and use the first sample image where the fourth target is located as the fourth sample image, and add it to the training set;

In step S15, the target detection network is trained according to the label information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set.

In a possible implementation manner, the method may be executed by an electronic device such as a terminal device or a server, and the terminal device may be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, For personal digital processing (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., the method can be implemented by the processor calling the computer-readable instructions stored in the memory. Alternatively, the method may be performed by a server.

For example, the first sample image may be an image acquired by an image acquisition device (eg, a camera). The first sample image may include a large-scale long-tailed image, that is, most of the images are background images, and a small part of the images include detectable objects. Detectable targets may include, for example, human bodies, faces, vehicles, objects, and the like. For example, in the field of security and protection, images of a certain geographical area can be collected by cameras, and people may pass through the geographical area only a small part of the time, so most of the collected images are background images, and only a small part of the images include human faces and/or faces. or human body. In this case, the collected images can form a long-tailed dataset. The embodiment of the present disclosure does not limit the acquisition method of the first sample image and the category of the target in the first sample image.

In a possible implementation manner, a target detection network may be preset to detect the position (ie, detection frame) and category of the target in the image. The target detection network may be, for example, a convolutional neural network, and the embodiment of the present disclosure does not limit the network structure of the target detection network.

In a possible implementation manner, before step S11, the method further includes: pre-training the target detection network by using the labeled third sample image. That is to say, a training set may be preset, and the training set includes the labeled third sample images, and the labeling information of the third sample images may include the detection frame and category of the target in the image. According to the training set, the target detection network can be pre-trained by the method in the related art, so that the target detection network has a certain detection accuracy.

However, the pre-trained object detection network has poor detection effect on large-scale long-tail images. Therefore, the unlabeled first sample image can be used to further train the object detection network through active learning.

In a possible implementation manner, in step S11, the unlabeled first sample image may be input into the target detection network for processing to obtain the target detection result of the first sample image. The target detection result may include the image area, feature information and classification probability of the target in the first sample image. The image area where the target is located can be the detection frame in the image; the feature information of the target can be, for example, the feature vector output by the hidden layer (such as the convolution layer) of the target detection network; the classification probability of the target can represent the classification of the target belonging to each category Posterior probability.

In a possible implementation manner, the target in the first sample image may also be referred to as an instance, and one or more targets may be detected in each first sample image. In practical processing, the order of magnitude of detected objects may be several to dozens of times the order of magnitude of images.

In a possible implementation manner, in step S12, according to the classification probability of the target, the maximum value of the classification probability may be obtained and determined as the classification confidence level of the target.

In a possible implementation manner, in step S13, for a target whose category confidence is greater than or equal to the first threshold (which may be referred to as a first target), the first sample image where the first target is located may be used as a Annotated sample images (may be referred to as second sample images) are added to the training set. The image area of the first target is taken as the marked image area, and the category corresponding to the category confidence of the first target is taken as the marked category of the first target. The same second sample image may be labeled multiple times by multiple first objects in the second sample image. The first threshold is, for example, 0.99, and the embodiment of the present disclosure does not limit the value of the first threshold.

In a possible implementation manner, the process of step S13 may be called pseudo-labeling. That is, the image where the target with higher confidence is located is regarded as a high-value sample, and the target detection inference result is directly used as the target annotation result. In this way, the number of positive sample data in the training set can be expanded to solve the problem of difficult collection of positive samples.

In a possible implementation manner, in step S14, for the target whose category confidence is less than the first threshold (which may be referred to as the second target), the target in the third sample image that has been marked in the training set (which may be referred to as The feature information of the third target), the feature correlation mining is performed on the second target, and the target that meets the requirements (may be referred to as the fourth target) is mined from the second target. For example, the distance or correlation between the feature information of the third target and the feature information of the second target can be calculated, a preset number of targets can be selected according to the distance or the correlation, and the selected preset number of targets can be used as the first target. Four goals.

In a possible implementation manner, the first sample image where the mined fourth target is located may be taken as the fourth sample image and added to the training set, so as to complete the processing process of feature correlation mining. In this way, the number of sample data in the training set can be further expanded.

In a possible implementation manner, the annotation information of the fourth sample image may be obtained by manual annotation, for example, manually determining the detection frame and category of the target in the fourth sample image. This embodiment of the present disclosure does not limit this.

In a possible implementation manner, in step S15, after obtaining the label information of the fourth sample image, the target detection network can be trained according to the second sample image, the third sample image and the fourth sample image in the training set.

In a possible implementation manner, through the processing of step S11, the target detection result of each first sample image is obtained; through the processing of S12, the class confidence of the target in each of the first sample images is obtained. In step S13, the sample image of the first target whose class confidence is greater than or equal to the first threshold can be added to the training set, and the labeled second sample image can be obtained by pseudo-labeling; in step S14, the class confidence can be Mining is performed on second targets whose degree is less than the first threshold.

In a possible implementation manner, step S14 may include:

According to the classification probability of the second target, determine the information entropy of the second target;

selecting a fifth target from the second target according to the category confidence and information entropy of the second target;

According to the category of the third target in the third sample image and the total number of sample images to be mined, respectively determine the second number of sample images to be mined in each category;

According to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, the fourth target and the The first sample image where the fourth target is located.

For example, according to the classification probability of the second target, the information entropy of the second target can be calculated to indicate the degree of uncertainty of the second target, that is, the greater the information entropy of the second target, the higher the information entropy of the second target The greater the degree of uncertainty; on the contrary, the smaller the information entropy of the second target, the smaller the degree of uncertainty of the second target. The embodiment of the present disclosure does not limit the calculation method of the information entropy.

In a possible implementation manner, according to the category confidence and information entropy of the second target, a target (which may be referred to as a fifth target) that satisfies a certain condition may be selected from a plurality of second targets, for example, a category may be selected. Targets with higher confidence, targets with higher information entropy, etc.

In a possible implementation manner, the step of selecting a fifth target from the second target according to the category confidence and information entropy of the second target may include:

According to the category confidence and information entropy of the second target, the second targets are sorted respectively, and a third number of sixth targets and a fourth number of seventh targets are selected;

The sixth target and the seventh target are combined to obtain the fifth target.

That is, according to the category confidence of the second targets, sort the plurality of second targets; according to the sorting result, select a preset third number of targets (which may be referred to as sixth targets) from the plurality of second targets ). Similarly, according to the information entropy of the second targets, the plurality of second targets are sorted; according to the sorting result, a preset fourth number of targets (which may be referred to as seventh targets) are selected from the plurality of second targets. Wherein, the third number and the fourth number may be 3K respectively, K represents the number of sample images to be mined, and K is, for example, 10000. In actual processing, the value of K may be 5% to 25% of the total number of unlabeled first sample images. The embodiments of the present disclosure do not limit the value of K and the quantitative relationship between the third quantity and the fourth quantity and K.

It should be understood that those skilled in the art can set the values of the number K, the third number and the fourth number of sample images to be excavated according to the actual situation, and the third number and the fourth number may be different, which is not made in this embodiment of the present disclosure. limit.

In a possible implementation manner, the selected sixth target and the seventh target may be combined, and the combined multiple targets may be used as the fifth target, so as to remove possible duplicate targets therein. In actual processing, about 6K fifth objects are available.

The above processing method can be called bootstrapping. In this way, a certain number of positive samples and negative samples with high probability can be selected from the second target at the same time, so as to carry out feature correlation mining in the future. Reduce the calculation amount of feature correlation mining and improve processing efficiency.

In a possible implementation manner, according to the category of the third target in the third sample image and the total number of sample images to be mined, the step of respectively determining the second number of sample images to be mined in each category may be: include:

According to the category of the third object in the third sample image, determine the proportion of the third object of each category;

According to the proportion of the third target of each category, determine the sampling proportion of each category;

According to the sampling proportion of each category, the second quantity of sample images to be mined in each category is determined respectively.

For example, according to the category of the third object in the third sample image that has been marked in the training set, the proportion f _c of the third object in each category can be determined; according to the proportion f _c , each category can be calculated by the following formula sampling weight of

r _c =max(f _c ,t·exp(f _c /t-1)) (1)

In the formula (1) and (2), _c represents R & lt class c sample values; t is the hyper-parameters, for example, a value of 0.1; C denotes the number of categories; R & lt classes C _i represents the i-th sample value of the categories .

Through the processing of formulas (1) and (2), the sampling proportion corresponding to the category with a smaller proportion can be increased, and the sampling proportion corresponding to the category with a larger proportion can be reduced, thereby alleviating the quantity imbalance between samples of different categories in order to improve the training effect of the network.

In one possible implementation, according to the sampling weight of each category

and the total number (K) of sample images to be mined, the second number of sample images to be mined for each category can be determined. Further, feature correlation mining may be performed according to the second quantity.

That is, the labeled third sample image in the training set can be input into the target detection network, and the feature information of the third sample image, such as a feature vector, is output from the hidden layer (eg, convolution layer) of the target detection network. In this way, the features of the third sample image can be obtained to facilitate subsequent feature correlation mining.

In a possible implementation manner, according to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, from the third sample image Among the five targets, the fourth target and the first sample image where the fourth target is located are determined, including:

According to the distance between the feature information of the third target of the first category and the feature information of each fifth target, determine the third target with the smallest distance from each fifth target among the third targets of the first category, and use it as The eighth target, the first category is any one of the categories of the third target;

The target with the largest distance among the eighth targets is determined as the fourth target.

For example, after the second number of sample images to be mined in each category is determined, a k-center method may be used to mine a corresponding number of sample images from the sample images where the fifth target is located. For any one of the multiple categories of the third target (which may be referred to as the first category), the distance between the feature information of the third target of the first category and the feature information of each fifth target may be calculated, the distance It can be, for example, the Euclidean distance. For any fifth target, the third target with the smallest distance from the fifth target among the third targets in the first category can be determined, so that the third target with the smallest distance from each fifth target can be determined, which can be called the first target. Eight goals.

In a possible implementation manner, one target with the largest distance may be selected from each of the eighth targets, and determined as the fourth target obtained by this feature correlation mining. As shown in the following formula:

In formula (3), u represents the fourth target obtained by feature correlation mining; dist(f _j , g _l ) represents the feature information f _{j of} the jth fifth target and the lth third target of the first category c the distance between the feature information g _L;

A set of feature information representing the fifth target;

A set of feature information representing the third object of the first category c.

In a possible implementation manner, the first sample image where the fourth target is located can be determined, and the sample image is added to the training set as the fourth sample image, thereby completing the feature correlation mining process this time.

In a possible implementation manner, according to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, from the third sample image The step of determining the fourth target and the first sample image where the fourth target is located among the five targets further includes:

The determined fourth object is added to the third object of the first category, and the determined fourth object is removed from the unlabeled fifth object.

That is to say, the fourth target obtained by this feature correlation mining is regarded as the labeled target, and the fourth target is removed from the unlabeled target. In this case, the feature information of the fourth object may be added to the set of feature information of the third object of the first category c

, the set of feature information from the fifth target

removed in. In this way, in the next feature correlation mining, the two updated sets can be mined by formula (3), and the above process can be repeated.

In a possible implementation, the number of the fourth sample images of the first category reaches the second number of the first category, or the second number is not reached and the fifth target is exhausted (the set

When it is empty), the feature correlation mining of the first category can be completed.

In this way, feature correlation mining can be performed on each category separately, and finally a sufficient number of fourth sample images (usually K sample images) can be obtained, so as to further expand the number of sample images in the training set and alleviate the difference between positive and negative samples. imbalance between.

In a possible implementation manner, human annotation (human annotation) may be performed on the mined fourth sample image to obtain annotation information of the fourth sample image. Since there may be both a positive sample image (that is, the fourth sample image including the target in the image) and a negative sample image (that is, the fourth sample image that does not include the target) in the fourth sample image, the fourth sample image The annotation information can include the sample category information of whether the image is a positive sample image or a negative sample image, the image frame where the object is located in the positive sample image, and the category of the object.

In a possible implementation manner, after the manual annotation is completed, the second sample image, the third sample image and the fourth sample image in the training set may be selected according to the annotation information of the fourth sample image in step S15. , train the target detection network.

Wherein, step S15 may include: according to the categories of the targets in the positive sample images of the training set, respectively determining the first number of samples sampled from the positive sample images of each category, the positive sample images being the sample images including the target in the image ;

Sampling the positive sample images of each category according to the first quantity sampled in the positive sample images of each category to obtain a plurality of fifth sample images;

Sampling the negative sample images of the training set to obtain a plurality of sixth sample images, where the negative sample images are sample images that do not include the target in the image;

The object detection network is trained according to the fifth sample image and the sixth sample image.

For example, the target detection network can be trained by resampling, and the sampling frequency of data with low frequency in the data can be increased by resampling to improve the performance of the network for these data, and further improve the positive and negative samples. imbalance between.

In a possible implementation manner, the positive sample images and the negative sample images in the training set (including the second sample image, the third sample image and the fourth sample image) may be sampled respectively, so that the sampled positive sample image The number of negative samples is the same or similar.

In a possible implementation manner, for the positive sample image, the total number of samples of the positive sample image may be preset. According to the categories of the objects in the positive sample images in the training set, the first number of samples sampled from the positive sample images of each category is determined respectively.

Similar to the previous processing process, according to the category of the target in the positive sample image, the proportion of the target of each category can be determined; according to the proportion, the sampling proportion of each category can be calculated by the following formula:

In formula (4), R _h represents the sampling proportion of the positive sample images of _{the h th category; q h} represents the proportion of the objects of the h th category; t ₁ is a hyperparameter, and the value is, for example, 0.1.

Through the processing of formula (4), the sampling proportion corresponding to the category with a smaller proportion can be increased, and the sampling proportion corresponding to the category with a larger proportion can be reduced, so as to alleviate the imbalance in the number of positive sample images of different categories, so that Improve the training effect of the network.

In a possible implementation manner, the first number of positive sample images of each category may be determined according to the sampling proportion of positive sample images of each category and the total number of samples of positive sample images.

In a possible implementation manner, for any category, a first number of positive sample images may be randomly sampled from the positive sample images of the category according to the first number of the category, as the fifth sample image. The positive sample images of each category are sampled respectively, and the fifth sample image with the total number of samples can be obtained.

In a possible implementation manner, for the negative sample images, the negative sample images in the training set can be directly randomly sampled according to the preset total number of samples to obtain the sixth sample image with the total number of samples. The total number of samples of negative sample images may be the same as or different from the total number of samples of positive sample images, which is not limited in this embodiment of the present disclosure.

In a possible implementation manner, the target detection network can be trained according to the fifth sample image and the sixth sample image. That is, input the fifth and sixth sample images into the target detection network respectively to obtain the target detection results of the fifth and sixth sample images; determine the loss of the target detection network according to the target detection results and the label information; and adjust the loss in the reverse direction. The parameters of the target detection network; after multiple rounds of iterations, the trained target detection network is obtained when a preset condition (such as network convergence) is satisfied.

In this way, the detection effect of the trained object detection network for long-tailed images can be significantly improved.

In a possible implementation manner, before step S11, the step of pre-training the target detection network by using the marked third sample image can also be performed by the above-mentioned resampling training method, thereby improving the target detection network. pre-training effect.

In practical applications, the entire process of steps S11-S15 can be repeated to achieve continuous incremental training. That is to say, when the unlabeled sample images are collected again, the target detection network after this training can be used as the initial target detection network, the expanded training set can be used as the initial training set, and the pseudo-labeling can be repeated. - Feature correlation mining - The process of resampling training, so as to continuously improve the performance of the target detection network.

FIG. 2 shows a schematic diagram of a processing procedure of a network training method according to an embodiment of the present disclosure. As shown in FIG. 2 , the data source includes a large number of unlabeled first sample images 20 , the first sample images 20 are input into the target detection network for prediction, and the target detection of each first sample image 20 is obtained. The result 21 includes the image area (not shown), the feature vector and the classification probability of the object in the first sample image.

As shown in FIG. 2, in this example, the target detection network may include a CNN backbone network 211, a feature map pyramid network (FPN) 212, and a fully connected network 213, such as a bbox head. After the first sample image 20 is input to the target detection network, it is processed by the CNN backbone network 211 and the FPN 212 to obtain a feature map 214 of the first sample image, and the feature map 214 is processed by the fully connected network 213 to obtain the target detection result 21.

In this example, the category confidence of the target can be determined according to the classification probability of the target; for the first targets whose category confidence is greater than or equal to the first threshold (for example, 0.99), the first objects in which the first targets are located are determined. This image is used as the second sample image 22, and pseudo-labeling is performed on the second sample image 22, that is, the image area of the first target and the category corresponding to the category confidence of the first target are used as the labeling information of the second sample image 22. . The labeled second sample image 22 is added to the training set 25, thereby realizing the expansion of the positive samples in the training set.

In this example, for the second target whose category confidence is less than the first threshold, a certain number of fifth targets are selected by the bootstrapping method, and the sample image 23 where the fifth target is located is obtained. According to the feature vector (not shown) of the third target in the marked third sample image in the training set, perform feature correlation mining on the fifth target, and determine the fourth target and the first sample image where the fourth target is located, As the fourth sample image 24 . The fourth sample image 24 is manually labeled and added to the training set 25, so as to further expand the labeled images in the training set.

In this example, after two expansions, the training set 25 includes the labeled second sample image, the third sample image, and the fourth sample image. Resampling the training set 25, balancing the number of positive and negative samples, and the number of positive samples of different categories, to obtain a resampled training set 26; and then train the target detection network according to the resampled training set 26, thereby completing the entire process.

According to an embodiment of the present disclosure, a target detection method is also provided, the method comprising:

Input the image to be processed into the target detection network for processing, and obtain the target detection result of the image to be processed, and the target detection result includes the position and category of the target in the image to be processed, and the target detection network is based on the above-mentioned network. training method.

That is to say, the target detection network trained by the above method can be deployed to realize the target detection of the image to be processed. The image to be processed may be, for example, an image collected by an image collection device (eg, a camera), and the image may include a target to be detected, such as a human body, a face, a vehicle, an object, and the like. This embodiment of the present disclosure does not limit this.

In a possible implementation manner, the to-be-processed image may be input into a target detection network for processing to obtain a target detection result of the to-be-processed image. The target detection result includes the position and category of the target in the image to be processed, such as the detection frame where the face in the image to be processed is located and the identity corresponding to the face.

In this way, the detection accuracy of target detection can be improved, and target detection of large-scale long-tail image data can be realized.

According to the network training method of the embodiment of the present disclosure, the active learning mining method is used to mine potential unlabeled data, the semi-supervised learning method is used to label the auxiliary unlabeled data, and the quantity of positive sample data is expanded, thereby solving large-scale problems. In large-scale long-tail detection, the problem of large data size and difficulty in collecting positive samples, and to a certain extent, alleviates the problem of imbalance between positive and negative samples. The model performance is effectively improved in the environment of limited annotation and computing resources.

According to the network training method of the embodiment of the present disclosure, the target detection network is trained by means of resampling, which can solve the negative impact of the imbalance of positive and negative samples on network training, and alleviate the negative impact of the imbalance between different categories of positive samples on network training. , so that the target detection network can effectively converge during training and improve the network performance.

According to the network training method according to the embodiment of the present disclosure, by using the active learning method, potentially high-value samples that are helpful for model improvement can be mined in a huge amount of unlabeled data, and the model can be effectively improved in a limited labeling and computing resource environment. performance, saving a lot of manpower and computing costs required for the application of deep learning models in new businesses; using the resampling method, the target detection network can be effectively trained in the case of unbalanced samples, without too much manual parameter adjustment intervention, saving deep learning The labor cost required to apply the model to the new business.

The network training method according to the embodiment of the present disclosure can be applied to the fields of intelligent video analysis, security and other fields. With limited labor and computing resources, the method can be used to detect potential targets in intelligent video analysis or intelligent monitoring online. , and iteratively improve the detection network of the application, quickly meet the performance requirements required by the business with less labor and computing costs, and can continue to improve network performance in the future.

The network training method of the embodiments of the present disclosure can be applied to online intelligent video analysis or intelligent monitoring, so as to rapidly iterate online potential target detection applications in intelligent video analysis or intelligent monitoring under limited labor and computing resources It can quickly achieve the performance requirements required by the business with less labor and computing costs, and can continue to improve the performance of the model afterwards.

It can be understood that the above method embodiments mentioned in the embodiments of the present disclosure can be combined with each other to form a combined embodiment without violating the principle and logic. Those skilled in the art can understand that, in the above method of the specific embodiment, the specific execution order of each step should be determined by its function and possible internal logic.

In addition, the embodiments of the present disclosure also provide a network training device, a target detection device, an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any network training method or target detection method provided by the embodiments of the present disclosure, Corresponding technical solutions and descriptions, and refer to the corresponding records in the method section, will not be repeated.

3 shows a block diagram of a network training apparatus including a processor (not shown in FIG. 3 ) for executing a program stored in a memory (not shown in FIG. 3 ) according to an embodiment of the present disclosure part; as shown in Figure 3, the program part stored in the memory includes:

The target detection part 31 is configured to input the unlabeled first sample image into the target detection network for processing, and obtain the target detection result of the first sample image, and the target detection result includes the first sample image image area, feature information and classification probability of the target;

a confidence level determination part 32, configured to determine the category confidence level of the object according to the classification probability of the object;

The labeling part 33 is configured to take the first sample image where the first target is located as the marked second sample image for the first target whose category confidence is greater than or equal to the first threshold in the target, and add In the training set, the annotation information of the second sample image includes the image area of the first target and the category corresponding to the category confidence of the first target, and the training set includes the labeled third sample image ;

The feature mining part 34 is configured to, for the second object in the object whose category confidence is less than the first threshold value, perform an analysis on the second object according to the feature information of the third object in the third sample image. Feature correlation mining, through feature correlation mining, determine the fourth target and the first sample image where the fourth target is located from the second target, and use the first sample image where the fourth target is located as the The fourth sample image is added to the training set;

The training part 35 is configured to train the target detection network according to the label information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set.

In a possible implementation manner, the apparatus further includes: a pre-training part configured to pre-train the target detection network by using the labeled third sample image.

In a possible implementation manner, the sampling quantity determination sub-section is further configured to: in the category of the target according to the positive sample images of the training set, respectively determine the number of samples sampled from the positive sample images of each category. Before a certain number, the positive sample images and negative sample images in the training set are sampled to obtain the same or similar number of positive sample images and negative sample images.

In a possible implementation manner, the target selection subsection is further configured to: remove the same target as the seventh target from the sixth target, and obtain the sixth target and the seventh target The remaining target with different targets; the remaining target and the seventh target are regarded as the fifth target.

In a possible implementation manner, the method further includes: according to the distance between the feature information of the third target of the first category and the feature information of each fifth target, respectively determining the first category of After the third target with the smallest distance from each fifth target among the third targets is used as the eighth target, after the number of the first sample images where the fourth target is located reaches the sample images of the first category to be mined When the second number of , ends the determination of the eighth target.

In a possible implementation manner, the target and image determination subsection is further configured to: the distance between the feature information of the third target according to the first category and the feature information of each fifth target, respectively After determining the third target with the smallest distance from each fifth target among the third targets of the first category and using it as the eighth target, the number of the first sample images where the fourth target is located does not reach the number of the first sample images. When the second quantity of sample images of one category to be mined is empty, and the set storing the feature information of the fifth target is empty, the determination of the eighth target is ended.

In a possible implementation manner, the feature extraction part is further configured to: input the third sample image into the target detection network to obtain a feature vector output by the hidden layer of the target detection network; The feature vector is determined as feature information of the third target.

According to an aspect of the present disclosure, there is provided a target detection apparatus, the apparatus includes: a detection processing part configured to input an image to be processed into a target detection network for processing, and obtain a target detection result of the to-be-processed image, where The target detection result includes the position and category of the target in the to-be-processed image, and the target detection network is trained according to the above-mentioned network training method.

In some embodiments, the functions or included parts of the apparatus provided in the embodiments of the present disclosure may be configured to execute the methods described in the above method embodiments, and the specific implementation may refer to the above method embodiments. For brevity, I won't go into details here.

In some embodiments, a "part" can also be a part of a circuit, a part of a processor, a part of a program or software, etc., of course, it can also be a unit, and it can also be a module or non-modular.

Embodiments of the present disclosure further provide a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the foregoing method is implemented. The computer-readable storage medium may be a non-volatile computer-readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

Embodiments of the present disclosure also provide a computer program product, including computer-readable code, when the computer-readable code is run on a device, a processor in the device executes a network training method configured to implement the network training method provided in any of the above embodiments Or directives for object detection methods.

Embodiments of the present disclosure also provide another computer program product configured to store computer-readable instructions, which, when executed, cause the computer to perform the operations of the network training method or the target detection method provided by any of the foregoing embodiments.

The electronic device may be provided as a terminal, server or other form of device.

FIG. 4 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, etc. terminal.

4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814 , and the communication component 816 .

The processing component 802 generally controls the overall operation of the electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 can include one or more processors 820 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

Memory 804 is configured to store various types of data to support operation at electronic device 800 . Examples of such data include instructions for any application or method operating on electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

Power supply assembly 806 provides power to various components of electronic device 800 . Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 800 .

Multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

Audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when electronic device 800 is in operating modes, such as calling mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 804 or transmitted via communication component 816 . In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of electronic device 800 . For example, the sensor assembly 814 can detect the on/off state of the electronic device 800, the relative positioning of the components, such as the display and the keypad of the electronic device 800, the sensor assembly 814 can also detect the electronic device 800 or one of the electronic device 800 Changes in the position of components, presence or absence of user contact with the electronic device 800 , orientation or acceleration/deceleration of the electronic device 800 and changes in the temperature of the electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include a light sensor, such as a complementary metal oxide semiconductor (CMOS) or charge coupled device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 816 is configured to facilitate wired or wireless communication between electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as wireless network (WiFi), second generation mobile communication technology (2G) or third generation mobile communication technology (3G), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmed gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.

In an exemplary embodiment, a non-volatile computer-readable storage medium, such as a memory 804 comprising computer program instructions executable by the processor 820 of the electronic device 800 to perform the above method is also provided.

FIG. 5 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. 5, electronic device 1900 includes processing component 1922, which further includes one or more processors, and a memory resource represented by memory 1932 configured to store instructions executable by processing component 1922, such as an application program. An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 1922 is configured to execute instructions to perform the above-described methods.

The electronic device 1900 may also include a power supply assembly 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input output (I/O) interface 1958 . The electronic device 1900 can operate based on an operating system stored in the memory 1932, such as a Microsoft server operating system (Windows Server ^™ ), a graphical user interface based operating system (Mac OS X ^™ ) introduced by Apple, a multi-user multi-process computer operating system (Unix ^™ ), Free and Open Source Unix-like Operating System (Linux ^™ ), Open Source Unix-like Operating System (FreeBSD ^™ ) or the like.

In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as memory 1932 comprising computer program instructions executable by processing component 1922 of electronic device 1900 to perform the above-described method.

Embodiments of the present disclosure may be systems, methods and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the embodiments of the present disclosure.

A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code, written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present disclosure.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

The computer program product can be specifically implemented by hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.

Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Industrial Applicability

The embodiments of the present disclosure relate to a network training method and apparatus, a target detection method and apparatus, and an electronic device. The network training method includes: inputting unlabeled sample images into a target detection network for processing to obtain a target detection result, the result including the image area, feature information and classification probability of the target; and determining the category confidence of the target according to the classification probability of the target ; For the first target whose category confidence is greater than or equal to the threshold, the sample image where the first target is located is used as a marked image and is added to the training set; For the second target whose category confidence is less than the first threshold, the second target is characterized For related mining, the fourth target is determined from the second target, and the sample image where it is located is added to the training set; the target detection network is trained according to the sample image in the training set. The embodiments of the present disclosure can improve the training effect of the target detection network.

Claims

A network training method comprising:

Input the unlabeled first sample image into the target detection network for processing, and obtain the target detection result of the first sample image, and the target detection result includes the image area and feature information of the target in the first sample image and classification probability;

According to the classification probability of the target, determine the category confidence of the target;

For the first target whose category confidence is greater than or equal to the first threshold in the target, the first sample image where the first target is located is taken as the marked second sample image, and added to the training set, wherein the The labeling information of the second sample image includes the image area of the first target and the class corresponding to the class confidence of the first target, and the training set includes the labelled third sample image;

For the second target in the target whose category confidence is less than the first threshold, according to the feature information of the third target in the third sample image, feature correlation mining is performed on the second target, and feature correlation mining is performed on the second target. , determine the fourth target and the first sample image where the fourth target is located from the second target, take the first sample image where the fourth target is located as the fourth sample image, and add all the the training set described above;

The target detection network is trained according to the label information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set.
The method according to claim 1, wherein the target is trained according to the label information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set Detection network, including:

According to the category of the target in the positive sample images of the training set, respectively determine the first number of samples sampled from the positive sample images of each category, and the positive sample images are sample images including the target in the image;

Sampling the positive sample images of each category according to the first quantity sampled in the positive sample images of each category to obtain a plurality of fifth sample images;

Sampling the negative sample images of the training set to obtain a plurality of sixth sample images, where the negative sample images are sample images that do not include the target in the image;

The object detection network is trained according to the fifth sample image and the sixth sample image.
The method according to claim 1 or 2, wherein the feature correlation mining is performed on the second target according to the feature information of the third target in the third sample image, and the feature correlation mining is performed from the The fourth target and the first sample image where the fourth target is located are determined from the second target, including:

According to the classification probability of the second target, determine the information entropy of the second target;

selecting a fifth target from the second target according to the category confidence and information entropy of the second target;

According to the category of the third target in the third sample image and the total number of sample images to be mined, respectively determine the second number of sample images to be mined in each category;

According to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, the fourth target and the The first sample image where the fourth target is located.
The method according to claim 3, wherein, selecting a fifth target from the second target according to the category confidence and information entropy of the second target, comprising:

According to the category confidence and information entropy of the second target, the second targets are sorted respectively, and a third number of sixth targets and a fourth number of seventh targets are selected;

The sixth target and the seventh target are combined to obtain the fifth target.
The method according to claim 3 or 4, wherein the second number of sample images to be mined for each category is determined according to the category of the third target in the third sample image and the total number of sample images to be mined ,include:

According to the category of the third object in the third sample image, determine the proportion of the third object of each category;

According to the proportion of the third target of each category, determine the sampling proportion of each category;

According to the sampling proportion of each category, the second quantity of sample images to be mined in each category is determined respectively.
The method according to any one of claims 3-5, wherein according to the feature information of the third target in the third sample image, the feature information of the fifth target and the sample images of each category to be mined The second quantity is determined from the fifth target to determine the fourth target and the first sample image where the fourth target is located, including:

According to the distance between the feature information of the third target of the first category and the feature information of each fifth target, determine the third target with the smallest distance from each fifth target among the third targets of the first category, and use it as The eighth target, the first category is any one of the categories of the third target;

The target with the largest distance among the eighth targets is determined as the fourth target.
The method according to claim 6, wherein, according to the feature information of the third target in the third sample image, the feature information of the fifth target and the second number of sample images to be mined in each category, from the Determine the fourth target and the first sample image where the fourth target is located in the fifth target, and further include:

The determined fourth object is added to the third object of the first category, and the determined fourth object is removed from the unlabeled fifth object.
The method according to any one of claims 1-7, wherein the method further comprises:

The third sample image is input into the target detection network for processing to obtain feature information of the third target in the third sample image.
The method according to any one of claims 1-8, wherein the step of inputting the unlabeled first sample image into a target detection network for processing to obtain a target detection result of the first sample image Before, the method further includes:

The target detection network is pre-trained by using the labeled third sample images.
The method of any one of claims 1-9, wherein the first sample image comprises a long-tail image.
The method according to claim 2, wherein, before the step of respectively determining the first number of samples sampled from the positive sample images of each category according to the categories of the objects in the positive sample images of the training set, the method include:

Sampling the positive sample images and negative sample images in the training set to obtain the same or similar number of positive sample images and negative sample images.
The method according to claim 3, wherein the total number of the sample images to be mined is 5%˜25% of the total number of the first sample images.
The method according to claim 4, wherein the combining the sixth target and the seventh target to obtain the fifth target comprises:

Remove the same target as the seventh target in the sixth target, and obtain the remaining target that is different from the seventh target in the sixth target;

The remaining target and the seventh target are taken as the fifth target.
The method according to claim 6, wherein, according to the distance between the characteristic information of the third target of the first category and the characteristic information of each fifth target, respectively determining the third target of the first category After the third target with the smallest distance from each fifth target and serving as the eighth target, the method further includes:

When the number of the first sample images where the fourth target is located reaches the second number of the sample images to be mined of the first category, the determination of the eighth target is ended.
The method according to claim 7, wherein, according to the distance between the characteristic information of the third target of the first category and the characteristic information of each fifth target, respectively determining the third target of the first category After the third target with the smallest distance from each fifth target and serving as the eighth target, the method further includes:

When the number of the first sample images where the fourth target is located has not reached the second number of the sample images of the first category to be mined, and the set of storing the feature information of the fifth target is empty, the process ends. Determination of the eighth target.
The method according to claim 8, wherein the inputting the third sample image into the target detection network for processing to obtain the feature information of the third target in the third sample image comprises:

Inputting the third sample image into the target detection network to obtain the feature vector output by the hidden layer of the target detection network;

The feature vector is determined as feature information of the third target.
A target detection method, the method comprising:

Input the to-be-processed image into a target detection network for processing to obtain a target detection result of the to-be-processed image, where the target detection result includes the position and category of the target in the to-be-processed image,

The target detection network is obtained by training according to the network training method of any one of claims 1-10.
A network training device, comprising:

The target detection part is configured to input the unlabeled first sample image into the target detection network for processing, and obtain the target detection result of the first sample image, and the target detection result includes the target detection result in the first sample image. The image area, feature information and classification probability of the target;

a confidence determination part, configured to determine the category confidence of the target according to the classification probability of the target;

The labeling part is configured to take the first sample image where the first target is located as the labeled second sample image for the first target whose category confidence is greater than or equal to the first threshold in the target, and join the training set, wherein the labeling information of the second sample image includes the image area of the first target and the class corresponding to the class confidence of the first target, and the training set includes the labelled third sample image;

The feature mining part is configured to, for the second target in the target whose category confidence is less than the first threshold, perform feature information on the second target according to the feature information of the third target in the third sample image Relevance mining, through feature correlation mining, determine the fourth target and the first sample image where the fourth target is located from the second target, and use the first sample image where the fourth target is located as the first sample image. Four sample images are added to the training set;

The training part is configured to train the target detection network according to the label information of the fourth sample image, the second sample image, the third sample image and the fourth sample image in the training set.
An electronic device comprising:

processor;

a memory configured to store processor-executable instructions;

Wherein, the processor is configured to invoke the instructions stored in the memory to execute the network training method as claimed in any one of claims 1 to 16, or to execute the target detection method as claimed in claim 17.
A computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the network training method according to any one of claims 1 to 16 is realized, or the method described in claim 17 is realized. The target detection method described above.