CN114898201A

CN114898201A - Target detection method, device, equipment and medium

Info

Publication number: CN114898201A
Application number: CN202210807521.2A
Authority: CN
Inventors: 杨雪峰; 余言勋; 王亚运; 刘智辉
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-08-12
Anticipated expiration: 2042-07-11
Also published as: CN114898201B

Abstract

The embodiment of the application discloses a target detection method, a target detection device, target detection equipment and a target detection medium, and relates to the technical field of computer vision. The method comprises the following steps: the image to be processed is input into a target detection network to determine whether the image to be processed contains the object to be searched of the target type and the position area of the object to be searched in the image. The target detection network is obtained by training based on an annotated target frame and a predicted target frame which is annotated with a first sample type, wherein the first sample type of the predicted target frame is determined according to the similarity between the image features of a training image in the predicted target frame and the image features of the training image in the annotated target frame. Therefore, in the training stage of the target detection network, the model is required to be corrected based on the coordinate difference between the prediction target frame and the labeling target frame, and the model is also required to be corrected based on the image characteristic similarity information of the prediction target frame and the labeling target frame, so that the detection precision of the model is improved.

Description

Target detection method, device, equipment and medium

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a target detection method, device, equipment and medium.

Background

Object detection is one of the computer vision tasks. Unlike the classification and recognition tasks, the target detection needs to determine not only whether the image contains the object to be searched, but also the position area of the object to be searched in the image. When a neural network model for target detection is trained, in the related art, a sample type of each preset prediction target frame is mostly determined according to an Intersection over (Iou) between each preset prediction target frame (Anchor) and a labeling target frame (Ground route, gt), and model parameters are corrected according to classification loss and coordinate regression loss between each prediction target frame and the labeling target frame after the sample type is determined, so that the trained model can identify an object to be searched in the labeling target frame and can also determine a position area of the object to be searched in an image. However, when the intersection ratio between the prediction target frame and the labeling target frame is high but the image feature difference is large, the detection accuracy of the network model is affected if the above method is adopted for training.

Disclosure of Invention

The embodiment of the application provides a target detection method, a target detection device, target detection equipment and a target detection medium, which are used for improving the detection precision of a target detection model.

In a first aspect, an embodiment of the present application provides a target detection method, where the method includes:

inputting an image to be processed into a trained target detection network, identifying the image to be processed through the target detection network, and determining whether the image to be processed contains an object to be searched of a target type and a position area of the object to be searched in the image to be processed;

the target detection network is obtained by training at least based on an annotated target frame and a predicted target frame which is annotated with a first sample type, wherein the first sample type of the predicted target frame is determined based on image feature similarity information, and the image feature similarity information is determined according to the similarity between the image features of a training image in each predicted target frame of the first sample type and the image features of the training image in the annotated target frame; the marking target frame is a target frame for marking the object to be searched in the training image.

In the embodiment of the application, the image to be processed is input into the target detection network to determine whether the image to be processed contains the object to be searched of the target type and the position area of the object to be searched in the image. The target detection network is obtained by training based on an annotated target frame and a predicted target frame which is annotated with a first sample type, wherein the first sample type of the predicted target frame is determined according to the similarity between the image features of a training image in the predicted target frame and the image features of the training image in the annotated target frame. Therefore, in the training stage of the target detection network, the model is required to be corrected based on the coordinate difference between the prediction target frame and the labeling target frame, and the model is also required to be corrected based on the image characteristic similarity information of the prediction target frame and the labeling target frame, so that the detection precision of the model is improved.

In some possible embodiments, the image feature similarity information is obtained through a trained contrast learning network, and the contrast learning network is obtained through training based on the training image and a prediction target frame labeled with a second sample type;

and determining the second sample type of the prediction target frame according to the intersection ratio between the prediction target frame and the labeling target frame.

In the embodiment of the application, a contrast learning mechanism is introduced to mark the target frame and train a contrast learning network according to the images of the prediction target frames of which the intersection ratios determine the sample types, so that the trained contrast learning network has the function of identifying the image feature similar information in the prediction target frame and the mark target frame, and then the target detection network is trained according to the image feature similar information to improve the model detection precision.

In some possible embodiments, the first sample type is obtained by:

inputting each feature image into the trained contrast learning network, wherein each feature image is a plurality of images determined according to the image features of the training images in the prediction target frames of the second sample types; performing the following process through the comparative learning network:

for each feature image, determining the image feature similarity information according to the similarity of the feature image and the image feature of the labeling target frame;

and determining a first sample type of the prediction target frame corresponding to the feature image according to the comparison result of the image feature similarity information and the similarity threshold.

According to the embodiment of the application, the feature image determined according to the image feature of the training image in the prediction target frame of each second sample type is input into the trained contrast learning network, the similarity between the feature image and the image feature in the labeling target frame is identified through the contrast learning network, and then the sample types of the prediction target frames corresponding to the feature image are divided again according to the similarity, so that the first sample type of the prediction target frame is obtained.

In some possible embodiments, the trained target detection network is obtained by:

inputting the training image into a basic detection network, training the basic detection network in an iteration mode, determining whether a first convergence condition is met according to a detection loss value obtained by each iteration of the basic detection network, and taking the basic detection network as the target detection network after the first convergence condition is met; each iteration is as follows:

for any one first sample type of predicted target frame, determining a classification loss value and a coordinate regression loss value between the predicted target frame and the labeled target frame based on basic detection network parameters before the iteration of the current round;

determining a designated target frame selected in the iteration according to a coordinate regression loss value, inputting the designated target frame and the labeled target frame into the contrast learning network, determining the similarity between the image features of the training image in the designated target frame and the image features of the training image in the labeled target frame through the contrast learning network, and determining a contrast loss value based on the similarity;

and determining the detection loss value based on the classification loss value, the coordinate regression loss value and the contrast loss value, and adjusting the basic detection network parameters according to the detection loss value.

In the embodiment of the application, multiple rounds of training are performed on the basic detection network in an iteration mode until the detection loss value obtained by iteration meets the preset convergence condition. In each iteration process, a classification loss value and a coordinate regression loss value between each first sample type of prediction target frame and each first sample type of annotation target frame are determined in advance based on network parameters before the iteration of the iteration, then the designated target frame with the highest confidence coefficient is determined according to the coordinate regression loss value, further image feature similarity between the designated target frame and the annotation target frame is determined according to a comparison learning network, a comparison loss value is determined according to the feature similarity, finally a detection loss value used for correcting a basic detection network is determined based on the coordinate regression loss value and the comparison loss value, and the basic detection network parameters are adjusted according to the detection loss value. The first sample type in the training process is determined according to the image feature similarity in the prediction target frame and the labeling target frame, so that the identification precision of the target basis detection network is improved.

In some possible embodiments, the determining the designated target box selected in the current iteration according to the coordinate regression loss value includes:

and taking the predicted target frame with the minimum coordinate regression loss value as the specified target frame.

In the embodiment of the application, the prediction target frame of the first sample type with the minimum characteristic loss value between the prediction target frame and the labeling target frame is used as the specified target frame, and the coordinate regression loss value used for training the basic detection network is determined according to the characteristic loss value between the specified target frame and the labeling target frame.

In some possible embodiments, the trained comparative learning network is obtained by:

inputting a sample image into a basic learning network, training the basic learning network in an iterative mode, determining whether a second convergence condition is met according to a learning loss value obtained by each iteration of the basic learning network, and taking the basic learning network as the comparison learning network after the second convergence condition is met; wherein each sample image pair comprises two sample images with different image characteristics, and the sample images are determined according to the image characteristics of the training images in the prediction target frame of the second sample type; each iteration is as follows:

determining the feature similarity between the sample images in the sample image pairs based on the basic learning network parameters before the iteration of the current round aiming at each sample image pair;

determining a contrast threshold according to the second sample type of the prediction target frame corresponding to each sample image;

and determining the learning loss value according to the comparison result of the feature similarity and the comparison threshold, and adjusting the basic learning network parameters according to the learning loss value.

In the embodiment of the application, the basic learning network is trained for multiple rounds in an iterative manner until the learning loss value obtained by iteration meets the convergence condition. In each iteration process, the feature similarity between each sample image in the sample image pair is determined in advance based on the network parameters before the iteration of the current round, and then a contrast threshold value is determined according to the second sample type of the prediction target frame corresponding to the sample image. And then determining a learning loss value for training the network model. The internal sample images for training are determined by the training images according to the image characteristics in the prediction target frame of the second sample type, so that after a trained contrast training network is obtained, the contrast loss value between the specified target frame and the labeled target frame can be determined by selecting the specified target frame based on the first sample type and inputting the specified target frame into the contrast learning network, and the detection loss value for training the target basis detection network is determined based on the contrast loss value, so that the identification precision of the target basis detection network is improved.

In some possible embodiments, the determining a contrast threshold according to the second sample type of the prediction target frame corresponding to each sample image includes:

if the second sample types of the prediction target frames corresponding to the sample images are the same, taking a first threshold value as the comparison threshold value;

if the types of second samples of the corresponding prediction target frames of the sample images are different, taking a second threshold value as the comparison threshold value; wherein the second threshold is less than the first threshold.

In the embodiment of the application, the contrast threshold corresponding to the sample image of the same type in the sample image pair is set to be larger, otherwise, the contrast threshold is set to be smaller so as to realize the network optimization goal, and the network optimization goal is to increase the distance of the feature vector of the training image corresponding to the prediction target frame of the second sample type belonging to the same sample type, otherwise, the distance is decreased. And determining the sample type of the first sample type of the predicted target frame according to the feature recognition result of the trained comparison learning network on the first sample type of the predicted target frame and the labeled target frame in the process of training the basic detection network.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the image recognition module is configured to input a to-be-processed image into a trained target detection network, and recognize the to-be-processed image through the target detection network;

the detection information module is configured to determine whether the image to be processed contains an object to be searched of a target type and a position area of the object to be searched in the image to be processed;

In some possible embodiments, the first sample type is obtained by:

In some possible embodiments, the apparatus further comprises:

a basic detection network training module, configured to obtain the trained target detection network in the following manner:

determining a designated target frame selected in the iteration according to a coordinate regression loss value, inputting the designated target frame and the labeled target frame into the comparison learning network, determining the similarity between the image characteristics of the training image in the designated target frame and the image characteristics of the training image in the labeled target frame through the comparison learning network, and determining a contrast loss value based on the similarity;

In some possible embodiments, the determining the specified target box selected in the current iteration according to the coordinate regression loss value is performed, and the basic detection network training module is configured to:

In some possible embodiments, the apparatus further comprises:

a learning network training module configured to perform deriving the trained comparative learning network by:

In some possible embodiments, performing the determining the contrast threshold according to the second sample type of the prediction target frame corresponding to each sample image, the learning network training module is configured to:

if the types of the second samples of the corresponding prediction target frames of the sample images are different, taking a second threshold value as the comparison threshold value; wherein the second threshold is less than the first threshold.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the steps comprised in the method of any one of the first aspect according to the obtained program instructions.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of any one of the first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product, where the computer program product includes: computer program code which, when run on a computer, causes the computer to perform the method of any of the first aspects.

Drawings

FIG. 1 is a schematic cross-over diagram provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an image feature with high intersection-to-average ratio and large feature difference provided by an embodiment of the present application;

FIG. 3 is an overall flowchart of training a target detection network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a sample image provided by an embodiment of the present application;

FIG. 5 is a block diagram illustrating a predicted target of a first sample type according to an embodiment of the present disclosure;

fig. 6 is an overall flowchart of a target detection method according to an embodiment of the present application;

FIG. 7 is a block diagram of an object detection device 700 according to an embodiment of the present disclosure;

fig. 8 is a schematic view of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The terms "first" and "second" in the description and claims of the present application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof, which are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The term "a plurality" in the present application may mean at least two, for example, two, three or more, and the embodiments of the present application are not limited.

For the target detection algorithm, the allocation of positive and negative samples has a significant impact on the accuracy of the algorithm. The positive and negative samples are for a predicted target frame generated by the program, i.e., a plurality of anchor frames of a predetermined fixed size. Each prediction target frame is provided with a corresponding sample type, for example, in a face detection task, the image content in the prediction target frame with the sample type being a positive sample is a face to be detected, and the image content in the prediction target frame with the negative sample is not a face but a background such as a window, a seat and the like. In the related technology, the sample type of each preset prediction target frame is determined according to the intersection ratio between the preset prediction target frames and the marking target frame. The intersection ratio is a ratio of an overlapping area of the prediction target frame and the labeling target frame to the total area, specifically as shown in fig. 1, where a is the labeling target frame and B is the prediction target frame. The overlapping area between the labeling target frame a and the prediction target frame B is a black area shown in fig. 1, and the intersection ratio of the prediction target frames is (a ≧ B)/(a £ B).

The object to be searched is not limited in the embodiment of the application, and a person skilled in the art can set the object according to actual requirements, for example, the object to be searched can include but is not limited to at least one of a human body, an animal, a plant, an article, a building, and the like;

the above-mentioned conventional allocation method of positive and negative samples only depends on the size of the cross-over ratio, and does not consider whether the features of the images in the frame are similar, for example, the object to be searched is bamboo, i.e., the characterization detection network is used to identify whether there is bamboo in the input image, and the location area of the bamboo in the input image. Specifically, as shown in fig. 2, the intersection ratio between the labeling target frame a and the prediction target frame B shown in fig. 2 is high, but the non-overlapping area of the labeling target frame a contains more image features representing bamboo, and the overlapping area contains fewer image features representing bamboo, which results in a larger difference in image features between the labeling target frame a and the prediction target frame B. It can be known from fig. 2 that the bamboo features in the labeled target frame a are small but the intersection is large, and the frame is determined to be a better position of the object to be searched in the process of detecting the network training, thereby affecting the detection accuracy of the model.

Considering that the contrast learning algorithm is used for obtaining general features of a data set by enabling a model to learn which data features are similar or different without a label, and has the capability of identifying feature similarity between different images, the application is intended to train the contrast learning network after allocating a sample type to each predicted target frame through the above-mentioned conventional cross-comparison method, and allocate a sample type to the predicted target frame again by using the image feature similarity between the labeled target frame and the predicted target frame in the training basis detection network stage, so as to determine a loss value generated by the contrast learning network after changing the sample type. And then the loss values of the basic detection network and the comparison learning network are fused to train the basic detection network so as to obtain a converged target detection network.

The invention conception of the application is as follows: the image to be processed is input into a target detection network to determine whether the image to be processed contains the object to be searched of the target type and the position area of the object to be searched in the image. The target detection network is obtained by training based on an annotated target frame and a predicted target frame which is annotated with a first sample type, wherein the first sample type of the predicted target frame is determined according to the similarity between the image features of a training image in the predicted target frame and the image features of the training image in the annotated target frame. Therefore, in the training stage of the target detection network, the model is required to be corrected based on the coordinate difference between the prediction target frame and the labeling target frame, and the model is also required to be corrected based on the image characteristic similarity information of the prediction target frame and the labeling target frame, so that the detection precision of the model is improved.

To facilitate understanding of the technical solutions provided by the embodiments of the present application, a training process of the target detection network of the present application is described below, where the training phase of the target detection network is divided into three phases as shown in fig. 3 according to a time sequence, including:

the first stage is as follows: 301, training an initial basic detection network based on a labeled target frame and a predicted target frame of a second sample type to obtain a converged basic detection network;

and a second stage: step 302, training a basic learning network based on the labeled target frame and the predicted target frame of the second sample type to obtain a converged comparative learning network;

and a third stage: step 303, identifying image feature similarity information between the labeling target frame and the prediction target frames of the second sample types based on the contrast learning network, so as to determine the first sample type of each prediction target frame according to the image feature similarity information. And training the basic detection network according to the predicted target frames of the first sample types to obtain a converged target basic detection network.

It should be noted that, the above-mentioned sequence relationship between the first stage and the second stage is only a time sequence in the experimental process of the present application, and does not indicate that the first stage must be executed before the second stage. The following explains the training process in the above three stages respectively:

the first stage is as follows: and training the initial basic detection network based on the labeling target frame and the prediction target frame of the second sample type to obtain a converged basic detection network.

In the embodiment of the present application, sample type division is performed on each prediction target frame generated based on an anchor point in a program in advance according to the conventional sample allocation manner shown in fig. 1. Specifically, the prediction target frame with the intersection sum ratio higher than 0.5 with the labeling target frame is set as a positive sample, and the prediction target frame with the intersection sum ratio not higher than 0.5 is set as a negative sample. For convenience of describing the technical solution provided by the embodiment of the present application, the sample type allocated by the conventional intersection ratio is referred to as a second sample type of the prediction target box.

In the training process, the initial basic detection network is trained in a traditional detection model training mode, specifically, the classification result of the prediction target frame of each second sample type is determined through network model parameters, and the classification result is the recognition result of the image features in the prediction target frame by the representation network model. And then determining a classification loss value based on the classification result and the belonged classification of the object to be detected, and determining a coordinate regression loss value based on the coordinate difference between the prediction target frame and the labeling target frame of each second sample type. And then correcting the network model parameters according to the model total loss, and determining model convergence when the network total loss calculated by the corrected model parameters is not higher than a preset threshold value, so as to obtain a converged basic detection network.

And a second stage: training the basic learning network based on the labeling target frame and the prediction target frame of the second sample type to obtain a converged comparison learning network;

before training the contrast learning network, in the embodiment of the present application, sample type division is performed on each prediction target frame generated based on an anchor point in a program in advance according to the conventional sample allocation manner shown in fig. 1, so as to obtain each prediction target frame carrying a second sample type. Then, as shown in fig. 4, the training image is input into the basic detection network trained in the first stage to determine the image features of the training image in the prediction target frame of each second sample type, and a sample image with the same size as the prediction target frame is generated according to the image features of the training image in the prediction target frames of the second sample types. Thus, a sample image corresponding to the prediction target frame of each second sample type can be obtained. Next, the sample images are paired into sample image pairs, and the two sample images in each sample image pair are allocated to have different image characteristics.

In the training process of the basic learning network, the sample images are input into the basic learning network, and the basic learning network is trained in an iteration mode. And determining whether the learning loss value obtained by each turn of the basic learning network meets a second convergence condition, and taking the basic learning network with the iteration finished after the second convergence condition is met as a comparison learning network. The second convergence condition may be determined as whether a comparison result of the learning loss value and a preset threshold satisfies a predetermined value.

In practice, each iteration is as follows: firstly, respectively carrying out feature recognition on training images in sample image pairs based on basic learning network parameters before iteration in the current round so as to obtain feature vectors of the training images on different dimensions, and then comparing the feature vectors according to the feature vectors of the sample images so as to determine feature similarity among the sample images.

Further, for each sample image pair, determining a comparison threshold according to a second sample type of the prediction target frame corresponding to each sample image in the sample image pair, determining a learning loss value according to a comparison result of the feature similarity and the comparison threshold, and finally adjusting the basic learning network parameters according to the learning loss value until the learning loss value calculated by the adjusted learning network parameters meets the second convergence condition.

In particular, for two sample images Z within a sample image pair _i And Z _j In the embodiment of the present application, the loss value function of the comparative learning can be expressed as the following formula (1):

formula (1)

Wherein L is _i,j Characterization Z _i And Z _j A characteristic loss value of; sim () is a similarity calculation function, specifically cosine similarity, euclidean distance, etc.;

the temperature parameter used for controlling the difference degree is a preset value; l _[k≠i] Is a signaling function, the representation of the formula is 1 when k is not equal to i, otherwise, is 0; n represents the number of sample images; k represents the sequence number of the sample image.

After determining the loss value corresponding to each training sample pair through the formula (1), if the sample types of the prediction target frames corresponding to the sample images in the sample image pairs are the same (namely, the sample types are both positive samples or both negative samples), the first threshold is used as a comparison threshold, otherwise, the second threshold is used as a comparison threshold. It should be understood that the first threshold and the second threshold may be set according to actual situations, but it is necessary to ensure that the first threshold is greater than the second threshold. The reason is that the network optimization aims to reduce the distance of the feature vector between the sample images corresponding to the prediction target frames belonging to the same sample, and conversely, to increase the distance. Thus, the contrast learning network can automatically learn the similarity of the image features in the prediction target frame of each second sample type through optimization.

It should be noted that the positive and negative samples distributed by the comparative learning network are the second sample type determined by the conventional cross-correlation equation. Because the second sample type ignores the difference of image characteristics between the prediction target frame and the labeling target frame, the detection precision of the model is affected if the network model is directly trained by adopting the prediction target frame of the second sample type. In order to solve the above problem, in the embodiment of the present application, the initial detection network and the initial learning network are trained by using the predicted target frames of the second sample type in the above training process. The reason for this is that the parameters in the initial learning network at the initial stage of training are initialized randomly, so that the capability of distinguishing whether the features between the sample images are similar or not is not provided, and in this case, if the parameters are directly applied to the basic detection network, a large negative effect is caused. Therefore, in the embodiment of the application, the basic detection network is trained by combining the following third stage with the contrast learning network, and in the third stage, the contrast value caused by the change of the sample type of the prediction target frame is added to the loss function for correcting the model parameters, so that the positive and negative sample distribution mechanism of the basic detection network is optimized by using the contrast learning network, and the target basic detection network with higher detection precision is obtained. See in particular the third stage below.

And a third stage: and identifying the image feature similarity between the labeling target frame and the prediction target frame of each second sample type based on a contrast learning network, and reallocating the sample types of the prediction target frames according to the image feature similarity. And training the basic detection network based on the prediction target frame of the first sample type obtained after the sample types are redistributed to obtain a converged target basic detection network.

According to the embodiment of the application, a training image is input into a basic detection network during training, the basic detection network is trained in an iteration mode, whether a first convergence condition is met or not is determined by determining a detection loss value obtained by each turn of the basic detection network, and the basic detection network after iteration is finished is used as a target basic detection network after a network model meets the first convergence condition. The first convergence condition may be determined by detecting whether a comparison result of the loss value and a predetermined threshold is a predetermined value.

In each iteration process, firstly, the coordinate regression loss value between the prediction target frame and the labeling target frame of each first sample type is determined based on the basic detection network parameters before the iteration of the current round.

During implementation, the feature vectors of the labeled target frames are determined according to the basic detection network parameters before the iteration of the current round, and the classification results of the prediction target frames of the first sample types are determined according to the basic detection network parameters. And then determining a classification loss value based on the classification result and the belonged classification of the object to be detected, and determining a coordinate regression loss value based on the coordinate difference between the prediction target frame and the labeling target frame of each first sample type.

Here, the prediction target frame of the first sample type of the prediction target frame is obtained by reassigning the sample type to the prediction target frame of each second sample type based on the similarity of the image features of the training image in the labeling target frame and the prediction target frames of each second sample type. Specifically, the training image is input into the basic detection network obtained in the first stage, so as to obtain the image features of the training image identified by the basic detection network in the prediction target frame of each second sample type. Next, a corresponding feature image is generated according to each image feature, and then each feature image is input into the contrast learning network obtained in the second stage, so as to determine the feature similarity between each feature image and the image feature in the labeling target frame (i.e. the image feature similarity information) through the contrast learning network. Further, a sample type is allocated to the prediction target frame corresponding to each feature image according to the comparison result of the feature similarity and the similarity threshold, and the reallocated sample type is the first sample type.

Specifically, as shown in fig. 5, assuming that the prediction target frame B of the second sample type corresponding to the feature image is a negative sample, if the feature similarity between the image feature in the labeling target frame a and the feature image satisfies the similarity threshold, the sample type of the prediction target frame B is modified into a positive sample, that is, the first sample type of the prediction target frame B is a positive sample. On the contrary, if the prediction target frame B of the second sample type corresponding to the feature image is a positive sample, but the feature similarity between the image feature in the labeling target frame a and the feature image does not satisfy the similarity threshold, the sample type of the prediction target frame B is modified into a negative sample, that is, the first sample type of the prediction target frame B is a negative sample. And redistributing the sample types of the prediction target frames of the second sample types through the process to obtain the prediction target frame of the first sample type.

And then, determining the specified target frame selected in the iteration according to the coordinate regression loss value. In implementation, the predicted target frame with the smallest characteristic loss value in the predicted target frames of the first sample types can be used as a designated target frame, and the designated target frame represents the predicted target frame with the highest confidence level in the recognition result of the predicted target frame of each first sample type by the current network.

After the designated target frame is determined, the designated target frame and the labeled target frame are input into a contrast learning network, so that the contrast learning network determines the feature similarity between the designated target frame and the labeled target frame, and determines a contrast loss value based on the feature similarity. And finally, determining a detection loss value based on the coordinate regression loss value and the comparison loss value, and adjusting basic detection network parameters according to the detection loss value.

Specifically, the detection loss value used for training the basic detection network is determined by the following formula (2):

L _od =αL _cls +βL _reg +γL _c formula (2)

Wherein L is _od To detect loss values; l is _cls Is a classification loss value; l is _reg Is a coordinate regression loss value; l is _c Is a comparative loss value; α, β, and γ are preset weight coefficients.

Specifically, in the training process, the initial detection network and the initial learning network are trained based on the prediction target frame of the second sample type, so as to obtain a converged basic detection network and a converged comparative learning network. And then, in a third stage of the training basis detection network, classifying the first sample type for each predicted target frame by using the image feature similarity between the labeled target frame and the predicted target frame of each second sample type. And respectively obtaining a classification loss value and a coordinate regression loss value output by the basic detection network based on the feature comparison result of the prediction target frame and the labeling target frame of the first sample type, and obtaining a comparison loss value generated by the comparison learning network due to the change of the sample type. And finally, obtaining a detection loss value for training the basic detection network by fusing the losses of the basic detection network and the comparison learning network, and training the basic detection network until convergence by adopting the detection loss value.

The trained target detection network in the embodiment of the application can be applied to detection of illegal objects (such as vehicles, billboards and the like, pedestrians) in the traffic field, a passerby re-recognition task and the like, and the target detection network trained through the training process can more accurately recognize the position area of the object to be searched in the input image. It should be understood that the above application fields are only illustrations of the application fields of the target detection network trained in the present application, and are not limitations of the application fields.

Based on the same inventive concept, the embodiment of the present application provides a target detection method, specifically as shown in fig. 6, including the following steps:

step 601: inputting an image to be processed into a trained target basis detection network, and identifying the image to be processed through the target basis detection network;

step 602: determining whether the image to be processed contains an object to be searched in a target type and a position area of the object to be searched in the image to be processed;

the target basis detection network is obtained by training at least based on an annotated target frame and a predicted target frame which is annotated with a first sample type, wherein the first sample type of the predicted target frame is determined based on image feature similarity information, and the image feature similarity information is determined according to the similarity of the image features of a training image in each predicted target frame of the first sample type and the image features of the training image in the annotated target frame; the marking target frame is a target frame for marking the object to be searched in the training image.

Based on the same inventive concept, an embodiment of the present application provides an object detection apparatus 700, specifically as shown in fig. 7, including:

an image recognition module 701 configured to perform input of a to-be-processed image into a trained target detection network, and recognize the to-be-processed image through the target detection network;

a detection information module 702 configured to perform determining whether the image to be processed contains an object to be searched for in a target type and a position area of the object to be searched for in the image to be processed;

In some possible embodiments, the first sample type is obtained by:

In some possible embodiments, the apparatus further comprises:

inputting the training image into a basic detection network, training the basic detection network in an iteration mode, determining whether a first convergence condition is met or not according to a detection loss value obtained by each iteration of the basic detection network, and taking the basic detection network as the target detection network after the first convergence condition is met; each iteration is as follows:

In some possible embodiments, the apparatus further comprises:

if the types of second samples of the corresponding prediction target frames of the sample images are different, taking a second threshold value as the comparison threshold value; wherein the second threshold is less than the first threshold. The electronic device 130 according to this embodiment of the present application is described below with reference to fig. 8. The electronic device 130 shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the electronic device 130 is represented in the form of a general electronic device. The components of the electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 that connects the various system components (including the memory 132 and the processor 131).

Bus 133 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 132 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the electronic device 130, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur via input/output (I/O) interfaces 135. Also, the electronic device 130 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 130, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 132 comprising instructions, executable by the processor 131 of the apparatus 400 to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising computer programs/instructions which, when executed by the processor 131, implement any one of the object detection methods as provided herein.

In an exemplary embodiment, aspects of an object detection method provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of an object detection method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for performing object detection of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the PowerPC programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image scaling device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of object detection, the method comprising:

inputting an image to be processed into a trained target basis detection network, and identifying the image to be processed through the target basis detection network;

determining whether the image to be processed contains an object to be searched in a target type and a position area of the object to be searched in the image to be processed;

2. The method according to claim 1, wherein the image feature similarity information is obtained through a trained contrast learning network, and the contrast learning network is obtained through training based on the training image and a prediction target frame labeled with a second sample type;

3. The method according to claim 2, characterized in that the first sample type is obtained by:

for each characteristic image, determining the image characteristic similarity information according to the similarity of the characteristic image and the image characteristic of the labeling target frame;

4. The method according to any one of claims 1 to 3, wherein the trained target-based detection network is obtained by:

inputting the training image into a basic detection network, training the basic detection network in an iteration mode, determining whether a first convergence condition is met according to a detection loss value obtained by each iteration of the basic detection network, and taking the basic detection network as the target basic detection network after the first convergence condition is met; each iteration is as follows:

5. The method of claim 4, wherein determining the designated target box selected in the iteration according to the coordinate regression loss value comprises:

6. The method of claim 5, wherein the trained comparative learning network is obtained by:

7. The method of claim 6, wherein determining the contrast threshold according to the second sample type of the corresponding prediction target frame of each sample image comprises:

8. An object detection apparatus, characterized in that the apparatus comprises:

the image recognition module is configured to input a to-be-processed image into a trained target detection network and recognize the to-be-processed image through the target detection network;

9. An electronic device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory and for executing the steps comprised by the method of any one of claims 1 to 7 in accordance with the obtained program instructions.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of any of claims 1-7.