CN112613539A

CN112613539A - Method, device, equipment and medium for constructing classification network and object detection model

Info

Publication number: CN112613539A
Application number: CN202011453575.0A
Authority: CN
Inventors: 李帮怀; 袁野
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-06

Abstract

The embodiment of the invention provides a method, a device, equipment and a medium for constructing a classification network and an object detection model, which are used for respectively extracting the characteristics of a plurality of images of a sample object to obtain a plurality of corresponding characteristic images, wherein the plurality of images have different image distribution domains, and the characteristic images corresponding to the images with different image distribution domains are input into a preset network in pairs to obtain a plurality of first classification results and a plurality of second classification results; updating the parameters of the preset network according to the distances between the first classification results and the second classification results and the class labels of the sample objects; and determining the preset network after being updated for multiple times as a classification network.

Description

Method, device, equipment and medium for constructing classification network and object detection model

Technical Field

The invention relates to the technical field of computer processing, in particular to a method, a device, equipment and a medium for constructing a classification network and an object detection model.

Background

Object Detection (Object Detection) is a core technology of computer vision, which aims to locate each Object in a picture and give a correct Object type, and a vision algorithm based on the Object Detection has wide application. At present, object detection is generally performed based on a deep neural network.

However, the current object detection has a significant problem that when the detected object is in different environmental conditions or is shot in different styles, for example, in different conditions such as day/night/foggy day, or shot in different styles such as oil painting and sketch, a large difference occurs in the object detection of the same object. Therefore, the deep neural network in the related art cannot accurately perform object detection under variable environmental conditions.

Disclosure of Invention

In view of the above problems, a classification network and object detection model construction method, apparatus, device and medium according to embodiments of the present invention are proposed to overcome or at least partially solve the above problems.

In order to solve the above problem, a first aspect of the present invention discloses a method for constructing a classification network, where the method includes:

respectively extracting the features of a plurality of images of a sample object to obtain a plurality of corresponding feature images, wherein the plurality of images have different image distribution domains;

inputting feature images corresponding to two images with different image distribution domains into a preset network in pairs to obtain a plurality of first classification results and a plurality of second classification results;

updating the parameters of the preset network according to the distances between the first classification results and the second classification results and the class labels of the sample objects;

and determining the preset network after being updated for multiple times as a classification network.

Optionally, the updating the parameter of the preset network according to the distance between the first classification results and the second classification results and the class label of the sample object includes:

determining a target loss value according to the distance between the plurality of first classification results and the plurality of second classification results;

determining a first classification loss value according to the plurality of first classification results and the class label of the sample object, and determining a second classification loss value according to the plurality of second classification results and the class label of the sample object;

and updating the parameters of the preset network according to the target loss value, the first classification loss value and the second classification loss value.

Optionally, the method further comprises:

processing the plurality of first classification results to determine a first result center, and processing the plurality of second classification results to determine a second result center;

determining a target loss value according to distances between the first classification results and the second classification results, including:

and determining a target loss value according to the difference between the first result center and the second result center.

Optionally, determining a target loss value according to a difference between the first result center and the second result center includes:

determining a target loss value according to the following formula:

therein, Loss_cosineAnd the target loss value is a vector representation of the first classification result center and a vector representation of the second classification result center.

In a second aspect of the embodiments of the present invention, there is provided a method for constructing an object detection model, where the method includes:

obtaining an original object detection model, wherein the original object detection model comprises an original classification sub-network and an original regression sub-network;

and replacing the original classification sub-network with the classification network described in the embodiment of the first aspect to obtain a target object detection model.

In a third aspect of the embodiments of the present invention, there is provided a method for constructing an object detection model, where the method includes:

obtaining a plurality of images of a sample object, the plurality of images having different image distribution domains;

inputting every two images with different image distribution domains into a preset network in pairs to obtain a plurality of first object detection results and a plurality of second object detection results;

comparing the plurality of first object detection results with the plurality of second object detection results to determine a first target object detection result and a second target object detection result characterizing the same object class;

updating the parameters of the preset network according to the distance between the first object target detection result and the second object target detection result and the position label and the category label of the sample object;

and determining the preset network after multiple updates as an object detection model.

Optionally, updating the parameter of the preset network according to the distance between the first object target detection result and the second object target detection result, and the location tag and the category tag of the sample object, includes:

determining a target loss value according to the distance between the first target object detection result and the second target object detection result;

determining a first object detection loss value according to the plurality of first object detection results and the position labels and the category labels of the sample objects, and determining a second object detection loss value according to the plurality of second object detection results and the position labels and the category labels of the sample objects;

and updating the parameters of the preset network according to the target loss value, the first object detection loss value and the second object detection loss value.

Optionally, after obtaining the object detection model, the method further includes:

obtaining an image to be detected, wherein the image to be detected has any one of the different image distribution domains;

and inputting the image to be detected into the object detection model to obtain the position and the category of the object contained in the image to be processed.

A fourth aspect of the present invention provides a classification network constructing apparatus, including:

the characteristic extraction module is used for respectively extracting the characteristics of a plurality of images of a sample object to obtain a plurality of corresponding characteristic images, and the plurality of images have different image distribution domains;

the classification result obtaining module is used for inputting the feature images corresponding to the images with different image distribution domains into a preset network in pairs to obtain a plurality of first classification results and a plurality of second classification results;

the updating module is used for updating the parameters of the preset network according to the distances between the first classification results and the second classification results and the class labels of the sample objects;

and the determining module is used for determining the preset network after being updated for multiple times as the classified network.

In a fifth aspect of the embodiments of the present invention, there is provided an object detection model building apparatus, including:

a model obtaining module for obtaining an original object detection model, wherein the original object detection model comprises an original classification sub-network and an original regression sub-network;

and a replacing module, configured to replace the original classification subnetwork with the classification network described in the embodiment of the first aspect, so as to obtain a target object detection model.

In a sixth aspect of the embodiments of the present invention, there is provided an object detection model building apparatus, including:

a sample image obtaining module for obtaining a plurality of images of a sample object, the plurality of images having different image distribution domains;

the detection result obtaining module is used for inputting every two images with different image distribution domains into a preset network in pairs to obtain a plurality of first object detection results and a plurality of second object detection results;

a screening module for comparing the plurality of first object detection results with the plurality of second object detection results and determining a first target object detection result and a second target object detection result representing the same object class;

the updating module is used for updating the parameters of the preset network according to the distance between the first object target detection result and the second object target detection result and the position label and the category label of the sample object;

and the determining module is used for determining the preset network after being updated for multiple times as the object detection model.

The embodiment of the invention also discloses an electronic device, which comprises: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the classification network construction method according to the embodiments of the first aspect, or the object detection model construction method according to the second aspect, or the object detection model construction method according to the third aspect.

Embodiments of the present invention also disclose a computer-readable storage medium storing a computer program for causing a processor to execute the classification network construction method according to the embodiment of the first aspect of the present invention, or the object detection model construction method according to the second aspect of the present invention, or the object detection model construction method according to the third aspect of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the plurality of images of the sample object can be respectively subjected to feature extraction to obtain a plurality of corresponding feature images, and the plurality of images have different image distribution domains; inputting characteristic image pairs corresponding to two images with different image distribution domains into a preset network to obtain a plurality of first classification results and a plurality of second classification results; updating the preset network according to the distances between the first classification results and the second classification results and the class labels of the sample objects; and determining the preset network after being updated for multiple times as a classification network.

In this embodiment, after the feature image pairs corresponding to each of the two images with different image distribution domains are input into the preset network, the distances between the plurality of first classification results and the plurality of second classification results obtained can represent the identification differences of the preset network to the images in different shooting states. When the parameters of the preset network are updated according to the class labels of the sample objects and the distances between the first classification results and the second classification results, the difference of the preset network in classifying the images of the objects shot in different states can be reduced as much as possible, so that the classification results of the sample objects shot in different states by the preset network can tend to be consistent. Furthermore, the classification network obtained by subsequent training can adapt to object detection aiming at images shot in different shooting states, so that the accuracy of object detection under the condition that the states of the shot objects are changed is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a schematic diagram of a technical concept proposed in an embodiment of the present invention;

FIG. 2 is a flow chart of steps of a method for constructing a classification network according to an embodiment of the present invention;

FIG. 3 is a flow chart of the steps for updating a default network in the practice of the present invention;

FIG. 4 is a flow chart of the steps in constructing an object detection model in the practice of the present invention;

FIG. 5 is a model architecture diagram of an object detection model in the practice of the present invention;

FIG. 6 is a flow chart of the steps for updating a default network in the practice of the present invention;

FIG. 7 is a block diagram of a classification network construction device in an embodiment of the present invention;

FIG. 8 is a block diagram of an object detection model building apparatus in accordance with the present invention;

fig. 9 is a block diagram showing a configuration of an object detection model constructing apparatus according to still another embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below to clearly and completely describe the technical solutions in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The inventor proposes the following technical idea to solve the problems in the related art: in the model training process, an additional supervision is added to the classification result of the images of the same type of objects shot under different shooting states to enable the classification results of the images of the same type of objects to be close to each other in a high-dimensional space, so that the detection capability of the model on the images shot under different shooting states is enhanced.

To achieve the technical idea, referring to fig. 1, a schematic diagram of a technical idea proposed by an embodiment of the present application is shown.

As shown in fig. 1, in the process of model training, two sample images shot in different shooting states for the same area may be input to the neural network, so as to obtain classification results output by the neural network and corresponding to the two sample images, and determine a difference between the classification results of the two sample images, so as to update parameters of the neural network according to the difference between the classification results and a loss corresponding to the classification result. For example, parameters of the neural network are updated according to the loss L1 and L2 respectively related to the difference S and the classification results of the difference S, so that the difference between the classification results of the images shot under different shooting states is continuously drawn close in the training process, and the classification results of the neural network classifying the difference S and the classification results of the images are consistent.

In conjunction with the technical concept shown in fig. 1, the embodiment of the present application shows an example of constructing a neural network with object classification capability, which is described in sections 1.1, 1.2, and 1.3 below.

1.1 the process of constructing the classification network:

referring to fig. 2, a flowchart illustrating steps of a classification network construction method is shown, where the method may be applied to an intelligent terminal, and as shown in fig. 2, the method may specifically include the following steps:

step S201: and respectively carrying out feature extraction on the plurality of images of the sample object to obtain a plurality of corresponding feature images, wherein the plurality of images have different image distribution domains.

In this embodiment, the sample object may refer to any type of object, and specifically, a plurality of images obtained by photographing the sample object in different photographing states may be obtained, and feature extraction may be performed on the plurality of images to obtain feature images corresponding to the plurality of images. The different shooting states may refer to different shooting environment conditions, different shooting picture styles, or different shooting scenes.

Since the plurality of images are captured in different capturing states, the obtained plurality of images may have different image distribution domains, and the image distribution domains in this embodiment may refer to different background information of the images. Thus, a plurality of feature images obtained after feature extraction also have different background features.

For example, the resulting image may have different background environments for different environmental conditions, such as environments at night and day, and different weather conditions, such as environmental backgrounds of heavy fog, heavy rain, clear, etc. As in fig. 1, the vehicle may be photographed in clear weather to obtain an image a, and in foggy weather to obtain an image B. And after the characteristic extraction is carried out on the second domain sample image B, the background characteristic representation of the obtained characteristic image is a foggy weather background.

Further, for example, the sample object may be photographed in a different screen style, so that images of different styles of the sample object, such as a sketch style image and an oil painting style image, can be obtained. In this way, the images obtained by performing feature extraction on the images with different styles can have different image distribution domains, and can also be understood as having different background styles. Of course, the images may be taken in different background scenes, for example, images taken in different street scenes.

It is understood that the plurality of images obtained by photographing the sample object in different photographing states may be taken together with images photographed in different camera angles, that is, different images are photographed in different photographing states and also photographed in different camera angles.

Step S202: and inputting the characteristic images corresponding to the images with different image distribution domains into a preset network in pairs to obtain a plurality of first classification results and a plurality of second classification results.

In this embodiment, after obtaining a plurality of feature images, each two feature images with different image distribution domains may be combined into a sample pair, so that a plurality of sample pairs may be obtained, where each sample pair includes a feature image in one image distribution domain and a feature image in another image distribution domain.

In one scenario, the two feature images included in the sample pair may be images taken from the same camera angle, so that the difficulty in subsequently comparing the classification results of the two feature images can be reduced. Of course, in other scenarios, the two feature images included in the sample image pair may also be images taken under different camera perspectives.

The preset network may be a convolutional neural network, specifically, the structure of the preset network may be set according to actual requirements, and the preset network may be used to identify the category to which the object in the image belongs.

In this embodiment, two feature images in each group of sample pairs may be input to the preset network in pairs, so as to obtain a first classification result output by the preset network for classifying the feature images in one of the image distribution domains, and a second classification result output by the preset network for classifying the feature images in the other image distribution domain. That is, the first classification result represents a result of the preset network classifying the object of the image shot in one shooting state, and the second classification result represents a result of the preset network classifying the object of the image shot in another different shooting state.

As shown in fig. 1, the feature images of the second domain sample image B and the feature images of the second domain sample image a may be input to a neural network for classification and recognition, so that the neural network classifies objects captured under different weather conditions.

Step S203: and updating the parameters of the preset network according to the distances between the first classification results and the second classification results and the class labels of the sample objects.

In practice, because a sample pair input to the preset network is a feature image corresponding to each of two images in different image distribution domains, and the feature images in the different image distribution domains have different background features, the accuracy of the preset network in object classification of the two feature images in the sample pair may be different.

Illustratively, as shown in fig. 1, the second domain sample image B is a car taken under a foggy day, and the first domain sample image a is a car taken under a clear day. Therefore, the definition of the feature image corresponding to the second domain sample image B is different from that of the feature image corresponding to the first domain sample image a, and thus, the accuracy of the preset network in object classification of the two feature images may be different, which may cause that similar objects in different image distribution domains may not be mutually aggregated in the feature image output at the last layer.

Therefore, in the embodiment of the application, in order to enable similar objects in different image distribution domains to be mutually gathered in the feature map output by the last layer, the difference between similar objects in different external environments can be continuously reduced in the training process.

Specifically, distances between a plurality of first classification results and a plurality of second classification results may be determined, and then when the parameter of the preset network is updated according to the first classification result, the second classification result, and the class label, the parameter of the preset network is updated by combining the distances between the first classification result and the second classification result. The preset network parameter may refer to a model parameter of the preset network.

It is understood that in order to: the category label refers to the true category to which the sample object belongs.

Step S204: and determining the preset network after being updated for multiple times as a classification network.

In this embodiment, the training may be ended when the preset number of times is updated or the preset network reaches convergence, so that the preset network at the end of the training is used as the classification network.

In an example, referring to fig. 3, a flowchart illustrating a step of updating a preset network is shown, and as shown in fig. 3, the method may specifically include the following steps:

step S2041: and determining a target loss value according to the distance between the plurality of first classification results and the plurality of second classification results.

In this embodiment, when the preset network is updated, there are two losses, one is a difference between the classification results, and the other is a loss determined according to the classification result and the classification label. Wherein, the gap between the classification results can be determined as a target loss value.

Thus, at each training, the distance between the first classification result and the second classification result output when the training is performed on the preset network can be determined, and the difference can be determined as the target loss value. Therefore, the target loss value can represent the difference of the classification results of object classification of the feature images of different image distribution domains, and the smaller the difference is, the representation preset network can overcome the influence of different image distribution domains, so that the object classification can be accurately performed.

In one example, when determining the target loss value, the target loss value may be determined according to a difference between result centers of the classification results, and specifically, the plurality of first classification results may be processed first to determine a first result center, and the plurality of second classification results may be processed to determine a second result center; and then, determining a target loss value according to the difference between the first result center and the second result center.

When the target loss value is determined according to the difference between the first result center and the second result center, the target loss value can be determined by the following formula (1):

therein, Loss_cosineIn order to be said target loss value, the loss value,

is a vector representation of the first classification result center,

is a vector representation of the center of the second classification result.

In this embodiment, the processing of the first classification result may be to perform an operation on the feature vector of the first classification result to obtain a mean value of the feature vector, where the mean value is a first result center, that is, the first result center may be represented by a vector. In the same way, a second result center can be obtained. Thereafter, a target loss value can be obtained according to equation (1).

Step S2042: determining a first classification loss value according to the plurality of first classification results and the class label of the sample object, and determining a second classification loss value according to the plurality of second classification results and the class label of the sample object.

In this embodiment, during each training, a first classification loss value for performing object classification on the feature image of one image distribution domain may be determined according to the first classification result and the class label output by the current training, or a second classification loss value for performing object classification on the feature image of another image distribution domain may be determined according to the second classification result and the class label output by the current training.

The first classification loss value and the second classification loss value can both represent the difference between the classification result of classifying the sample object and the real class.

Step S2043: and updating the parameters of the preset network according to the target loss value, the first classification loss value and the second classification loss value.

In this embodiment, the overall loss of the preset network may be determined according to the target loss value, the first classification loss value, the second classification loss value, and the respective corresponding weights, and then the parameters of the preset network may be updated according to the overall loss.

According to the technical scheme of the embodiment of the application, the trained classification network can be obtained, so that the classification network can be used for classifying objects and can also be used as a sub-network in other object detection networks to identify the classes of the objects.

1.2 the construction process of the object detection model.

There may be two ways to construct the object detection model, one of which is as described in section 1.2.1, and the classification sub-network in the original object detection model is directly replaced with the classification network trained by the present application. The other method is as described in section 1.2.2, and a preset network including a classification branch and a positioning branch is trained according to a training process for training a classification network, so as to obtain an object detection model through training.

1.2.1 method of constructing an object detection model

First, an original object detection model is obtained, which includes an original classification subnetwork and an original regression subnetwork.

And then, replacing the original classification sub-network with a classification network trained through the process of section 1.1, thereby obtaining a target object detection model.

In this embodiment, the regression sub-network may generate a location box for the sample object, and the classification sub-network may be used to classify the object in the location box.

Of course, in an example, the original object detection model may further include a feature extraction module, where the original classification sub-network and the original regression sub-network are respectively connected to the output end of the feature extraction module, and the feature extraction module is configured to perform feature extraction on the image input to the original object detection model, and input the extracted feature image to the original classification sub-network and the original regression sub-network simultaneously. After the original classification subnetwork is replaced by the classification network trained through the process of section 1.1, the extracted feature images can be input into the classification network and the original regression subnetwork.

After the original classification sub-network is replaced by the classification network trained through the 1.1-section process, the target object detection model can be trained by utilizing a plurality of sample images of the sample object in different shooting states to obtain a final target object detection model, so that the object classification accuracy of the target object detection model can be further improved.

It can be understood that when the classification network is replaced to the original object detection model as the classification subnetwork, the modification of the original regression subnetwork is avoided, and therefore, the classification network of the present application can be replaced to a network model with an object classification function, thereby improving the application range of the classification network.

1.2.2 method II of constructing an object detection model

Referring to fig. 4 and 5, fig. 4 outputs a flowchart of steps for constructing an object detection model, fig. 5 shows a schematic diagram of a principle of training an object detection model in the second mode, and as shown in fig. 4, the method may specifically include the following steps:

step S401: a plurality of images of the sample object is obtained, the plurality of images having different image distribution domains.

In this embodiment, the process of obtaining the multiple images of the sample object is described with reference to the step S201, and is not described herein again.

Wherein a plurality of sample objects may be included in each image. As shown in fig. 5, each image input to the neural network on the left side in fig. 1 includes a plurality of vehicles, that is, each of the image a and the image B includes 5 sample objects, that is, 5 vehicles, and each vehicle is a sample object.

Step S402: and inputting every two images with different image distribution domains into a preset network in pairs to obtain a plurality of first object detection results and a plurality of second object detection results.

As shown in fig. 5, the preset network may include a main network, a classification branch, and a positioning branch, where the main network is configured to perform feature extraction on each input image, and input the feature images obtained after feature extraction into the classification branch and the positioning branch, respectively.

The classification branch is used for processing the input characteristic image to obtain the category of the sample object, and the positioning branch is used for processing the input characteristic image to obtain the position of the sample object. In this embodiment, the type and position of each sample object in each image can be obtained for both the two images input in pairs, where the type and position of one sample object are the object detection result. In this way, since one image includes a plurality of sample objects, a plurality of corresponding object detection results can be obtained for each image. The plurality of first object detection results correspond to one of the images input to the preset network in pairs, and the plurality of second object detection results correspond to the other image input to the preset network in pairs.

Step S403: comparing the plurality of first object detection results with the plurality of second object detection results to determine a first target object detection result and a second target object detection result characterizing the same object class.

In this embodiment, for one image input to the preset network, the types and positions of a plurality of sample objects in the image may be detected, so as to obtain a plurality of first object detection results, where each first object detection result corresponds to the type and position of one sample object. Similarly, for another image input to the preset network, a plurality of second object detection results may be obtained, each of the second object detection results corresponding to a category and a position of one sample object.

The first target object detection result and the second target object detection result of the same object type can be determined from the plurality of first object detection results and the plurality of second object detection results. The first object detection result and the second object detection result with the same object type are screened out according to the object type represented by the first object detection result and the object type represented by the second object detection result, so that the similar objects in different image distribution domains can be found.

As shown in fig. 5, since the images a and B each include 5 sample objects, each image can obtain 5 object detection results, and one object detection result includes the position and the category of a vehicle, and since the 5 sample objects belong to the same category, that is, the vehicle category, the object detection result belonging to the vehicle in the first object detection result can be used as the first target object detection result, and the object detection result belonging to the vehicle in the second object detection result can be used as the second target object detection result.

Step S404: and updating the parameters of the preset network according to the distance between the first object target detection result and the second object target detection result, and the position label and the category label of the sample object.

Step S405: and determining the preset network after multiple updates as an object detection model.

After the first target object detection result and the second target object detection result of the same object type are screened out, because the same objects under different image distribution domains need to be mutually gathered together, the difference between the same objects from different image distribution domains can be continuously reduced in the training process, namely, the distance between the first target object detection result and the second target object detection result of the same object type is continuously reduced.

For example, as shown in fig. 1, it is necessary to bring the first target object detection result and the second target object detection result of the same-belonging vehicle close to each other so that the features belonging to the vehicle can be grouped together in the high-dimensional feature map.

In specific implementation, the distance between the first object detection result and the second object detection result may be used as a loss for updating the preset network, and parameters of the preset network may be updated. Specifically, since the first object detection result includes the position detection result and the category detection result, and the second object detection result also includes the position detection result and the category detection result, the loss of the object detection can be determined based on the first object detection result, the second object detection result, the position detection result, and the category detection result. Thus, when the preset network is updated, the preset network can be updated for many times according to the loss of object detection and the distance between the first object target detection result and the second object target detection result.

In an example, referring to fig. 6, a flowchart illustrating a step of updating the preset network is shown, and as shown in fig. 6, the method may specifically include the following steps:

step S601: and determining a target loss value according to the distance between the first target object detection result and the second target object detection result.

In this embodiment, the process of determining the target loss value may refer to the process described in section 1.1 above, and is not described herein again.

Step S602: determining a first object detection loss value according to the plurality of first object detection results and the position labels and the category labels of the sample objects, and determining a second object detection loss value according to the plurality of second object detection results and the position labels and the category labels of the sample objects.

As shown in fig. 5, in this embodiment, since the first object detection result and the second object detection result both include the position detection result and the category detection result, at each training, the first position loss value L3 may be determined according to the position detection result and the position label in the plurality of first object detection results, and the first category loss value L1 may be determined according to the category detection result and the category label in the plurality of first object detection results, so as to determine the first object detection loss value according to the first position loss value L3 and the first category loss value L1.

Similarly, the second position loss value L4 may be determined according to the position detection result and the position tag in the second object detection results, and the second category loss value L2 may be determined according to the category detection result and the category tag in the second object detection results, so that the second object detection loss value may be determined according to the second position loss value L4 and the second category loss value L2.

Step S603: and updating the parameters of the preset network according to the target loss value, the first object detection loss value and the second object detection loss value.

In one example, the overall loss of the preset network may be determined according to a target loss value, the first object detection loss value, the second object detection loss value, and respective corresponding weights, so that the preset network is updated according to the overall loss.

As shown in fig. 5, in an application scenario, since the preset network has a classification branch and a positioning branch, when updating, parameters of the classification branch may be updated according to the first class loss value L1, the second class loss value L2, and the target loss value S; then, the parameters of the positioning branch can be updated according to the first position loss value L3 and the second position loss value L4.

In one example, after obtaining the object detection model, the object detection model may be used for object detection, and specifically, an image to be detected may be obtained, where the image to be detected has any one of the different image distribution domains; and inputting the image to be detected into the object detection model to obtain the position and the category of the object contained in the image to be processed.

In this example, after obtaining the object detection model, the object detection model may perform object detection on the input image to be detected. The image to be detected may be an image photographed in any photographing state, for example, an image photographed in any external environment, and the image to be detected may include an object to be recognized.

The object detection model can be used for identifying the position and the category of an object contained in the image to be processed so as to obtain the detection result of the position and the category of the object.

When the object detection model is trained, sample images shot in different shooting states are used as training samples, so that a plurality of images in the training samples have different image distribution domains, and parameters of the preset network are updated according to differences between detection results of similar objects in different image distribution domains in the training process. Therefore, with the deepening of the training, the difference between the detection results of the similar objects in different image distribution domains is smaller and smaller, so that the detection results of the similar objects in different image distribution domains can be mutually gathered when the training is finished, the object detection model can be suitable for the detection of the images shot in different shooting states, the generalization capability of the model is improved, and the accuracy of object detection is improved under the condition that the shooting states are changed.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 7, a block diagram of a structure of a classification network constructing apparatus according to an embodiment of the present invention is shown, and as shown in fig. 7, the apparatus may specifically include the following modules:

a feature extraction module 701, configured to perform feature extraction on a plurality of images of a sample object, respectively, to obtain a plurality of corresponding feature images, where the plurality of images have different image distribution domains;

a classification result obtaining module 702, configured to input feature images corresponding to two images with different image distribution domains into a preset network in pairs, so as to obtain a plurality of first classification results and a plurality of second classification results;

an updating module 703, configured to update the parameter of the preset network according to the distance between the first classification results and the second classification results and the class label of the sample object;

a determining module 704, configured to determine the preset network after multiple updates as a classified network.

The update module 703 may specifically include the following units:

a first loss value determining unit, configured to determine a target loss value according to distances between the plurality of first classification results and the plurality of second classification results;

a second loss value determination unit configured to determine a first classification loss value according to the plurality of first classification results and the class label of the sample object, and determine a second classification loss value according to the plurality of second classification results and the class label of the sample object;

and the third loss value determining unit is used for updating the parameters of the preset network according to the target loss value, the first classification loss value and the second classification loss value.

Optionally, the apparatus may further include the following modules:

the processing module is used for processing the plurality of first classification results to determine a first result center, and processing the plurality of second classification results to determine a second result center;

the first loss value determining unit is specifically configured to determine a target loss value according to a difference between the first result center and the second result center.

Optionally, the first loss value determining unit is specifically configured to determine the target loss value according to the following formula:

therein, Loss_cosineIs the target lossAnd the missing value is represented by the vector of the first classification result center and is represented by the vector of the second classification result center.

Referring to fig. 8, a structural block diagram of an object detection model building apparatus according to an embodiment of the present application is shown, and as shown in fig. 8, the apparatus may specifically include the following modules:

a model obtaining module 801, configured to obtain an original object detection model, where the original object detection model includes an original classification subnetwork and an original regression subnetwork;

a replacing module 802, configured to replace the original classification sub-network with the classification network described in section 1.1, to obtain a target object detection model.

Referring to fig. 9, a structural block diagram of an apparatus for constructing an object detection model according to an embodiment of the present application is shown, and as shown in fig. 9, the apparatus may specifically include the following modules:

a sample image obtaining module 901, configured to obtain a plurality of images of a sample object, where the plurality of images have different image distribution domains;

a detection result obtaining module 902, configured to input each two images with different image distribution domains into a preset network in pairs, so as to obtain a plurality of first object detection results and a plurality of second object detection results;

a screening module 903, configured to compare the multiple first object detection results with the multiple second object detection results, and determine a first target object detection result and a second target object detection result that represent the same object class;

an updating module 904, configured to update the parameter of the preset network according to a distance between the first object target detection result and the second object target detection result, and the location tag and the category tag of the sample object;

a determining module 905, configured to determine the preset network after multiple updates as an object detection model.

Optionally, the updating module 904 may specifically include the following units:

a first loss determining unit configured to determine a target loss value according to a distance between the first target object detection result and the second target object detection result;

a second loss determining unit configured to determine a first object detection loss value according to the plurality of first object detection results and the position tags and the category tags of the sample objects, and determine a second object detection loss value according to the plurality of second object detection results and the position tags and the category tags of the sample objects;

and the updating unit is used for updating the parameters of the preset network according to the target loss value, the first object detection loss value and the second object detection loss value.

Optionally, the apparatus may further include the following modules:

the image acquisition module is used for acquiring an image to be detected, and the image to be detected has any image distribution domain in the different image distribution domains;

and the detection module is used for inputting the image to be detected into the object detection model to obtain the position and the category of the object contained in the image to be processed.

It should be noted that the device embodiments are similar to the method embodiments, so that the description is simple, and reference may be made to the method embodiments for relevant points.

Embodiments of the present invention further provide an electronic device, which may include a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor is configured to execute the classification network building method or the object detection model building method.

Embodiments of the present invention further provide a computer-readable storage medium storing a computer program, which enables a processor to execute the classification network construction method or the object detection model construction method according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed descriptions of the classification network construction method, the object detection model construction method, the apparatus, the device, and the storage medium provided by the present invention are provided, and a specific example is applied in the description to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for constructing a classification network, the method comprising:

2. The method of claim 1, wherein updating the parameters of the pre-set network according to the distances between the first classification results and the second classification results and the class labels of the sample objects comprises:

3. The method of claim 2, further comprising:

and determining a target loss value according to the distance between the first result center and the second result center.

4. The method of claim 3, wherein determining a target loss value based on a distance between the first result center and the second result center comprises:

determining a target loss value according to the following formula:

5. An object detection model construction method, characterized in that the method comprises:

replacing the original classification sub-network with the classification network of any one of claims 1-4 to obtain a target object detection model.

6. An object detection model construction method, characterized in that the method comprises:

inputting images with different image distribution domains into a preset network to obtain a plurality of first object detection results and a plurality of second object detection results;

7. The method of claim 6, wherein updating the parameters of the pre-set network according to the distance between the first object detection result and the second object detection result and the location tag and the category tag of the sample object comprises:

8. The method of claim 6 or 7, wherein after obtaining the object detection model, the method further comprises:

9. A hierarchical network construction apparatus, the apparatus comprising:

10. An object detection model construction apparatus, characterized in that the apparatus comprises:

a replacing module, configured to replace the original classification sub-network with the classification network according to any one of claims 1 to 4, so as to obtain a target object detection model.

11. An object detection model construction apparatus, characterized in that the apparatus comprises:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the classification network construction method according to any one of claims 1 to 4, or the object detection model construction method according to claim 5, or the object detection model construction method according to any one of claims 6 to 8.

13. A computer-readable storage medium storing a computer program for causing a processor to execute the classification network construction method according to any one of claims 1 to 4, or the object detection model construction method according to claim 5, or the object detection model construction method according to any one of claims 6 to 8.