CN113902899A

CN113902899A - Training method, target detection method, device, electronic device and storage medium

Info

Publication number: CN113902899A
Application number: CN202111156486.4A
Authority: CN
Inventors: 谌强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-07

Abstract

The disclosure provides a training method, a target detection device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, and particularly relates to computer vision and deep learning technology. The specific implementation scheme is as follows: inputting the first sample image and the second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image; inputting the third sample image and the fourth sample image into a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image; based on a first contrast loss function, utilizing a first characteristic diagram, a second characteristic diagram, a third characteristic diagram and a fourth characteristic diagram to adjust model parameters of a first preset model until a preset condition is met, wherein the numerical value of the model parameters of the second preset model is the same as the numerical value of the model parameters of the first preset model; and determining a first preset model obtained under the condition that a preset condition is met as a detection model.

Description

Training method, target detection method, device, electronic device and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to computer vision and deep learning techniques. And more particularly, to a training method, an object detection method, an apparatus, an electronic device, and a storage medium.

Background

The target detection is a basic task in the field of computer vision, and is widely applied to the fields of intelligent security, intelligent traffic, human-computer interaction and the like.

Object detection may refer to separating an object of interest from an image and determining the class and location of the object.

Disclosure of Invention

The disclosure provides a training method, a target detection device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of a detection model, including: inputting a first sample image and a second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image; inputting a third sample image and a fourth sample image into a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image, wherein the first sample image and the third sample image are a positive sample image pair, and the second sample image and the fourth sample image are a negative sample image pair; adjusting model parameters of the first preset model by using the first feature map, the second feature map, the third feature map and the fourth feature map based on a first comparison loss function until a preset condition is met, wherein the numerical value of the model parameters of the second preset model is the same as the numerical value of the model parameters of the first preset model; and determining a first preset model obtained under the condition that the preset condition is met as the detection model.

According to another aspect of the present disclosure, there is provided an object detection method including: inputting an image to be processed into a detection model, and obtaining the category and the position of each object included in the image to be processed, wherein the detection model is trained by using the method.

According to another aspect of the present disclosure, there is provided a training apparatus for detecting a model, including: the first obtaining module is used for inputting a first sample image and a second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image; a second obtaining module, configured to input a third sample image and a fourth sample image into a second preset model, so as to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image, where the first sample image and the third sample image are a positive sample image pair, and the second sample image and the fourth sample image are a negative sample image pair; an adjusting module, configured to adjust a model parameter of the first preset model based on a first contrast loss function by using the first feature map, the second feature map, the third feature map, and the fourth feature map until a preset condition is satisfied, where a value of the model parameter of the second preset model is the same as a value of the model parameter of the first preset model; and the determining module is used for determining a first preset model obtained under the condition that the preset condition is met as the detection model.

According to another aspect of the present disclosure, there is provided an object detecting apparatus including: a seventh obtaining module, configured to input an image to be processed into a detection model, and obtain a category and a position of each object included in the image to be processed, where the detection model is trained by using the apparatus as described above.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture of a training method, a target detection method, and an apparatus to which a detection model may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a training method of a detection model according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of a training process of a detection model according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates an example schematic of a training process of a detection model according to an embodiment of this disclosure;

FIG. 5 schematically illustrates a flow chart of a target detection method according to an embodiment of the present disclosure;

FIG. 6 schematically shows a block diagram of a content processing apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of an object detection apparatus according to an embodiment of the disclosure; and

fig. 8 schematically shows a block diagram of an electronic device adapted to implement a training method and an object detection method of a detection model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Target detection can be achieved by using a detection model trained on the sample image set. The detection model may be trained based on the sample image set in such a way that a pre-trained model may be obtained by pre-training a preset model using the disclosed sample image set. And then, fine-tuning the pre-training model by using the sample image set corresponding to the scene to obtain a detection model corresponding to the scene.

A sample image set corresponding to a scene may be difficult to cover all situations of the scene, for example, a change in environment may cause a large change in sample images of the same scene, and the change in environment may include a change in weather and a change in light and shade, etc. In addition, a detection model with better robustness can be deployed in a plurality of scenes, but some scenes may have larger difference with the scenes corresponding to the sample image set. Therefore, the detection model trained based on the above method may have various bad cases, which results in the need for repeated training and fine tuning of the detection model. The above method is time and labor consuming, and may result in the performance of the detection model being not satisfactory after fine tuning based on bad case. The detection model trained based on the above method has poor robustness.

Therefore, the embodiment of the present disclosure provides a scheme for improving robustness of a detection model, that is, a first sample image and a second sample image are input into a first preset model, and a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image are obtained. And inputting the third sample image and the fourth sample image into a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image. The first and third sample images are a positive sample image pair and the second and fourth sample images are a negative sample image pair. And based on the first contrast loss function, adjusting the model parameters of the first preset model by using the first characteristic diagram, the second characteristic diagram, the third characteristic diagram and the fourth characteristic diagram until the preset conditions are met. The values of the model parameters of the second predetermined model are the same as the values of the model parameters of the first predetermined model. And determining a first preset model obtained under the condition that a preset condition is met as a detection model.

The first feature map and the third feature map are feature maps corresponding to the positive sample image pair, the second feature map and the fourth feature map are feature maps corresponding to the negative sample image pair, and the first feature map, the second feature map, the third feature map and the fourth feature map are processed by utilizing a first contrast loss function in a contrast learning mode, so that the first preset model can learn information with unchanged scenes, the quality of extracted features of the first preset model is enhanced, and the detection model can have higher robustness in different scenes.

Fig. 1 schematically illustrates an exemplary system architecture of a training method, an object detection method and apparatus to which a detection model may be applied, according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the training method, the target detection method, and the apparatus for the detection model may be applied may include a terminal device, but the terminal device may implement the training method, the target detection method, and the apparatus for the detection model provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be various types of servers providing various services, such as a background management server (for example only) providing support for content browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the training method of the detection model and the target detection method provided by the embodiments of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Accordingly, the training device and the target detection device for detecting the model provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

Alternatively, the training method and the target detection method of the detection model provided by the embodiment of the present disclosure may also be generally performed by the server 105. Accordingly, the training device and the target detection device of the detection model provided by the embodiment of the present disclosure may be generally disposed in the server 105. The training method and the target detection method of the detection model provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the training device and the target detection device of the detection model provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

For example, the Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, and solves the defects of high management difficulty and weak service extensibility in a conventional physical host and a VPS (Virtual Private Server). Server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

For example, the server 105 inputs a first sample image and a second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image, and inputs a third sample image and a fourth sample image into a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image. The first and third sample images are a positive sample image pair and the second and fourth sample images are a negative sample image pair. And based on the first contrast loss function, adjusting the model parameters of the first preset model by using the first characteristic diagram, the second characteristic diagram, the third characteristic diagram and the fourth characteristic diagram until the preset conditions are met. The values of the model parameters of the second predetermined model are the same as the values of the model parameters of the first predetermined model. And determining a first preset model obtained under the condition that a preset condition is met as a detection model. Or a server cluster capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105 adjusts the model parameters of the first preset model based on the first comparison loss function by using the first feature map, the second feature map, the third feature map and the fourth feature map until the preset condition is satisfied.

For example, the server 105 inputs the image to be processed into the detection model, and obtains the category and the position of each object included in the image to be processed. Or the server or server cluster capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105 inputs the image to be processed into the detection model, and the category and the position of each object included in the image to be processed are obtained.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a training method of a detection model according to an embodiment of the present disclosure.

As shown in FIG. 2, the method 200 includes operations S210-S240.

In operation S210, the first sample image and the second sample image are input into the first preset model, and a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image are obtained.

In operation S220, the third sample image and the fourth sample image are input into the second preset model, and a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image are obtained. The first and third sample images are a positive sample image pair and the second and fourth sample images are a negative sample image pair.

In operation S230, model parameters of the first preset model are adjusted based on the first contrast loss function by using the first feature map, the second feature map, the third feature map, and the fourth feature map until a preset condition is satisfied. The values of the model parameters of the second predetermined model are the same as the values of the model parameters of the first predetermined model.

In operation S240, a first preset model obtained in a case where a preset condition is satisfied is determined as a detection model.

According to an embodiment of the present disclosure, the first and second preset models may be two models having the same network structure. The values of the model parameters of the first and second predetermined models may be the same. The first and second preset models may include a One-stage (i.e., One-stage) based detection model or a Two-stage (i.e., Two-stage) based detection model. The two-stage based detection model may include RCNN (region relational Neural network), Fast RCNN, or Mask RCNN.

According to embodiments of the present disclosure, a positive sample image pair may refer to images in which both sample images are in the same scene. A negative sample image pair may refer to two sample images that are images of different scenes. The positive and negative sample image pairs may be sample image pairs corresponding to at least one scene. The number of positive sample image pairs and the number of negative sample image pairs may include a plurality. The first sample image and the third sample image may be a positive sample image pair. The second and fourth sample images may be a negative sample image pair. At least one of the first sample image and the third sample image may be acquired. At least one of the first sample image and the third sample image may be processed from a preset image. The preset image may include a first sample image, a third sample image, or other sample images. The first sample image may be the same as or different from the second sample image, the third sample image, and the fourth sample image. The second sample image may be the same as or different from the first sample image and the third sample image. The third sample image may be the same as or different from the first sample image, the second sample image, and the fourth sample image. The fourth sample image may be the same as or different from the first sample image and the third sample image.

According to an embodiment of the present disclosure, the first contrast function may be used to make the similarity between the first sample image and the third sample image as large as possible, and the similarity between the second sample image and the fourth sample image as small as possible. The similarity may be set according to actual service requirements, and is not limited herein. For example, the similarity may include a cosine similarity, a pearson correlation coefficient, a euclidean distance, or a Jaccard distance.

According to an embodiment of the present disclosure, the preset condition may be used as a condition for determining whether the first preset model is trained completely. The preset condition may include that the number of training times is greater than or equal to a time threshold. Alternatively, the preset condition may include convergence of an output value of the loss function. The loss function may comprise a first comparative loss function. The detection model may be used to detect the location and class of objects in the image.

According to the embodiment of the disclosure, for a positive sample image pair, a first sample image may be processed by using a first preset model, so as to obtain a first feature map corresponding to the first sample image. And processing the third sample image by using a second preset model to obtain a third feature map corresponding to the third sample image. For the negative sample image pair, the second sample image may be processed by using a second preset model, resulting in a second feature map of the second sample image pair. And processing the fourth sample image by using a second preset model to obtain a fourth feature map corresponding to the fourth sample image.

According to the embodiment of the disclosure, after the first feature diagram, the second feature diagram, the third feature diagram and the fourth feature diagram are obtained, the first feature diagram, the second feature diagram, the third feature diagram and the fourth feature diagram may be input into the first contrast loss function to obtain an output value, and then the model parameter of the first preset model is adjusted according to the output value until the preset condition is satisfied. The first contrast loss function can be processed by using a gradient descent algorithm to obtain a gradient vector, and the model parameters of the first preset model are adjusted according to the gradient vector. The gradient descent algorithm may comprise a random gradient descent algorithm. In the adjusting of the model parameters of the first preset model according to the gradient vectors, the model parameters of the first preset model may be adjusted by using a back propagation method based on the gradient vectors. In the training process, the second preset model does not participate in the process of adjusting the model parameters by using a back propagation method. After the numerical value of the model parameter of the first preset model is determined, the numerical value of the model parameter of the second preset model is adjusted to be consistent with the numerical value of the model parameter of the first preset model.

According to the embodiment of the disclosure, the first feature map and the third feature map are feature maps corresponding to a positive sample image pair, the second feature map and the fourth feature map are feature maps corresponding to a negative sample image pair, and the first feature map, the second feature map, the third feature map and the fourth feature map are processed by using a first contrast loss function in a contrast learning manner, so that the first preset model can learn information with unchanged scenes, the quality of feature extraction of the first preset model is enhanced, and the detection model can have higher robustness in different scenes.

According to an embodiment of the present disclosure, operation S230 may include the following operations.

And obtaining a first output value by using the first characteristic diagram and the third characteristic diagram based on the first contrast loss function. And obtaining a second output value by using the second feature map and the fourth feature map based on the first comparison loss function. And adjusting the model parameters of the first preset model according to the output value until the output value of the output value is converged. The output values include a first output value and a second output value. And determining a first preset model obtained under the condition that the output value is converged as a detection model.

According to the embodiment of the disclosure, the first feature map and the third feature map may be output as a first contrast loss function, resulting in a first output value. And inputting the second characteristic diagram and the fourth characteristic diagram into the first comparison loss function to obtain a second output value. And obtaining an output value according to the first output value and the second output value. And adjusting the model parameters of the first preset model according to the output value.

According to an embodiment of the present disclosure, the first contrast loss function may be determined according to the following equation (1).

According to an embodiment of the present disclosure, N characterizes the number of sample image pairs. y represents the label of whether the two sample images match. y-0 characterizes the two sample images as not matching. y 1 characterizes the two sample images matching. d may characterize the euclidean distance of the two sample images. margin characterizes a preset threshold. Two sample images may be considered to match if they are a positive sample image pair. If the two sample images are a negative sample image pair, the two sample images may be considered to be mismatched.

According to an embodiment of the disclosure, the output values further comprise a third output value and a fourth output value.

According to an embodiment of the present disclosure, the training method of the detection model may further include the following operations.

And respectively inputting the first feature map and the second feature map into a first preset model to obtain a first example vector of each object included in the first sample image and a second example vector of each object included in the second sample image. And respectively inputting the third feature map and the fourth feature map into a second preset model to obtain a third example vector of each object included in the third sample image and a fourth example vector of each object included in the fourth sample image. And obtaining a third output value by using the first example vector and the third example vector based on the second contrast loss function. And obtaining a fourth output value by using the second example vector and the fourth example vector based on the second contrast loss function.

According to embodiments of the present disclosure, an instance (i.e., instance) vector may be used to characterize a feature of an object included in a sample image. The sample image may include at least one object.

According to an embodiment of the present disclosure, the second comparison function may be used to make the similarity between the object of the first sample image and the same object of the third sample image as large as possible, so that the similarity between the object of the second sample image and the object of the fourth sample image is as small as possible.

According to an embodiment of the present disclosure, for a positive sample image pair, a first feature map may be processed using a first preset model, resulting in a first instance vector for each object in the first sample image. And processing the third sample image by using a second preset model to obtain a third instance vector of each object in the third sample image. For the negative sample image pair, the second sample image may be processed using a second preset model, resulting in a second instance vector for each object in the second sample image. And processing the fourth sample image by using a fourth preset model to obtain a fourth example vector of each object in the fourth sample image.

According to an embodiment of the present disclosure, after obtaining the first, second, third, and fourth instance vectors, the first and third instance vectors may be input to a second contrast loss function to obtain a third output value. And inputting the second example vector and the fourth example vector into a second comparison loss function to obtain a fourth output value.

According to an embodiment of the present disclosure, an output value is obtained according to the first output value, the second output value, the third output value, and the fourth output value. And then according to the output value, adjusting the model parameters of the first preset model until the preset conditions are met.

According to the embodiment of the disclosure, the first instance vector, the second instance vector, the third instance vector and the fourth instance vector are processed by utilizing the second contrast loss function, so that the consistency of the instance vectors can be effectively ensured, the first preset model can identify the existing objects in different types in different scenes, the false detection phenomenon is reduced, and the robustness of the detection model is further improved.

According to an embodiment of the present disclosure, the first feature map and the second feature map are respectively input into the first preset model, and the obtaining of the first instance vector of each object included in the first sample image and the second instance vector of each object included in the second sample image may include the following operations.

And respectively inputting the first feature map and the second feature map into a detection head of a first preset model to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image.

According to an embodiment of the present disclosure, the step of inputting the third feature map and the fourth feature map into the second preset model respectively to obtain the third instance vector of each object included in the third sample image and the fourth instance vector of each object included in the fourth sample image may include the following operations.

And respectively inputting the third feature map and the fourth feature map into a detection head of a second preset model to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image.

According to an embodiment of the present disclosure, the first preset pattern and the second preset pattern may each include a detection head. The detection head may be used to determine the location and class of the object. The detection head may include a Region candidate Network (RPN) and a roipoool. Alternatively, the detection header may include a candidate area network and roiign. The ROI may refer to a Region of Interest, i.e., Region of Interest.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And respectively inputting the first sample image and the second sample image into a backbone network of a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image.

According to an embodiment of the present disclosure, operation S220 may include the following operations.

And inputting the third sample image and the fourth sample image into a backbone network of a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image.

According to an embodiment of the present disclosure, the first preset model and the second preset model may each include a backbone (i.e., backbone) network. The backbone network may be used for feature extraction of images.

According to an embodiment of the present disclosure, the second sample image is obtained by processing the first sample image using a data enhancement method.

According to an embodiment of the present disclosure, the data enhancement method may include at least one of a geometric transformation method and a pixel transformation method. The geometric transformation method may include at least one of flipping, rotating, cropping, scaling, translating, and dithering. The pixel transformation method may include at least one of adjusting sharpness, adjusting contrast, adjusting brightness, and adjusting saturation.

According to the embodiment of the disclosure, the first sample image may be processed by a data enhancement method to obtain a second sample image, so that the second sample image and the first sample image are images in the same scene.

According to an embodiment of the present disclosure, the first sample image may be obtained by processing the second sample image using a data enhancement method.

The training method of the detection model according to the embodiment of the present disclosure is further described with reference to fig. 3 to 4.

Fig. 3 schematically shows a schematic diagram of a training process of a detection model according to an embodiment of the present disclosure.

As shown in fig. 3, the first preset model includes a backbone network 303 and a detection head 314. The second preset model includes a backbone network 308 and a detection head 317.

In the training process 300 of the detection model, the first sample image 301 is input to the backbone network 303, and a first feature map 304 corresponding to the first sample image 301 is obtained. The second sample image 302 is input to the backbone network 303, and a second feature map 305 corresponding to the second sample image 302 is obtained.

The third sample image 306 is input to the backbone network 308, and a third feature map 309 corresponding to the third sample image 308 is obtained. The fourth sample image 307 is input to the backbone network 308, and a fourth feature map 310 corresponding to the fourth sample image 307 is obtained.

The first profile 304 and the third profile 309 are input to a first contrast loss function 311 to obtain a first output value 312. The second profile 305 and the fourth profile 310 are input to a first contrast loss function 311 to obtain a second output value 313.

The first feature map 304 is input to the detection head 314, resulting in a first instance vector 315 for each object comprised by the first sample image 301. The second feature map 305 is input to the detection head 314, resulting in a second instance vector 316 for each object included in the second sample image 302.

The third feature map 309 is input to the detection head 317, and a third instance vector 318 of each object included in the third sample image 306 is obtained. The fourth feature map 310 is input to the detection head 317, and a fourth instance vector 319 of each object included in the fourth sample image 307 is obtained.

The first 315 and third 318 instance vectors are input to a second contrast loss function 320, resulting in a third output value 321. The second instance vector 316 and the fourth instance vector 319 are input to a second contrast loss function 320, resulting in a fourth output value 322.

An output value 323 is obtained from the first output value 312, the second output value 313, the third output value 321, and the fourth output value 322.

According to the output value 323, model parameters of the backbone network 303 and the detection head 314 included in the first preset model are adjusted until a preset condition is met. And determining a first preset model obtained by training under the condition of meeting the preset condition as a detection model.

Referring to fig. 4, the training process of the detection model is further described by taking the example that the detection header includes the candidate area network and roiign.

Fig. 4 schematically shows an example schematic of a training process of a detection model according to an embodiment of the disclosure.

As shown in fig. 4, the first preset model 401 includes a backbone network 4010 and a detection head 4011. The detection header 4011 includes a candidate area network 4011A and roiign 4011B. The second preset model 402 includes a backbone network 4020 and a detection header 4021. The detection header 4021 includes a candidate area network 4021A and roiign 4021B.

In the training process 400 of the detection model, the first sample image 403 is input to the backbone network 4010 to obtain a first feature map. A preset number of initial regions (i.e., initial ROIs) are set for each pixel position of the first feature map, and then the preset number of initial regions are input into the candidate region network 4011A, so as to obtain a classification result that the initial region is a foreground or a background and a position of the initial region. At least one candidate region is determined from a preset number of initial regions according to the classification result and the position. Then, the first Feature Map and at least one candidate region are input into roiign 4011B, and a candidate Feature Map 405 (i.e., a propofol Feature Map) is obtained. The candidate feature maps 405 are classified to obtain a first instance vector 406 of each object included in the first sample image 403. The candidate feature map 405 is input into Box Head407, resulting in the position of each object.

The third sample image 404 is input to the backbone network 4020, and a second feature map is obtained. A preset number of initial regions (i.e., initial ROIs) are set for each pixel position of the second feature map, and then the preset number of initial regions are input into the candidate region network 4021A, so as to obtain a classification result that the initial region is a foreground or a background and a position of the initial region. At least one candidate region is determined from a preset number of initial regions according to the classification result and the position. And inputting the second feature map and at least one candidate region into ROIAlign4021B to obtain a candidate feature map 408. The candidate feature maps 408 are classified to obtain a third instance vector 409 of each object included in the third sample image 408.

In the training process, a first contrast loss function is introduced, so that the backbone network can learn scene-invariant information. The introduction of the second contrast loss function enables the instance vectors to maintain consistency across different scenarios. Model parameters of the backbone network 4010 and the backbone network 4020 are shared. Model parameters of the candidate area network 4011A and the candidate area 4021A are shared. Roiign 4011B and roiign 4021B.

The above is merely an exemplary embodiment, but is not limited thereto, and other training methods of detection models and target detection methods known in the art may be included as long as improvement of the robustness of the detection model can be achieved.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related users all conform to the regulations of the related laws and regulations, and do not violate the good custom of the public order.

Fig. 5 schematically shows a flow chart of a target detection method according to an embodiment of the present disclosure.

As shown in fig. 5, the method 500 includes operation S510.

In operation S510, the image to be processed is input to the detection model, and the category and the position of each object included in the image to be processed are obtained.

According to an embodiment of the present disclosure, a detection model is trained using a training method of a detection model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a detection model may include a backbone network and a detection head. The image to be processed can be input into the backbone network of the detection model, and a characteristic diagram to be processed corresponding to the image to be processed is obtained. And inputting the characteristic diagram to be processed into a detection head of the detection model to obtain an instance vector of each object included in the image to be processed. And classifying each instance vector to obtain the category of each object. In addition, the candidate region corresponding to each object can be obtained by inputting the feature map to be processed into the detection head of the detection model. And determining the positions of the candidate areas to obtain the position of each object.

Fig. 6 schematically shows a block diagram of a content processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the training apparatus 600 for detecting a model may include a first obtaining module 610, a second obtaining module 620, an adjusting module 630, and a determining module 640.

The first obtaining module 610 is configured to input the first sample image and the second sample image into a first preset model, and obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image.

The second obtaining module 620 is configured to input the third sample image and the fourth sample image into a second preset model, so as to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image. The first and third sample images are a positive sample image pair and the second and fourth sample images are a negative sample image pair.

An adjusting module 630, configured to adjust a model parameter of the first preset model by using the first feature map, the second feature map, the third feature map, and the fourth feature map based on the first contrast loss function until a preset condition is satisfied. The values of the model parameters of the second predetermined model are the same as the values of the model parameters of the first predetermined model.

A determining module 640, configured to determine a first preset model obtained when a preset condition is met as a detection model.

According to an embodiment of the present disclosure, the adjusting module 630 may include a first obtaining unit, a second obtaining unit, an adjusting unit, and a determining unit.

And the first obtaining unit is used for obtaining a first output value by utilizing the first characteristic diagram and the third characteristic diagram based on the first comparison loss function.

And the second obtaining unit is used for obtaining a second output value by utilizing the second characteristic diagram and the fourth characteristic diagram based on the first comparison loss function.

And the adjusting unit is used for adjusting the model parameters of the first preset model according to the output value until the output value of the output value is converged. The output values include a first output value and a second output value.

A determination unit configured to determine a first preset model obtained in a case where the output value converges as a detection model.

According to an embodiment of the present disclosure, the output values further include a third output value and a fourth output value;

according to an embodiment of the present disclosure, the training apparatus 600 for detecting a model may further include a third obtaining module, a fourth obtaining module, a fifth obtaining module, and a sixth obtaining module.

And the third obtaining module is used for respectively inputting the first characteristic diagram and the second characteristic diagram into the first preset model to obtain a first example vector of each object included in the first sample image and a second example vector of each object included in the second sample image.

And the fourth obtaining module is used for respectively inputting the third feature map and the fourth feature map into the second preset model to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image.

And the fifth obtaining module is used for obtaining a third output value by utilizing the first example vector and the third example vector based on the second contrast loss function.

And a sixth obtaining module, configured to obtain a fourth output value by using the second instance vector and the fourth instance vector based on the second contrast loss function.

According to an embodiment of the present disclosure, the first obtaining module 610 may include a third obtaining unit.

And the third obtaining unit is used for respectively inputting the first sample image and the second sample image into the backbone network of the first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image.

According to an embodiment of the present disclosure, the second obtaining module 620 may include a fourth obtaining unit.

And the fourth obtaining unit is used for inputting the third sample image and the fourth sample image into the backbone network of the second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image.

According to an embodiment of the present disclosure, the third obtaining module may include a fifth obtaining unit.

A fifth obtaining unit, configured to input the first feature map and the second feature map into a detection head of a first preset model, respectively, to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image;

according to an embodiment of the present disclosure, the fourth obtaining module may include a sixth obtaining unit.

And the sixth obtaining unit is used for respectively inputting the third feature map and the fourth feature map into the detection head of the second preset model to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image.

Fig. 7 schematically shows a block diagram of an object detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the object detection apparatus 700 may include a seventh obtaining module 710.

A seventh obtaining module 710, configured to input the image to be processed into the detection model, and obtain a category and a position of each object included in the image to be processed.

According to an embodiment of the present disclosure, a detection model is trained using a training apparatus of a detection model according to an embodiment of the present disclosure.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

Fig. 8 schematically shows a block diagram of an electronic device adapted to implement a training method and an object detection method of a detection model according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a training method of a detection model or an object detection method. For example, in some embodiments, the training method of the detection model or the target detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method of the detection model or the target detection method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a training method or an object detection method of the detection model by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of a detection model, comprising:

inputting a first sample image and a second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image;

inputting a third sample image and a fourth sample image into a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image, wherein the first sample image and the third sample image are a positive sample image pair, and the second sample image and the fourth sample image are a negative sample image pair;

based on a first contrast loss function, utilizing the first feature map, the second feature map, the third feature map and the fourth feature map to adjust model parameters of the first preset model until a preset condition is met, wherein the numerical value of the model parameters of the second preset model is the same as the numerical value of the model parameters of the first preset model; and

and determining a first preset model obtained under the condition of meeting the preset condition as the detection model.

2. The method according to claim 1, wherein the adjusting the model parameters of the first preset model by using the first feature map, the second feature map, the third feature map and the fourth feature map until a preset condition is met based on the first contrast loss function comprises:

obtaining a first output value by using the first feature and the third feature based on the first contrast loss function;

obtaining a second output value by using the second feature and the fourth feature based on the first contrast loss function;

adjusting model parameters of the first preset model according to output values until the output values converge, wherein the output values comprise the first output values and the second output values; and

and determining a first preset model obtained under the condition that the output value is converged as a detection model.

3. The method of claim 2, wherein the output values further comprise a third output value and a fourth output value;

the method further comprises the following steps:

inputting the first feature map and the second feature map into the first preset model respectively to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image;

inputting the third feature map and the fourth feature map into the second preset model respectively to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image;

obtaining the third output value by using the first instance vector and the third instance vector based on a second contrast loss function; and

and obtaining the fourth output value by using the second example vector and the fourth example vector based on the second contrast loss function.

4. The method according to any one of claims 1 to 3, wherein the inputting a first sample image and a second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image comprises:

respectively inputting the first sample image and the second sample image into a backbone network of the first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image;

inputting a third sample image and a fourth sample image into a second preset model to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image, including:

and inputting the third sample image and the fourth sample image into a backbone network of the second preset model to obtain a third feature corresponding to the third sample image and a fourth feature corresponding to the fourth sample image.

5. The method according to claim 3, wherein the inputting the first feature map and the second feature map into the first preset model respectively to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image comprises:

inputting the first feature and the second feature into a detection head of the first preset model respectively to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image;

the inputting the third feature map and the fourth feature map into the second preset model respectively to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image includes:

and inputting the third feature and the fourth feature into a detection head of the second preset model respectively to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image.

6. The method according to any one of claims 1 to 5, wherein the second sample image is obtained by processing the first sample image using a data enhancement method.

7. A method of target detection, comprising:

inputting the image to be processed into a detection model to obtain the category and position of each object included in the image to be processed,

wherein the detection model is trained using the method according to any one of claims 1-6.

8. A training apparatus for testing a model, comprising:

the first obtaining module is used for inputting a first sample image and a second sample image into a first preset model to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image;

a second obtaining module, configured to input a third sample image and a fourth sample image into a second preset model, so as to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image, where the first sample image and the third sample image are a positive sample image pair, and the second sample image and the fourth sample image are a negative sample image pair;

an adjusting module, configured to adjust a model parameter of the first preset model based on a first contrast loss function by using the first feature map, the second feature map, the third feature map, and the fourth feature map until a preset condition is met, where a value of the model parameter of the second preset model is the same as a value of the model parameter of the first preset model; and

and the determining module is used for determining a first preset model obtained under the condition of meeting the preset condition as the detection model.

9. The apparatus of claim 8, wherein the adjustment module comprises:

a first obtaining unit, configured to obtain a first output value by using the first feature map and the third feature map based on the first contrast loss function;

a second obtaining unit, configured to obtain a second output value by using the second feature map and the fourth feature map based on the first contrast loss function;

an adjusting unit, configured to adjust a model parameter of the first preset model according to an output value until the output value converges, where the output value includes the first output value and the second output value; and

and the determining unit is used for determining a first preset model obtained under the condition that the output value is converged as a detection model.

10. The apparatus of claim 9, wherein the output values further comprise a third output value and a fourth output value;

the device further comprises:

a third obtaining module, configured to input the first feature map and the second feature map into the first preset model respectively, so as to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image;

a fourth obtaining module, configured to input the third feature map and the fourth feature map into the second preset model respectively, so as to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image;

a fifth obtaining module, configured to obtain the third output value by using the first instance vector and the third instance vector based on a second contrast loss function; and

a sixth obtaining module, configured to obtain the fourth output value by using the second instance vector and the fourth instance vector based on the second contrast loss function.

11. The apparatus of any of claims 8-10, wherein the first obtaining module comprises:

a third obtaining unit, configured to input the first sample image and the second sample image into a backbone network of the first preset model respectively, so as to obtain a first feature map corresponding to the first sample image and a second feature map corresponding to the second sample image;

the second obtaining module includes:

a fourth obtaining unit, configured to input the third sample image and the fourth sample image into a backbone network of the second preset model, so as to obtain a third feature map corresponding to the third sample image and a fourth feature map corresponding to the fourth sample image.

12. The apparatus of claim 10, wherein the third obtaining means comprises:

a fifth obtaining unit, configured to input the first feature map and the second feature map into a detection head of the first preset model respectively, so as to obtain a first instance vector of each object included in the first sample image and a second instance vector of each object included in the second sample image;

the fourth obtaining module includes:

a sixth obtaining unit, configured to input the third feature map and the fourth feature map into the detection head of the second preset model respectively, so as to obtain a third instance vector of each object included in the third sample image and a fourth instance vector of each object included in the fourth sample image.

13. An apparatus according to any one of claims 8 to 12, wherein the second sample image is obtained by processing the first sample image using a data enhancement method.

14. An object detection device comprising:

a seventh obtaining module, configured to input the image to be processed into the detection model, obtain a category and a position of each object included in the image to be processed,

wherein the detection model is trained using an apparatus according to any one of claims 8 to 13.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or claim 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6 or claim 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 6 or claim 7.