CN114187459A

CN114187459A - Training method and device of target detection model, electronic equipment and storage medium

Info

Publication number: CN114187459A
Application number: CN202111307878.6A
Authority: CN
Inventors: 张为明; 张伟; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-03-15

Abstract

The disclosure provides a training method and device of a target detection model, electronic equipment and a storage medium, and relates to the technical field of computer vision and deep learning. The specific implementation scheme is as follows: constructing an initial first target detection model, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; training an initial first target detection model by using the sample image and the corresponding sample target to obtain a trained first target detection model; training an initial second target detection model by adopting the sample image, a sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model; the sample image does not need to be marked manually, so that the labor cost is low; and the heavy-weight first target detection model is trained, and then the light-weight second target detection model is distilled, so that the accuracy of the second target detection model is improved.

Description

Training method and device of target detection model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision and deep learning technologies, and in particular, to a training method and apparatus for a target detection model, an electronic device, and a storage medium.

Background

At present, a plurality of business target detection tasks have high real-time requirements, so that most of the business target detection tasks sample a light-weight level backbone network to construct a target detection model, and train the target detection model based on labeled training data.

Disclosure of Invention

The disclosure provides a training method and device for a target detection model, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of a target detection model, including: constructing an initial first target detection model, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images; training the initial first target detection model by adopting a sample image and a corresponding sample target to obtain a trained first target detection model; training an initial second target detection model by using the sample image, a sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the network layer number of the second backbone network in the second target detection model is smaller than the network layer number of the first backbone network.

According to another aspect of the present disclosure, there is provided a training apparatus for an object detection model, including: the system comprises a construction module, a detection module and a detection module, wherein the construction module is used for constructing an initial first target detection model, and a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images; the first training module is used for training the initial first target detection model by adopting the sample images and the corresponding sample targets to obtain a trained first target detection model; the second training module is used for training an initial second target detection model by adopting the sample image, the sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the network layer number of the second backbone network in the second target detection model is smaller than the network layer number of the first backbone network.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training an object detection model set forth above in the present disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the training method of the object detection model set forth above in the present disclosure.

According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when being executed by a processor, realizes the steps of the training method of the object detection model proposed above in the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a block diagram of an electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, many business target detection tasks have high requirements on real-time performance, and therefore, most of the business target detection tasks use a light-weight level backbone network to construct a target detection model, and train the target detection model based on labeled training data.

However, in the above method, the lightweight class of target detection model is trained directly based on the labeled training data, the labeling cost is high, the feature expression learning capability is limited, and the accuracy of the learned model is poor.

In order to solve the above problems, the present disclosure provides a training method and apparatus for a target detection model, an electronic device, and a storage medium.

Fig. 1 is a schematic diagram of a first embodiment of the present disclosure, and it should be noted that the method for training the target detection model according to the embodiment of the present disclosure may be applied to a device for training the target detection model, and the device may be configured in an electronic device, so that the electronic device may perform a target detection function.

The electronic device may be any device having a computing capability, for example, a Personal Computer (PC), a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device having various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet Computer, a Personal digital assistant, and a wearable device.

As shown in fig. 1, the training method of the target detection model may include the following steps:

101, constructing an initial first target detection model, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images.

In an embodiment of the disclosure, the image pre-processing comprises at least one of the following processes: image color transformation, image geometric transformation and picture mosaic processing of a plurality of blocks in the image. The image color transformation may be, for example, at least one of the following operations on the original image: gaussian noise, gaussian blur, color distortion, etc. The image geometric transformation may, for example, perform at least one of the following operations on the original image: cutting, rotating, turning, etc. The tiling of multiple blocks in an image, for example, dividing the image into multiple blocks, disturbs the positions of the multiple blocks.

In the embodiment of the present disclosure, the two images in the positive sample pair are obtained by performing different image preprocessing on the same image. For example, one image is obtained by performing color conversion processing on an a image, and the other image is obtained by performing image geometric conversion on the a image. For another example, one image is obtained by performing image geometric transformation on the a image, and the other image is obtained by performing multi-tile processing on the a image.

In the embodiment of the present disclosure, the two images in the negative sample pair are obtained by respectively performing image preprocessing on different images. The same image preprocessing may be performed on different images, or different image preprocessing may be performed on different images. For example, two images in the negative sample pair are obtained by performing color conversion processing on the a image and performing color conversion processing on the B image. For another example, in the two images in the negative sample pair, one image is obtained by performing color transformation processing on the a image, and the other image is obtained by performing image geometric transformation on the B image.

In the embodiment of the present disclosure, the process of training the first trunk network by sampling the positive sample pair and the negative sample pair may be, for example, inputting one positive sample pair into the first trunk network, acquiring features of two images in the positive sample pair, further calculating a first similarity between the features of the two images in the positive sample pair, and adjusting a coefficient of the first trunk network so that the larger the first similarity is, the better the first similarity is. Inputting a negative sample pair into the first trunk network, acquiring the characteristics of the two images in the negative sample pair, further calculating a second similarity between the characteristics of the two images in the negative sample pair, and adjusting the coefficient of the first trunk network so that the smaller the second similarity, the better.

For another example, a positive sample pair and a negative sample pair are respectively input into the first backbone network, and the features of two images in the positive sample pair and the features of two images in the negative sample pair are obtained; further calculating a first similarity between the features of the two images in the positive sample pair and a second similarity between the features of the two images in the negative sample pair; and constructing a loss function according to the reciprocal of the difference value of the first similarity and the second similarity to adjust the coefficient of the first main network, so that the larger the first similarity is, the better the second similarity is, the smaller the second similarity is, the better the second similarity is.

In an embodiment of the present disclosure, the first target detection model may include: a first backbone network and a target detection network. The first backbone network is used for extracting image features in the sample image; the target detection network is used for predicting targets in the sample image based on the image features. The target detection Network may be, for example, a Training-Time-Friendly Network (TTFnet) Network.

In the implementation of the present disclosure, in order to extract more features in the image and improve the accuracy of the training of the first target detection model, the first backbone network may be, for example, a residual network of 101 layers. The accuracy of the first target detection model constructed by the residual error network after training is high.

In the embodiment of the disclosure, through self-supervision contrast learning, the loss of output is calculated by inputting the images of the positive sample pair and the negative sample pair, and the characteristics that the images of the positive sample pair are more and more similar and the images of the negative sample pair are more and more dissimilar are learned, so that manual labeling is avoided, the manual labeling cost is reduced, and the accuracy of the first target detection model after training is ensured.

And 102, training an initial first target detection model by using the sample image and the corresponding sample target to obtain a trained first target detection model.

In the embodiment of the present disclosure, the sample image and the corresponding sample object are data in a specific application scenario. Taking an application scene as a vehicle detection scene as an example, the sample image may be a vehicle image, and the sample target may be position information of a vehicle in the vehicle image. Taking an application scene as a road element detection scene as an example, the sample image may be a road image, and the sample target may be position information of a road element in the road image, and the like.

In the embodiment of the present disclosure, the training device for the target detection model performs the process of step 102, for example, to determine a loss function of the first target detection model based on the sample target and the predicted target output by the first target detection model; and adjusting coefficients except the coefficient of the first trunk network in the first target detection model based on the value of the loss function to realize training.

In the embodiment of the present disclosure, when the initial first target detection model is trained by using the sample image and the corresponding sample target, the coefficients of the first trunk network are not adjusted, and only the coefficients of the first target detection model except the coefficients of the first trunk network are adjusted, so that the accuracy of the first trunk network can be ensured, and the accuracy of the trained first target detection model is improved.

103, training an initial second target detection model by adopting the sample image, the sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the network layer number of the second backbone network in the second target detection model is smaller than that of the first backbone network.

In the embodiment of the present disclosure, the intermediate representation of the sample image output by the first object detection model may be a representation output by any network layer of the object detection network in the first object detection model after the sample image is input into the first object detection model. For example, the target detects a representation of a network layer output prior to the fully connected layer in the network.

In the disclosed embodiment, the second backbone network may be, for example, a 34-layer residual network.

In the embodiment of the present disclosure, taking an application scenario as an example of vehicle detection, the trained second target detection model may be used for detecting vehicle position information, that is, inputting a vehicle image to be detected into the trained second target detection model, and acquiring the position information of the vehicle output by the second target detection model. Taking an application scene as road element detection as an example, the trained second target detection model may be used for detecting position information of a road element, that is, a road image to be detected is input into the trained second target detection model, and the position information of the road element output by the second target detection model is obtained.

The training method of the target detection model of the embodiment of the disclosure constructs an initial first target detection model, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images; training an initial first target detection model by using the sample image and the corresponding sample target to obtain a trained first target detection model; training an initial second target detection model by adopting the sample image, the sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the sample image does not need to be marked manually, so that the labor cost is low; and the heavy-weight first target detection model is trained, and then the light-weight second target detection model is distilled, so that the accuracy of the second target detection model is improved.

In order to obtain the trained second target model and ensure the accuracy of the trained model, as shown in fig. 2, the second diagram is a schematic diagram according to a second embodiment of the present disclosure, in the embodiment of the present disclosure, a plurality of sub-loss functions may be constructed by using a sample image, a sample target corresponding to the sample image, and an intermediate representation of the sample image output by the first target detection model, so as to determine a total loss function, and the coefficient is adjusted according to the value of the total loss function, so as to implement training. The embodiment shown in fig. 2 may include the following steps:

step 201, constructing an initial first target detection model, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images.

Step 202, training the initial first target detection model by using the sample image and the corresponding sample target to obtain a trained first target detection model.

Step 203, constructing a first sub-loss function based on the sample target corresponding to the sample image and the prediction target output by the second target detection model.

In the embodiment of the present disclosure, the training device of the target detection model may input the sample image into the second target detection model, and obtain the predicted target output by the second target detection model; and determining the similarity between a sample target corresponding to the sample image and the prediction target, and constructing a first sub-loss function according to the reciprocal of the similarity, so that the greater the similarity between the sample target and the prediction target, the smaller the first sub-loss function.

Step 204, a second sub-loss function is constructed based on the intermediate representation of the sample image and the predicted intermediate representation output by the second target detection model.

In the embodiment of the present disclosure, the training apparatus for the target detection model may input the sample image into the first target detection model and the second target detection model, respectively, and obtain the intermediate representation output by the first target detection model and the predicted intermediate representation output by the second target detection model; determining the similarity between the intermediate representation and the predicted intermediate representation, and constructing a second sub-loss function according to the reciprocal of the similarity, so that the greater the similarity between the intermediate representation and the predicted intermediate representation, the smaller the second sub-loss function.

In step 205, a total loss function is determined according to the first sub-loss function and the second sub-loss function.

In the embodiment of the present disclosure, the process of determining the total loss function by the training apparatus of the target detection model may be, for example, determining a weight of the first sub-loss function and a weight of the second sub-loss function; and adding the first sub-loss function and the second sub-loss function according to the corresponding weights to obtain a total loss function.

And step 206, adjusting the coefficient of the second target detection model according to the value of the total loss function, and realizing training.

In summary, an initial first target detection model is constructed, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the method comprises the steps that images of negative sample pairs are obtained by respectively carrying out image preprocessing on different images, an initial first target detection model is trained by adopting sample images and corresponding sample targets to obtain a trained first target detection model, and a first sub-loss function is constructed on the basis of the sample targets corresponding to the sample images and a prediction target output by a second target detection model; constructing a second sub-loss function based on the intermediate representation of the sample image and the predicted intermediate representation output by the second target detection model; determining a total loss function according to the first sub-loss function and the second sub-loss function; adjusting the coefficient of the second target detection model according to the value of the total loss function to realize training, wherein the sample image does not need manual labeling, and the labor cost is low; and the heavy-weight first target detection model is trained, and then the light-weight second target detection model is distilled, so that the accuracy of the second target detection model is improved.

In order to more clearly illustrate the above embodiments, an example will now be given.

For example, taking an application scene as an example of vehicle detection, first, a residual error network (resnet101) based on 101 layers is used as a first backbone network, and self-supervised contrast learning for the first backbone network is completed based on an image of an arbitrary scene, for example, a framework of the self-supervised contrast learning may be MoCov 2. Then constructing a pre-trained first target detection model based on a trained resnet101 network and a target detection network; and completing retraining of the first target detection model based on the sample image in the vehicle detection scene and the corresponding sample target. The target detection Network may be a Training-Time-Friendly Network (TTFnet). And finally, distilling the lightweight second target detection model based on the first target detection model to obtain a trained second target detection model, wherein the trained second target detection model is used for carrying out target detection processing on an image to be detected in a vehicle detection scene.

In the embodiment of the present disclosure, the first target detection model is used as a teacher model to assist training of the second target detection model as a student model, the teacher model has many coefficients, a complex structure and a strong learning ability, and if the knowledge learned by the teacher model is migrated to the student model with a relatively weak learning ability, the generalization ability of the student model can be enhanced. Wherein, the second backbone network in the second target detection model may be a residual network (resnet34) of layer 34, the first target detection model plays a guiding role, and the second target detection model performs an actual target prediction task.

In the disclosed embodiment, the input of the second object detection model is the image of the vehicle to be detected, and the corresponding output is the position information of the vehicle.

In order to implement the above embodiments, the present disclosure further provides a training apparatus for a target detection model.

As shown in fig. 3, fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. The training apparatus 300 for the target detection model includes: a build module 310, a first training module 320, and a second training module 330.

The constructing module 310 is configured to construct an initial first target detection model, where a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images; a first training module 320, configured to train the initial first target detection model by using the sample image and the corresponding sample target to obtain a trained first target detection model; a second training module 330, configured to train an initial second target detection model by using the sample image, a sample target corresponding to the sample image, and an intermediate representation of the sample image output by the first target detection model, so as to obtain a trained second target detection model; the network layer number of the second backbone network in the second target detection model is smaller than the network layer number of the first backbone network.

As a possible implementation manner of the embodiment of the present disclosure, the first backbone network and the second backbone network are residual error networks.

As a possible implementation manner of the embodiment of the present disclosure, the first training module 320 is specifically configured to determine a loss function of the first target detection model based on the sample target and a predicted target output by the first target detection model; and adjusting coefficients except for the coefficient of the first trunk network in the first target detection model based on the value of the loss function to realize training.

As a possible implementation manner of the embodiment of the present disclosure, the second training module 330 is specifically configured to construct a first sub-loss function based on a sample target corresponding to the sample image and a prediction target output by the second target detection model; constructing a second sub-loss function based on the intermediate representation of the sample image and the predicted intermediate representation output by the second target detection model; determining a total loss function according to the first sub-loss function and the second sub-loss function; and adjusting the coefficient of the second target detection model according to the value of the total loss function to realize training.

As a possible implementation manner of the embodiment of the present disclosure, the image preprocessing includes at least one of the following processes: image color transformation, image geometric transformation and picture mosaic processing of a plurality of blocks in the image.

The training device for the target detection model of the embodiment of the disclosure is obtained by constructing an initial first target detection model, wherein a first trunk network in the first target detection model is trained by adopting a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images; training an initial first target detection model by using the sample image and the corresponding sample target to obtain a trained first target detection model; training an initial second target detection model by adopting the sample image, the sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the sample image does not need to be marked manually, so that the labor cost is low; and the heavy-weight first target detection model is trained, and then the light-weight second target detection model is distilled, so that the accuracy of the second target detection model is improved.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all carried out on the premise of obtaining the consent of the user, and all accord with the regulation of related laws and regulations without violating the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in the electronic device 400 are connected to the I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 executes the respective methods and processes described above, such as the training method of the object detection model. For example, in some embodiments, the training method of the target detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the method of training an object detection model described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured by any other suitable means (e.g. by means of firmware) to perform the training method of the object detection model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions of the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training an object detection model, comprising:

constructing an initial first target detection model, wherein a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images;

training the initial first target detection model by adopting a sample image and a corresponding sample target to obtain a trained first target detection model;

training an initial second target detection model by using the sample image, a sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the network layer number of the second backbone network in the second target detection model is smaller than the network layer number of the first backbone network.

2. The method of claim 1, wherein the first and second backbone networks are residual networks.

3. The method of claim 1, wherein the training the initial first target detection model using the sample images and the corresponding sample targets to obtain a trained first target detection model comprises:

determining a loss function of the first target detection model based on the sample target and a predicted target output by the first target detection model;

and adjusting coefficients except for the coefficient of the first trunk network in the first target detection model based on the value of the loss function to realize training.

4. The method of claim 1, wherein the training an initial second target detection model using the sample image, a sample target corresponding to the sample image, and an intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model comprises:

constructing a first sub-loss function based on a sample target corresponding to the sample image and a prediction target output by the second target detection model;

constructing a second sub-loss function based on the intermediate representation of the sample image and the predicted intermediate representation output by the second target detection model;

determining a total loss function according to the first sub-loss function and the second sub-loss function;

and adjusting the coefficient of the second target detection model according to the value of the total loss function to realize training.

5. The method of claim 1, wherein the image pre-processing comprises at least one of: image color transformation, image geometric transformation and picture mosaic processing of a plurality of blocks in the image.

6. A training apparatus for an object detection model, comprising:

the system comprises a construction module, a detection module and a detection module, wherein the construction module is used for constructing an initial first target detection model, and a first trunk network in the first target detection model is obtained by training a positive sample pair and a negative sample pair; the images in the positive sample pair are obtained by performing different image preprocessing on the same image; the images of the negative sample pairs are obtained by respectively carrying out image preprocessing on different images;

the first training module is used for training the initial first target detection model by adopting the sample images and the corresponding sample targets to obtain a trained first target detection model;

the second training module is used for training an initial second target detection model by adopting the sample image, the sample target corresponding to the sample image and the intermediate representation of the sample image output by the first target detection model to obtain a trained second target detection model; the network layer number of the second backbone network in the second target detection model is smaller than the network layer number of the first backbone network.

7. The apparatus of claim 6, wherein the first and second backbone networks are residual networks.

8. The apparatus of claim 6, wherein the first training module is specifically configured to,

9. The apparatus of claim 6, wherein the second training module is specifically configured to,

10. The apparatus of claim 6, wherein the image pre-processing comprises at least one of: image color transformation, image geometric transformation and picture mosaic processing of a plurality of blocks in the image.

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-5.