CN115841605A

CN115841605A - Target detection network training and target detection method, electronic device and storage medium

Info

Publication number: CN115841605A
Application number: CN202210989314.3A
Authority: CN
Inventors: 李志远; 庄月清; 李伯勋
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2023-03-24

Abstract

The application provides a target detection network training and target detection method, an electronic device, a storage medium and a computer program product. The method comprises the following steps: acquiring a first training image and first annotation data; inputting the first training image into a new class backbone network of a new class target detection network and a first teacher backbone network respectively to obtain a first image characteristic and a first teacher image characteristic, wherein the first teacher backbone network is a base class backbone network of a trained base class target detection network; inputting the first image characteristic into a new type subsequent network structure of the new type target detection network to obtain a first target prediction result; calculating a first regularization loss term based on the first image feature and the first teacher image feature; calculating a first prediction loss item based on the first target prediction result and the first annotation data; and optimizing the parameters of the new class target detection network based on the loss items, wherein the parameters of the first teacher backbone network are kept unchanged. And a network with good performance can be obtained under the training of a small sample.

Description

Target detection network training and target detection method, electronic device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target detection network training method, an electronic device, a storage medium, and a computer program product, and a target detection method, an electronic device, a storage medium, and a computer program product.

Background

As a basic research direction in the field of computer vision, target detection has been widely applied in real life, for example, in the fields of face recognition, automatic driving, safety control, and the like.

However, the performance of the existing target detector needs to rely on a huge amount of manual labeling data to be guaranteed. In a scene with a lack of data, the target detector cannot perform well due to the fact that the target detector is easy to be over-fitted to limited data.

Therefore, how to learn the target detector on a small sample is a very important issue in the field of computer vision.

Disclosure of Invention

The present application has been made in view of the above problems. The application provides a target detection network training method, an electronic device, a storage medium and a computer program product, and a target detection method, an electronic device, a storage medium and a computer program product.

According to an aspect of the present application, a target detection network training method is provided, including: acquiring a first training image and corresponding first labeling data, wherein the first labeling data are used for indicating the position and the category of a target contained in the first training image; inputting the first training image into a new class backbone network of a new class target detection network and a first teacher backbone network respectively to obtain a first image characteristic output by the new class backbone network and a first teacher image characteristic output by the first teacher backbone network, wherein the first teacher backbone network is a base class backbone network of a trained base class target detection network; inputting the first image characteristics into a new type subsequent network structure of the new type target detection network to obtain a first target prediction result output by the new type subsequent network structure, wherein the first target prediction result comprises position information and category information of a target in a first training image; calculating a first regularization loss term based on the first image feature and the first teacher image feature; calculating a first prediction loss item based on the first target prediction result and the first annotation data; and optimizing the parameters of the new class target detection network based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, wherein the parameters of the first teacher trunk network are kept unchanged in the process of optimizing the parameters of the new class target detection network.

Illustratively, the method further comprises: acquiring a second training image and corresponding second annotation data, wherein the second annotation data is used for indicating the position and the type of a target contained in the second training image; inputting a second training image into a base class main network and a second teacher main network of the base class target detection network respectively to obtain a second image characteristic output by the base class main network and a second teacher image characteristic output by the second teacher main network, wherein the second teacher main network is a main network in a pre-trained neural network model; inputting the second image characteristics into a base class subsequent network structure of a base class target detection network to obtain a second target prediction result output by the base class subsequent network structure, wherein the second target prediction result comprises position information and category information of targets in a second training image; calculating a second regularization loss term based on the second image feature and the second teacher image feature; calculating a second prediction loss item based on the second target prediction result and the second annotation data; and optimizing the parameters of the base class target detection network based on the second regular loss item and the second prediction loss item to obtain a trained base class target detection network, wherein the parameters of the second teacher trunk network are kept unchanged in the process of optimizing the parameters of the base class target detection network.

Illustratively, before inputting the second training image into the base class backbone network of the base class target detection network and the second teacher backbone network, respectively, the method further comprises: randomly initializing at least part of parameters of the subsequent network structure of the base class; and initializing the parameters of the base class backbone network into the parameters of the second teacher backbone network.

Exemplarily, the subsequent network structure of the base class comprises a base class interested region network and a base class frame detection head which are sequentially connected, the pre-trained neural network model comprises a backbone network and a target interested region network which are sequentially connected, any one of the base class interested region network and the target interested region network is used for outputting the region characteristics of the interested region, the base class frame detection head is used for outputting corresponding position information and class information, and the random initialization of at least part of parameters of the subsequent network structure of the base class comprises the following steps: carrying out random initialization on the parameters of the base class frame detection head; and initializing the parameters of the base class interested area network into the parameters of the target interested area network.

Illustratively, before inputting the first training image into the new class backbone network of the new class target detection network and the first teacher backbone network, respectively, the method further comprises: randomly initializing at least part of parameters of the new class of subsequent network structure; and initializing the parameters of the new class backbone network into the parameters of the first teacher backbone network.

Exemplarily, the new class follow-up network structure includes a new class interest area network and a new class frame detection head which are sequentially connected, the trained base class target detection network includes a base class backbone network, a base class interest area network and a base class frame detection head which are sequentially connected, either one of the new class interest area network and the base class interest area network is used for outputting area characteristics of an interest area, either one of the new class frame detection head and the base class frame detection head is used for outputting corresponding position information and class information, and the random initialization of at least part of parameters of the new class follow-up network structure includes: carrying out random initialization on the parameters of the new class frame detection head; and initializing the parameters of the new type interesting area network into the parameters of the base type interesting area network.

According to another aspect of the present application, there is provided a target detection method, including: acquiring a target image; and inputting the target image into the trained new type target detection network to obtain a target image prediction result output by the trained new type target detection network, wherein the target image prediction result comprises position information and category information of the target in the target image.

According to another aspect of the present application, there is provided an electronic device, comprising a processor and a memory, wherein the memory stores computer program instructions, and the computer program instructions are executed by the processor to perform the above object detection network training method.

According to another aspect of the present application, there is provided an electronic device comprising a processor and a memory, wherein the memory has stored therein computer program instructions for executing the above object detection method when the computer program instructions are executed by the processor.

According to another aspect of the present application, there is provided a storage medium having stored thereon program instructions for executing the above-mentioned object detection network training method when executed.

According to another aspect of the present application, there is provided a storage medium having stored thereon program instructions for performing the above-described object detection method when executed.

According to another aspect of the application, a computer program product is provided, comprising a computer program for performing the above object detection network training method when the computer program is run.

According to another aspect of the present application, a computer program product is provided, comprising a computer program for performing the above object detection method when the computer program is run.

According to the target detection network training method, the electronic device, the storage medium, the computer program product, the target detection method, the electronic device, the storage medium, and the computer program product of the embodiments of the application, the pre-trained parameters of the base class backbone network can be used to assist in training the backbone network of the current new class target detection network. The new class target detection network is trained through a base class knowledge regularization method, so that the generalization performance of the new class target detection network can be well guaranteed even if the new class target detection network is trained on a limited sample. Therefore, the algorithm can greatly improve the performance of the new class target detection network based on small sample training.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 illustrates a schematic block diagram of an example electronic device for implementing an object detection network training or object detection method and apparatus in accordance with embodiments of the present application;

FIG. 2 shows a schematic flow diagram of a method of object detection network training according to one embodiment of the present application;

FIG. 3 illustrates a schematic diagram of object detection network training for the Faster RCNN network;

FIG. 4 shows a schematic flow diagram of a target detection method according to an embodiment of the present application;

FIG. 5 shows a schematic block diagram of an object detection network training apparatus according to one embodiment of the present application;

FIG. 6 shows a schematic block diagram of an object detection arrangement according to an embodiment of the present application;

FIG. 7 shows a schematic block diagram of an electronic device according to one embodiment of the present application; and

FIG. 8 shows a schematic block diagram of an electronic device according to one embodiment of the present application.

Detailed Description

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is an important branch of artificial intelligence, particularly a machine is used for identifying the world, and computer vision technologies generally comprise technologies such as face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to many fields, such as safety control, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, intelligent medical treatment, face payment, face unlocking, fingerprint unlocking, person certificate verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like.

In order to make the objects, technical solutions and advantages of the present application more apparent, exemplary embodiments according to the present application will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the application described in the application without inventive step, shall fall within the scope of protection of the application.

In order to at least partially solve the technical problem, embodiments of the present application provide an object detection network training method, an electronic device, a storage medium, and a computer program product, and an object detection method, an electronic device, a storage medium, and a computer program product. According to the target detection network training method, the pre-trained parameters of the backbone network can be used for assisting in training the backbone network of the current target detection network. The algorithm can greatly improve the performance of the target detection network based on small sample training. The target detection network training method and the target detection method according to the embodiment of the application can be applied to any field needing target detection, including but not limited to the fields of face recognition, fingerprint recognition, character recognition, identity authentication, automatic driving, safety prevention and control and the like.

First, an example electronic device 100 for implementing an object detection network training or object detection method and apparatus according to an embodiment of the present application is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104. Optionally, the electronic device 100 may also include an input device 106, an output device 108, and an image capture device 110, which may be interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a microprocessor, the processor 102 may be one or a combination of several of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), or other forms of processing units having data processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement the client functionality (implemented by the processor) of the embodiments of the application described below and/or other desired functionality. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc. Alternatively, the input device 106 and the output device 108 may be integrated together, implemented using the same interactive device (e.g., a touch screen).

The image capture device 110 may capture images and store the captured images in the storage device 104 for use by other components. The image photographing device 110 may be a separate camera or a camera in a mobile terminal, etc. It should be understood that the image capture device 110 is merely an example, and the electronic device 100 may not include the image capture device 110. In this case, other devices having image capturing capabilities may be used to capture an image and transmit the captured image to the electronic device 100.

For example, an example electronic device for implementing the object detection network training or object detection method and apparatus according to the embodiments of the present application may be implemented on a device such as a personal computer or a remote server.

Next, a target detection network training method according to an embodiment of the present application will be described with reference to fig. 2. FIG. 2 shows a schematic flow diagram of a method 200 of object detection network training according to one embodiment of the present application. As shown in fig. 2, the target detection network training method 200 includes steps S210, S220, S230, S240, S250, and S260.

In step S210, a first training image and corresponding first annotation data are obtained, where the first annotation data is used to indicate a position and a category of a target included in the first training image.

The training images described herein may be any images, and any source and form of images are intended to fall within the scope of the present application. Optionally, the training image may be a static image, or may be any video frame in a dynamic video. The training image may be an original image captured by an image capturing device (e.g., a separate camera or a camera of a mobile terminal, etc.), or may be an image obtained after preprocessing (such as digitizing, normalizing, smoothing, etc.) the original image. Note that the preprocessing of the original image may include an operation of extracting a sub-image including the target from the original image acquired by the image acquisition device to obtain a training image. The target described herein may be any object including, but not limited to: a person or a part of a human body (such as a human face), an animal, a vehicle, a building, a character, and the like.

The number of the first training images obtained in step S210 may be one or more. The one or more first training images may constitute a first data set, i.e. a new class data set. The data volume of the new class data set is less, namely, the labeling information of the new class is less.

In step S220, the first training image is respectively input into the new class backbone network of the new class target detection network and the first teacher backbone network to obtain a first image feature output by the new class backbone network and a first teacher image feature output by the first teacher backbone network, where the first teacher backbone network is a base class backbone network of the trained base class target detection network.

The main network of the target detection network is mainly used for extracting features, and the extracted features can be processed by a subsequent network structure to obtain the position information of the position where the target is located and the category information of the category to which the target belongs.

For example, and without limitation, the various backbone networks described herein, including the new class backbone network, the first teacher backbone network, and the second teacher backbone network, etc., may be implemented using similar network structures, for example, using the same number of network layers, the same size of convolution kernel, etc., but after training, the network layers of the above backbone networks may have different sizes (including weights and offsets of neurons, etc.).

The backbone network may include one or more of: convolutional layers, pooling layers, activation function layers, and the like. By way of example and not limitation, the subsequent network structure may include a region candidate network, a region of interest header, a box detection header, and the like. The area candidate network is used for detecting a plurality of candidate areas where the target is located. The interested area head is used for extracting the area characteristics of the interested area based on the image characteristics output by the backbone network and the candidate area output by the area candidate network. The frame detection head is used for obtaining position information and category information of the target based on the region characteristics of the region of interest.

In step S220, the first teacher backbone network is a base class backbone network of the trained base class target detection network, which has a similar structure as the new class backbone network. And, optionally, the initial parameters of the new class backbone network may be the same as the parameters of the first teacher backbone network. The first teacher backbone network is a pre-trained backbone network. For example, a first teacher backbone network may be a backbone network that is pre-trained on a larger number of data sets (which may be referred to as second data sets). The second data set may also be referred to as a base class data set. The number of first training images contained in the first data set may be smaller than the number of images in the second data set. In addition, optionally, the images in the first data set and the images in the second data set may be partially identical or may be completely different. Optionally, the first data set may also be a subset of the second data set.

In step S230, the first image feature is input into a new class subsequent network structure of the new class target detection network to obtain a first target prediction result output by the new class subsequent network structure, where the first target prediction result includes location information and category information of a target in the first training image.

As described above, subsequent network structures in the object detection network may output location information and category information of the object. Those skilled in the art can understand the form and meaning of the position information and the category information output by the target detection algorithm, which are not described herein in detail. Illustratively, the category information may be a probability distribution of a category to which the target belongs.

In step S240, a first canonical loss term is calculated based on the first image feature and the first teacher image feature.

The first regularization loss term may be calculated based on the following equation:

wherein,

representing a first teacher's backbone network, F _B Representing a new class of backbone network, I representing a first training image,

representing a first teacher image feature, F _B (I) Representing a first image feature.

In step S250, a first predicted loss term is calculated based on the first target prediction result and the first annotation data.

The first predicted loss term may be calculated by substituting the first target prediction result and the first annotation data into a preset first target loss function. The first objective loss function may be any suitable objective loss function, which is not limited in this application. By way of example and not limitation, the first predictive loss term can be classified as a classification loss and a regression loss. The classification loss is used for measuring a difference between category information in the first target prediction result and category information in the first labeling data, and the regression loss is used for measuring a difference between position information in the first target prediction result and position information in the first labeling data. It is to be understood that the above-mentioned position information may be represented by coordinate information of a bounding box (bounding box), and the regression loss may be a difference between the bounding box in the first target prediction result and the bounding box in the first annotation data. By way of example and not limitation, the classification penalty can be calculated using a cross-entropy penalty function, and the regression penalty can be calculated using a mean absolute error (L1) penalty function or a mean square error (L2) penalty function, among others.

In step S260, parameters of the new class target detection network are optimized based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, wherein the parameters of the first teacher backbone network are kept unchanged during the process of optimizing the parameters of the new class target detection network.

Optionally, the first regular loss term and the first predicted loss term may be weighted and summed based on a preset first weight to obtain a first total loss term. Subsequently, the parameters of the new class object detection network can be optimized through a back propagation and gradient descent algorithm based on the first total loss term. The calculation of the loss items and the parameter optimization can be carried out iteratively until the new class of target detection network converges. The above training process of the new class of target detection network can be understood by those skilled in the art, and is not described in detail herein.

The target detection network training method 200 is illustrated below with the fast RCNN network as a new class of target detection network.

Referring to fig. 3, a schematic diagram of object detection network training for the Faster RCNN network is shown. In fig. 3, the features extracted by the backbone network of the fast RCNN to be trained are constrained to be as close as possible to the features extracted by the backbone network of the pre-trained fast RCNN by introducing a regularization loss term. In particular, assume that the Faster RCNN framework comprises a backbone network F _B Regional candidate network F _RPN Detection head F _RCNN . Detection head F _RCNN A region of interest Head (RoI Head) and a Box Predictor (Box Predictor) may be included. The regional candidate network is shown in fig. 3 as the RPN. The pre-trained Faster RCNN can be pre-trained on, for example, a larger number of base class datasets.

It is to be understood that, when the process shown in fig. 3 is applied to a new class target detection network, the training image shown in fig. 3 may be a first training image, the teacher backbone network may be a first teacher backbone network, i.e., a base class backbone network, and the backbone network may be a new class backbone network. The RPN, roI header and frame detection header may be a new class of subsequent network structure. Can be introduced into a teacher backbone network

Can be initialized to parameters of the backbone network of the pre-trained base class fast RCNN. Subsequently, F is _B And &>

As a constraint, add to the loss term of fast RCNN as a base class knowledge canonical loss. During the training process, is>

Is frozen to not participate in the update.

The base class can be understood as a class in which the number of the labeled samples is not limited, that is, how many labeled samples of the base class can be used, and is not limited by the number. The new class can be understood as a class in which the number of labeled samples is limited, for example, only K labeled samples can be used for training, and K can be a smaller number such as 1, 2, 5, 10, 20, 30, and the like. The labeled sample is the training image and the corresponding labeled data. For example, an annotation sample includes a training image and annotation data corresponding to the training image. The base class and the new class are concepts adopted in the field of small sample learning, and the difference and meaning of the base class and the new class can be understood by those skilled in the art, and are not described in detail herein. After the new class target detection network is trained on the new class data set, three of the new class target detection network can be obtainedThe components are as follows: new class backbone network

New class area candidate network->

New detection head/device>

According to the target detection network training method provided by the embodiment of the application, the parameters of the base class backbone network which is pre-trained can be utilized to assist in training the backbone network of the current new class target detection network. The new class target detection network is trained through a base class knowledge regularization method, so that the generalization performance of the new class target detection network can be well guaranteed even if the new class target detection network is trained on a limited sample. Therefore, the algorithm can greatly improve the performance of the new class target detection network based on small sample training. For example, the experimental result on the COCO small sample dataset shows that the Average accuracy of the Mean Average Precision (mapp) index of the new class target detection network obtained by training with the training method 200 can be improved by 8 points under the condition that each new class only has 10 to 30 labeled samples.

Illustratively, the object detection network training method according to the embodiment of the present application may be implemented in a device, an apparatus, or a system having a memory and a processor.

The target detection network training method according to the embodiment of the application can be deployed at an image acquisition end, for example, at a personal terminal or a server end with an image acquisition function.

Alternatively, the target detection network training method according to the embodiment of the present application may also be distributively deployed at a server side (or a cloud side) and a personal terminal. For example, a training image may be collected at a client, the client transmits the collected image to a server (or a cloud), and the server (or the cloud) performs target detection network training based on the training image.

According to an embodiment of the application, the method further comprises: acquiring a second training image and corresponding second labeling data, wherein the second labeling data are used for indicating the position and the type of a target contained in the second training image; inputting a second training image into a base class main network and a second teacher main network of the base class target detection network respectively to obtain a second image characteristic output by the base class main network and a second teacher image characteristic output by the second teacher main network, wherein the second teacher main network is a main network in a pre-trained neural network model; inputting the second image characteristics into a base class subsequent network structure of a base class target detection network to obtain a second target prediction result output by the base class subsequent network structure, wherein the second target prediction result comprises position information and category information of a target in a second training image; calculating a second regularization loss term based on the second image feature and the second teacher image feature; calculating a second prediction loss item based on the second target prediction result and the second annotation data; and optimizing the parameters of the base class target detection network based on the second regular loss item and the second prediction loss item to obtain a trained base class target detection network, wherein the parameters of the second teacher trunk network are kept unchanged in the process of optimizing the parameters of the base class target detection network.

The training process for the base class target detection network is similar to the training process for the new class target detection network, and the training process for the base class target detection network can be understood with reference to the training process for the new class target detection network, which is not described herein again.

The number of the second training images may be one or more. The one or more second training images may constitute a second data set. The number of second training images contained in the second data set may be less than the number of images in a large-scale pre-training data set (e.g., imageNet). In addition, optionally, the images in the second data set may be partially the same as or entirely different from the images in the large-scale pre-training data set. Alternatively, the second data set may be a subset of a large-scale pre-training data set.

Illustratively, the second canonical loss term may be calculated based on the following equation:

the second target prediction result and the second annotation data may be substituted into a preset second target loss function to calculate a second prediction loss term. The second objective function may be the same as or different from the first objective loss function.

Optionally, the second regular loss term and the second predicted loss term may be weighted and summed based on a preset second weight to obtain a second total loss term. Subsequently, parameters of the second target detection network may be optimized by a back propagation and gradient descent algorithm based on the second total loss term. The second weight may be the same as or different from the first weight.

The general training procedure of the present embodiment is described below.

A pre-trained neural network model may be obtained first, which may be obtained by pre-training the neural network model based on a large-scale pre-training dataset (e.g., imageNet). The neural network model may be any suitable network model, such as an object classification network or an object detection network, etc. Subsequently, a base class target detection network (pre-training knowledge regularization method) can be trained on a base class data set with more labels based on parameters of a backbone network of the pre-trained neural network model. After the base class target detection network is trained, the new class target detection network can be finely tuned on the limited labeled new class data set based on the parameters of the backbone network of the base class target detection network, so as to obtain the trained new class target detection network (base class knowledge regularization method). By the mode, the base class target detection network can be trained by using a large amount of base class marking data, and then the new class target detection network is trained, so that the new class target detection network can obtain good detection capability on new classes with rare marking data.

With reference to fig. 3, the training method of the base class target detection network is described by taking the Faster RCNN network as an example. It is to be understood that when the process shown in fig. 3 is applied to the base class target detection network, the training image shown in fig. 3 may be a second training image, the teacher backbone network may be a backbone network of a pre-trained neural network model, and the backbone network may be a base class backbone network. The RPN, roI header, and frame detection header may be base class successor network structures.

In fig. 3, the extracted features of the backbone network of the base class target detection network may be constrained as close as possible to the extracted features of the backbone network of the neural network model pre-trained on a large-scale pre-training dataset by introducing another regular loss term. For example, a backbone network of a pre-trained neural network model may be introduced

Base class target detection network backbone network->

And a trunk network of the pre-trained neural network model>

Is added as a constraint as a pre-trained knowledge canonical loss to the loss term of the Faster RCNN. During the training process, is>

Is frozen to not participate in the update.

After the fine tuning on the new class data set (i.e., the second data set) is completed, three components of the base class target detection network can be obtained: base class backbone network

Base class area candidate network->

Base type detection head/device>

The base class target detection network is trained based on a pre-training knowledge regularization method, so that the base class target detection network can keep general feature expression capability while learning features suitable for detection tasks. The new class target detection network is trained through a base class knowledge regularization method, so that the generalization performance of the new class target detection network can be well guaranteed even if the new class target detection network is trained on a limited sample.

According to an embodiment of the present application, before the second training image is respectively input to the base class backbone network of the base class target detection network and the second teacher backbone network, the method may further include: randomly initializing at least part of parameters of the subsequent network structure of the base class; and initializing the parameters of the base class backbone network into the parameters of the second teacher backbone network.

The above describes an embodiment of initializing the parameters of the base class backbone network to the parameters of the second teacher backbone network, and details are not described here. Note that initializing the parameters of the base class backbone network to the parameters of the second teacher backbone network is only an example and not a limitation of the present application. In order to maintain a good universal feature expression capability of the base class target detection network, it is desirable that the features extracted by the base class backbone network are as close as possible to the features extracted by the backbone network of the pre-trained neural network model. The parameters of the base class backbone network are initialized to the parameters of the second teacher backbone network, and fine tuning of the initialized parameters of the base class backbone network is a convenient learning mode on the basis, so that the method is beneficial to fast training to obtain an ideal base class target detection network. However, it is understood that the base class backbone network can have other suitable initialization parameters.

According to the embodiment of the application, the subsequent network structure of the base class comprises a base class interested area network and a base class frame detection head which are sequentially connected, the pre-trained neural network model comprises a backbone network and a target interested area network which are sequentially connected, any one of the base class interested area network and the target interested area network is used for outputting the area characteristics of the interested area, the base class frame detection head is used for outputting corresponding position information and class information, and random initialization is carried out on at least part of parameters of the subsequent network structure of the base class, wherein the random initialization comprises the following steps: carrying out random initialization on the parameters of the base class frame detection head; and initializing the parameters of the base class interested area network into the parameters of the target interested area network.

Alternatively, the pre-trained neural network model may be any suitable network capable of image processing, and the backbone network thereof may perform feature extraction on the image. Illustratively, the pre-trained neural network model may be an object classification network or an object detection network. In case the pre-trained neural network model is a target classification network, it may further include a target box detection head connected to the target region-of-interest network, the target box detection head being configured to output corresponding class information. In case the pre-trained neural network model is a target detection network, it may further include a target box detection head connected to the target region-of-interest network, the target box detection head being configured to output corresponding location information and category information.

It is noted that although the embodiments of the present application are described herein with reference to the following network structure of the target detection network including the area-of-interest network and the block detection header, it is understood that this is only an example and not a limitation of the present application, and the target detection network may have other suitable network structures. For example, the subsequent network structure may not be partitioned into the area-of-interest network and the box detection header, but the entire subsequent network structure is an entirety containing some network layers, which may directly output the required location information and category information.

By way of example and not limitation, the target region-of-interest network may include a region candidate network and a region-of-interest header in a pre-trained neural network model, and the base-class region-of-interest network may include a region candidate network and a region-of-interest header in a base-class target detection network. Similarly, the new class of roi network may include an area candidate network and an roi header in the new class of target detection network. The meaning of the area candidate network and the area of interest header can be understood by those skilled in the art with reference to fig. 3 and the above description, and will not be described herein. It should be noted that the implementation manner of the roi network including the area candidate network and the roi header is merely an example and not a limitation of the present application, and in the case that the network structure of the target detection network changes, the network structure of the corresponding roi network may also change appropriately.

The base class target detection network can follow the parameters of the backbone network and the interesting area network of the pre-trained neural network model, but the parameters of the base class frame detection head can be randomly initialized. The base class target detection network is subjected to fine adjustment on the basis of the pre-trained neural network model, so that the base class target detection network can maintain the general processing capability of the pre-trained neural network model and enhance the detection capability of the base class.

It should be noted that the scheme of randomly initializing the parameters of the base class frame detection head is only an example and is not limited in this application, and the parameters of the base class frame detection head may be set in other suitable manners, for example, the parameters may be set to be equal to the parameters or other empirical values of the frame detection head in the pre-trained neural network model. In addition, although the base class target detection network follows the parameters of the roi of the pre-trained neural network model in the embodiment, this is only an example and is not a limitation of the present application, and the parameters of the base class roi of the base class target detection network may also be randomly initialized or obtained in other ways.

According to an embodiment of the application, before inputting the first training image into the new class backbone network of the new class target detection network and the first teacher backbone network, respectively, the method further includes: randomly initializing at least part of parameters of the new class of subsequent network structure; and initializing the parameters of the new class backbone network into the parameters of the first teacher backbone network.

The implementation and advantages of initializing the parameters of the base class backbone network to the parameters of the second teacher backbone network have been described above, and the scheme of initializing the parameters of the new class backbone network to the parameters of the first teacher backbone network is similar to the above, and will not be described here again.

According to the embodiment of the application, the new class follow-up network structure comprises a new class interesting area network and a new class frame detection head which are sequentially connected, the trained base class target detection network comprises a base class main network, a base class interesting area network and a base class frame detection head which are sequentially connected, any one of the new class interesting area network and the base class interesting area network is used for outputting the area characteristics of an interesting area, any one of the new class frame detection head and the base class frame detection head is used for outputting corresponding position information and category information, and random initialization of at least part of parameters of the new class follow-up network structure comprises the following steps: carrying out random initialization on the parameters of the new class frame detection head; and initializing the parameters of the new interesting area network into the parameters of the base interesting area network.

The new class target detection network can use the parameters of the backbone network and the interested area network of the base class target detection network, but the parameters of the new class frame detection head can be initialized randomly. Namely, the new class target detection network is finely adjusted on the basis of the base class target detection network, so that the new class target detection network can enhance the detection capability of the new class while maintaining the detection capability of the base class target detection network.

It should be noted that the scheme of randomly initializing the parameters of the new class frame detection head is only an example and is not a limitation of the present application, and the parameters of the new class frame detection head may be set in other suitable manners, for example, may be set to be equal to the parameters of the base class frame detection head or other empirical values. In addition, although the new class target detection network follows the parameters of the roi of the base class target detection network in the embodiment, this is only an example and is not a limitation to the present application, and the parameters of the new class roi of the new class target detection network may also be randomly initialized or obtained in other ways.

The random initialization of the parameters of the new class frame detection head facilitates the quick fine tuning of the new class target detection network to be suitable for the detection of the new class.

According to the embodiment of the application, the new class target detection network is one of the following networks: fast area Convolutional neural network (fast RCNN), retinaNet, full Convolutional single-Stage anchor-free frame network (FCOS), adaptive Training Sample Selection network (ATSS); the base class target detection network is one of: fast RCNN, retinaNet, FCOS, ATSS.

Although the implementation of the target detection network training method is described above by taking the fast RCNN framework as an example, this is not a limitation of the present application. Any existing or future occurrence of a deep learning based object detection framework may employ this technique, e.g., retinaNet, FCOS, ATSS, etc.

According to another aspect of the present application, an object detection algorithm is provided. FIG. 4 shows a schematic flow diagram of a target detection method 400 according to one embodiment of the present application. As shown in fig. 4, the object detection method 400 includes steps S410 and S420.

In step S410, a target image is acquired.

Similar to the training images, the target images described herein may be any images, and any source and form of images should fall within the scope of the present application. Alternatively, the target image may be a still image or any video frame in a dynamic video. The target image may be an original image captured by an image capturing device (e.g., a separate camera or a camera of a mobile terminal), or may be an image obtained after preprocessing (such as digitizing, normalizing, smoothing, etc.) the original image. Note that the preprocessing of the original image may include an operation of extracting a sub-image containing the target from the original image acquired by the image acquisition device to obtain a target image.

In step S420, a target image is input into the trained new class target detection network to obtain a target image prediction result (first target image prediction result) output by the trained new class target detection network, where the target image prediction result includes position information and category information of a target in the target image.

The form and meaning of the target image prediction result are similar to the first target prediction result and the second target prediction result, and are not described herein again.

Because the trained new target detection network has better performance in small sample training, on the basis of adopting small sample training, the prediction result obtained by target detection based on the network is more accurate.

It is to be appreciated that the method 400 may further include: inputting the target image into the trained base class target detection network to obtain a target image prediction result (second target image prediction result) output by the trained base class target detection network, wherein the target image prediction result comprises position information and category information of the target in the target image.

The first target image prediction result and the second target image prediction result may include category information corresponding to different categories, respectively. For example, the first target image prediction result may include category information corresponding to the new category, and the second target image prediction result may include category information corresponding to the base category.

According to another aspect of the present application, a target detection network training apparatus is provided. FIG. 5 shows a schematic block diagram of an object detection network training apparatus 500 according to one embodiment of the present application.

As shown in fig. 5, the target detection network training apparatus 500 according to the embodiment of the present application includes an obtaining module 510, a first input module 520, a second input module 530, a first calculating module 540, a second calculating module 550, and an optimizing module 560. The modules may perform the steps of the target detection network training method described above in fig. 2, respectively. Only the main functions of the components of the object detection network training apparatus 500 will be described below, and details that have been described above will be omitted.

The obtaining module 510 is configured to obtain a first training image and corresponding first labeling data, where the first labeling data is used to indicate a position and a category of a target included in the first training image. The obtaining module 510 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 105.

The first input module 520 is configured to input the first training image into a new class backbone network of the new class target detection network and a first teacher backbone network, respectively, to obtain a first image feature output by the new class backbone network and a first teacher image feature output by the first teacher backbone network, where the first teacher backbone network is a base class backbone network of the trained base class target detection network. The first input module 520 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 105.

The second input module 530 is configured to input the first image feature into a new class subsequent network structure of the new class target detection network to obtain a first target prediction result output by the new class subsequent network structure, where the first target prediction result includes location information and category information of a target in the first training image. The second input module 530 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 105.

The first calculation module 540 is configured to calculate a first regularization loss term based on the first image feature and the first teacher image feature. The first calculation module 540 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 105.

The second calculation module 550 is configured to calculate a first predicted loss term based on the first target prediction result and the first annotation data. The second computing module 550 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 105.

The optimization module 560 is configured to optimize parameters of the new class target detection network based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, where parameters of the first teacher backbone network remain unchanged during the process of optimizing the parameters of the new class target detection network. The optimization module 560 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 105.

According to another aspect of the present application, there is provided an object detecting apparatus. FIG. 6 shows a schematic block diagram of an object detection apparatus 600 according to one embodiment of the present application.

As shown in fig. 6, the object detection apparatus 600 according to the embodiment of the present application includes an acquisition module 610 and an input module 620. The respective modules may respectively perform the respective steps of the object detection method described above with reference to fig. 4. Only the main functions of the respective components of the object detection apparatus 600 will be described below, and details that have been described above will be omitted.

The obtaining module 610 is configured to obtain a target image. The obtaining module 610 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 106.

The input module 620 is configured to input a target image into the trained new class target detection network to obtain a target image prediction result output by the trained new class target detection network, where the target image prediction result includes location information and category information of a target in the target image. The input module 620 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 106.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

FIG. 7 shows a schematic block diagram of an electronic device 700 according to one embodiment of the present application. The electronic device 700 includes a memory 710 and a processor 720.

The memory 710 stores computer program instructions for implementing the corresponding steps in the method for object detection network training according to an embodiment of the present application.

Processor 720 is operative to execute the computer program instructions stored in memory 710 to perform the corresponding steps of the object detection network training method according to the embodiments of the present application.

In one embodiment, the computer program instructions, when executed by the processor 720, are for: acquiring a first training image and corresponding first labeling data, wherein the first labeling data are used for indicating the position and the category of a target contained in the first training image; inputting the first training image into a new class backbone network of a new class target detection network and a first teacher backbone network respectively to obtain a first image characteristic output by the new class backbone network and a first teacher image characteristic output by the first teacher backbone network, wherein the first teacher backbone network is a base class backbone network of a trained base class target detection network; inputting the first image characteristic into a new type subsequent network structure of a new type target detection network to obtain a first target prediction result output by the new type subsequent network structure, wherein the first target prediction result comprises position information and category information of a target in a first training image; calculating a first canonical loss term based on the first image feature and the first teacher image feature; calculating a first prediction loss item based on the first target prediction result and the first annotation data; and optimizing the parameters of the new class target detection network based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, wherein the parameters of the first teacher trunk network are kept unchanged in the process of optimizing the parameters of the new class target detection network.

Illustratively, the electronic device 700 may further include an image capture device 730. The image acquisition device 730 is optional. The image capture device 730 may be configured to capture training images and transmit the training images to the memory 710 for storage and/or to the processor 720 for target detection network training.

FIG. 8 shows a schematic block diagram of an electronic device 800 according to one embodiment of the present application. Electronic device 800 includes memory 810 and processor 820.

The memory 810 stores computer program instructions for implementing corresponding steps in an object detection method according to an embodiment of the application.

Processor 820 is configured to execute computer program instructions stored in memory 810 to perform the corresponding steps of the object detection method according to embodiments of the present application.

In one embodiment, the computer program instructions, when executed by the processor 820, are for: acquiring a target image; and inputting the target image into the trained new-class target detection network to obtain a target image prediction result output by the trained new-class target detection network, wherein the target image prediction result comprises position information and class information of the target in the target image.

Illustratively, the electronic device 800 may further include an image capture device 830. The image acquisition device 830 is optional. The image capture device 830 may be used to capture a target image and transmit the target image to the memory 810 for storage and/or to the processor 820 for target detection.

In addition, according to an embodiment of the present application, there is also provided a storage medium, on which program instructions are stored, and when the program instructions are executed by a computer or a processor, the storage medium is configured to perform corresponding steps of the object detection network training method according to the embodiment of the present application, and is configured to implement corresponding modules in the object detection network training apparatus according to the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disc read-only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the program instructions, when executed by a computer or a processor, may cause the computer or the processor to implement the respective functional modules of the object detection network training apparatus according to the embodiment of the present application, and/or may perform the object detection network training method according to the embodiment of the present application.

In one embodiment, the program instructions are operable when executed to perform the steps of: acquiring a first training image and corresponding first labeling data, wherein the first labeling data are used for indicating the position and the category of a target contained in the first training image; inputting the first training image into a new class backbone network of a new class target detection network and a first teacher backbone network respectively to obtain a first image characteristic output by the new class backbone network and a first teacher image characteristic output by the first teacher backbone network, wherein the first teacher backbone network is a base class backbone network of a trained base class target detection network; inputting the first image characteristic into a new type subsequent network structure of a new type target detection network to obtain a first target prediction result output by the new type subsequent network structure, wherein the first target prediction result comprises position information and category information of a target in a first training image; calculating a first regularization loss term based on the first image feature and the first teacher image feature; calculating a first prediction loss item based on the first target prediction result and the first annotation data; and optimizing the parameters of the new class target detection network based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, wherein the parameters of the first teacher trunk network are kept unchanged in the process of optimizing the parameters of the new class target detection network.

Furthermore, according to an embodiment of the present application, there is also provided a computer program product, which includes a computer program, when running, for executing the above-mentioned object detection network training method 200.

In one embodiment, the computer program is operable when executed to perform the steps of: acquiring a first training image and corresponding first labeling data, wherein the first labeling data are used for indicating the position and the category of a target contained in the first training image; inputting the first training image into a new class backbone network of a new class target detection network and a first teacher backbone network respectively to obtain a first image characteristic output by the new class backbone network and a first teacher image characteristic output by the first teacher backbone network, wherein the first teacher backbone network is a base class backbone network of a trained base class target detection network; inputting the first image characteristic into a new type subsequent network structure of a new type target detection network to obtain a first target prediction result output by the new type subsequent network structure, wherein the first target prediction result comprises position information and category information of a target in a first training image; calculating a first regularization loss term based on the first image feature and the first teacher image feature; calculating a first prediction loss item based on the first target prediction result and the first annotation data; and optimizing the parameters of the new class target detection network based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, wherein the parameters of the first teacher trunk network are kept unchanged in the process of optimizing the parameters of the new class target detection network.

Furthermore, according to an embodiment of the present application, there is also provided a storage medium, on which program instructions are stored, and when the program instructions are executed by a computer or a processor, the storage medium is configured to perform corresponding steps of the object detection method according to the embodiment of the present application, and to implement corresponding modules in the object detection device according to the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the program instructions, when executed by a computer or a processor, may cause the computer or the processor to implement the respective functional modules of the object detection apparatus according to the embodiment of the present application, and/or may perform the object detection method according to the embodiment of the present application.

In one embodiment, the program instructions are operable when executed to perform the steps of: acquiring a target image; and inputting the target image into the trained new-class target detection network to obtain a target image prediction result output by the trained new-class target detection network, wherein the target image prediction result comprises position information and class information of the target in the target image.

Furthermore, according to an embodiment of the present application, there is also provided a computer program product, comprising a computer program for performing the above object detection method 400 when the computer program is run.

In one embodiment, the computer program is operable when executed to perform the steps of: acquiring a target image; and inputting the target image into the trained new-class target detection network to obtain a target image prediction result output by the trained new-class target detection network, wherein the target image prediction result comprises position information and class information of the target in the target image.

The modules in the electronic device according to embodiments of the present application may be implemented by a processor of the electronic device implementing object detection network training or object detection according to embodiments of the present application running computer program instructions stored in a memory, or may be implemented when computer instructions stored in a computer-readable storage medium of a computer program product according to embodiments of the present application are run by a computer.

Although the example embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above-described example embodiments are merely illustrative and are not intended to limit the scope of the present application thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present application. All such changes and modifications are intended to be included within the scope of the present application as claimed in the appended claims.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, a division of a unit is only one type of division of a logical function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various application aspects. However, the method of the present application should not be construed to reflect the intent: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules in an object detection network training apparatus or object detection apparatus according to embodiments of the present application. The present application may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiments of the present application or the description thereof, and the protection scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope disclosed in the present application, and all the changes or substitutions should be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A target detection network training method comprises the following steps:

acquiring a first training image and corresponding first labeling data, wherein the first labeling data are used for indicating the position and the category of a target contained in the first training image;

inputting the first training image into a new class backbone network of a new class target detection network and a first teacher backbone network respectively to obtain a first image feature output by the new class backbone network and a first teacher image feature output by the first teacher backbone network, wherein the first teacher backbone network is a base class backbone network of a trained base class target detection network;

inputting the first image feature into a new class subsequent network structure of the new class target detection network to obtain a first target prediction result output by the new class subsequent network structure, wherein the first target prediction result comprises position information and category information of a target in the first training image;

computing a first regularization loss term based on the first image feature and the first teacher image feature;

calculating a first predicted loss term based on the first target prediction result and the first annotation data;

and optimizing the parameters of the new class target detection network based on the first regular loss term and the first prediction loss term to obtain a trained new class target detection network, wherein the parameters of the first teacher backbone network are kept unchanged in the process of optimizing the parameters of the new class target detection network.

2. The method of claim 1, wherein the method further comprises:

acquiring a second training image and corresponding second labeling data, wherein the second labeling data are used for indicating the position and the category of a target contained in the second training image;

inputting the second training image into a base class backbone network of a base class target detection network and a second teacher backbone network respectively to obtain a second image feature output by the base class backbone network and a second teacher image feature output by the second teacher backbone network, wherein the second teacher backbone network is a backbone network in a pre-trained neural network model;

inputting the second image characteristics into a base class subsequent network structure of the base class target detection network to obtain a second target prediction result output by the base class subsequent network structure, wherein the second target prediction result comprises position information and category information of targets in the second training image;

computing a second regularization loss term based on the second image features and the second teacher image features;

calculating a second prediction loss term based on the second target prediction result and the second annotation data;

and optimizing the parameters of the base class target detection network based on the second regular loss term and the second prediction loss term to obtain the trained base class target detection network, wherein the parameters of the second teacher backbone network are kept unchanged in the process of optimizing the parameters of the base class target detection network.

3. The method of claim 2, wherein prior to said inputting the second training image into a base class backbone network of a base class target detection network and a second teacher backbone network, respectively, the method further comprises:

randomly initializing at least part of parameters of the subsequent network structure of the base class;

initializing the parameters of the base class backbone network to the parameters of the second teacher backbone network.

4. The method according to claim 3, wherein the base class successor network structure includes a base class interesting region network and a base class frame detection head connected in sequence, the pre-trained neural network model includes a backbone network and a target interesting region network connected in sequence, either one of the base class interesting region network and the target interesting region network is used for outputting region characteristics of interesting regions, the base class frame detection head is used for outputting corresponding position information and class information,

the randomly initializing at least part of parameters of the subsequent network structure of the base class comprises:

carrying out random initialization on the parameters of the base class frame detection head;

initializing the parameters of the base class interested area network into the parameters of the target interested area network.

5. The method of any of claims 1 to 4, wherein prior to said inputting the first training image into the new class backbone network of the new class target detection network and the first teacher backbone network, respectively, the method further comprises:

randomly initializing at least part of parameters of the new class of subsequent network structure;

initializing the parameters of the new class backbone network to the parameters of the first teacher backbone network.

6. The method according to claim 5, wherein the new class follow-up network structure comprises a new class ROI network and a new class frame detection header connected in sequence, the trained base class target detection network comprises a base class backbone network, a base class ROI network and a base class frame detection header connected in sequence, either one of the new class ROI network and the base class ROI network is used for outputting region characteristics of an ROI, and either one of the new class frame detection header and the base class frame detection header is used for outputting corresponding position information and class information,

the randomly initializing at least part of the parameters of the new class of subsequent network structures comprises:

carrying out random initialization on the parameters of the new class frame detection head;

and initializing the parameters of the new type interesting area network into the parameters of the base type interesting area network.

7. A method of target detection, comprising:

acquiring a target image;

inputting the target image into the trained new class target detection network according to any one of claims 1 to 6 to obtain a target image prediction result output by the trained new class target detection network, wherein the target image prediction result comprises position information and category information of a target in the target image.

8. An electronic device comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor to perform the object detection network training method of any one of claims 1 to 6; alternatively, the computer program instructions are for performing the object detection method of claim 7 when executed by the processor.

9. A storage medium on which program instructions are stored, the program instructions being operable when executed to perform the method of object detection network training of any one of claims 1 to 6; alternatively, the program instructions are operable, when executed, to perform the object detection method of claim 7.

10. A computer program product comprising a computer program for performing, when running, the method of object detection network training according to any one of claims 1 to 6; alternatively, the computer program is operative, when running, to perform the object detection method of claim 7.