CN112949692A

CN112949692A - Target detection method and device

Info

Publication number: CN112949692A
Application number: CN202110148538.7A
Authority: CN
Inventors: 张一凡; 刘杰
Original assignee: Goertek Inc
Current assignee: Goertek Inc
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-06-11
Also published as: WO2022166293A1

Abstract

The application discloses a target detection method and a target detection device, wherein the method comprises the following steps: setting at least one adjusting mode for a down-sampling structure of a backbone network of YOLO-v4 based on the characteristics of a target to be detected; adjusting a down-sampling structure of a backbone network of YOLO-v4 by using an adjusting mode, and constructing a target detection model based on YOLO-v 4; inputting the detection image into a target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure. The beneficial effects of this technical scheme lie in can realizing the precision promotion of target detection through adjusting the downsampling structure to industrial defect detection scene is taken as an example, formation of image such as mar, fine hair is the target that threadiness, volume are less, if utilize original downsampling structure to handle the detection image, many times downsampling can obviously reduce the detection performance, and this problem has effectively been solved to the target detection model after the improvement.

Description

Target detection method and device

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target detection method and apparatus.

Background

YOLO (english is called youonly Look one, and there is no chinese name in the industry for a while) is a typical single-stage target detection technology, i.e., information such as the position and the category of a target is directly regressed according to an original image, and the fourth version, YOLO-v4, has been developed at present. Fig. 1 shows a schematic diagram of a network structure of YOLO-v4, which can be seen to include a down-sampling structure composed of a plurality of down-sampling layers, but this arrangement has some disadvantages, for example, in a defect detection scenario in industry, some defects are still difficult to be identified accurately, and the technology still has room for improvement.

It should be noted that the statements herein merely provide background information related to the present application and may not necessarily constitute prior art.

Disclosure of Invention

The embodiment of the application provides a target detection method and a target detection device, so that the target detection precision is further improved.

The embodiment of the application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a target detection method, including: setting at least one adjusting mode for a down-sampling structure of a backbone network of YOLO-v4 based on the characteristics of a target to be detected; adjusting a down-sampling structure of a backbone network of YOLO-v4 by using an adjusting mode, and constructing a target detection model based on YOLO-v 4; inputting the detection image into a target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.

In some embodiments, in the target detection method, adjusting the downsampling structure of the YOLO-v4 backbone network by using an adjustment method includes: the step size of at least one down-sampling layer in the down-sampling structure is adjusted.

In some embodiments, in the target detection method, adjusting the downsampling structure of the YOLO-v4 backbone network by using an adjustment method includes: one or more downsampling layers in the downsampling structure are deleted.

In some embodiments, in the target detection method, adjusting the downsampling structure of the YOLO-v4 backbone network by using an adjustment method includes: any of the 1/4 downsampling layers and 1/32 downsampling layers were deleted.

In some embodiments, in the target detection method, adjusting the downsampling structure of the YOLO-v4 backbone network by using an adjustment method further includes: and reducing the number of channels of each network structure originally connected behind the deleted down-sampling layer by half.

In some embodiments, in the target detection method, constructing a YOLO-v 4-based target detection model includes: and adding a detection branch on the basis of the specified down-sampling layer in the adjusted down-sampling structure.

In some embodiments, in the target detection method, constructing a target detection model based on YOLO-v4 further includes: and setting an anchor frame used by each detection branch in the target detection model according to the added detection branches.

In some embodiments, in the target detection method, setting an anchor frame used by each detection branch in the target detection model according to the added detection branches includes: allocating a first preset number of anchor frame groups to each detection branch, wherein the first preset number is the number of the anchor frame groups used by the original YOLO-v4 main network, and the number of the anchor frame groups allocated to each detection branch is not 0; or increasing the number of the anchor frame groups from a first preset number to a second preset number, and averagely distributing the anchor frame groups with the second preset number to each detection branch.

In some embodiments, the target detection method further comprises: and pruning the target detection model.

In a second aspect, an embodiment of the present application further provides an object detection apparatus, including: the adjusting unit is used for setting at least one adjusting mode for the down-sampling structure of the backbone network of the YOLO-v4 based on the characteristics of the target to be detected; the system comprises a construction unit, a target detection unit and a target detection unit, wherein the construction unit is used for adjusting a down-sampling structure of a backbone network of YOLO-v4 by using an adjustment mode and constructing a target detection model based on YOLO-v 4; the detection unit is used for inputting the detection image into the target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.

In some embodiments, in the object detection apparatus, the construction unit is configured to adjust a step size of at least one down-sampling layer in the down-sampling structure.

In some embodiments, in the object detection apparatus, the construction unit is configured to delete one or more downsampled layers in the downsampled structure.

In some embodiments, in the object detection apparatus, a unit is constructed to delete 1/4 any one of the down-sampling layer and 1/32 down-sampling layer.

In some embodiments, in the target detection apparatus, the construction unit is configured to reduce the number of channels of each network structure originally connected after the deleted downsampling layer by half.

In some embodiments, in the object detection apparatus, the construction unit is configured to add the detection branch based on a specified downsampling layer in the adjusted downsampling structure.

In some embodiments, in the object detection apparatus, the construction unit is configured to set an anchor frame used by each detection branch in the object detection model according to the added detection branches.

In some embodiments, in the target detection apparatus, the construction unit is configured to assign a first preset number of anchor frame groups to each detection branch, where the first preset number is the number of anchor frame groups used by the original YOLO-v4 backbone network, and the number of anchor frame groups to which each detection branch is assigned is not 0; or increasing the number of the anchor frame groups from a first preset number to a second preset number, and averagely distributing the anchor frame groups with the second preset number to each detection branch.

In some embodiments, the object detection apparatus further comprises: and the pruning unit is used for carrying out pruning processing on the target detection model.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method of object detection as described in any one of the above.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium storing one or more programs, which when executed by an electronic device including a plurality of application programs, cause the electronic device to perform the object detection method as described in any one of the above.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects: and selecting YOLO-v4 to construct a target detection model, and setting an adjustment mode aiming at a down-sampling structure based on the characteristics of the target to be detected, so that the down-sampling characteristic diagram with the adjusted size can be obtained by the target detection model obtained after adjustment, and higher precision can be obtained by carrying out target detection on the basis. Taking an industrial defect detection scene as an example, images such as scratches, broken fibers and the like are linear targets with small volumes, if an original down-sampling structure is used for processing a detection image, the detection performance can be obviously reduced by down-sampling for many times, and the problem is effectively solved by the improved target detection model.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 shows a schematic diagram of the network structure of YOLO-v 4;

FIG. 2 is a graph of the characteristic dimensions of the output of each down-sampling layer shown on the basis of the network architecture of FIG. 1;

FIG. 3 shows a schematic flow diagram of a target detection method according to an embodiment of the present application;

FIG. 4 is a graph of feature sizes of down-sampled layer outputs based on a network structure of a target detection model according to one embodiment of the present application;

FIG. 5 is a graph of feature sizes of down-sampled layer outputs based on a network structure of a target detection model according to another embodiment of the present application;

FIG. 6 is a graph of feature sizes of down-sampled layer outputs based on a network structure of an object detection model according to yet another embodiment of the present application;

FIG. 7 is a graph of feature sizes of down-sampled layer outputs based on a network structure of a target detection model according to yet another embodiment of the present application;

FIG. 8 illustrates a network diagram of an object detection model according to one embodiment of the present application;

FIG. 9 shows a network diagram of an object detection model according to another embodiment of the present application;

FIG. 10 shows a schematic structural diagram of an object detection model according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 2 shows the characteristic diagram size of each down-sampling layer output on the basis of the network structure shown in fig. 1. As shown in fig. 2, in the case where the size of the input image is 416 × 416 (pixels, the same below), and the input image is divided into three channels of RGB (i.e., 416 × 3 shown in fig. 2, and the numbers marked in the following layers have the same meaning and are not explained one by one), the input image is processed into a feature map of 416 × 32 (the corresponding network structure is not shown in fig. 1), and then sequentially passes through 1/2 down-sampling layers to obtain a feature map of 208 × 208, 1/4 down-sampling layers to obtain a feature map of 104 × 104, 1/8 down-sampling layers to obtain a feature map of 52 × 52, 1/16 down-sampling layers to obtain a feature map of 26 × 26, and 1/32 down-sampling layers to obtain a feature map of 13.

The reason why the original YOLO-v4 is designed in this way is that a feature map with a smaller size can be obtained through multiple times of downsampling, and the inference speed of the model can be greatly improved by carrying out target detection on the smaller feature map.

However, the inventor finds that for a common natural object which is imaged in a planar shape and has a large volume, the detection accuracy is not reduced by multiple times of down-sampling; however, for an object which is imaged in a linear shape and has a small volume (such as some fine fiber defects and fine impurity defects in industrial detection in particular), multiple times of down-sampling can significantly reduce the detection performance of the model.

That is, the idea of the prior art is to down-sample as much as possible, and the design idea of the present invention is to reduce down-sampling, thereby achieving an improvement in accuracy.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

FIG. 3 shows a schematic flow diagram of a target detection method according to an embodiment of the present application. As shown in fig. 3, the method includes:

step S310, setting at least one adjusting mode for the down-sampling structure of the backbone network of YOLO-v4 based on the characteristics of the target to be detected.

The target to be detected can be various objects to be detected, such as vehicles, defects, and the like, and the "features" do not refer to tensor features obtained by using a neural network, but refer to slender, small-sized, and other expression features. For the purpose of distinction, the tensor features obtained by the neural network are hereinafter referred to as "feature maps".

As described above, for the target to be detected that is imaged in a linear shape and has a small volume, since the down-sampling frequency is too many, the detection accuracy is reduced, and therefore, the adjustment manner here may be to reduce the down-sampling frequency or effect achieved by the down-sampling structure.

And S320, adjusting the downsampling structure of the backbone network of the YOLO-v4 by using an adjusting mode, and constructing a target detection model based on the YOLO-v 4.

Step S330, inputting the detection image into a target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.

Here, the specific detection method of YOLO-v4 is not changed, and referring to fig. 1 and fig. 2, a target detection result is obtained based on the obtained downsampling feature map by using an anchor frame group through operations such as upsampling, splicing, convolution and the like by using a detection branch extracted based on a downsampling layer. In order to improve the detection effect, it can also be considered to add a detection branch, and a corresponding embodiment will be described later.

It can be seen that the method shown in fig. 3 can improve the precision of target detection by adjusting the down-sampling structure, and taking an industrial defect detection scene as an example, images such as scratches and broken fibers are linear targets with a small volume, and if an original down-sampling structure is used to process a detection image, the detection performance is obviously reduced by down-sampling for many times, and the problem is effectively solved by the improved target detection model.

For example, in the YOLO-v4 backbone network, the step size (stride) of 1/8 downsampling layers is 2, and as can be seen from fig. 2, taking the size of the detected image as 416 × 416, the size of the downsampling feature map obtained by 1/8 downsampling layers is 52 × 52. If the step size is adjusted from 2 to 1, the effect shown in fig. 4 can be obtained, that is, taking the size of the detected image as 416 × 416 as an example, the size of the down-sampled feature obtained by the 1/8 down-sampling layer is still 104 × 104, which is the same as the size of the down-sampled feature obtained by the 1/4 down-sampling layer (the number of channels changes).

The target detection model constructed by the original YOLO-v4 is used as a comparative example 1, the step length of a 1/8 down-sampling layer is set to be 1, the rest target detection models obtained without change are used as an embodiment 1, detection images in an experimental set are detected after training of the same sample set, and experimental data show that the embodiment 1 is superior to the comparative example 1 in multiple indexes, specifically, the embodiment has the advantage of about 6 percentage points in an average detection accuracy (mAP) index, the embodiment has the advantage of about 1 percentage point in a recall rate (recall) index, and the embodiment has the advantage of about 4 percentage points in a detection accuracy (precision) index.

The down-sampling times can be reduced very directly by deleting the down-sampling layer. However, reducing the number of downsamplings may also result in a relatively large size of the downsampled feature map, which may in turn increase the training and reasoning time of the target detection model.

The inventor finds a more balanced scheme through experiments. In some embodiments, in the target detection method, adjusting the downsampling structure of the YOLO-v4 backbone network by using an adjustment method includes: any of the 1/4 downsampling layers and 1/32 downsampling layers were deleted. FIG. 5 shows the feature map size of each down-sampled layer output after the down-sampled layer has been removed 1/4; fig. 6 shows the feature map size of each down-sampled layer output after the down-sampled layer is removed 1/32.

The target detection model corresponding to fig. 5 has 255 mbytes, and the target detection model corresponding to fig. 6 has only 73.7 mbytes, so that compared with the target detection model (256 mbytes) of the above comparative example 1, the volume is also reduced, and the memory space for deploying the target detection model equipment can be saved.

In some embodiments, in the target detection method, adjusting the downsampling structure of the YOLO-v4 backbone network by using an adjustment method further includes: and reducing the number of channels of each network structure originally connected behind the deleted down-sampling layer by half. For example, fig. 7 shows the feature map size output by each down-sampling layer after the down-sampling layer is removed 1/4 and the number of channels of each network structure is halved thereafter.

The target detection model corresponding to fig. 7 is only 64.2 mbytes, and compared with the target detection model (256 mbytes) of the above comparative example 1, the volume is also reduced, and the memory space for deploying the target detection model equipment can be saved.

Referring to fig. 1, the original YOLO-v4 backbone network has three detection branches, which are respectively connected to the 1/8 down-sampling layer, the 1/16 down-sampling layer and the 1/32 down-sampling layer.

The target detection can be realized on more downsampling feature maps by adding the detection branches, so that the detection precision can be improved. For example, fig. 8 shows a network schematic diagram of a target detection model according to an embodiment of the present application, and compared with the original YOLO-v4 backbone network shown in fig. 1, fig. 8 introduces a new detection branch at the 1/4 downsampling layer, that is, the target detection model has four detection branches in total, and compared with the original YOLO-v4 backbone network, the target detection model can perform target detection by using a downsampling feature map obtained at the 1/4 downsampling layer, so as to achieve precision improvement.

It should be noted that the scheme of increasing the detection branches and the scheme of reducing the down-sampling may be used in combination, for example, the down-sampling structure is adjusted first, and then the arrangement manner of the detection branches is adjusted based on the adjusted down-sampling structure. As in the network structure shown in fig. 7, although the 1/4 down-sampling layer is deleted, an additional detection branch can be led out from the 1/2 down-sampling layer; as with the network architecture shown in fig. 5, although the 1/32 downsampling layer is omitted, detection branches may be drawn based on the 1/2 downsampling layer, the 1/4 downsampling layer, the 1/8 downsampling layer, and the 1/16 downsampling layer, among others.

The anchor frame is a reference frame selected during target detection, the specific use mode can be realized by referring to the prior art, and the number of the anchor frames can be only adjusted according to the scheme provided by the embodiment of the application.

Referring to fig. 1, in the original YOLO-v4 backbone network, three anchor frame (anchor) groups with serial numbers 0, 1, and 2 are used for the detection branches led out from the 1/8 downsampling layer; three anchor frame groups with serial numbers of 3, 4 and 5 are used for detection branches led out from the 1/16 down-sampling layer; the detection branches from the 1/32 downsampling layer use three anchor block sets numbered 6, 7, and 8.

Since the detection branch is newly added, it is necessary to determine which anchor boxes are used by the newly added detection branch.

Two possible anchor frame allocation schemes are shown, one of which may be selected for use depending on the actual requirements. In one scheme, 9 anchor frame groups used in the backbone network of the original YOLO-v4 may be re-allocated to all current detection branches, for example, referring to fig. 8, the newly added detection branch (derived from 1/4 downsampling layer) uses an anchor frame group with sequence number 0; three anchor frame groups with serial numbers of 1, 2 and 3 are used for detection branches led out from the 1/8 down-sampling layer; three anchor frame groups with serial numbers of 4, 5 and 6 are used for detection branches led out from the 1/16 down-sampling layer; the detection branch from the 1/32 down-sampling layer uses two sets of anchor boxes numbered 7 and 8.

Alternatively, in another scheme, each detection branch may use the same number of anchor frame groups, for example, referring to fig. 9, a network diagram of a target detection model according to another embodiment of the present application is shown, in which three anchor frame groups are also used for the newly added detection branch, and a total of 12 anchor frame groups are used.

Reducing down-sampling and increasing detection branches in the backbone network can improve the detection performance of linear or small-volume targets, but also bring more calculation amount. In order to reduce the amount of calculation caused by the above operations and reduce the risk of network overfitting, pruning processing may be performed on the target detection model.

For example, a network pruning algorithm may be selected, a target detection model is first sparsely trained to obtain a sparsified γ parameter (provided that the target detection model needs to use a batch standardized BN layer with the γ parameter), and then an input channel and/or an output channel of a convolutional layer is pruned based on the sparsified γ parameter.

The embodiment of the application also provides a target detection device, which is used for realizing the target detection method.

Specifically, fig. 10 shows a schematic structural diagram of an object detection model according to an embodiment of the present application. As shown in fig. 10, the object detection model 1000 includes:

the adjusting unit 1010 is configured to set at least one adjusting mode for a downsampling structure of the YOLO-v4 backbone network based on characteristics of the target to be detected.

The building unit 1020 is configured to adjust a downsampling structure of the YOLO-v4 backbone network by using an adjustment mode, and build a target detection model based on the YOLO-v 4.

The detection unit 1030 is configured to input the detection image into the target detection model, extract a downsampling feature map of the detection image by the target detection model, and obtain a target detection result according to the downsampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.

It can be seen that the apparatus shown in fig. 10 can achieve precision improvement of target detection by adjusting the down-sampling structure, and taking an industrial defect detection scene as an example, images such as scratches and broken fibers are linear targets with a small volume, and if an original down-sampling structure is used to process a detection image, multiple down-sampling will significantly reduce detection performance, and the problem is effectively solved by the improved target detection model.

In some embodiments, in the target detection apparatus, the constructing unit 1020 is configured to adjust a step size of at least one down-sampling layer in the down-sampling structure.

In some embodiments, in the object detection apparatus, the constructing unit 1020 is configured to delete one or more downsampled layers in the downsampled structure.

In some embodiments, in the target detection apparatus, the unit 1020 is constructed to delete 1/4 any one of the down-sampling layer and 1/32 down-sampling layer.

In some embodiments, in the target detection apparatus, the constructing unit 1020 is configured to reduce the number of channels of each network structure originally connected after the deleted downsampling layer by half.

In some embodiments, in the target detection apparatus, the constructing unit 1020 is configured to add the detection branch based on a specified downsampling layer in the adjusted downsampling structure.

In some embodiments, in the target detection apparatus, the constructing unit 1020 is configured to set an anchor frame used by each detection branch in the target detection model according to the added detection branches.

In some embodiments, in the target detection apparatus, the constructing unit 1020 is configured to assign a first preset number of anchor frame groups to each detection branch, where the first preset number is the number of anchor frame groups used by the original YOLO-v4 backbone network, and the number of anchor frame groups to which each detection branch is assigned is not 0; or increasing the number of the anchor frame groups from a first preset number to a second preset number, and averagely distributing the anchor frame groups with the second preset number to each detection branch.

It can be understood that the target detection apparatus can implement the steps of the target detection method executed by the target detection server provided in the foregoing embodiment, and the related explanations about the target detection method are applicable to the target detection apparatus, and are not described herein again.

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 11, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 11, but that does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the target detection device on a logic level. The target detection means shown in fig. 11 does not constitute a limitation of the present application on the number of target detection means. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

setting at least one adjusting mode for a down-sampling structure of a backbone network of YOLO-v4 based on the characteristics of a target to be detected; adjusting a down-sampling structure of a backbone network of YOLO-v4 by using an adjusting mode, and constructing a target detection model based on YOLO-v 4; inputting the detection image into a target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.

The method performed by the object detection device according to the embodiment shown in fig. 1 of the present application may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may further execute the method executed by the target detection apparatus in fig. 1, and implement the function of the target detection apparatus in the embodiment shown in fig. 10, which is not described herein again in this embodiment of the present application.

An embodiment of the present application further provides a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which, when executed by an electronic device including a plurality of application programs, enable the electronic device to perform the method performed by the object detection apparatus in the embodiment shown in fig. 1, and are specifically configured to perform:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of object detection, comprising:

setting at least one adjusting mode for a down-sampling structure of a backbone network of YOLO-v4 based on the characteristics of a target to be detected;

adjusting the down-sampling structure of the backbone network of the YOLO-v4 by using the adjusting mode, and constructing a target detection model based on the YOLO-v 4;

inputting a detection image into the target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.

2. The method of claim 1, wherein the adjusting the downsampling structure of the YOLO-v4 backbone network using the adjustment manner comprises:

and adjusting the step size of at least one down-sampling layer in the down-sampling structure.

3. The method of claim 1, wherein the adjusting the downsampling structure of the YOLO-v4 backbone network using the adjustment manner comprises:

deleting one or more downsampling layers in the downsampling structure.

4. The method of claim 3, wherein the adjusting the downsampling structure of the YOLO-v4 backbone network using the adjustment manner comprises:

any of the 1/4 downsampling layers and 1/32 downsampling layers were deleted.

5. The method of claim 3, wherein the adjusting the downsampling structure of the YOLO-v4 backbone network using the adjustment method further comprises:

and reducing the number of channels of each network structure originally connected behind the deleted down-sampling layer by half.

6. The method of claim 1, wherein constructing a YOLO-v 4-based target detection model comprises:

and adding a detection branch on the basis of the specified down-sampling layer in the adjusted down-sampling structure.

7. The method of claim 6, wherein constructing a YOLO-v 4-based target detection model further comprises:

and setting an anchor frame used by each detection branch in the target detection model according to the added detection branches.

8. The method of claim 7, wherein setting an anchor frame used by each detection branch in the target detection model according to the added detection branches comprises:

allocating a first preset number of anchor frame groups to each detection branch, wherein the first preset number is the number of the anchor frame groups used by the original YOLO-v4 main network, and the number of the anchor frame groups allocated to each detection branch is not 0;

alternatively, the first and second electrodes may be,

and increasing the number of the anchor frame groups from the first preset number to a second preset number, and averagely distributing the anchor frame groups with the second preset number to each detection branch.

9. The method of any one of claims 1 to 8, further comprising:

and pruning the target detection model.

10. An object detection device, comprising:

the adjusting unit is used for setting at least one adjusting mode for the down-sampling structure of the backbone network of the YOLO-v4 based on the characteristics of the target to be detected;

the construction unit is used for adjusting the downsampling structure of the backbone network of the YOLO-v4 by using the adjusting mode, and constructing a target detection model based on the YOLO-v 4;

the detection unit is used for inputting a detection image into the target detection model, extracting a down-sampling feature map of the detection image by the target detection model, and obtaining a target detection result according to the down-sampling feature map; the size of the downsampled feature map is determined according to the adjusted downsampled structure.