CN112116032A

CN112116032A - Object detection device and method and terminal equipment

Info

Publication number: CN112116032A
Application number: CN201910542145.7A
Authority: CN
Inventors: 康昊; 谭志明
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2020-12-22
Also published as: JP2021002333A; JP7428075B2

Abstract

The embodiments of the present invention provide an object detection apparatus and method, and a terminal device, because the number of input channels and output channels of all convolutional layers in a shuffling unit for feature extraction is the same, and feature expansion and compression are not required, thereby reducing the memory and performance requirements of a processor and increasing the processing speed, in addition, the shuffling unit has at least one deep separable convolutional layer, thereby greatly reducing the memory and performance requirements of FLOPs and the like on the processor, and simultaneously, compared with a Yolo-Darknet53 and other networks, the requirements of the processor are greatly reduced and the identification accuracy is almost the same, thereby providing a detection method with light weight, high processing speed and high identification accuracy, thereby being capable of being applied to a terminal device with limited memory and performance and obtaining a better identification effect.

Description

Object detection device and method and terminal equipment

Technical Field

The invention relates to the technical field of information.

Background

In recent years, with the help of deep learning, research in the field of computer vision has been greatly advanced. Deep learning refers to an algorithm set for solving various problems such as images and texts by applying various machine learning algorithms on a hierarchical neural network. The core of deep learning is feature learning, and aims to obtain hierarchical feature information through a hierarchical neural network, so that the important problem that features need to be designed manually in the past is solved.

Currently, some popular deep learning methods are presented, for example, the Yolo network is a promising deep learning method for target identification and detection. For example, the Yolo-Darknet53 network having the Darknet53 as a backbone network structure, in which the Darknet53 structure is used for feature extraction, has a faster processing speed and a higher recognition accuracy than a single stage (single stage) method because it has multi-scale object detection and better classifiers.

It should be noted that the above background description is only for the sake of clarity and complete description of the technical solutions of the present invention and for the understanding of those skilled in the art. Such solutions are not considered to be known to the person skilled in the art merely because they have been set forth in the background section of the invention.

Disclosure of Invention

However, the wide and deep layer of neural networks, such as the Yolo-dark 53 network, with high recognition accuracy makes these networks have high requirements on the memory and Processing speed of the processor, for example, the Yolo-dark 53 network requires that the number of Operations Per Second (FLOPs, FLoating Operations Per Second) is 739.8M, the Processing speed on the Central Processing Unit (CPU) is 1375.8ms Per frame, and the Processing speed on the Graphics Processing Unit (GPU) is 37.0ms Per frame. And certain terminal devices, such as vehicle-mounted devices, can only meet the FLOPs of about 100M. Therefore, these current neural networks with high recognition accuracy cannot be applied to some mobile devices.

According to a first aspect of embodiments of the present invention, there is provided an object detection apparatus, the apparatus comprising: a feature extraction unit for extracting features in an input image; and a detection unit configured to detect an object in the input image based on the features extracted by the feature extraction unit, wherein the feature extraction unit includes at least one shuffling unit including a plurality of convolutional layers each having the same number of input channels and output channels, the plurality of convolutional layers including at least one depth-separable convolutional layer.

According to a second aspect of embodiments of the present invention, there is provided a terminal device, which includes the apparatus according to the first aspect of embodiments of the present invention.

According to a third aspect of embodiments of the present invention, there is provided an object detection method, the method including: extracting features in the input image by using a feature extraction unit; and detecting an object in the input image based on the features extracted by the feature extraction unit using a detection unit, wherein the feature extraction unit includes at least one shuffling unit including a plurality of convolutional layers each having the same number of input channels and output channels, the plurality of convolutional layers including at least one depth-separable convolutional layer.

The invention has the beneficial effects that: the shuffling unit used for feature extraction has the same number of input channels and output channels of all convolution layers, and does not need feature expansion and compression, thereby reducing the requirements on the memory and performance of a processor and improving the processing speed, in addition, the shuffling unit has at least one deep separable convolution layer, thereby greatly reducing the requirements on the memory and performance of the processor by FLOPs and the like, and simultaneously, compared with a network such as Yolo-Darknet53 and the like, the requirements on the processor are greatly reduced, and the identification precision is almost the same, therefore, the detection method with light weight, high processing speed and higher identification precision is provided, thereby being applied to terminal equipment with limited memory and performance and obtaining better identification effect.

Specific embodiments of the present invention are disclosed in detail with reference to the following description and drawings, indicating the manner in which the principles of the invention may be employed. It should be understood that the embodiments of the invention are not so limited in scope. The embodiments of the invention include many variations, modifications and equivalents within the spirit and scope of the appended claims.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments, in combination with or instead of the features of the other embodiments.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps or components.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

fig. 1 is a schematic view of an object detection apparatus according to embodiment 1 of the present invention;

fig. 2 is a detection result of an input image by the object detection apparatus 10 according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram of the feature extraction unit 100 according to embodiment 1 of the present invention;

fig. 4 is a schematic diagram of first shuffle unit 103 according to embodiment 1 of the present invention;

fig. 5 is a schematic diagram of second shuffle unit 104 in embodiment 1 of the present invention;

fig. 6 is a schematic diagram of a terminal device according to embodiment 2 of the present invention;

fig. 7 is a schematic block diagram of a system configuration of a terminal device of embodiment 2 of the present invention;

fig. 8 is a schematic diagram of an object detection method according to embodiment 3 of the present invention.

Detailed Description

The foregoing and other features of the invention will become apparent from the following description taken in conjunction with the accompanying drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the embodiments in which the principles of the invention may be employed, it being understood that the invention is not limited to the embodiments described, but, on the contrary, is intended to cover all modifications, variations, and equivalents falling within the scope of the appended claims.

Example 1

The embodiment of the invention provides an object detection device. Fig. 1 is a schematic view of an object detection apparatus according to embodiment 1 of the present invention.

As shown in fig. 1, the object detection device 10 includes:

a feature extraction unit 100 for extracting features in an input image; and

a detection unit 200 for detecting an object in the input image based on the features extracted by the feature extraction unit 100,

wherein the feature extraction unit 100 comprises at least one shuffle unit (shuffle unit) comprising a number of convolutional layers, each of the convolutional layers having the same number of input channels and output channels, the convolutional layers comprising at least one depth-separable convolutional layer.

Fig. 2 is a detection result of an input image by the object detection device 10 according to embodiment 1 of the present invention. As shown in fig. 2, each object in the image can be accurately detected by the object detection device 10.

It can be seen from the above embodiments that, since the numbers of input channels and output channels of all convolutional layers in the shuffling unit for feature extraction are the same, and feature expansion and compression are not required, the memory and performance requirements on the processor are reduced and the processing speed is increased, and in addition, the shuffling unit has at least one deep separable convolutional layer, so that the memory and performance requirements on the processor by the FLOPs and the like are greatly reduced, and meanwhile, compared with the networks such as Yolo-Darknet53 and the like, the requirements on the processor are greatly reduced and the identification accuracy is almost the same, so that a detection method with light weight, high processing speed and high identification accuracy is provided, and can be applied to terminal devices with limited memory and performance and obtain a good identification effect.

In this embodiment, the input image may be an image obtained in real time or obtained in advance. For example, the input images are video images captured by the in-vehicle apparatus, each of which corresponds to one frame of the video image.

In the present embodiment, the feature extraction unit 100 is configured to extract features in an input image, and the feature extraction unit 100 includes at least one shuffling unit including a plurality of convolutional layers each having the same number of input channels and output channels, the plurality of convolutional layers including at least one depth-separable convolution (depth-wise separable convolution) layer.

In this embodiment, the at least one shuffle unit may include at least one first shuffle unit that is a shuffle unit having a step size (stride) of 1 and/or at least one second shuffle unit that is a shuffle unit having a step size of 2.

Hereinafter, the structure of the feature extraction unit 100 of the present embodiment is exemplarily described.

Fig. 3 is a schematic diagram of the feature extraction unit 100 according to embodiment 1 of the present invention. As shown in fig. 3, the feature extraction unit 100 includes:

a first convolution layer 101 that processes an input image;

a pooling layer 102 for pooling characteristics output by the first convolution layer 101;

a plurality of first shuffle units 103; and

a plurality of second shuffle units 104.

In this embodiment, the first buildup layer 101 and the pooling layer 102 may use an existing structure.

In the present embodiment, the number and the order of arrangement of the first shuffle units 103 and the second shuffle units 104 may be set according to actual needs. That is, a preset rule may be determined according to actual needs, and the number and the order of the first shuffle units 103 and the second shuffle units 104 may be determined according to the preset rule.

As shown in fig. 3, "type" (type) indicates the type of each layer of the feature extraction unit, "channel parameter" (filters) indicates the size of the channel, "size" (size) indicates the size of the feature map processed by each layer, "step" (stride) indicates the step size of each layer, and "output" (output) indicates the size of the feature to be output.

In this embodiment, the channel parameters, the size, the step size, and the size of the output feature of each layer may be determined according to actual needs.

As shown in fig. 3, the number plus x indicates the number of the layer repeat arrangements, for example, 7 x indicates 7 corresponding layer repeat arrangements, and 3 x indicates 3 corresponding layer repeat arrangements.

As shown in fig. 3, the first convolution layer 101 extracts features of an input image, inputs the extracted features to the pooling layer 102 to be pooled, inputs the pooled features to the plurality of first shuffling units 103 and the plurality of second shuffling units 104 arranged in sequence to be shuffled, and outputs the extracted features for detection by the detection unit 200.

Hereinafter, the configurations of first shuffle unit 103 and second shuffle unit 104 will be described as examples.

Fig. 4 is a schematic diagram of first shuffle unit 103 according to embodiment 1 of the present invention. As shown in fig. 4, first shuffle unit 103 includes:

a first channel splitting module 401 that splits a feature input to the first shuffle unit 104 into a first partial feature and a second partial feature;

a second convolutional layer 402, which processes the second partial feature;

a first depth separable convolutional layer 403 which processes a second partial feature passing through the second convolutional layer 402;

a third convolutional layer 404, which processes the second partial feature passing through the first depth-separable convolutional layer 403;

a first merging module 405 that merges the first partial feature and the second partial feature that has passed through the third convolutional layer 404; and

a first shuffle module 406 that shuffles the combined first partial features and second partial features.

As shown in fig. 4, the input features are split into two parts, i.e., a first part of features and a second part of features, after passing through a first channel splitting module 401, the first part of features are output to a first merging module 405 through a left branch without any processing, the second part of features enter a right branch, are first input to a second convolution layer 402 of 1 × 1, the features output from the second convolution layer 402 are normalized and activated and input to a first depth separable convolution layer 403 of 3 × 3, the features output from the first depth separable convolution layer 403 are normalized and input to a third convolution layer 404 of 1 × 1, the features output from the third convolution layer 404 are normalized and activated and output to the first merging module 405, the first merging module 405 merges the first part of features and the second part of features and inputs the merged features to a first shuffling module 406, the first shuffle module 406 shuffles the combined first partial features and second partial features and outputs the shuffled partial features.

Fig. 5 is a schematic diagram of second shuffle unit 104 in embodiment 1 of the present invention. As shown in fig. 5, the input features include a third partial feature and a fourth partial feature, and the second shuffling unit 104 includes:

a second depth separable convolutional layer 501 which processes a third partial feature;

a fourth convolutional layer 502, which processes the third partial feature of the second depth-separable convolutional layer 501;

a fifth convolution layer 503 that processes the fourth partial feature;

a third depth-separable convolutional layer 504 that processes a fourth portion of the features that pass through the fifth convolutional layer 503;

a sixth convolutional layer 505 which processes the fourth partial feature passing through the third depth-separable convolutional layer 504;

a second merging module 506 that merges the third partial feature that passes through the fourth convolutional layer 502 and the fourth partial feature that passes through the sixth convolutional layer 505; and

and a second shuffling module 507 for performing a shuffling process on the combined third partial feature and fourth partial feature.

As shown in fig. 5, the input third partial feature enters the left branch and is first input into the 3 × 3 second depth-separable convolutional layer 501, the feature output by the second depth-separable convolutional layer 501 is normalized and input into the 1 × 1 fourth convolutional layer 502, and the feature output by the fourth convolutional layer 502 is normalized and activated and output into the second combining module 506; the fourth portion of the input features enter the right branch, and are first input into a fifth convolutional layer 503 of 1 × 1, the features output by the fifth convolutional layer 503 are normalized and activated and input into a third depth separable convolutional layer 504 of 3 × 3, the features output by the third depth separable convolutional layer 504 are normalized and input into a sixth convolutional layer 505 of 1 × 1, the features output by the sixth convolutional layer 505 are normalized and activated and output into a second combining module 506, the second combining module 506 combines the third portion of the features and the fourth portion of the features and inputs the combined features into a second shuffling module 507, and the second shuffling module 507 shuffles the combined third portion of the features and the fourth portion of the features and outputs the combined features.

In the present embodiment, first, second, third, fourth, fifth, and sixth

convolutional layers

101, 402, 404, 502, 503, and 505 may be ordinary convolutional layers, and first, second, and third depth-separable convolutional layers 403, 501, and 504 may be existing depth-separable convolutional layers.

Because the number of input channels and output channels of each convolutional layer is the same, that is, each convolutional layer does not need to be subjected to characteristic expansion and compression, the requirements on the memory and the performance of the processor are reduced, and the processing speed is increased.

The structure of the feature extraction unit 100 of the present embodiment is exemplarily described above.

After the feature extraction unit 100 extracts features from an input image, the detection unit 200 detects an object in the input image based on the features extracted by the feature extraction unit 100.

In the present embodiment, the detection unit 200 may use an existing network structure, for example, the detection unit 200 includes a Yolo network. The Yolo network detects the object in the input image according to the extracted features, and the principle and process of detecting the object by the Yolo network can refer to the prior art, which is not described herein again.

Table 1 shows a parameter comparison between the object detection apparatus according to the embodiment of the present invention and the existing network structure.

	Yolo-Darknet53	Object detecting device 10	Object detection device 10'
				FLOPs	739.8M	108.7M	67.7M
CPU (ms/frame)	1378.5	585.3	310.0
				GPU (ms/frame)	37.0	25.7	20.7
mAP(％)	48.29	46.70	44.31
				APperson(％)	22.67	17.22	15.64
APbicycle(％)	32.97	26.91	23.06
				APcar(％)	77.58	76.49	74.41
APbus(％)	63.19	69.52	67.50
				APvan(％)	27.88	22.76	20.17
APtruck(％)	65.46	67.28	65.10

TABLE 1

As shown in table 1, the first column is parameters of the existing Yolo-dark net53, the second column is parameters of the object detection device 10 of the present embodiment, and the third column is parameters of the object detection device 10' of the present embodiment. The object detection device 10 and the object detection device 10 'have the same structure but different parameters, for example, the object detection device 10' has smaller channel parameters than the object detection device 10. FLOPs represents the number of operations executed per second, CPU represents the processing speed on CPU, GPU represents the processing speed on GPU, mAP represents the average identification accuracy, APPerson represents the identification accuracy of people, APbicycle represents the identification accuracy of bicycles, APcar represents the identification accuracy of cars, APbus represents the identification accuracy of buses, APvan represents the identification accuracy of van vehicles, and APtruck represents the identification accuracy of trucks. It can be seen that the requirements of the object detection device 10 and the object detection device 10' of the present embodiment on the memory and performance of the processor are greatly reduced compared to the Yolo-Darknet53 network, and the recognition accuracy rate is substantially equal to that of the Yolo-Darknet53 network.

Example 2

An embodiment of the present invention further provides a terminal device, and fig. 6 is a schematic diagram of the terminal device in embodiment 2 of the present invention. As shown in fig. 6, the terminal device 600 includes an object detection apparatus 601, and the structure and function of the object detection apparatus 601 are the same as those described in embodiment 1, and are not described again here.

Fig. 7 is a schematic block diagram of a system configuration of a terminal device according to embodiment 2 of the present invention. As shown in fig. 7, the terminal device 700 may include a central processor 701 and a memory 702; the memory 702 is coupled to the central processor 701. The figure is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.

As shown in fig. 7, the terminal device 700 may further include: an input unit 703, a display 704, and a power source 705.

In one embodiment, the functionality of the object detection device described in example 1 may be integrated into the central processor 701. Wherein, the central processor 701 may be configured to: extracting features in the input image by using a feature extraction unit; and detecting an object in the input image based on the features extracted by the feature extraction unit using a detection unit, wherein the feature extraction unit includes at least one shuffling unit including a plurality of convolutional layers each having the same number of input channels and output channels, the plurality of convolutional layers including at least one depth-separable convolutional layer.

For example, the at least one shuffle unit includes at least one first shuffle unit that is a shuffle unit having a step size of 1 and/or at least one second shuffle unit that is a shuffle unit having a step size of 2.

In another embodiment, the object detection device described in embodiment 1 may be configured separately from the central processing unit 701, for example, the object detection device may be configured as a chip connected to the central processing unit 701, and the function of the object detection device is realized by the control of the central processing unit 701.

It is not necessary that the terminal device 700 in this embodiment include all of the components shown in fig. 7.

As shown in fig. 7, a central processing unit 701, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processing unit 701 receiving input and controlling operation of the various components of the terminal device 700.

The memory 702, for example, may be one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. And the central processor 701 can execute the program stored in the memory 702 to realize information storage or processing, or the like. The functions of other parts are similar to the prior art and are not described in detail here. The various components of terminal device 700 can be implemented in dedicated hardware, firmware, software, or combinations thereof, without departing from the scope of the invention.

Example 3

The embodiment of the invention also provides an object detection method, which corresponds to the object detection device in the embodiment 1.

Fig. 8 is a schematic diagram of an object detection method according to embodiment 3 of the present invention. As shown in fig. 8, the method includes:

step 801: extracting features in the input image by using a feature extraction unit; and

step 802: and detecting the object in the input image by using a detection unit according to the features extracted by the feature extraction unit.

Wherein the feature extraction unit comprises at least one shuffle unit, the shuffle unit comprising a plurality of convolutional layers, a number of input channels and output channels of each of the plurality of convolutional layers being the same, the plurality of convolutional layers comprising at least one depth separable convolutional layer.

In this embodiment, the specific implementation method of the above steps is the same as that described in embodiment 1, and is not repeated here.

An embodiment of the present invention also provides a computer-readable program, where when the program is executed in an object detection apparatus or a terminal device, the program causes a computer to execute the object detection method described in embodiment 3 in the object detection apparatus or the terminal device.

An embodiment of the present invention further provides a storage medium storing a computer-readable program, where the computer-readable program enables a computer to execute the object detection method described in embodiment 3 in an object detection device or a terminal device.

The object detection method executed in the object detection apparatus or the terminal device described in connection with the embodiments of the present invention may be directly embodied as hardware, a software module executed by a processor, or a combination of the two. For example, one or more of the functional block diagrams and/or one or more combinations of the functional block diagrams illustrated in fig. 1 may correspond to individual software modules of a computer program flow or may correspond to individual hardware modules. These software modules may correspond to the steps shown in fig. 8, respectively. These hardware modules may be implemented, for example, by solidifying these software modules using a Field Programmable Gate Array (FPGA).

A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium; or the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The software module may be stored in the memory of the mobile terminal or in a memory card that is insertable into the mobile terminal. For example, if the terminal device employs a relatively large capacity MEGA-SIM card or a large capacity flash memory device, the software module may be stored in the MEGA-SIM card or the large capacity flash memory device.

One or more of the functional block diagrams and/or one or more combinations of the functional block diagrams described with respect to fig. 1 may be implemented as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any suitable combination thereof designed to perform the functions described herein. One or more of the functional block diagrams and/or one or more combinations of the functional block diagrams described with respect to fig. 1 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP communication, or any other such configuration.

While the invention has been described with reference to specific embodiments, it will be apparent to those skilled in the art that these descriptions are illustrative and not intended to limit the scope of the invention. Various modifications and alterations of this invention will become apparent to those skilled in the art based upon the spirit and principles of this invention, and such modifications and alterations are also within the scope of this invention.

Claims

1. An object detection apparatus, the apparatus comprising:

a feature extraction unit for extracting features in an input image; and

a detection unit for detecting an object in the input image based on the features extracted by the feature extraction unit,

wherein the feature extraction unit comprises at least one shuffling unit comprising a plurality of convolutional layers, each convolutional layer of the plurality of convolutional layers having the same number of input channels and output channels, the plurality of convolutional layers comprising at least one depth-separable convolutional layer.

2. The apparatus of claim 1, wherein,

the at least one shuffle unit includes at least one first shuffle unit that is a shuffle unit having a step size of 1 and/or at least one second shuffle unit that is a shuffle unit having a step size of 2.

3. The apparatus of claim 2, wherein,

the feature extraction unit further includes:

a first convolution layer that processes the input image; and

and a pooling layer that pools characteristics output from the first convolution layer and inputs the pooled characteristics to the first shuffling unit or the second shuffling unit.

4. The apparatus of claim 2, wherein,

the first shuffle unit includes:

a first channel splitting module that splits a feature input to the first shuffle unit into a first partial feature and a second partial feature;

a second convolutional layer that processes the second partial feature;

a first depth-separable convolutional layer that processes a second partial feature that passes through the second convolutional layer;

a third convolutional layer that processes a second partial feature of the first depth-separable convolutional layer;

a first merging module that merges the first partial feature and a second partial feature that passes through the third convolutional layer; and

a first shuffling module that shuffles the combined first partial features and second partial features.

5. The apparatus of claim 2, wherein,

the features input to the second shuffle unit include a third partial feature and a fourth partial feature,

the second shuffle unit includes:

a second depth separable convolutional layer that processes the third partial feature;

a fourth convolutional layer that processes a third partial feature that passes through the second depth-separable convolutional layer;

a fifth convolutional layer which processes the fourth partial feature;

a third depth-separable convolutional layer that processes a fourth portion of features that pass through the fifth convolutional layer;

a sixth convolutional layer which processes a fourth partial feature passing through the third depth-separable convolutional layer;

a second merging module that merges a third partial feature that passes through the fourth convolution layer and a fourth partial feature that passes through the sixth convolution layer; and

a second shuffling module that shuffles the combined third partial feature and fourth partial feature.

6. The apparatus of claim 2, wherein,

the at least one shuffle unit includes a plurality of first shuffle units and a plurality of second shuffle units,

the plurality of first shuffle units and the plurality of second shuffle units are ordered according to a preset rule.

7. The apparatus of claim 1, wherein,

the detection unit comprises a Yolo network.

8. A terminal device comprising the apparatus of any one of claims 1-7.

9. A method of object detection, the method comprising:

extracting features in the input image by using a feature extraction unit; and

detecting an object in the input image by a detection unit based on the features extracted by the feature extraction unit,

10. The method of claim 9, wherein,