CN112464930A

CN112464930A - Target detection network construction method, target detection method, device and storage medium

Info

Publication number: CN112464930A
Application number: CN201910857984.8A
Authority: CN
Inventors: 徐航; 黎嘉伟; 李震国; 张维; 梁小丹
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2021-03-09

Abstract

The application provides a construction method of an object detection network, an object detection method, an object detection device and a computer-readable storage medium. Relate to the artificial intelligence field, concretely relates to computer vision field. The construction method comprises the following steps: determining a search space of a target detection network; and determining an initial network architecture of the target detection network according to the search space of the target detection network, and updating and iterating the initial network architecture of the target detection network according to the search space of the target detection network until the target detection network meeting preset requirements is obtained. The selectable connection relation of the feature fusion layer in the target detection network comprises connection of any node of one layer of neural network between any two adjacent layers of neural networks in the feature fusion layer and any node of the other layer of neural network. The method and the device can simplify the complexity of the target detection network.

Description

Target detection network construction method, target detection method, device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method of constructing an object detection network, an object detection method, an apparatus, and a storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

With the rapid development of artificial intelligence technology, neural networks (e.g., deep neural networks) have achieved great success in processing and analyzing various media signals such as images, videos, and voices in recent years. A well-performing neural network tends to have a delicate network structure, which requires a highly skilled and experienced human expert to expend a great deal of effort in construction. In order to better construct a neural network, it has been proposed to construct a neural network by a neural network structure search (NAS) method, and to automatically search for a neural network structure, thereby obtaining a neural network structure with excellent performance.

Object detection, which is one of the basic tasks in the field of computer vision, generally locates a target object in an image and assigns a corresponding tag to the target object. The current mainstream target detection system mainly comprises a backbone network, a feature fusion layer, a regional convolutional network (RPN) and a Regional Convolutional Neural Network (RCNN) header.

In the traditional scheme, an expert manually designs a target detection network according to a certain strategy, a large amount of labor cost and time cost are consumed, and the performance of the designed target detection network is relatively common.

Disclosure of Invention

The application provides a target detection network construction method, a target detection device and a storage medium, so as to construct a target detection network with lower complexity.

In a first aspect, a method for constructing an object detection network is provided, where the object detection network includes a backbone network, a feature fusion layer, a regional candidate network RPN, and a regional convolutional neural network RCNN, and the method includes: determining a search space of a target detection network, wherein the search space of the target detection network comprises a search space of a feature fusion layer; determining an initial network architecture of the target detection network according to a search space of the target detection network; and iteratively updating the initial network architecture of the target detection network according to the search space of the target detection network until the target detection network meeting the preset requirement is obtained.

The search space of the target detection network comprises a search space of a feature fusion layer.

The search space of the feature fusion layer includes a selectable connection relation of the feature fusion layer, and the selectable connection relation of the feature fusion layer specifically includes a connection between any one node of one layer of neural network and any one node of another layer of neural network in two adjacent layers of neural networks in the multilayer neural network.

In addition, the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer. That is, when determining the initial network architecture of the target detection network according to the search space of the target detection network, the network architecture of the feature fusion layer in the initial network architecture of the target detection network may be specifically determined according to the search space of the feature fusion layer.

It should be understood that, when the initial network architecture of the target detection network is determined in the search space of the target detection network, the number of network layers of the target detection network and the number of nodes included in the target detection network may be predetermined, and specifically, the number of network layers of the target detection network and the number of nodes included in the target detection network may be determined according to an application requirement or a requirement of the target detection network on the target detection performance.

For example, when the requirement on the target detection performance of the target detection network is high, the number of network layers of the target detection network and the number of nodes included in the target detection network may be relatively large, and when the requirement on the target detection speed/complexity of the target detection network is high, the number of network layers of the target detection network and the number of nodes included in the target detection network may be relatively small.

In the application, as the optional connection relations of the feature fusion layer included in the search space of the feature fusion layer are more, the network architecture of the feature fusion layer in the initial network architecture of the target detection network can be more reasonably determined according to the more optional connection relations, the network architecture of the feature fusion layer is updated, and the complexity of the finally obtained target detection network can be simplified.

Specifically, in the present application, for the search space of the feature fusion layer, because a search space with a more free optional connection relationship is adopted, compared with a manner of manually setting a network architecture, when the network architecture of the feature fusion layer is updated, a network structure of a more simplified feature fusion layer can be obtained when the network architecture of the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, so that the complexity of the target detection network can be finally simplified, and a storage space required to be occupied when the target detection network is deployed is reduced.

In addition, as the search space of the feature fusion layer contains more optional connection relations, the network architecture of the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, and the network architecture of the feature fusion layer is updated, so that the target detection network with better performance can be constructed finally.

The method for constructing the object detection network according to the first aspect may be an automatic method for constructing a neural network, and the method for constructing the object detection network according to the first aspect may be automatically performed by an object detection network constructing apparatus.

Optionally, in the target detection network, a network architecture of the backbone network and a network structure of the RPN are predetermined.

That is to say, in the process of updating the initial network architecture of the target detection network, only the feature fusion layer and the RCNN in the initial network architecture of the target detection network may be updated.

In addition, the network architecture of the backbone network and the network architecture of the RPN may also be network architectures that are not determined in advance, so that the network architectures of the backbone network, the feature fusion layer, the RPN, and the RCNN of the target detection network may be updated in the process of updating the initial network architecture of the target detection network.

With reference to the first aspect, in some implementation manners of the first aspect, the iteratively updating the initial network architecture of the target detection network according to the search space of the target detection network until a target detection network meeting a preset requirement is obtained includes: and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network so as to reduce the value of a loss function corresponding to the target detection network, thereby obtaining the target detection network meeting the preset requirement.

Wherein the loss function includes a target detection error of the target detection network and/or a complexity of the target detection network.

It should be understood that, in the iterative update process, the initial network architecture of the target detection network may be iteratively updated according to the value of the loss function corresponding to the target detection network, so that the value of the loss function corresponding to the target detection network is as small as possible until the target detection network meeting the preset requirement is obtained.

Specifically, when an initial network architecture of the target detection network is updated iteratively, a network structure (a connection relationship between different nodes in the network) of the target detection network may be adjusted, a value of a loss function corresponding to the target detection network is calculated after each adjustment, and then the network structure of the target detection network is updated according to the value of the loss function corresponding to the target detection network, so that iteration is continued until a target detection network meeting preset requirements is obtained.

In the iterative updating process, the value of the loss function corresponding to the target detection network can be calculated after the network architecture of the target detection network is updated each time, if the value of the loss function meets the requirement, the updating of the target detection network architecture is stopped, the target detection network architecture obtained at this moment is the target detection network meeting the preset requirement, and if the value of the loss function does not meet the requirement, the network parameters of the target detection network can be continuously updated according to the value of the loss function until the target detection network meeting the preset requirement is obtained.

With reference to the first aspect, in certain implementations of the first aspect, the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, where the selectable operation type of the feature fusion layer includes a convolution operation corresponding to a connection of any one node of one layer of the neural network and any one node of another layer of the neural network in two adjacent layers of the multi-layer neural network.

Wherein the convolution operation comprises a hole convolution operation.

In the present application, when the optional operation types in the feature fusion layer include a hole convolution operation, substantially the same target detection performance can be achieved with fewer convolution parameters.

In addition, when the optional operation type in the feature fusion layer comprises a hole convolution operation, the target detection network with better target detection performance can be obtained under the condition that the convolution parameter quantity is basically the same.

Specifically, under the condition of the same parameter number, the cavity convolution is adopted for convolution processing, so that a larger receptive field can be obtained compared with the traditional convolution, and therefore the finally obtained target detection network has better target detection performance.

With reference to the first aspect, in certain implementations of the first aspect, the RCNN includes a plurality of base units, each of the plurality of base units is composed of at least two nodes, the search space of the target detection network further includes a search space of the RCNN, the search space of the RCNN includes the search space of each of the plurality of base units, the search space of each base unit includes a selectable connection relationship of each base unit, and the selectable connection relationship of each base unit includes a connection between any two nodes within each base unit; the RCNN in the initial network architecture of the target detection network is determined according to a search space of the RCNN.

It should be understood that within each of the basic cells described above, the nodes are connected in the direction of input to output.

In the application, in the search space of the RCNN, any two nodes in each basic unit of the RCNN can be connected, so that the initial network structure of the RCNN can be more reasonably determined according to the looser search space of the RCNN, the initial network structure of the RCNN is updated, and the complexity of a target detection network can be simplified.

Further, in the present application, when the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, the network architecture of the feature fusion layer is updated, the network architecture of the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN, and the network architecture of the RCNN is updated, the network structure of the feature fusion layer and the network structure of the RCNN can be updated and optimized at the same time, so that the network structure of the RCNN finally obtained through optimization is more matched with the network structure of the feature fusion layer, and the target detection network with better target detection performance can be obtained.

With reference to the first aspect, in certain implementations of the first aspect, the search space of each basic unit further includes a selectable operation type of each basic unit, and the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, where the convolution operation includes a hole convolution operation.

In the present application, when the optional operation type in the RCNN includes a hole convolution operation, substantially the same target detection performance can be achieved with fewer convolution parameters.

In addition, when the optional operation type of the RCNN includes a hole convolution operation, a target detection network with better target detection performance can be obtained under the condition that the convolution parameter quantities are substantially the same.

With reference to the first aspect, in certain implementations of the first aspect, the convolution operation corresponding to a connection between any two nodes in each basic unit includes a hole convolution operation with an interval number of 2.

Because the resolution of the feature map processed in the RCNN is generally low and is more suitable for the operation of hole convolution with a small number of intervals, for each basic unit in the RCNN, when the optional operation includes hole convolution with an interval number of 2, the performance of the finally obtained RCNN can be improved, and further the performance of the finally obtained target detection network can be improved.

With reference to the first aspect, in certain implementations of the first aspect, at least two basic units of the plurality of basic units are respectively composed of different numbers of nodes.

In the application, when at least two basic units in a plurality of basic units in the RCNN can be composed of nodes with different numbers, the basic units in the RCNN are more freely composed, so that the possibility of the network structure of the RCNN can be increased when the initial network structure of the RCNN is determined and updated, the better network structure of the RCNN is conveniently searched, and the target detection network with better target detection performance is more likely to be obtained finally.

With reference to the first aspect, in certain implementations of the first aspect, a resolution of the input feature map of each basic unit is the same as a resolution of the output feature map of each basic unit.

When the resolution of the input feature map of each basic unit in the RCNN is the same as the resolution of the output feature map of each basic unit, each basic unit in the RCNN does not change the resolution of the feature map when processing the feature map, so that the information of the feature map is convenient to keep, and in addition, jump links are also facilitated, otherwise, each jump link needs to additionally align the size of the output and the feature map to be input, and the efficiency is low.

With reference to the first aspect, in certain implementations of the first aspect, the target detection network that meets the preset requirement meets at least one of the following conditions: the detection performance of the target detection network meets the preset performance requirement; updating the network architecture of the target detection network for more than or equal to a preset number; the complexity of the target detection network is less than or equal to a preset complexity.

With reference to the first aspect, in certain implementations of the first aspect, the complexity of the target detection network is determined according to at least one of the number or size of model parameters of the target detection network, a memory access cost MAC of the target detection network, and the number of floating point operations of the target detection network.

In a second aspect, a method for constructing an object detection network is provided, where the object detection network includes a backbone network, a feature fusion layer, a regional candidate network RPN, and a regional convolutional neural network RCNN, and the method includes: determining a search space of a target detection network, wherein the RCNN comprises a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the target detection network comprises a search space of the RCNN, the search space of the RCNN comprises a search space of each basic unit of the plurality of basic units, the search space of each basic unit comprises a selectable connection relation of each basic unit, and the selectable connection relation of each basic unit comprises a connection between any two nodes in each basic unit; determining an initial network architecture of the target detection network according to a search space of the target detection network, wherein the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN; and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network until the target detection network meeting the preset requirement is obtained.

It should be understood that the search space of the RCNN may be composed of the search space of each base unit in the RCNN.

In the method, any two nodes in each basic unit of the RCNN can be connected in the search space of the RCNN, and compared with a mode of manually setting a network architecture, the method and the device can reasonably determine the initial network structure of the RCNN according to the looser search space of the RCNN, update the initial network structure of the RCNN, and simplify the complexity of a target detection network.

In addition, because the search space of the RCNN contains more optional connection relationships, the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN, and the network architecture of the RCNN is updated, so that the target detection network with better performance can be constructed finally.

The method for constructing the object detection network according to the second aspect may be an automatic method for constructing a neural network, and the method for constructing the object detection network according to the second aspect may be automatically performed by an object detection network constructing apparatus.

Optionally, the network architecture of the backbone network and the network structure of the RPN are predetermined.

If the network architecture of the backbone network and the network architecture of the RPN are predetermined, only the feature fusion layer and the RCNN in the initial network architecture of the target detection network may be updated when the initial network architecture of the target detection network is updated.

With reference to the second aspect, in some implementation manners of the second aspect, iteratively updating the initial network architecture of the target detection network according to a search space of the target detection network until a target detection network meeting preset requirements is obtained includes: and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network so as to reduce the value of a loss function corresponding to the target detection network, thereby obtaining the target detection network meeting the preset requirement.

In the iterative updating process, the initial network architecture of the target detection network can be iteratively updated according to the value of the loss function corresponding to the target detection network, so that the value of the loss function corresponding to the target detection network is as small as possible until the target detection network meeting the preset requirement is obtained.

Specifically, in the iterative update process, the value of the loss function corresponding to the target detection network may be calculated after the network architecture of the target detection network is updated each time, if the value of the loss function meets the requirement, the update of the target detection network architecture is stopped, the target detection network architecture obtained at this time is the target detection network meeting the preset requirement, and if the value of the loss function does not meet the requirement, the network parameters of the target detection network may be continuously updated according to the value of the loss function until the target detection network meeting the preset requirement is obtained.

With reference to the second aspect, in some implementations of the second aspect, the search space of each basic unit further includes a selectable operation type of each basic unit, and the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, where the convolution operation includes a hole convolution operation.

When the optional operation types in the RCNN include hole convolution operations, substantially the same target detection performance can be achieved with fewer convolution parameters.

When the optional operation type of the RCNN includes a hole convolution operation, a target detection network with better target detection performance can be obtained under the condition that the convolution parameters are basically the same.

With reference to the second aspect, in some implementations of the second aspect, the convolution operation corresponding to a connection between any two nodes in each basic unit includes a hole convolution operation with an interval number of 2.

With reference to the second aspect, in certain implementations of the second aspect, at least two of the plurality of basic units are respectively composed of different numbers of nodes.

When at least two basic units in a plurality of basic units in the RCNN can be composed of nodes with different numbers, the basic units in the RCNN are more freely composed, so that the possibility of the network structure of the RCNN can be increased when the initial network structure of the RCNN is determined and updated, the better network structure of the RCNN is conveniently searched, and the target detection network with better target detection performance is more likely to be obtained finally.

With reference to the second aspect, in certain implementations of the second aspect, the resolution of the input signature of each base unit is the same as the resolution of the output signature of each base unit.

When the resolution of the input feature map of each basic unit in the RCNN is the same as the resolution of the output feature map of each basic unit, the method indicates that each basic unit in the RCNN does not change the resolution of the feature map when processing the feature map, so that the information of the feature map is kept.

With reference to the second aspect, in some implementations of the second aspect, the search space of the target detection network further includes a search space of a feature fusion layer, the search space of the feature fusion layer includes a selectable connection relation of the feature fusion layer, and the selectable connection relation of the feature fusion layer includes a connection of any one node of one layer of neural network in two adjacent layers of neural networks in the multilayer neural network with any one node of another layer of neural network; the feature fusion layer in the initial network architecture of the target detection network is determined from the search space of the feature fusion layer.

In the application, when the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, the network architecture of the feature fusion layer is updated, the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN, and the network architecture of the RCNN is updated, the network structure of the feature fusion layer and the network structure of the RCNN can be updated and optimized simultaneously, so that the network structure of the RCNN obtained through final optimization is better matched with the network structure of the feature fusion layer, and the target detection network with better target detection performance can be obtained.

With reference to the second aspect, in some implementations of the second aspect, the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, and the selectable operation type of the feature fusion layer includes a convolution operation corresponding to a connection of any one node of one layer of the neural network and any one node of the other layer of the neural network in two adjacent layers of the multi-layer neural network, where the convolution operation corresponding to a connection of any one node of one layer of the neural network and any one node of the other layer of the neural network in two adjacent layers of the multi-layer neural network includes a hole convolution operation.

When the selectable operation types in the feature fusion layer include a hole convolution operation, substantially the same target detection performance can be achieved with fewer convolution parameters.

With reference to the second aspect, in certain implementations of the second aspect, the target detection network that meets the preset requirements meets at least one of the following conditions: the detection performance of the target detection network meets the preset performance requirement; updating the network architecture of the target detection network for more than or equal to a preset number; the complexity of the target detection network is less than or equal to a preset complexity.

With reference to the second aspect, in some implementations of the second aspect, the complexity of the target detection network is determined according to at least one of the number or size of model parameters of the target detection network, a memory access cost MAC of the target detection network, and the number of floating point operations of the target detection network.

In a third aspect, a target detection method is provided, which includes: acquiring an image; processing the image by adopting a target detection network to obtain a target detection result of the image, wherein the target detection result comprises the position of the detection target in the image and a classification result of the detection target; the target detection network comprises a backbone network, a feature fusion layer, a regional candidate network RPN and a regional convolutional neural network RCNN, the target detection network meets the preset requirement, the target detection network is obtained by iteratively updating an initial network architecture of the target detection network according to a search space of the target detection network, and the initial network architecture of the target detection network is determined according to the search space of the target detection network; the search space of the target detection network comprises a search space of a feature fusion layer, the feature fusion layer in an initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, the search space of the feature fusion layer comprises an optional connection relation of the feature fusion layer, and the optional connection relation of the feature fusion layer comprises the connection of any node of one layer of neural network in two adjacent layers of neural networks in a multilayer neural network and any node in the other layer of neural network.

Because the target detection network adopted by the target detection method has more optional connection relations of the feature fusion layers contained in the search space of the feature fusion layer adopted in the construction process, the feature fusion layer determined according to the search space of the feature fusion layer can better perform feature fusion, and the final target detection network has better performance when performing target detection.

With reference to the third aspect, in certain implementations of the third aspect, the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, and the selectable operation type of the feature fusion layer includes a convolution operation corresponding to a connection of any one node of one layer of the neural network in two adjacent layers of the neural network with any one node in another layer of the neural network, where the convolution operation includes a hole convolution operation.

When the selectable operation types in the feature fusion layer include a hole convolution operation, substantially the same target detection performance can be achieved with fewer convolution parameters. In addition, under the condition that the convolution parameters have the same parameter number, the cavity convolution is adopted for convolution processing, so that a larger receptive field can be obtained compared with the traditional convolution, and therefore the finally obtained target detection network has better target detection performance.

With reference to the third aspect, in some implementations of the third aspect, the RCNN includes a plurality of base units, each of the plurality of base units is formed by at least two nodes, the search space of the target detection network further includes a search space of the RCNN, the search space of the RCNN includes the search space of each of the plurality of base units, the search space of each base unit includes a selectable connection relationship of each base unit, and the selectable connection relationship of each base unit includes a connection between any two nodes in each base unit; the RCNN in the initial network architecture of the target detection network is determined according to a search space of the RCNN.

With reference to the third aspect, in some implementations of the third aspect, the search space of each basic unit further includes a selectable operation type of each basic unit, where the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, and the convolution operation corresponding to a connection between any two nodes in each basic unit includes a hole convolution operation.

When the optional operation types in the RCNN include hole convolution operations, substantially the same target detection performance can be achieved with fewer convolution parameters. In addition, when the optional operation type of the RCNN includes a hole convolution operation, a target detection network with better target detection performance can be obtained under the condition that the convolution parameter quantities are substantially the same.

Specifically, under the condition that the convolution parameters have the same parameter number, the hollow convolution is adopted for convolution processing, so that a larger receptive field can be obtained compared with the traditional convolution, and therefore, the target detection network has better target detection performance.

With reference to the third aspect, in certain implementations of the third aspect, the convolution operation corresponding to a connection between any two nodes in each basic unit includes a hole convolution operation with an interval number of 2.

Because the resolution of the feature map processed in the RCNN is generally low, it is more suitable to adopt the hole convolution operation with a small number of intervals, and for each basic unit in the RCNN, when the optional operation includes the hole convolution with the number of intervals of 2, the target detection performance of the target detection network can be finally improved.

With reference to the third aspect, in certain implementations of the third aspect, at least two basic units of the plurality of basic units are respectively composed of different numbers of nodes.

With reference to the third aspect, in certain implementations of the third aspect, a resolution of the input feature map of each basic unit is the same as a resolution of the output feature map of each basic unit.

When the resolution of the input feature map of each basic unit in the RCNN is the same as the resolution of the output feature map of each basic unit, the resolution of the feature map is not changed when each basic unit in the RCNN processes the feature map, so that the information of the feature map is convenient to retain, and the target detection performance of the target detection network is further ensured.

With reference to the third aspect, in certain implementations of the third aspect, the target detection network satisfies at least one of the following conditions: the detection performance of the target detection network meets the preset performance requirement; updating the network architecture of the target detection network for more than or equal to a preset number; the complexity of the target detection network is less than or equal to a preset complexity.

With reference to the third aspect, in certain implementations of the third aspect, the complexity of the target detection network is determined according to at least one of the number or size of model parameters of the target detection network, a memory access cost MAC of the target detection network, and the number of floating point operations of the target detection network.

In a fourth aspect, there is provided a target detection method, the method comprising: acquiring an image; processing the image by adopting a target detection network to obtain a target detection result of the image, wherein the target detection result comprises the position of the detection target in the image and a classification result of the detection target; the target detection network comprises a backbone network, a feature fusion layer, a regional candidate network RPN and a regional convolutional neural network RCNN, the target detection network meets the preset requirement, the target detection network is obtained by iteratively updating an initial network architecture of the target detection network according to a search space of the target detection network, and the initial network architecture of the target detection network is determined according to the search space of the target detection network; the RCNN comprises a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the target detection network comprises the search space of the RCNN, the search space of the RCNN comprises the search space of each basic unit of the plurality of basic units, the search space of each basic unit comprises the optional connection relation of each basic unit, the optional connection relation of each basic unit comprises the connection between any two nodes in each basic unit, and the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN.

Because the target detection network adopted by the target detection method has more optional connection relations of the RCNN contained in the RCNN search space adopted in the construction process, the RCNN determined according to the RCNN search space can better perform feature fusion, and the final target detection network has better performance when performing target detection. In addition, the network structure of the RCNN in the initial network architecture of the target detection network can be more reasonably determined according to more optional connection relations, and the network architecture of the RCNN can be updated, so that the complexity of the finally obtained target detection network can be simplified.

Specifically, in the present application, for the search space of the RCNN, because a search space with a more free optional connection relationship is adopted, compared with a manner of manually setting a network architecture, when the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN and the network architecture of the RCNN is updated, a more simplified network structure of the RCNN can be obtained, so that the complexity of the target detection network can be finally simplified, and a storage space required to be occupied when the target detection network is deployed is reduced.

With reference to the fourth aspect, in some implementations of the fourth aspect, the search space of each basic unit further includes a selectable operation type of each basic unit, the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, and the convolution operation corresponding to a connection between any two nodes in each basic unit includes a hole convolution operation.

When the optional operation types in the RCNN include hole convolution operations, the target detection network can achieve substantially the same target detection performance with fewer convolution parameters. When the optional operation type of the RCNN comprises a hole convolution operation, the target detection network can be used for target detection under the condition that the convolution parameter quantity is basically the same, and the detection performance is better.

With reference to the fourth aspect, in some implementations of the fourth aspect, the convolution operation corresponding to a connection between any two nodes in each basic unit includes a hole convolution operation with an interval number of 2.

With reference to the fourth aspect, in certain implementations of the fourth aspect, at least two of the plurality of basic units are respectively composed of different numbers of nodes.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the resolution of the input feature map of each base unit is the same as the resolution of the output feature map of each base unit.

With reference to the fourth aspect, in some implementations of the fourth aspect, the search space of the target detection network further includes a search space of a feature fusion layer, the search space of the feature fusion layer includes a selectable connection relation of the feature fusion layer, and the selectable connection relation of the feature fusion layer includes a connection of any one node of one layer of neural network in two adjacent layers of neural networks in the multilayer neural network with any one node of the other layer of neural network; the feature fusion layer in the initial network architecture of the target detection network is determined from the search space of the feature fusion layer.

With reference to the fourth aspect, in some implementations of the fourth aspect, the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, and the selectable operation type of the feature fusion layer includes a convolution operation corresponding to a connection of any one node of one layer of the neural network and any one node of the other layer of the neural network in two adjacent layers of the neural network, where the convolution operation corresponding to a connection of any one node of one layer of the neural network and any one node of the other layer of the neural network in two adjacent layers of the neural network includes a hole convolution operation.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the object detection network satisfies at least one of the following conditions: the detection performance of the target detection network meets the preset performance requirement; updating the network architecture of the target detection network for more than or equal to a preset number; the complexity of the target detection network is less than or equal to a preset complexity.

With reference to the fourth aspect, in some implementations of the fourth aspect, the complexity of the target detection network is determined according to at least one of the number or size of model parameters of the target detection network, a memory access cost MAC of the target detection network, and the number of floating point operations of the target detection network.

In a fifth aspect, an apparatus for constructing an object detection network is provided, the apparatus including means for performing the method in any one of the implementations of the first aspect or the second aspect.

In a fifth aspect, an object detection apparatus is provided, which includes means for performing the method in any one of the implementation manners of the third aspect or the fourth aspect.

In a sixth aspect, there is provided an apparatus for constructing an object detection network, the apparatus including: a memory for storing a program; a processor for executing the memory-stored program, the processor being configured to perform the method of any one of the implementations of the first aspect or the second aspect when the memory-stored program is executed.

In a seventh aspect, there is provided an object detection apparatus, comprising: a memory for storing a program; a processor configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform the method in any one of the implementation manners of the third aspect or the fourth aspect.

In an eighth aspect, a computer readable medium is provided, which stores program code for execution by a device, the program code comprising instructions for performing the method of any one of the implementations of the first aspect to the fourth aspect.

A ninth aspect provides a computer program product comprising instructions for causing a computer to perform the method of any one of the implementations of the first to fourth aspects when the computer program product runs on a computer.

A tenth aspect provides a chip, where the chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface to execute the method in any one implementation manner of the first aspect to the fourth aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in any one implementation manner of the first aspect to the fourth aspect.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of target detection using a convolutional neural network model provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a system architecture provided by an embodiment of the present application;

FIG. 5 is a schematic view of an object detection system of an embodiment of the present application;

FIG. 6 is a schematic diagram of an object detection system in the field of autonomous driving;

FIG. 7 is a schematic flow chart diagram of a method of constructing an object detection network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of one possible structure of a feature fusion layer;

FIG. 9 is a schematic diagram of one possible structure of an RCNN;

FIG. 10 is a schematic diagram of one possible structure of a feature fusion layer;

FIG. 11 is a schematic diagram of one possible structure of an RCNN;

FIG. 12 is a schematic flow chart diagram of a method of constructing an object detection network according to an embodiment of the present application;

FIG. 13 is a schematic flow chart diagram of a target detection method of an embodiment of the present application;

fig. 14 is a schematic block diagram of a construction apparatus of an object detection network according to an embodiment of the present application;

FIG. 15 is a schematic block diagram of an object detection apparatus of an embodiment of the present application;

fig. 16 is a schematic block diagram of a construction apparatus of an object detection network according to an embodiment of the present application;

fig. 17 is a schematic block diagram of an object detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The scheme of the application can be applied to the fields of auxiliary driving, automatic driving, safe cities, intelligent terminals and the like which need target detection (for example, detection of pedestrians in images). Two more common application scenarios are briefly introduced below.

The application scene one: assist/autopilot system

In Advanced Driving Assistance Systems (ADAS) and Automatic Driving Systems (ADS), pedestrians or obstacles on a road surface need to be detected and avoided, and particularly, to avoid collision with pedestrians, accurate target detection is required.

Application scenario two: safety city/video monitoring system

The method can be used for searching criminal suspects, missing persons, specific vehicles and the like in an analysis unit of a detection result system by carrying out target detection (detecting pedestrians or vehicles) in real time in a safe city system and a video monitoring system, marking out a detection result and carrying out analysis on the detection result system.

The scheme of the application relates to the construction of a neural network and the target detection by using the neural network, and in order to better understand the scheme of the application, the related terms and concepts of the neural network are introduced below.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which can be as shown in equation (1):

wherein s is 1,2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is the activation functions (activations functions) of the neural unit that are used to non-linearly transform features in the neural network to convert input signals in the neural unit into output signals. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is activeA function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example, assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Residual error network

The residual network is a deep convolutional network proposed in 2015, which is easier to optimize than the conventional convolutional neural network and can improve accuracy by increasing the equivalent depth. The core of the residual network is to solve the side effect (degradation problem) caused by increasing the depth, so that the network performance can be improved by simply increasing the network depth. The residual network generally includes many sub-modules with the same structure, and a residual network (ResNet) is usually used to connect a number to indicate the number of times that the sub-module is repeated, for example, ResNet50 indicates that there are 50 sub-modules in the residual network.

(6) Classifier

Many neural network architectures have a classifier for classifying objects in the image. The classifier is generally composed of a fully connected layer (called normalized exponential function) and a softmax function (called normalized exponential function), and is capable of outputting probabilities of different classes according to inputs.

(7) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(8) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the numerical values of the parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

Some basic contents of the neural network are briefly described above, and some specific neural networks that may be used in image data processing are described below.

The system architecture of the embodiment of the present application is described in detail below with reference to fig. 1.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 1, the system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data collection system 160.

In addition, the execution device 110 includes a calculation module 111, an I/O interface 112, a preprocessing module 113, and a preprocessing module 114. Wherein, the calculation module 111 may include the target model/rule 101, and the pre-processing module 113 and the pre-processing module 114 are optional.

The data acquisition device 160 is used to acquire training data. For the target detection method of the embodiment of the present application, the training data may include a training image (including a pedestrian) and annotation data, where coordinates of a bounding box (bounding box) in which the pedestrian exists in the training image are given in the annotation data. After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The following describes that the training device 120 obtains the target model/rule 101 based on the training data, the training device 120 performs object detection on the input training image, and compares the output target detection result (the bounding box of the target such as a pedestrian or a vehicle in the image and the confidence of the bounding box) with the labeling result until the difference between the target detection result of the object output by the training device 120 and the pre-labeled result is smaller than a certain threshold value, thereby completing the training of the target model/rule 101.

The target model/rule 101 can be used to implement the target detection method of the embodiment of the present application, that is, the target detection result of the image to be processed can be obtained by inputting the image to be processed (after the relevant preprocessing) into the target model/rule 101. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the collection of the data collection device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: the image to be processed is input by the client device. The client device 140 may specifically be a terminal device.

The pre-processing module 113 and the pre-processing module 114 are used for pre-processing according to input data (such as an image to be processed) received by the I/O interface 112, and in this embodiment, there may be no pre-processing module 113 and the pre-processing module 114 or only one pre-processing module. When the preprocessing module 113 and the preprocessing module 114 are not present, the input data may be directly processed using the calculation module 111.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 presents the results of the processing, such as the target detection results calculated by the target model/rule 101, to the client device 140 for presentation to the user.

Specifically, the target detection result obtained by the processing of the target model/rule 101 in the calculation module 111 may be processed by the preprocessing module 113 (or may be processed by the preprocessing module 114), and then the processing result is sent to the I/O interface, and then the I/O interface sends the processing result to the client device 140 for display.

It should be understood that, when the preprocessing module 113 and the preprocessing module 114 are not present in the system architecture 100, the computing module 111 may also transmit the processed target detection result to the I/O interface, and then the I/O interface sends the processing result to the client device 140 for display.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In fig. 1, the user may manually give input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 1, the target model/rule 101 obtained by training according to the training device 120 may be a neural network in the embodiment of the present application, and specifically, the neural network provided in the embodiment of the present application may be a CNN (convolutional neural network), a Deep Convolutional Neural Network (DCNN), or the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 2. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 2, a Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230. The relevant contents of these layers are described in detail below.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 2 may include layers such as example 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 2, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the fully-connected layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) and an output layer 240 may be included in the fully-connected layer 230, and parameters included in the hidden layers may be pre-trained according to the related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the fully-connected layer 230, i.e., the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e., the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

It should be understood that the Convolutional Neural Network (CNN)200 shown in fig. 2 may be adopted to perform the target detection method of the embodiment of the present application, and as shown in fig. 2, after the image to be processed is processed by the input layer 210, the convolutional/pooling layer 220 and the fully-connected layer 230, the detection result of the image to be processed (the bounding box in which the pedestrian exists in the image to be processed and the confidence of the bounding box in which the pedestrian exists in the image) may be obtained.

Fig. 3 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithms for the various layers in the convolutional neural network shown in fig. 2 can all be implemented in a chip as shown in fig. 3.

A neural-Network Processing Unit (NPU) 50 is mounted as a coprocessor on a main CPU (CPU) (host CPU), and tasks are allocated by the main CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 503 fetches the data corresponding to the matrix B from the weight memory 502 and buffers it in each PE in the arithmetic circuit 503. The arithmetic circuit 503 takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit can 507 store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 505 (DMAC) to transfer the input data in the external memory to the input memory 501 and/or the unified memory 506, store the weight data in the external memory in the weight memory 502, and store the data in the unified memory 506 in the external memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.

Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are on-chip memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

In addition, in the present application, the operations of the layers in the convolutional neural network shown in fig. 2 may be performed by the operation circuit 503 or the vector calculation unit 507.

As shown in fig. 4, the present embodiment provides a system architecture 300. The system architecture includes a local device 301, a local device 302, and an execution device 210 and a data storage system 250, wherein the local device 301 and the local device 302 are connected with the execution device 210 through a communication network.

The execution device 210 may be implemented by one or more servers. Optionally, the execution device 210 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The execution device 210 may use data in the data storage system 250 or call program code in the data storage system 250 to implement the method of searching a neural network structure of the embodiments of the present application.

Specifically, the execution device 210 may perform the following process: determining a search space of a target detection network; determining an initial network architecture of the target detection network according to a search space of the target detection network; and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network until the target detection network meeting the preset requirement is obtained.

The process-executing device 210 can build a target neural network, which can be used for target detection.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 210 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 301 and the local device 302 acquire the relevant parameters of the target neural network from the execution device 210, deploy the target neural network on the local device 301 and the local device 302, and perform target detection by using the target neural network.

In another implementation, the execution device 210 may directly deploy a target neural network, and the execution device 210 acquires the to-be-processed image from the local device 301 and the local device 302 (the local device 301 and the local device 302 may upload the to-be-processed image to the execution device 210), performs target detection on the to-be-processed image according to the target neural network, and sends a target detection result to the local device 301 and the local device 302.

The execution device 210 may also be referred to as a cloud device, and in this case, the execution device 210 is generally deployed in the cloud.

The object detection system is described in detail below with reference to fig. 5.

FIG. 5 is a schematic diagram of an object detection system according to an embodiment of the present application.

As shown in fig. 5, the target detection system includes a backbone network (backbone), a feature fusion layer, a candidate region generation network (RPN), and a Regional Convolutional Neural Network (RCNN), where the backbone network may also be referred to as a backbone network, and the four network structures in the target detection network are described in detail below.

Backbone network:

the backbone network is used for extracting bottom layer picture information and is a general structure of a depth neural network model based on vision. In practice, the backbone network is usually fine-tuned based on the architecture of a general deep convolutional neural network. For example, the backbone network may be formed by fine tuning based on the architecture of a Visual Geometry Group (VGG) network, which is proposed by the visual geometry group of oxford university. For another example, the backbone network can be further fine-tuned based on a deep residual network (ResNet) architecture.

As shown in fig. 5, the backbone network may perform feature extraction on the image to be detected to obtain image features of the image to be detected.

A characteristic fusion layer:

the feature fusion layer is used for screening and fusing multi-scale and multi-level features extracted from the backbone network to generate more compact and expressive feature vectors so as to facilitate further processing after the feature vectors are input into a classifier. The feature fusion layer is widely used in the neural network design based on multi-scale and multi-level features. In practical application, on one hand, the pyramid structure can be used for adjusting the size, the shape and the weight of features with different scales, and the results are added and fused into a single feature vector, and on the other hand, the features with different levels are connected through skip links, so that multi-level features with higher expressive force are mined.

As shown in fig. 5, the feature fusion layer performs fusion processing on the image features of the to-be-detected image extracted by the backbone network to obtain multi-level features of the to-be-detected image.

RPN：

The RPN is a fast regression classifier for generating coarse target locations and class label information (class label information). In practical application, the RPN can be implemented by using a two-layer simple network including a binary classifier and a bounding box regression (the bounding box regression is a regression model for target detection, and a regression window closer to a real window and having a smaller loss function value is found near a target location obtained by a sliding window).

As shown in fig. 5, the RPN processes the multi-level features of the to-be-detected image obtained from the feature fusion layer to obtain a preliminary classification detection result of the target.

RCNN：

The RCNN, which may also be referred to as an RCNN header, is a unique part of the target detection network and is used to further optimize the preliminary classification detection result obtained by the RPN. Through the combination of the RCNN and the RPN, the target detection system can rapidly remove a large number of invalid image areas and can intensively and finely detect more potential image areas, so that the target detection effect is improved.

As shown in fig. 5, the RCNN further processes the preliminary classification detection result obtained by the RPN to obtain the classification of the target and the bounding box of the target.

The target detection system obtained by the construction method of the target detection network can be applied to an automatic driving scene. In an object recognition scenario of automatic driving, it is necessary to accurately recognize objects such as vehicles, pedestrians, and the like on a road. The automatic driving system needs to respond to the emergency on the road in real time, so that the carried target detection system is required to obtain an efficient and accurate target identification result on limited hardware resources. The target detection system constructed by the construction method of the target detection network in the embodiment of the application can effectively detect the target, so that the target detection system constructed by the construction method of the target detection network in the embodiment of the application can improve the target detection effect and improve the safety performance of automatic driving.

Fig. 6 is a schematic diagram of an object detection system applied in the field of automatic driving.

As shown in fig. 6, video data consisting of a series of road pictures can be acquired by the in-vehicle camera. The video data is processed by the target detection module, so that the obstacles and the positions of the obstacles in the road picture can be detected. Next, the target detection module may input the detection result (the obstacle present in the road screen and the position of the obstacle) to the action detection module, so that the action detection module can generate a driving operation signal according to the detection result and transmit the driving operation signal to a corresponding execution device of the automatic driving, thereby implementing automatic control of the vehicle.

In addition, the object detection system in the object detection module of fig. 6 may be a neural network (model) obtained by the method for constructing an object detection network according to the embodiment of the present application.

Fig. 7 is a schematic flowchart of a method for constructing an object detection network according to an embodiment of the present application. The method shown in fig. 7 may be performed by a device for constructing an object detection network according to an embodiment of the present application (for example, the method shown in fig. 7 may be performed by a device shown in fig. 14 or fig. 16, hereinafter), and the method shown in fig. 7 includes steps 1001 to 1003, which are described in detail below.

1001. A search space of the target detection network is determined.

For the search space of the feature fusion layer, optional connection relationships of the feature fusion layer may be included.

Specifically, the selectable connection relation of the feature fusion layer may include connection of any one node of one layer of the neural network with any one node of the other layer of the neural network in two adjacent layers of the neural networks of the feature fusion layer.

For example, as shown in fig. 8, the feature fusion layer includes a layer a neural network, a layer B neural network, and a layer C neural network. The constituent nodes of each layer of neural network are as follows.

Layer a neural network: p _ {0} _1, P _ {0} _2, P _ {0} _3, and P _ {0} _ 4.

Layer B neural network: p _ {1} _1, P _ {2} _2, P _ {3} _3, and P _ {4} _ 4.

Layer C neural networks: p _ {2} _1, P _ {2} _2, P _ {2} _3, and P _ {2} _ 4.

In fig. 8, any one node in the layer a neural network may be connected to any one node in the layer B neural network, and any one node in the layer B neural network may be connected to any one node in the layer C neural network.

Due to the fact that jump links among all layers are considered from the A-layer neural network to the B-layer neural network and from the B-layer neural network to the C-layer neural network, a feature fusion layer with a better network structure can be constructed.

For convenience of illustration, fig. 8 shows only the connection relationship of one node in the layer a neural network and a node in the layer B neural network, and the connection relationship of one node in the layer B and a node in the layer C neural network.

Specifically, as shown in fig. 8, the node P _ {0} _2 in the layer a neural network may be connected with the nodes P _ {1} _1, P _ {1} _2, P _ {1} _3, and P _ {1} _4 in the layer B neural network; the node P _ {1} _1 in the layer B neural network may be connected with the nodes P _ {2} _1, P _ {2} _2, P _ {2} _3, and P _ {2} _4 in the layer C neural network. It should be understood that fig. 8 only shows one specific case of the feature fusion layer, and the application does not limit the number of layers of the neural network included in the feature fusion layer and the number of nodes included in each layer of the neural network.

In addition, the search space of the target detection network may further include a search space of the RCNN.

The RCNN of the target detection network includes a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the RCNN includes the search space of each basic unit of the plurality of basic units, the search space of each basic unit includes the optional connection relationship of each basic unit, and the optional connection relationship of each basic unit includes the connection between any two nodes in each basic unit.

For example, as shown in fig. 9, the RCNN includes a base unit 1 above fig. 9 and a base unit 2 below fig. 9, and the base unit 1 and the base unit 2 are described below, respectively.

Base unit 1:

the basic unit 1 is composed of 7 nodes including 2 input nodes (H _ {0} ), 4 intermediate nodes (0,1,2,3), and 1 output node (H _ {1 }).

In the basic unit 1, any two nodes can be directly connected with each other except that 2 input nodes cannot be directly connected with each other.

The basic unit 2:

the basic unit 2 is composed of 7 nodes including 2 input nodes (H _ {0}, H _ {1}), 4 intermediate nodes (0,1,2,3), and 1 output node (H _ {2 }).

Fig. 9 described above shows only one possible connection relationship between the base unit 1 and the base unit 2.

As for the above-described base unit 1 and base unit 2, the nodes inside the base unit 1 or base unit 2 are connected in the direction from the input to the output when the connection is made.

1002. And determining an initial network architecture of the target detection network according to the search space of the target detection network.

The initial network architecture of the target detection network includes a network architecture of a feature fusion layer, and the network architecture of the feature fusion layer is determined according to a search space of the feature fusion layer, that is, in step 1002, the network architecture of the feature fusion layer in the initial network architecture of the target detection network may be determined according to the search space of the feature fusion layer.

In addition, the initial network architecture of the target detection network may also include a network architecture of the RCNN, which is determined according to the search space of the RCNN, that is, in step 1002, the network architecture of the RCNN may be determined according to the search space of the RCNN.

It should be understood that, when determining the initial network architecture of the target detection network in step 1002, the number of network layers of the target detection network and the number of nodes included in the target detection network may be predetermined.

Specifically, before the target detection network is constructed, the network level number of the target detection network and the number of nodes included in the target detection network may be determined according to the application requirement or the requirement of the target detection performance of the target detection network to be constructed.

For example, when the requirement on the target detection performance of the target detection network is high, the number of network layers of the target detection network may be relatively large, and the number of nodes included in the target detection network may also be relatively large, whereas when the requirement on the target detection performance of the target detection network is low, the number of network layers of the target detection network may be relatively small, and the number of nodes included in the target detection network may also be relatively small.

1003. And according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network until the target detection network meeting the preset requirement is obtained.

It should be understood that, in step 1003, when the initial network architecture of the target detection network is iteratively updated, either one of the network architecture of the feature fusion layer in the initial network architecture of the target detection network in the target detection network and the network architecture of the RCNN may be updated, or both the network architecture of the feature fusion layer in the initial network architecture of the target detection network and the network architecture of the RCNN may be updated simultaneously.

Specifically, in step 1003, after iteratively updating the network architecture of the target detection network each time, it may be determined whether the target detection network of the updated network meets the requirement, and if the preset requirement is not met, the network architecture of the target detection network is continuously updated until the target detection network meeting the preset requirement is obtained.

In the above object detection network, the network architecture of the backbone network and the network structure of the RPN may be predetermined. In this way, in the process of updating the initial network architecture of the target detection network, only the network architecture of the feature fusion layer in the initial network architecture of the target detection network or the network architecture of the RCNN in the initial network architecture of the target detection network may be updated.

In addition, the network architecture of the backbone network and the network architecture of the RPN may also be network architectures that are not determined in advance, so that the network architectures of the backbone network, the feature fusion layer, the RPN, and the RCNN of the target detection network may all be updated in the process of updating the initial network architecture of the target detection network.

In the application, as the selectable connection relations of the feature fusion layer included in the search space of the feature fusion layer are more, the feature fusion layer in the initial network architecture of the target detection network can be more reasonably determined according to more selectable connection relations, the network architecture of the feature fusion layer is updated, and the complexity of the finally obtained target detection network can be simplified.

Specifically, in the present application, for the search space of the feature fusion layer, because a search space with a more free optional connection relationship is adopted, when the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer and the network architecture of the feature fusion layer is updated, a more simplified network architecture of the feature fusion layer can be obtained, so that the complexity of the target detection network can be finally simplified, and the storage space required to be occupied when the target detection network is deployed is reduced.

In addition, as the search space of the feature fusion layer contains more optional connection relations, the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, and the network architecture of the feature fusion layer is updated, so that the target detection network with better performance can be constructed finally.

In the method, any two nodes in each basic unit of the RCNN can be connected in the search space of the RCNN, so that the initial network structure of the RCNN can be determined more reasonably according to the looser search space of the RCNN, the initial network structure of the RCNN is updated, and the complexity of a target detection network can be simplified.

In addition, as the search space of the RCNN contains more optional connection relations, the network architecture of the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN, and the network architecture of the RCNN is updated, so that the target detection network with better performance can be constructed finally.

The method shown in fig. 7 may be an automatic construction method of a neural network, and the method shown in fig. 7 may be automatically performed by the construction apparatus of the object detection network according to the embodiment of the present application.

When the network architecture of the feature fusion layer of the initial network architecture of the target detection network and the network architecture of the RCNN in the target detection network are updated simultaneously, the finally obtained feature fusion layer can be matched with the structure of the RCNN better, and therefore the performance of the finally obtained target detection network can be improved.

Optionally, the step 1003 specifically includes: and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network so as to reduce the value of a loss function corresponding to the target detection network, thereby obtaining the target detection network meeting the preset requirement.

The loss function may include an object detection error of the object detection network and/or a complexity of the object detection network.

When the initial network architecture of the target detection network is updated iteratively, the network structure (the connection relationship between different nodes in the network) of the target detection network can be adjusted, the value of the loss function corresponding to the target detection network is calculated after each adjustment, and then the network structure of the target detection network is updated according to the value of the loss function corresponding to the target detection network, so that iteration is continued until the target detection network meeting the preset requirement is obtained.

Specifically, in the above process, a value (also referred to as a function value) of a loss function corresponding to the target detection network may be determined after the network architecture of the target detection network is updated each time, if the value of the loss function corresponding to the target detection network is already smaller than a preset threshold, the updating of the network architecture of the target detection network may be stopped, the target detection network obtained at this time is the target detection network meeting the preset requirement, and if the value of the loss function corresponding to the target detection network is not smaller than the preset threshold, it may be determined whether to continue updating the network architecture of the target detection network according to the value of the loss function corresponding to the target detection network, and iteration is continued until the target detection network meeting the preset requirement is obtained.

Optionally, the target detection network meeting the preset requirement satisfies at least one of the following conditions (1) to (3):

(1) the detection performance of the target detection network meets the preset performance requirement;

(2) updating the network architecture of the target detection network for more than or equal to a preset number;

(3) the complexity of the target detection network is less than or equal to a preset complexity.

The complexity of the target detection network (which may also be referred to as the complexity of the network structure of the target detection network) may be determined according to at least one of the number or size of model parameters of the target detection network, the memory access cost MAC of the target detection network, and the number of floating point operations of the target detection network.

For the search space of the feature fusion layer, in addition to the optional connection relationship of the feature fusion layer, an optional operation type of the feature fusion layer may be included.

Specifically, the selectable operation types of the feature fusion layer may include a convolution operation corresponding to connection of any one node of one layer of the neural network and any one node of the other layer of the neural network in two adjacent layers of the multilayer neural network.

The convolution operation corresponding to the connection of any node of one layer of the neural network and any node of the other layer of the neural network in the two adjacent layers of the neural networks comprises a hole convolution operation.

Optional operations of the feature fusion layer may include any of the following:

(1) no connection;

(2)5 × 5 hole convolution with the number of intervals of 2;

(3) skip connections (identity mapping);

(4)5 × 5 hole convolution with a number of intervals of 3;

(5)3 × 3 hole convolution with the number of intervals of 2;

(6)3 × 3 depth separable convolution;

(7)3 × 3 hole convolution with a number of intervals of 3;

(8) a 5 x 5 depth separable convolution.

Compared with the traditional convolution operator, under the condition that the learnable parameter quantity is the same, the cavity convolution operator has a larger receptive field than the cavity convolution operator, so that the cavity convolution operator can be used for extracting the features with a larger visual range from the image. Or, in order to extract the features in the same visual range, the number and the size of the network parameters of the feature fusion layer can be reduced by adopting the hole convolution operator. Therefore, when the optional operation types in the feature fusion layer include a hole convolution operation, substantially the same target detection performance can be achieved with fewer convolution parameters.

The network structure of the feature fusion layer of the finally obtained target detection network is described below with reference to the drawings.

As shown in fig. 10, the feature fusion layer includes the same number of neural networks, and nodes in each layer of neural networks are the same as those in the feature fusion layer shown in fig. 8. Fig. 10 shows one possible connection manner of the finally obtained feature fusion layer, specifically, in fig. 10, specifically, in fig. 8, P _ {1} _1 and P _ {1} _4 in the B-layer neural network receive the outputs of all nodes (4 nodes) in the first group, and P _ {1} _2 and P _ {1} _3 in the B-layer neural network receive the outputs of 3 nodes in the a-layer neural network; p _ {2} _4 in the layer C neural network receives the output of all nodes (4 nodes) in the layer B neural network, and P _ {2} _1, P _ {2} _2, P _ {2} _3 in the layer C neural network all receive the output of 3 nodes in the layer A neural network.

Table 1 shows the input nodes corresponding to each node in the layer B neural network.

TABLE 1

Layer B neural network	Corresponding input node
		P_{1}_1	P _ {0} _1, P _ {0} _2, P _ {0} _3, and P _ {0} _4
P_{1}_2	P _ {0} _2, P _ {0} _3, and P _ {0} _4
		P_{1}_3	P _ {0} _2, P _ {0} _3, and P _ {0} _4
P_{1}_4	P _ {0} _1, P _ {0} _2, P _ {0} _3, and P _ {0} _4

Table 2 shows the input nodes corresponding to each node in the layer B neural network.

TABLE 2

Layer C neural network	Corresponding input node
		P_{2}_1	P _ {0} _1, P _ {0} _2, P _ {0} _3, and P _ {0} _4
P_{2}_2	P _ {0} _2, P _ {0} _3, and P _ {0} _4
		P_{2}_3	P _ {0} _2, P _ {0} _3, and P _ {0} _4
P_{2}_4	P _ {0} _1, P _ {0} _2, P _ {0} _3, and P _ {0} _4

In addition, fig. 10 also shows a possible selection manner of operation space between nodes, specifically, operations between a part of nodes between the layer a neural network and the layer B neural network are shown in table 3, and except for the operation relationship shown in table 3, corresponding operations between the layer a neural network and the second node are both identity.

TABLE 3

Layer B neural network	Corresponding input node	Corresponding operation
			P_{1}_1	P_{0}_3	dil_conv_5×5_r3
P_{1}_2	P_{0}_3	dil_conv_5×5_r3
			P_{1}_4	P_{0}_4	dil_conv_5×5_r2

Operations between part of nodes from the B-layer neural network to the C-layer neural network are shown in table 4, and except for the operational relationship shown in table 4, corresponding operations between the B-layer neural network and the C-layer neural network are both identities.

TABLE 4

Layer C neural network	Corresponding input node	Corresponding operation
			P_{2}_2	P_{1}_1	dil_conv_5×5_r3
P_{2}_4	P_{1}_4	dil_conv_5×5_r3

The search space of the target detection network in step 1002 may further include a search space of an RCNN, where the RCNN may include a plurality of basic units, each basic unit in the plurality of basic units is formed by at least two nodes, the search space of the RCNN includes a search space of each basic unit in the plurality of basic units, the search space of each basic unit includes a selectable connection relationship of each basic unit, and the selectable connection relationship of each basic unit includes a connection between any two nodes in each basic unit.

When the search space of the target detection network in step 1002 includes the search space of the RCNN, the RCNN in the initial network architecture of the target detection network in the target detection network may be determined according to the search space of the RCNN. That is, in step 1002, a network architecture of the RCNN in the initial network architecture of the target detection network may be determined according to the search space of the RCNN.

Further, in the present application, when the feature fusion layer of the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer and the network architecture of the feature fusion layer is updated, compared with a manner in which the network architecture is manually set, the present application determines the RCNN of the initial network architecture of the target detection network according to the search space of the RCNN and updates the network architecture of the RCNN, and can simultaneously implement updating and optimization of the network structure of the feature fusion layer and the network structure of the RCNN, so that the network structure of the RCNN finally obtained through optimization is more matched with the network structure of the feature fusion layer, and the target detection network with better target detection performance can be obtained.

Optionally, in the search space of the RCNN, the search space of each basic unit further includes a selectable operation type of each basic unit, where the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, and the convolution operation includes a hole convolution operation.

Optionally, the convolution operation corresponding to the connection between any two nodes in each basic unit includes a hole convolution operation with an interval number of 2.

Alternatively, the basic unit in the RCNN may be composed of different numbers of nodes.

In the conventional scheme, the basic units in the RCNN are generally composed of the same number of nodes, and this way is not flexible enough in optimizing the RCNN network structure. The basic units in the RCNN in the application can be composed of nodes with different numbers, so that the basic units in the RCNN are more freely composed, the possibility of the network structure of the RCNN can be increased when the initial network structure of the RCNN is determined and updated, the better network structure of the RCNN is convenient to search, and the target detection network with better target detection performance is more likely to be obtained finally.

In addition, the number of node configurations of each basic unit in the RCNN may be predetermined before the target detection network is constructed.

Optionally, the resolution of the input feature map of each basic unit is the same as the resolution of the output feature map of each basic unit.

When the resolution of the input feature map of each basic unit in the RCNN is the same as the resolution of the output feature map of each basic unit, each basic unit in the RCNN does not change the resolution of the feature map when processing the feature map, so that the information of the feature map is kept.

The network structure of the RCNN of the target detection network obtained finally is described below with reference to the drawings.

As shown in fig. 11, the basic units included in the RCNN and the number of nodes included in each basic unit are all the same as those of the RCNN shown in fig. 9, and the network architecture shown by the RCNN shown in fig. 11 may be the network architecture of the finally obtained RCNN.

Specifically, as shown in fig. 11, for the basic unit 1 of the RCNN, the feature diagram input by two identical input nodes (H _ {0}), and the input nodes (H _ {0}) are subjected to the processing by the intermediate nodes 0 to 3, and then weighted summation is performed at the output node, so as to obtain the output (H _ {1}) of the first convolution unit. Wherein the operations from the input node to the intermediate node and from the intermediate node to the output node are both 5 × 5 convolutions (conv _5 × 5).

For the basic unit 2 of the RCNN, the characteristic diagrams of the input nodes H _ {0} and H _ {1} are processed by the intermediate nodes 0 to 3, and then weighted summation is carried out at the output node to obtain the output (H _ {2}) of the first convolution unit. Wherein the operations from the input node to the intermediate node and from the intermediate node to the output node are both 5 × 5 convolutions (conv _5 × 5).

The following describes the process of the method for constructing the object detection network according to the embodiment of the present application in more detail with reference to fig. 12.

Fig. 12 is a flowchart of a method for constructing an object detection network according to an embodiment of the present application. The method shown in fig. 12 may be executed by the apparatus for constructing the object detection network according to the embodiment of the present application, and the method shown in fig. 12 includes steps 2001 to 2006, which are described in detail below.

2001. And initializing the network architecture of the target detection network.

In step 2001, an initial network architecture of the target detection network may be determined.

Specifically, since the target detection network includes the backbone network, the feature fusion layer, the RPN, and the RCNN, in step 2001, an initial network architecture of the target detection network, that is, a network architecture of the backbone network, the feature fusion layer, the RPN, and the RCNN in the target detection network, needs to be determined.

The backbone network and the RPN may be pre-constructed networks, and therefore, the network structure of the initialization target detection network is equivalent to the network structure of the initialization feature fusion layer and the RCNN.

After step 2001, the initial network structure of the feature fusion layer and the initial network structure of the RCNN may be determined.

2002a, evaluating the performance of the characteristic fusion layer.

Specifically, in step 2002a, the performance of the feature fusion layer and the resources occupied by the feature fusion layer may be evaluated, so that the network architecture of the feature fusion layer may be updated subsequently according to the evaluation condition of the performance and the resources occupied by the feature fusion layer.

2002b, performance evaluation on RCNN.

Similar to the step 2002a, in the step 2002b, the performance and the resource occupation situation of the RCNN may also be evaluated, so as to facilitate a subsequent update of the network architecture of the RCNN according to the performance and the resource occupation situation of the RCNN.

Step 2002a and step 2002b may occur simultaneously or sequentially, and the order of occurrence of step 2002a and step 2002b is not limited in this application.

2003a, updating a network architecture of the feature fusion layer.

After the performance and the resource occupation of the feature fusion layer are obtained in step 2002a, the network architecture of the feature fusion layer may be updated according to the performance and the resource occupation of the feature fusion layer.

Specifically, when the performance of the feature fusion layer does not meet the requirement, the complexity of the feature fusion layer can be increased appropriately when the feature fusion layer is updated, so that the updated feature fusion layer has better performance; and when the resources occupied by the feature fusion layer are too much, the complexity of the feature fusion layer can be reduced when the feature fusion layer is updated, so that the updated feature fusion layer has a simpler structure, and the resources occupied by the updated feature fusion layer are reduced.

2003b, updating the network architecture of the RCNN.

After the performance and the resource occupation of the RCNN are obtained through the step 2002b, the network architecture of the RCNN may be updated according to the performance and the resource occupation of the RCNN.

Specifically, when the performance of the RCNN does not meet the requirement, the complexity of the RCNN may be increased when updating the RCNN, so that the updated RCNN has better performance; when the resources occupied by the RCNN are too much, the complexity of the RCNN can be reduced when the RCNN is updated, so that the updated RCNN has a simpler structure, and the resources occupied by the updated RCNN are reduced.

It should be understood that the above steps 2003a and 2003b may occur simultaneously or sequentially, and the sequence of steps 2003a and 2003b is not limited in this application.

2004. And updating the network parameters of the target detection network.

After the network architectures of the feature fusion layer and the RCNN are updated, the network parameters of the target detection network may be updated, where the network parameters of the target detection network include network parameters of the backbone network, the feature fusion layer, the RPN, and the RCNN (the network parameters may specifically include convolution parameters).

2005. And determining whether the target detection network meets a preset condition.

Specifically, in step 2005, it may be determined whether the target detection network satisfies the following condition:

(1) the detection performance of the target detection network meets the preset requirement;

(2) the method comprises the steps that whether the network architecture updating times of a target detection network reach preset times or not is judged;

When the target network satisfies any one of the above conditions (1) to (3), it may be determined that the target detection network satisfies a preset requirement.

The performance of the target detection network meets the preset requirement, specifically, the accuracy of the target detection network in target detection is greater than a certain accuracy threshold. For example, when the accuracy of the target detection performed by the target detection network is greater than 60%, it is determined that the performance of the target detection network meets the preset requirement.

When it is determined in step 2005 that the target detection network satisfies the preset condition, the process proceeds to step 2006, and the target detection network is output. When it is determined in step 2005 that the target detection network does not satisfy the preset condition, steps 2002a and 2002b are continuously performed to continuously update the network architecture of the target detection network.

2006. And outputting the target detection network.

After determining that the target detection network satisfies the preset condition through step 2005, the construction of the target neural network is completed, and the target detection network can be output.

Fig. 13 is a schematic flowchart of an object detection method according to an embodiment of the present application. The method shown in fig. 13 may be performed by the object detection apparatus according to the embodiment of the present application, for example, the method shown in fig. 13 may be performed by the apparatus shown in fig. 15 or fig. 17.

The method shown in fig. 13 includes

steps

3001 and 3002, which are described in detail below, along with associated content.

3001. An image is acquired.

3002. And processing the image by adopting a target detection network to obtain a target detection result of the image.

The target detection result of the image obtained in step 3002 includes the position of the detection target in the image and the classification result to which the detection target belongs.

The target detection network used in step 3002 may be constructed according to the method for constructing a target detection network in the embodiment of the present application. The above definition and explanation of the object detection network in the introduction of the embodiments of the present application also apply to the object detection network in step 3002.

Specifically, the target detection network used in step 3002 may be constructed according to the method shown in fig. 7 or fig. 12.

According to the target detection method, the optional connection relation of the feature fusion layer contained in the search space of the feature fusion layer adopted by the target detection network in the construction process is more, so that the feature fusion layer determined according to the search space of the feature fusion layer can better perform feature fusion, and the final target detection network has better performance when target detection is performed.

The initial network architecture of the target detection network in the method shown in fig. 13 is determined according to the search space of the target detection network, and the final network architecture of the target detection network may be obtained by iteratively updating the initial network architecture of the target detection network according to the search space of the target detection network.

When the target detection network adopted in the method shown in fig. 13 is obtained, the search space of the target detection network may include a search space of the feature fusion layer and/or a search space of the RCNN.

When the search space of the target detection network comprises the search space of the feature fusion layer, the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, and the final network architecture of the feature fusion layer in the target detection network is obtained by iteratively updating the network architecture of the feature fusion layer in the initial network architecture of the target detection network according to the search space of the feature fusion layer.

When the search space of the target detection network comprises the search space of the RCNN, the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN, and the final network architecture of the RCNN in the target detection network is obtained by iteratively updating the network architecture of the RCNN in the initial network architecture of the target detection network according to the search space of the RCNN.

Optionally, the search space of the target detection network includes a search space of a feature fusion layer, the search space of the feature fusion layer includes a selectable connection relationship of the feature fusion layer, and the selectable connection relationship of the feature fusion layer includes a connection between any one node of one neural network and any one node of another neural network in two adjacent neural networks of the multi-layer neural network of the feature fusion layer.

Optionally, the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, where the selectable operation type of the feature fusion layer includes a convolution operation corresponding to a connection of any node of one layer of neural network in two adjacent layers of neural networks in the multilayer neural network and any node of another layer of neural network, and the convolution operation includes a hole convolution operation.

When the selectable operation types in the feature fusion layer include a hole convolution operation, substantially the same target detection performance can be achieved with fewer convolution parameters. In addition, when the convolution parameters have the same parameter number, the convolution processing by the hole convolution can obtain a larger receptive field than the conventional convolution, and therefore, better target detection performance can be obtained when the target detection network is used for target detection.

Optionally, the RCNN includes a plurality of basic units, each basic unit in the plurality of basic units is configured by at least two nodes, the search space of the target detection network further includes a search space of the RCNN, the search space of the RCNN includes the search space of each basic unit in the plurality of basic units, the search space of each basic unit includes the selectable connection relationship of each basic unit, and the selectable connection relationship of each basic unit includes a connection between any two nodes in each basic unit; the RCNN in the initial network architecture of the target detection network is determined according to a search space of the RCNN.

Optionally, the search space of each basic unit further includes a selectable operation type of each basic unit, where the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, and the convolution operation includes a hole convolution operation.

When the optional operation types in the RCNN include hole convolution operations, substantially the same target detection performance can be achieved with fewer convolution parameters. In addition, when the optional operation type of the RCNN includes a hole convolution operation, better target detection performance can be achieved under the condition that the convolution parameter quantities are substantially the same.

Specifically, under the condition that the convolution parameters have the same parameter number, the hollow convolution for convolution processing can obtain a larger receptive field than the traditional convolution, so that the target detection network has better target detection performance when used for target detection.

Optionally, at least two basic units in the plurality of basic units are respectively composed of nodes with different numbers.

When the resolution of the input feature map of each basic unit in the RCNN is the same as the resolution of the output feature map of each basic unit, the resolution of the feature map is not changed when each basic unit in the RCNN processes the feature map, so that the information of the feature map is convenient to keep, and a better target detection effect is obtained when a target detection network is adopted for target detection.

Optionally, the target detection network satisfies at least one of the following conditions: the detection performance of the target detection network meets the preset performance requirement; updating the network architecture of the target detection network for more than or equal to a preset number; the complexity of the target detection network is less than or equal to a preset complexity.

Optionally, the complexity of the target detection network is determined according to at least one of the number or size of the model parameters of the target detection network, the memory access cost MAC of the target detection network, and the number of floating point operations of the target detection network.

In order to illustrate the effect of the method for constructing the target detection network in the embodiment of the present application, the following is to analyze the complexity of the neural network obtained by the method for constructing the target detection network in the embodiment of the present application and the accuracy of target detection performed by the neural network obtained by the method for constructing the target detection network in the embodiment of the present application with specific test results.

TABLE 5

Table 5 shows the complexity of the target detection network obtained with different schemes.

The specific schemes of the existing schemes 1 to 6 are as follows:

the existing scheme 1: lead anchoring faster rcnn (guided anchoring master rcnn);

existing scheme 2: path aggregation network (path aggregation network) was published in CVPR in 2018;

existing scheme 3: real-time object detection with region-aware protocol Networks (NIPS) using the area-probing network, published in 2015;

existing scheme 4: feature pyramid network (feature pyramid network), published in the CVPR in 2017;

existing scheme 5: a relationship network for object detection (CVPR), published in the CVPR in 2018;

existing scheme 6: loss of focus (focal length for dense object detection) was published in ICCV in 2017.

As can be seen from table 5, when the data set COCO is tested, when the backbone networks are ResNet-50 and ResNet-101, respectively, the average accuracy (mAP) of the target detection performed by using the neural network obtained by the present application is higher than the average accuracy of the target detection performed by using the neural network obtained by the conventional scheme.

Specifically, when the backbone network is ResNet-50, the average accuracy of target detection performed by using the neural network obtained by the present application is 40.5, while the average accuracy of target detection performed by using the neural network obtained by using the existing scheme 1 or the existing scheme 2 is 39.8, and compared with the existing scheme 1 and the existing scheme 2, the average accuracy of target detection performed by using the neural network obtained by the present application is improved by 40.5-39.8 to 0.7.

When the backbone network is ResNet-101, the average accuracy of target detection performed by the neural network obtained by the scheme of the present application is 42.5, while the average accuracy of target detection performed by the neural network obtained by the existing schemes 3 to 6 is 39.1 at most, and compared with the existing schemes 3 to 6, the average accuracy of target detection performed by the neural network obtained by the scheme of the present application is improved by at least 42.5-39.1 to 3.4 (compared with the existing scheme 3, the average accuracy of target detection corresponding to the scheme of the present application is improved by 42.5-34.9 to 7.6).

TABLE 6

Table 6 shows the number of parameters included in the target detection network constructed by the existing scheme and the scheme of the present application, and the average accuracy when the target detection network constructed by the existing scheme and the scheme of the present application is used for target detection. As shown in table 6, the total amount of parameters of the target detection network constructed on the three data sets (the VOC data set, the COCO data set, and the BDD data set) according to the scheme of the present application is lower than the total amount of parameters of the target detection network constructed by the existing scheme. In addition, the mAP of the target detection network on the test data set (PASCALVOC data set) which is respectively constructed on the three data sets in the scheme of the application is also higher than the mAP of the target detection network on the test data set (PASCAL VOC data set) which is constructed by the existing scheme.

TABLE 7

Similar to table 6, table 7 also shows the parameters included in the target detection network constructed by the existing scheme and the scheme of the present application, and the average accuracy when the target detection network constructed by the existing scheme and the scheme of the present application is used for target detection.

As shown in table 7, the total amount of parameters of the target detection network constructed on the three data sets (BDD data set, COCO data set, and VOC data set) according to the scheme of the present application is lower than the total amount of parameters of the target detection network constructed by the existing scheme. In addition, the mAP of the target detection network on the test data set (BDD data set) which is respectively constructed on the three data sets in the scheme is higher than the mAP of the target detection network on the test data set (BDD data set) which is constructed by the existing scheme.

The method for constructing the object detection network and the method for detecting the object in the embodiment of the present application are described in detail with reference to the accompanying drawings, and the following describes the apparatus for constructing the object detection network and the apparatus for detecting the object in the embodiment of the present application with reference to the accompanying drawings.

It is to be understood that the construction apparatus of the object detection network described hereinafter is capable of executing the respective steps of the construction method of the object detection network of the embodiment of the present application, and the object detection apparatus described hereinafter is capable of executing the respective steps of the object detection method of the embodiment of the present application, and the repetitive description thereof is appropriately omitted below when describing the construction apparatus of the object detection network and the object detection apparatus of the embodiment of the present application.

Fig. 14 is a schematic block diagram of a construction apparatus of an object detection network according to an embodiment of the present application. The apparatus 5000 shown in fig. 14 includes a determining unit 5001 and a constructing unit 5002.

The apparatus 5000 may perform the steps of the method for constructing the target detection network according to the embodiment of the present application, and specifically, the apparatus 5000 may perform the method shown in fig. 7 or the method shown in fig. 12.

Specifically, when the apparatus 5000 executes the method shown in fig. 7, the determining unit 5001 may be specifically configured to execute

steps

1001 and 1002, and the constructing unit 5002 may be configured to execute step 1003.

When the apparatus 5000 performs the method shown in fig. 12, the determining unit 5001 may be specifically configured to perform step 2001, and the constructing unit 5002 may be configured to perform steps 2002 to 2006, where step 2002 includes steps 2002a and 2002b, and step 2003 includes steps 2003a and 2003 b.

The determining unit 5001 and the constructing unit 5002 in the apparatus 5000 described above correspond to the processor 7002 in the apparatus 7000 shown in fig. 16 below.

Fig. 15 is a schematic block diagram of an object detection apparatus according to an embodiment of the present application. The apparatus 6000 shown in fig. 15 includes an acquisition cell 6001 and a detection cell 6002.

The apparatus 6000 may perform the steps of the target detection method according to the embodiment of the present application, and specifically, the apparatus 6000 may perform the method shown in fig. 7 or the method shown in fig. 12.

steps

Fig. 16 is a schematic hardware configuration diagram of a neural network configuration search apparatus according to an embodiment of the present application. The neural network structure search apparatus 3000 shown in fig. 16 (this apparatus 3000 may be specifically a kind of computer device) includes a memory 3001, a processor 3002, a communication interface 3003, and a bus 3004. The memory 3001, the processor 3002, and the communication interface 3003 are communicatively connected to each other via a bus 3004.

The memory 3001 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 3001 may store a program, and the processor 3002 is configured to execute the steps of the method of constructing the object detection network according to the embodiment of the present application when the program stored in the memory 3001 is executed by the processor 3002.

The processor 3002 may be a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement the method for constructing the target detection network according to the embodiment of the present disclosure.

The processor 3002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method for constructing the object detection network of the present application may be implemented by integrated logic circuits of hardware in the processor 3002 or by instructions in the form of software.

The processor 3002 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 3001, and the processor 3002 reads information in the memory 3001, and in combination with hardware thereof, completes functions required to be executed by a unit included in the neural network structure search apparatus, or performs a method of constructing an object detection network according to an embodiment of the present application.

The communication interface 3003 enables communication between the apparatus 3000 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, information of the neural network to be constructed and training data required in constructing the neural network may be acquired through the communication interface 3003.

The bus 3004 may include a pathway to transfer information between various components of the apparatus 3000 (e.g., memory 3001, processor 3002, communication interface 3003).

Fig. 17 is a schematic hardware configuration diagram of an object detection apparatus according to an embodiment of the present application. The object detection apparatus 4000 shown in fig. 17 includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004. The memory 4001, the processor 4002 and the communication interface 4003 are communicatively connected to each other via a bus 4004.

Memory 4001 may be a ROM, a static storage device, and a RAM. The memory 4001 may store programs, and when the programs stored in the memory 4001 are executed by the processor 4002, the processor 4002 and the communication interface 4003 are used to execute the respective steps of the object detection method of the embodiment of the present application.

The processor 4002 may be a general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits, and is configured to execute the relevant programs to implement the functions required by the units in the object detection apparatus according to the embodiment of the present application, or execute the object detection method according to the embodiment of the present application.

Processor 4002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the target detection method of the embodiment of the present application may be implemented by an integrated logic circuit of hardware in the processor 4002 or an instruction in the form of software.

The processor 4002 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 4001, and the processor 4002 reads information in the memory 4001, and completes functions required to be executed by units included in the object detection apparatus of the embodiment of the present application in combination with hardware thereof, or executes the object detection method of the embodiment of the present application.

Communication interface 4003 enables communication between apparatus 4000 and other devices or a communication network using transceiver means such as, but not limited to, a transceiver. For example, the image to be processed may be acquired through the communication interface 4003.

Bus 4004 may include a pathway to transfer information between various components of apparatus 4000 (e.g., memory 4001, processor 4002, communication interface 4003).

It should be noted that although apparatus 7000 and apparatus 5000 as described above illustrate only memories, processors, and communication interfaces, in particular implementations, it will be understood by those skilled in the art that apparatus 7000 and apparatus 8000 may also include other devices necessary to achieve normal operation. Also, those skilled in the art will appreciate that apparatus 7000 and apparatus 8000 may also include hardware components to perform other additional functions, according to particular needs. Moreover, those skilled in the art will appreciate that apparatus 7000 and apparatus 8000 may also include only those components necessary to implement the embodiments of the present application, and need not include all of the components shown in FIGS. 16 and 17.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for constructing an object detection network, wherein the object detection network comprises a backbone network, a feature fusion layer, a regional candidate network RPN and a regional convolutional neural network RCNN, and the method is characterized by comprising the following steps:

determining a search space of the target detection network, wherein the search space of the target detection network comprises a search space of the feature fusion layer, the search space of the feature fusion layer comprises a selectable connection relation of the feature fusion layer, and the selectable connection relation of the feature fusion layer comprises a connection of any node of one layer of neural network and any node of the other layer of neural network in two adjacent layers of neural networks of the feature fusion layer;

determining an initial network architecture of the target detection network according to a search space of the target detection network, wherein a feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer;

and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network until the target detection network meeting the preset requirement is obtained.

2. The method according to claim 1, wherein the iteratively updating the initial network architecture of the target detection network according to the search space of the target detection network until the target detection network satisfying a preset requirement is obtained includes:

and according to the search space of the target detection network, iteratively updating the initial network architecture of the target detection network to reduce the value of a loss function corresponding to the target detection network, so as to obtain the target detection network meeting the preset requirement, wherein the loss function comprises a target detection error of the target detection network and/or the complexity of the target detection network.

3. The construction method according to claim 1 or 2, wherein the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, the selectable operation type of the feature fusion layer includes a convolution operation corresponding to connection of any one node of one layer of the neural network and any one node of the other layer of the neural network in two adjacent layers of the multilayer neural network, and the convolution operation includes a hole convolution operation.

4. The building method according to any one of claims 1-3, wherein the RCNN comprises a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the object detection network further comprises the search space of the RCNN, the search space of the RCNN comprises the search space of each basic unit of the plurality of basic units, the search space of each basic unit comprises the optional connection relation of each basic unit, and the optional connection relation of each basic unit comprises the connection between any two nodes in each basic unit;

the RCNN in an initial network architecture of the target detection network is determined according to a search space of the RCNN.

5. The construction method according to claim 4, wherein the search space of each basic unit further includes a selectable operation type of each basic unit, the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, and the convolution operation includes a hole convolution operation.

6. The construction method according to claim 5, wherein the hole convolution operation corresponding to the connection between any two nodes in each basic unit comprises a hole convolution operation with an interval number of 2.

7. The construction method according to any one of claims 4 to 6, wherein at least two basic units of the plurality of basic units are respectively constituted by different numbers of nodes.

8. The construction method according to any one of claims 4 to 7, wherein the resolution of the input feature map of each basic unit is the same as the resolution of the output feature map of each basic unit.

9. The construction method according to any one of claims 2 to 8, wherein the target detection network satisfying preset requirements satisfies at least one of the following conditions:

the detection performance of the target detection network meets the preset performance requirement;

updating the network architecture of the target detection network for more than or equal to a preset number;

the complexity of the target detection network is less than or equal to a preset complexity.

10. The method of claim 9, wherein the complexity of the target detection network is determined according to at least one of a number or a size of model parameters of the target detection network, a memory access cost MAC of the target detection network, and a number of floating point operations of the target detection network.

11. A method for constructing an object detection network, wherein the object detection network comprises a backbone network, a feature fusion layer, a regional candidate network RPN and a regional convolutional neural network RCNN, and the method is characterized by comprising the following steps:

determining a search space of the target detection network, wherein the RCNN includes a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the target detection network includes the search space of the RCNN, the search space of the RCNN includes the search space of each basic unit of the plurality of basic units, the search space of each basic unit includes a selectable connection relation of each basic unit, and the selectable connection relation of each basic unit includes a connection between any two nodes in each basic unit;

determining an initial network architecture of the target detection network according to the search space of the target detection network, wherein the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN;

12. The method according to claim 11, wherein the iteratively updating the initial network architecture of the target detection network according to the search space of the target detection network until the target detection network satisfying a preset requirement is obtained includes:

13. The building method according to claim 11 or 12, wherein the search space of each basic unit further includes a selectable operation type of each basic unit, the selectable operation type of each basic unit includes a convolution operation corresponding to a connection between any two nodes in each basic unit, and the convolution operation includes a hole convolution operation.

14. The construction method according to claim 13, wherein the hole convolution operation includes a hole convolution operation of an interval number of 2.

15. The construction method according to any one of claims 11 to 14, wherein at least two basic units of the plurality of basic units are respectively constituted by different numbers of nodes.

16. The construction method according to any one of claims 11 to 15, wherein the resolution of the input feature map of each basic unit is the same as the resolution of the output feature map of each basic unit.

17. The construction method according to any one of claims 11 to 16, wherein the target detection network satisfying preset requirements satisfies at least one of the following conditions:

18. The method of claim 17, wherein the complexity of the target detection network is determined according to at least one of a number or a size of model parameters of the target detection network, a memory access cost MAC of the target detection network, and a number of floating point operations of the target detection network.

19. A method of object detection, comprising:

acquiring an image;

processing the image by adopting a target detection network to obtain a target detection result of the image, wherein the target detection result comprises the position of a detection target in the image and a classification result of the detection target;

the target detection network comprises a backbone network, a feature fusion layer, a regional candidate network RPN and a regional convolutional neural network RCNN, the target detection network meets preset requirements, the target detection network is obtained by iteratively updating an initial network architecture of the target detection network according to a search space of the target detection network, and the initial network architecture of the target detection network is determined according to the search space of the target detection network;

the search space of the target detection network comprises the search space of the feature fusion layer, the feature fusion layer in the initial network architecture of the target detection network is determined according to the search space of the feature fusion layer, the search space of the feature fusion layer comprises the optional connection relation of the feature fusion layer, and the optional connection relation of the feature fusion layer comprises the connection of any node of one layer of neural network in two adjacent layers of neural networks in the multilayer neural network with any node in the other layer of neural network.

20. The object detection method of claim 19, wherein the search space of the feature fusion layer further includes a selectable operation type of the feature fusion layer, the selectable operation type of the feature fusion layer includes a convolution operation corresponding to a connection of any one node of one layer of the neural network with any one node of the other layer of the neural network in two adjacent layers of the multi-layer neural network, wherein the convolution operation includes a hole convolution operation.

21. The object detection method according to claim 19 or 20, wherein the RCNN includes a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the object detection network further includes the search space of the RCNN, the search space of the RCNN includes the search space of each basic unit of the plurality of basic units, the search space of each basic unit includes the selectable connection relation of each basic unit, and the selectable connection relation of each basic unit includes a connection between any two nodes within each basic unit;

22. The object detection method of claim 21, wherein the search space of each basic unit further includes a selectable operation type of the each basic unit, the selectable operation type of the each basic unit includes a convolution operation corresponding to a connection between any two nodes in the each basic unit, and the convolution operation includes a hole convolution operation.

23. The object detection method of claim 22, wherein the hole convolution operation includes a hole convolution operation of an interval number of 2.

24. The object detection method of any one of claims 21-23, wherein at least two basic units of the plurality of basic units are respectively composed of different numbers of nodes.

25. The object detection method of any one of claims 21-24, wherein the resolution of the input feature map of each elementary unit is the same as the resolution of the output feature map of each elementary unit.

26. The object detection method according to any of claims 19-25, wherein the object detection network fulfils at least one of the following conditions:

27. The object detection method of claim 26, wherein the complexity of the object detection network is determined according to at least one of the number or size of model parameters of the object detection network, a memory access cost MAC of the object detection network, and the number of floating point operations of the object detection network.

28. A method of object detection, comprising:

acquiring an image;

the RCNN includes a plurality of basic units, each basic unit of the plurality of basic units is composed of at least two nodes, the search space of the target detection network includes the search space of the RCNN, the search space of the RCNN includes the search space of each basic unit of the plurality of basic units, the search space of each basic unit includes the optional connection relationship of each basic unit, the optional connection relationship of each basic unit includes the connection between any two nodes in each basic unit, and the RCNN in the initial network architecture of the target detection network is determined according to the search space of the RCNN.

29. The object detection method of claim 28, wherein the search space of each basic unit further includes a selectable operation type of the each basic unit, the selectable operation type of the each basic unit includes a convolution operation corresponding to a connection between any two nodes in the each basic unit, and the convolution operation includes a hole convolution operation.

30. The object detection method of claim 29, wherein the hole convolution operation includes a hole convolution operation of an interval number of 2.

31. The object detection method of any one of claims 28-30, wherein at least two elementary units of the plurality of elementary units are each constituted by a different number of nodes.

32. The object detection method of any one of claims 28-31, wherein the resolution of the input feature map of each elementary unit is the same as the resolution of the output feature map of each elementary unit.

33. The object detection method according to any of claims 28-32, wherein the object detection network satisfies at least one of the following conditions:

34. The object detection method of claim 33, wherein the complexity of the object detection network is determined according to at least one of the number or size of model parameters of the object detection network, a memory access cost MAC of the object detection network, and the number of floating point operations of the object detection network.

35. An apparatus for constructing an object detection network, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the processor for performing the construction method of any one of claims 1-10 or 11-18 when the memory-stored program is executed.

36. An object detection device, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the processor for performing the object detection method of any one of claims 19-27 or 28-34 when the memory-stored program is executed.

37. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, the program code comprising instructions for performing the method of any of claims 1-10 or 11-18.

38. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, the program code comprising instructions for performing the method of any of claims 19-27 or 28-34.