CN114972182A

CN114972182A - Object detection method and device

Info

Publication number: CN114972182A
Application number: CN202210395863.8A
Authority: CN
Inventors: 屈展; 李卓凌; 周洋; 黄梓钊; 袁凯轮; 刘健庄; 江立辉; 张洪波
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-08-30

Abstract

The application discloses an object detection method, relates to the field of artificial intelligence, and comprises the following steps: acquiring a target image; detecting an object in a target image through a machine learning model, and outputting a first depth value of the object and target information, wherein the target information comprises at least one of the following information: geometric characteristics of the object, key point information of the object and position characteristics of the object; predicting at least one second depth value of the object according to the target information; and fusing the first depth value and the at least one second depth value to obtain the target depth of the object. The depth of the center point of the target is estimated based on different principles through related attribute information (target information) of the object output by the machine learning model, so that at least one depth prediction with difference and diversification can be obtained, and a more accurate depth value of the object can be determined based on a plurality of depth values of the object obtained through a plurality of different modes.

Description

Object detection method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to an object detection method and device.

Background

Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, military and the like, and is a study on how to use cameras/video cameras and computers to acquire data and information of a photographed object which are required by us. In a descriptive sense, a computer is provided with eyes (a camera or a video camera) and a brain (an algorithm) to identify, track, measure and the like an object instead of human eyes, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make an artificial system "perceive" from images or multidimensional data. Generally, computer vision is to use various imaging systems to obtain input information instead of visual organs, and then the computer is used to process and interpret the input information instead of the brain. The ultimate research goal of computer vision is to make a computer have the ability to adapt to the environment autonomously by visually observing and understanding the world like a human.

At present, visual perception networks have more and more functions, including image classification, 2D detection, semantic segmentation (Mask), key point detection, linear object detection (such as lane line or stop line detection in an automatic driving technology), travelable region detection, and the like. In addition, the visual perception system has the characteristics of low cost, non-contact property, small volume and large information amount. With the continuous improvement of the precision of the visual perception algorithm, the visual perception algorithm becomes a key technology of many artificial intelligence systems at present and is widely applied, such as: in Advanced Driving Assistance Systems (ADAS) and Automatic Driving Systems (ADS), dynamic obstacles (people or vehicles) and static objects (traffic lights, traffic signs or traffic cones) on a road surface are identified, and a slimming effect and the like are realized by identifying masks and key points of a human body in a terminal vision photographing and beautifying function.

In either Automatic Driving (ADS) or advanced assisted driving systems (ADAS), vehicles need to perceive other traffic participants in the road environment in real time to ensure driving safety and support their next driving decisions. And in all sensors of the intelligent vehicle, the camera can provide information with the highest resolution and the richest details. In addition, the visual perception system has the characteristics of low cost and small volume, so that the perception and modeling of other traffic participants near the current vehicle through visual signals are key technologies in the intelligent vehicle perception system. The target detection of surrounding vehicles and pedestrians is the most important perception function, and has been widely used at present, such as: estimating the vehicle position of the surrounding environment in ADAS and ADS to avoid collision and keep a safe distance; the orientation and size of the surrounding vehicle are predicted and the future motion trail of the surrounding vehicle is predicted on the basis of the predicted orientation and size, so that the automatic driving vehicle can plan a safe and comfortable traveling route and the like.

Detecting traffic participants in an image and obtaining relevant information is the most challenging problem in camera-based autonomous driving perception systems. Meanwhile, traffic participants are various in types, and further interference can be brought by various environmental factors such as the distance between the traffic participants and the camera, the shielding degree and the illumination. How to improve the accuracy of target detection is the key for the intelligent vehicle to accurately sense the positions of surrounding traffic participants and make the next driving decision.

With the development of deep learning technology, the adoption of Convolutional Neural Networks (CNNs) for target detection is a trend in the field. In the existing research, according to different ways of obtaining target depth information, a target detection method can be generally divided into two methods, namely direct estimation based on a neural network and geometric inference based on other information of a target by estimating key points of the target. The method based on direct estimation requires a neural network to learn the depth of each target relative to a camera according to the appearance of the target on an image, and estimates the depth, the direction and the size of the target are directly obtained from the network in an inference stage.

Disclosure of Invention

The application provides an object detection method, which can improve the precision of depth values obtained when an object in an image is detected.

In a first aspect, the present application provides an object detection method, including: acquiring a target image; detecting an object in the target image through a machine learning model, and outputting a first depth value of the object and target information, wherein the target information comprises at least one of the following information: geometric features of the object, keypoint information of the object, and location features of the object; predicting at least one second depth value of the object according to the target information; and fusing the first depth value and the at least one second depth value to obtain the target depth of the object.

In one possible implementation, the target image may include at least one object, and the object in this embodiment may be any one of the objects in the target image.

In a possible implementation, the inference precision of the machine learning model is not accurate enough, and once a camera for shooting an image or a shooting scene changes, the precision is often greatly reduced. In the embodiment of the present application, the depth of the center point of the target is estimated based on different principles through the related attribute information of the object (i.e., the target information in the embodiment of the present application) output by the machine learning model, so that at least one differential and diversified depth prediction can be obtained, and a more accurate depth value of the object can be determined based on a plurality of depth values of the object obtained through a plurality of different methods.

In one possible implementation, after the detecting the object in the target image through the machine learning model, the method further includes: outputting a first confidence of the first depth values and a second confidence of predicting each of the second depth values; and fusing the first confidence coefficient and the at least one second confidence coefficient to obtain a target confidence coefficient of the target depth.

According to the method and the device, different theoretical assumptions and attribute combinations of targets are used, multiple groups of depth estimations and the uncertainty of each depth estimation are generated simultaneously, high-reliability depth predictions are selected and fused according to the uncertainty in the subsequent process, and unreliable error predictions are abandoned, so that a robust depth estimation result is finally generated, the detection robustness and accuracy of the system to different environments are greatly improved, the uncertainty also helps the system to generate a more reliable evaluation score for the prediction result, and the detection accuracy and the system interpretability are further improved.

In one possible implementation, the at least one second depth value of the object may be predicted by an N-point perspective algorithm PnP.

Specifically, the keypoint information in the target information may include 2D positions of a plurality of keypoints of the object on the image, the geometric feature of the object includes a physical size of the object, and the positional feature of the object includes an orientation angle of the object. And predicting at least one second depth value of the object by an N-point perspective algorithm PnP according to the target information.

In the present embodiment, the depth of the center point of the object is of interest, so the 3D center point of the object can be defined as a key point as well, given its 2D position (u) on the image _c ,v _c ) In the middle of the objectThe location of the center point can be expressed in terms relating only to z:

based on this, each keypoint can now be expressed as 2 linear equations with respect to z-value, so that two sets of estimates of the target center point z-value can be calculated.

In one possible implementation, the depth prediction may also be based on a length of the object in a preset direction. Specifically, the geometric features of the object include a plurality of pixel sizes of a 3D envelope box of the object in a preset direction, and a physical size of the object in the preset direction; furthermore, at least one second depth value of the object may be predicted based on the plurality of pixel sizes and a geometric relationship between the physical sizes.

Taking the preset direction as a vertical direction as an example, and taking the length of the preset direction as a height at this time, the network can predict the physical height H of the target and the projection points of the corner points and the central point of the target 3D bounding box on the upper and lower surfaces of the bounding box (i.e. the predefined key points), and since the physical height of the target and the distance between the key points are related, the depth estimation of the key points can be obtained by the following formula:

where f is the camera intrinsic parameter, which can be obtained in advance, H is the target physical height estimated by the network, and H is the vertical distance between a pair of key points located on the upper and lower surfaces of the target 3D bounding box.

In one possible implementation, the fusing the first confidence level and the at least one second confidence level includes: acquiring the first confidence degrees and a first weight of each second confidence degree; and according to the first weight, carrying out weighted summation on the first confidence coefficient and the at least one second confidence coefficient.

The first confidence and the first weight of each second confidence can be understood as the first weight corresponding to the first confidence and the first weight corresponding to each second confidence.

In a possible implementation, the merging manner may be weighted summation, and specifically, the first depth value and a second weight of each of the second depth values may be obtained; and performing a weighted summation of the first depth value and the at least one second depth value according to the second weight.

Wherein, the first depth value and the second weight of each second depth value can be understood as the second weight corresponding to the first depth value and the second weight corresponding to each second depth value.

In one possible implementation, the second weight of the first depth value and each of the second depth values may be determined based on a confidence (or uncertainty) of the first depth value and each of the second depth values, wherein the confidence (or uncertainty) may describe a reliability degree of the depth values, and the higher the confidence is, the higher the corresponding weight is, i.e., the value of the second weight is positively correlated to the confidence of the corresponding depth value.

The confidence of the first depth value may be an output of a machine learning model, the confidence of the second depth values obtained in different ways may be updated as trainable parameters during model training, and the confidence of each updated second depth value may be obtained when the model converges. In a possible implementation, the second weight of each second depth value may also be preset, which is not limited herein.

In one possible implementation, the method further comprises: obtaining a confidence of a 3D envelope box of the object output by the machine learning model; and fusing the confidence of the 3D envelope box of the object and the target confidence to obtain the geometric confidence of the object.

In one possible implementation, the fusing the confidence of the 3D envelope box of the object and the target confidence includes: obtaining a confidence of a 3D envelope box of the object and a third weight of the target confidence; and according to the third weight, carrying out weighted summation on the confidence coefficient of the 3D envelope box of the object and the target confidence coefficient.

In one possible implementation, the method further comprises: obtaining a confidence of a 2D envelope box of the object output by the machine learning model; and obtaining the detection confidence of the object according to the confidence of the 2D envelope box of the object and the geometric confidence.

In a second aspect, the present application provides an object detection apparatus comprising:

the acquisition module is used for acquiring a target image;

an object detection module, configured to detect an object in the target image through a machine learning model, and output a first depth value of the object and target information, where the target information includes at least one of the following information: geometric features of the object, keypoint information of the object, and location features of the object;

a depth value prediction module for predicting at least one second depth value of the object according to the target information;

and the fusion module is used for fusing the first depth value and the at least one second depth value to obtain the target depth of the object.

In one possible implementation, the apparatus further comprises:

a confidence prediction module for outputting a first confidence of the first depth value and a second confidence of each of the second depth values after the detection of the object in the target image by the machine learning model;

the fusion module is further configured to fuse the first confidence level and the at least one second confidence level to obtain a target confidence level of the target depth.

In one possible implementation, the keypoint information comprises 2D locations on the image of a plurality of keypoints of the object;

the depth value prediction module is specifically configured to:

and predicting at least one second depth value of the object by an N-point perspective algorithm PnP according to the target information.

In one possible implementation, the geometric characteristic of the object includes a physical dimension of the object, and the positional characteristic of the object includes an orientation angle of the object.

In one possible implementation, the geometric features of the object include a plurality of pixel sizes of a 3D envelope box of the object in a preset direction, and a physical size of the object in the preset direction;

the depth value prediction module is specifically configured to:

predicting at least one second depth value of the object according to the plurality of pixel sizes and a geometric relationship between the physical sizes.

In a possible implementation, the fusion module is specifically configured to:

acquiring the first confidence degrees and a first weight of each second confidence degree;

and according to the first weight, carrying out weighted summation on the first confidence coefficient and the at least one second confidence coefficient.

In a possible implementation, the fusion module is specifically configured to:

acquiring the first depth value and a second weight of each second depth value;

weighted summing the first depth value and the at least one second depth value according to the second weight.

In a possible implementation, the obtaining module is further configured to:

obtaining a confidence of a 3D envelope box of the object output by the machine learning model;

the fusion module is further configured to:

and fusing the confidence of the 3D envelope box of the object and the target confidence to obtain the geometric confidence of the object.

In a possible implementation, the fusion module is specifically configured to:

obtaining a confidence of a 3D envelope box of the object and a third weight of the target confidence;

and according to the third weight, carrying out weighted summation on the confidence coefficient of the 3D envelope box of the object and the target confidence coefficient.

In a possible implementation, the obtaining module is further configured to:

obtaining a confidence of a 2D envelope box of the object output by the machine learning model;

and the confidence determining module is used for obtaining the detection confidence of the object according to the confidence of the 2D envelope box of the object and the geometric confidence.

In a third aspect, an embodiment of the present application provides an object detection apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to perform the method according to the first aspect and any optional method thereof.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the first aspect and any optional method thereof.

In a fifth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the first aspect and any optional method thereof.

In a sixth aspect, the present application provides a chip system, which includes a processor, configured to support an execution device or a training device to implement the functions involved in the above aspects, for example, to transmit or process data involved in the above methods; or, information. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;

FIGS. 2a and 2b are schematic diagrams of an application system framework of the present invention;

FIG. 3 is an illustration of an application scenario of the present application;

FIG. 4 is an illustration of an application scenario of the present application;

FIG. 5 is a schematic diagram of a system architecture of the present application;

FIG. 6 is a schematic diagram of a neural network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a neural network according to an embodiment of the present application;

fig. 8 is a hardware structure of a chip according to an embodiment of the present disclosure;

fig. 9 is a flowchart illustrating an object detection method according to an embodiment of the present application;

fig. 10a and 10b are backbone networks backbone according to embodiments of the present application;

FIG. 11 is a schematic of the structure of an FPN;

FIG. 12a is a schematic of a header;

FIG. 12b is a schematic representation of the RPN layer of a header;

fig. 13 is a schematic diagram of a network structure in the present embodiment;

FIG. 14 is a schematic representation of a heatmap provided by an embodiment of the present application;

FIG. 15 is an angular illustration provided by an embodiment of the present application;

FIG. 16 is a highly schematic representation of the present application;

FIG. 17a is a schematic flow chart of a system provided in an embodiment of the present application;

fig. 17b is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 19 is a schematic structural diagram of a training apparatus according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenes, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" process of consolidation. The "IT value chain" reflects the value of artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (provision and processing technology implementation) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can be used for performing symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference refers to the process of simulating human intelligent inference mode in a computer or an intelligent system, using formalized information to think and solve problems of a machine according to an inference control strategy, and the typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..

The embodiment of the application is mainly applied to the fields of driving assistance, automatic driving, mobile phone terminals and the like which need to complete various perception tasks. As shown in fig. 2a and 2b, a single picture is obtained from a video through frame extraction, and the picture is sent to the sensing network shown in fig. 2a or 2b in the present invention, so as to obtain information of 2D, 3D, Mask (mask), key points, etc. of an object of interest in the picture. The detection results are output to a post-processing module for processing, for example, the detection results are sent to a planning control unit in an automatic driving system for decision making, and a beautifying algorithm is sent to a mobile phone terminal for processing to obtain a beautified picture. The two application scenes of the ADAS/ADS visual perception system and the mobile phone beauty are simply introduced below.

Application scenario 1: ADAS/ADS visual perception system

As shown in fig. 3, in ADAS and ADS, multiple types of 2D target detection need to be performed in real time, including: dynamic obstacles (Pedestrian), rider (Cyclist), Tricycle (Tricycle), Car (Car), Truck (Truck), Bus (Bus)), static obstacles (traffic cone (trafficcon), traffic stick (TrafficStick), fire hydrant (FireHydrant), motorcycle (Motocycle), Bicycle (Bicycle)), traffic signs (TrafficSign, guide sign (GuideSign), advertising board (Billboard), Red traffic light (TrafficLight _ Red)/Yellow traffic light (TrafficLight _ Yellow)/Green traffic light (TrafficLight _ Green)/Black traffic light (TrafficLight _ Black), road sign (roadn)). In addition, in order to accurately acquire the region of the dynamic obstacle occupied in the 3-dimensional space, it is also necessary to perform 3D estimation on the dynamic obstacle and output a 3D frame. In order to fuse with data of a laser radar, a Mask of a dynamic obstacle needs to be acquired, so that laser point clouds hitting the dynamic obstacle are screened out; in order to accurately park a parking space, 4 key points of the parking space need to be detected simultaneously; in order to perform the composition positioning, it is necessary to detect key points of a static object. By using the technical scheme provided by the embodiment of the application, all or part of the functions can be completed.

For example, the technical scheme provided by the embodiment of the application can be applied to adaptive cruise in auxiliary driving and advanced auxiliary driving.

The adaptive cruise function in ADAS requires adaptive adjustment of the speed of the vehicle in front of the lane according to its position and speed, thereby enabling automatic cruise without collision. When no other traffic participants (targets) exist in front of the lane of the vehicle, the vehicle advances according to the preset speed or the road speed limit. For example, when the sensing system of the host vehicle detects that other traffic participants enter the front of the host vehicle, the host vehicle automatically reduces the speed of the host vehicle according to the position and the speed of the white vehicle, so as to avoid collision caused by deceleration of the front vehicle.

For example, the technical scheme provided by the embodiment of the application can be applied to target track prediction in automatic auxiliary driving and monitoring.

The track prediction senses a road scene through a camera, information such as the position, the orientation, the size and the like of an important traffic participant in the environment is obtained through a target detection algorithm, the movement speed and the direction of each target can be obtained through accumulating multi-frame detection results, and therefore the prediction of the future movement track of the target is achieved and is used as the basis for the follow-up decision control of the automatic driving vehicle. For example, predictions of the future direction of motion of a surrounding vehicle may be made for an autonomous vehicle, for example, future motion of a pedestrian may be predicted by detecting its orientation and position in a monitored scenario, thereby identifying in advance the likely appearance of a crowd.

Application scenario 2: mobile phone beauty function

As shown in fig. 4, in a mobile phone, masks and key points of a human body can be detected by the object detection method provided by the embodiment of the present application, and corresponding parts of the human body can be enlarged and reduced, such as operations of closing waist and beautifying hip, so as to output a beautifying picture.

Application scenario 3: image classification scene:

after the object recognition device obtains the image to be classified, the object recognition method is adopted to obtain the object category in the image to be classified, and then the image to be classified can be classified according to the object category of the object in the image to be classified. For photographers, many photographs are taken every day, with animals, people, and plants. The method can quickly classify the photos according to the content in the photos, and can be divided into photos containing animals, photos containing people and photos containing plants.

For the condition that the number of images is large, the efficiency of a manual classification mode is low, fatigue is easily caused when people deal with the same thing for a long time, and the classification result has large errors; and the images can be classified by adopting the method of the application.

Application scenario 4 commodity classification:

after the object recognition device acquires the image of the commodity, the object recognition method is adopted to acquire the commodity category in the image of the commodity, and then the commodity is classified according to the commodity category. For various commodities in large shopping malls or supermarkets, the object detection method can be used for classifying the commodities, so that the time overhead and the labor cost are reduced.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) And (4) object identification, namely determining the category of the image object by using image processing and machine learning, computer graphics and other related methods.

(2) Neural network

The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(3) Deep neural network

Deep Neural Networks (DNN) can be understood as Neural networks with many hidden layers, where "many" has no special metric, and we often say that the multilayer Neural networks and the Deep Neural networks are essentially the same thing. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the digital video signal,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. Then, how are the specific parameters defined in DNN? First we look at the definition of the coefficient W. Taking a three-layer DNN as an example, such as: the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks.

(4) Convolutional Neural Networks (CNN) are a type of deep neural Network with convolutional structures. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(5) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

(6) Recurrent Neural Networks (RNNs) are used to process sequence data.

In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although solving many problems, the common neural network still has no capability to solve many problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN.

Now that there is a convolutional neural network, why is a circular neural network? For simple reasons, in convolutional neural networks, there is a precondition assumption that: the elements are independent of each other, as are inputs and outputs, such as cats and dogs. However, in the real world, many elements are interconnected, such as stock changes over time, and for example, a person says: i like to travel, wherein the favorite place is Yunnan, and the opportunity is in future to go. Here, to fill in the blank, humans should all know to fill in "yunnan". Because humans infer from the context, but how do the machine do it? The RNN is generated. RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

(7) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close as possible to the value really expected to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value really expected to be target, and then according to the difference situation between the predicted value and the value really expected to be target (of course, an initialization process is usually carried out before the first update, namely parameters are configured for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower in prediction, and the adjustment is continuously carried out until the deep neural network can predict the value really expected to be target or a value very close to the value really expected to be target. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(8) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in a training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

The following describes a system architecture provided by the embodiments of the present application.

Referring to fig. 5, the present embodiment provides a system architecture 100. As shown in the system architecture 100, the data collecting device 160 is configured to collect training data, which in this embodiment of the present application includes: an image or image block of the object and a category of the object; and stores the training data into the database 130, and the training device 120 trains to obtain a machine learning model based on the training data maintained in the database 130, where the machine learning model may include a CNN feature extraction model (explaining that the feature extraction model here is a model obtained by training in the training phase as described above, and may be a machine learning model for feature extraction, etc.) and a head (header). The CNN feature extraction model can be used for realizing the machine learning model provided by the embodiment of the application, namely, the image or the image block to be recognized is input into the CNN feature extraction model after relevant preprocessing, and information such as 2D, 3D, Mask, key points and the like of an object of interest of the image or the image block to be recognized can be obtained. The CNN feature extraction model in the embodiment of the present application may specifically be a CNN convolutional neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the CNN feature extraction model based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 5, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 5, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: an image or image block or picture to be recognized.

During the input data preprocessing performed by the execution device 120 or the processing related to the computation performed by the computation module 111 of the execution device 120 (such as performing the function implementation of the machine learning model in the present application), the execution device 120 may call the data, the code, and the like in the data storage system 150 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results, such as the 2D, 3D, Mask, keypoints, confidence, etc. information of the object of interest in the image or image block or picture obtained as described above, to the client device 140, and provides it to the user.

Alternatively, the client device 140 may be a planning control unit in an automatic driving system, or a beauty algorithm module in a mobile phone terminal.

It should be noted that the training device 120 may generate corresponding target models/rules based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 5, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 5 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 5, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 5, a CNN feature extraction model is obtained by training according to the training device 120, and the CNN feature extraction model may be a CNN convolutional neural network in the embodiment of the present application or a machine learning model to be described in the following embodiment.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 5. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

The structure of the neural network specifically adopted in the image processing method according to the embodiment of the present application may be as shown in fig. 6. In fig. 6, Convolutional Neural Network (CNN)100 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230. The input layer 210 may obtain an image to be processed, and deliver the obtained image to be processed to the convolutional layer/pooling layer 220 and the following neural network layer 230 for processing, so as to obtain a processing result of the image. The following describes the internal layer structure in CNN 100 in fig. 6 in detail.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 6 may include layers such as 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. That is, the output of a convolutional layer may be used as the input of a subsequent pooling layer, or may be used as the input of another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further forward extract more and more complex features, such as features with high levels of semantics, the more semantic features are suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 6, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may comprise an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller size images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the neural network layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 6) and an output layer 240 may be included in the neural network layer 230, and parameters included in the hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 230, i.e. the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from the direction 210 to 240 in fig. 6 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e. the propagation from the direction 240 to 210 in fig. 6 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 210 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

The structure of the neural network specifically adopted in the image processing method according to the embodiment of the present application may be as shown in fig. 7. In fig. 7, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120 (where pooling is optional), and a neural network layer 130. Compared with fig. 6, in the convolutional layers/pooling layers 120 in fig. 7, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the all-neural network layer 130 for processing.

It should be noted that the convolutional neural networks shown in fig. 6 and fig. 7 are only examples of two possible convolutional neural networks of the image processing method according to the embodiment of the present application, and in a specific application, the convolutional neural networks used in the image processing method according to the embodiment of the present application may also exist in the form of other network models.

In addition, the structure of the convolutional neural network obtained by the neural network structure search method according to the embodiment of the present application may be as shown in the convolutional neural network structures in fig. 6 and 7.

Fig. 8 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor NPU 50. The chip may be provided in the execution device 110 as shown in fig. 5 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 5 to complete the training work of the training apparatus 120 and output the target model/rule. The algorithms for the various layers in the convolutional neural networks shown in fig. 6 and 7 can be implemented in a chip as shown in fig. 8.

The neural network processor NPU 50, NPU is mounted as a coprocessor on a main Central Processing Unit (CPU) (host CPU), and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used to store input data as well as output data.

The weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through a memory access controller 505 (DMAC), stores the weight data in the external memory into the weight memory 502, and stores the data in the unified memory 506 into the external memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.

Optionally, the input data in this application is a picture, and the output data is 2D, 3D, Mask, key points, etc. of the object of interest in the picture.

Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are On-Chip memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

The execution apparatus 110 in fig. 5 described above can execute the steps of the object detection method in the embodiment of the present application, and the CNN model shown in fig. 6 and 7 and the chip shown in fig. 8 can also be used to execute the steps of the object detection method in the embodiment of the present application. The object detection method according to the embodiment of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 9, fig. 9 is a flowchart illustrating a flow of an object detection method according to an embodiment of the present disclosure, and as shown in fig. 9, the object detection method according to the embodiment of the present disclosure may include steps 901 to 903, which are described in detail below.

901. Acquiring a target image, detecting an object in the target image through a machine learning model, and outputting a first depth value and target information of the object, wherein the target information includes at least one of the following information: geometric features of the object, keypoint information of the object, positional features of the object.

In this embodiment of the application, the architecture of the machine learning model may be the architecture shown in fig. 2a, and may be composed of a feature extraction network and a head-end header, where the feature extraction network may include a backbone network backbone and a Feature Pyramid Network (FPN), where the FPN is optional.

In the embodiment of the application, the backbone network backbone is used for receiving an input picture, performing convolution processing on the input picture, and outputting feature graphs with different resolutions corresponding to the picture; that is to say, feature maps corresponding to the pictures in different sizes are output, that is, the Backbone completes extraction of basic features, and provides corresponding features for subsequent detection.

Specifically, the backbone network may perform a series of convolution processes on an input picture to obtain feature maps (feature maps) at different scales. These feature maps will provide the base features for subsequent detection modules. The backbone network may take various forms, such as a Visual Geometry Group (VGG), a residual neural network (net), a core structure of google lenet (inclusion-net), and the like.

The backbone network backbone can perform convolution processing on an input image to generate a plurality of convolution feature maps with different scales, wherein each feature map is a matrix of H W C, H is the height of the feature map, W is the width of the feature map, and C is the number of channels of the feature map.

The backhaul may adopt various existing convolutional network frameworks, such as VGG16, Resnet50, inclusion-Net, etc., and Resnet18 is described as an example of the backhaul. This flow is shown in figure 10 a.

Assume that the resolution of the input picture is H × W × 3 (height H, width W, number of channels is 3, i.e., three channels RBG). The input picture is convolved by a first convolution module Res18-Conv1 (convolution module 1 in the figure) of Resnet18 to generate Featuremap C1, the feature map is downsampled 2 times relative to the input image, and the number of channels is expanded to 64, so that the resolution of C1 is H/4W/4 64, the convolution module 1 is composed of a plurality of convolution layers, the latter convolution modules are similar, and fig. 10b is a structural schematic diagram of the convolution module, as shown in fig. 10b, the convolution module 1 may include a plurality of convolution layers (convolution layer 1 to convolution layer N); c1 carries out convolution operation through a 2 nd convolution module Res18-Conv2 (convolution module 2 in the figure) of Resnet18 to obtain Featuremap C2, and the resolution of the feature map is consistent with that of C1; c2 continues to pass through the 3 rd convolution module Res18-Conv3 (convolution module 3 in the figure) of Resnet18 to generate Featuremap C3, which is further down-sampled relative to C2, and the number of channels is multiplied by H/8W/8 x 128; finally, the C3 is processed by Res18-Conv4 (convolution module 4 in the figure) to generate Featuremap C4 with the resolution of H/16W/16 256.

As can be seen from fig. 10a, Resnet18 performs convolution processing on multiple layers of input pictures to obtain feature maps of different scales: C1/C2/C3/C4. The width and height of the feature map of the bottom layer are larger, the number of channels is smaller, the feature map is mainly the low-layer features (such as image edges and texture features) of the image, the width and height of the feature map of the upper layer are smaller, the number of channels is larger, and the feature map is mainly the high-layer features (such as shapes and object features) of the image. The subsequent 2D detection process will make further predictions based on these feature maps.

In this embodiment of the present application, the backbone network backbone includes a plurality of convolution modules, each convolution module includes a plurality of convolution layers, each convolution module can perform convolution processing on an input feature map to obtain feature maps with different resolutions, and a first convolution layer included in the backbone network backbone in this embodiment of the present application is one of the plurality of convolution layers included in the backbone network backbone.

It should be noted that the backbone network in the embodiment of the present application may also be referred to as a backbone network, and is not limited herein.

It should be noted that the backbone network backbone shown in fig. 10a and 10b is only one implementation manner, and does not constitute a limitation to the present application.

In the embodiment of the present application, the FPN is connected to the backbone network backbone, and the FPN may perform convolution processing on a plurality of feature maps with different resolutions generated by the backbone network backbone to construct the feature pyramid.

Referring to fig. 11, fig. 11 is a structural schematic diagram of an FPN, where convolution module 1 is used to process the topmost feature map C4, and convolution module 1 may include at least one convolution layer, for example, convolution module 1 may reduce the number of channels of the topmost feature map C4 to 256 using a hole convolution and a 1 × 1 convolution, as the topmost feature map P4 of the feature pyramid; transversely linking the output results of the feature map C3 at the top layer and the next layer, reducing the number of channels to 256 by using 1 × 1 convolution (convolution module 2), and adding the result and the feature map p4 pixel by pixel to obtain a feature map p 3; by analogy, from top to bottom, a feature pyramid Φ p ═ { feature map p4, feature map p3, feature map p2, feature map p1} is constructed.

In this embodiment, the FPN includes a plurality of convolution modules, each convolution module includes a plurality of convolution layers, and each convolution module can perform convolution processing on the input feature map.

It should be noted that the FPN shown in fig. 11 is only one implementation and does not limit the present application.

In this embodiment of the application, a Header is connected to an FPN, and the Header can complete detection of a 2D frame of a task according to a feature diagram provided by the FPN, and output the 2D frame and a 3D frame of an object of the task, a corresponding confidence coefficient, and the like, and then a structural schematic of the Header is described, referring to fig. 12a, where fig. 12a is a schematic of the Header, and as shown in fig. 12a, the Header includes three modules, namely a candidate Region generation Network (RPN), a ROI-ALIGN, and an RCNN.

The RPN module may be configured to predict a region where the task object is located on one or more feature maps provided by the FPN, and output a candidate 2D frame matching the region; alternatively, it can be understood that the RPN predicts the regions where the task object may exist on one or more cross-plots output by the FPN, and gives a frame of these regions, which are called candidate regions (propofol). For example, when the Header is responsible for detecting the car, the RPN layer thereof predicts a candidate frame in which the car may exist; when the Header is responsible for detecting a person, its RPN layer predicts a candidate box in which the person may be present. Of course, these propofol are inaccurate, on the one hand they do not necessarily contain the objects of the task, and on the other hand the boxes are not compact.

The 2D candidate region prediction process may be implemented by the RPN module of the Header, which predicts regions where the task object may exist according to the feature map provided by the FPN, and provides candidate frames (also called candidate regions, propofol) for these regions. In this embodiment, if the Header is responsible for detecting the car, the RPN layer thereof predicts the candidate frame for the possible car.

The basic structure of the RPN layer may be as shown in fig. 12 b. The feature map RPNHidden is generated by a convolution module 1 (e.g. a convolution of 3 x 3) on the feature map provided by the FPN. The RPN layer of the following Header will predict Proposal from the RPN Hidden. Specifically, the RPN layer of the Header predicts the coordinates and confidence of the Proposal at each position of the RPN Hidden by the convolution module 2 and the convolution module 3, respectively (for example, a convolution of 1 × 1, respectively). The higher this confidence, the greater the probability that the object of the task is present for this Proposal. For example, a larger score for a certain Proposal in the Header indicates a greater probability of the presence of a vehicle. The propofol predicted by each RPN layer needs to go through a propofol merging module, redundant propofol is removed according to the degree of coincidence between the propofol (this process may be implemented by, but is not limited to, an NMS algorithm), and the N (N < K) propofol with the largest score is selected from the remaining K propofol as a candidate region where an object may exist. As can be seen from fig. 12b, these propofol's are inaccurate, on the one hand they do not necessarily contain the objects of the task, and on the other hand the boxes are not compact. Therefore, the RPN module is only a coarse detection process, and needs to be subdivided by the subsequent RCNN module. When the RPN module regresses the coordinates of the propofol, the coordinates relative to the Anchor are regressed instead of directly regressing the absolute values of the coordinates. The higher these anchors match the actual object, the greater the probability that the RPN can detect the object.

The ROI-ALIGN module is used for deducting the characteristics of the region where the candidate 2D frame is located from a characteristic diagram provided by the FPN according to the region predicted by the RPN module; that is, the ROI-ALIGN module mainly extracts the feature of the region where each propofol is located on a certain feature map according to the propofol provided by the RPN module, and resize to a fixed size to obtain the feature of each propofol. It is understood that the ROI-ALIGN module can use, but is not limited to, ROI-POOLING/ROI-ALIGN/PS-ROI-POOLING/PS-roiploign/PS-roialogn (location sensitive region of interest extraction) feature extraction methods.

The RCNN module is used for carrying out convolution processing on the characteristics of the region where the candidate 2D frame is located through a neural network to obtain the confidence coefficient that the candidate 2D frame belongs to each object category; and adjusting the coordinates of the 2D frame of the candidate region through a neural network, so that the adjusted 2D frame is more matched with the shape of the actual object than the candidate 2D frame, and selecting the adjusted 2D frame with the confidence coefficient larger than a preset threshold value as the 2D frame of the region. That is to say, the RCNN module mainly performs a refinement process on the features of each propofol proposed by the ROI-ALIGN module, obtains confidence of each category to which each propofol belongs (for example, for a vehicle task, 4 scores of background/Car/Truck/Bus may be given), and adjusts the coordinates of the 2D frame of the propofol, and outputs a more compact 2D frame. These 2D frames are merged by non-maximum suppression (NMS) and output as the last 2D frame.

The 2D candidate region subdivision classification is mainly implemented by the connector's RCNN module in fig. 12a, which further regresses more compact 2D box coordinates according to the features of each propofol extracted by the ROI-ALIGN module, and classifies this propofol, outputting its confidence that belongs to each category. The implementation of the RCNN is numerous, one implementation of which is shown in fig. 12 b. The ROI-ALIGN module output Feature size may be N × 14 × 256 (features of prosassals), which is first processed in the RCNN module by convolution module 4 of respet 18 (Res18-Conv5), the output Feature size is N × 7 512, and then processed through a Global Avg Pool (average pooling layer), averaging 7 × 7 features in each channel in the input features, resulting in N × 512 features, where each 1 × 512 dimensional Feature vector represents each propofol Feature. Next, the exact coordinates of the box (output vector N × 4, these 4 values indicate the x/y coordinates of the center point of the box, and the width and height of the box) and the confidence of the box category (in Header0, it is necessary to give a score that this box is background/Car/Truck/Bus) are regressed by 2 full connection layers FC. And finally, selecting a plurality of boxes with the largest scores through box combination operation, and removing repeated boxes through NMS operation so as to obtain compact box output.

In some practical application scenarios, the sensing network may further include other headers, and 3D/Mask/Keypoint detection may be further performed on the basis of detecting the 2D frame. Illustratively, taking 3D as an example, the ROI-ALIGN module extracts features of the region where each 2D frame is located on the feature map output by the FPN according to the accurate 2D frames provided by the Header, and assuming that the number of the 2D frames is M, the feature size output by the ROI-ALIGN module is M14 256, which is first processed by the convolution module 5 of the Resnet18 (for example, Res18-Conv5), and the feature size output is N7 512, and then processed by a Global Avg Pool (average pooling layer), and the features of 7 of each channel in the input features are averaged to obtain M512 features, where each 1 × 512-dimensional feature vector represents the features of each 2D frame. Next, the orientation angle (orientation, M × 1 vector), centroid point coordinates (centroid, M × 2 vector, these 2 values represent the x/y coordinates of the centroid), and length, width, and height (division) of the object in the frame are regressed through 3 fully connected layers FC.

In this embodiment, the header includes at least one convolution module, each convolution module includes at least one convolution layer, each convolution module can perform convolution processing on the input feature map, and a third convolution layer included in the header in this embodiment is one of the plurality of convolution layers included in the header.

It should be noted that the header shown in fig. 12a and 12b is only one implementation and does not limit the present application.

Illustratively, the feature extraction network may be implemented by a CNN module, which includes a feature extractor (encoder) and a plurality of predictive decoders (heads). The encoder performs convolution processing on an input image (RGB image) for extracting high-level semantic features of the image, the framework can adopt any standard image classification model, the feature resolution is gradually reduced through Pooling or stride constraint, and finally the high-level semantic features of the image are obtained. As shown in fig. 13, this embodiment employs centernet as encoder to down-sample the image spatial resolution to 1/16 in the original size, with an output characteristic dimension of 64, and the model parameters are initialized with values trained in the ImageNet image classification dataset.

For each attribute needing prediction, an independent head network is constructed, the feature obtained by an encoder is input, and the high-resolution prediction is obtained through two convolutions and upsampling and by supplementing detail information in the feature with the feature from the low-level of the encoder, and a BN layer and a nonlinear RELU layer are inserted between the two convolutions. As shown in head1, 2 in fig. 13, the intermediate structure of each head is identical, and only in the final stage, different dimensions are output according to the predicted attributes, and the spatial resolution of the output stage is 1/4 of the original image size.

It should be noted that the encoder and head structures, and the hyper-parameters included therein, including the number of convolutional layers, the types of active layers, the intermediate feature dimensions, the resolution, etc., are not limited, and any common architecture for dense prediction tasks may be used.

In the embodiment of the present application, the machine learning model may detect an object in an image, and then predict a target attribute of the object, where the main purpose of target attribute prediction is to analyze visual information in each pixel of the image and its neighborhood, determine whether an object of interest exists around the pixel, and information related to the object, where the prediction content may include, but is not limited to, heatmap, 2D Box offset, orientation, physical size, keypoint offset, and uncertainty of some attributes.

The following description is made separately:

in a possible implementation, the output dimensionality of the thermodynamic diagram Heatmap is C, each dimensionality corresponds to one target category, the probability that each pixel point in the image feature diagram belongs to one target category is expressed, and the value range is 0-1. When the heatmap of a certain pixel is higher in the c-th dimension, it indicates that there is a participant (target object) in the vicinity of the pixel with a high probability.

In one possible implementation, the feature map predicts a 2D bounding box of the object for each pixel position, the bounding box giving the rectangular area occupied by the object in the image, and the 2D box actually giving the 2D detection result of the object.

In one possible implementation, for each CNN-processed picture, its predicted heatmap is first decoded to know whether there is a target in the picture. In this embodiment, each target may be expressed by using a center point of a 3D bounding box of the target, and the closer a certain pixel is to the centroid of the target, the larger the value of heatmap is, and the maximum value 1 is reached at the centroid point position. In the present embodiment, there are a plurality of categories of objects (such as pedestrians, vehicles, etc.), and the Heatmap, c (k), for each pixel k of the input image I can be calculated using the following formula:

wherein f is _k (W, I) is the output of the model at the kth pixel when the parameter of CNN is W and the input is I, and the formula normalizes the output of any model into the probability value with the value range of 0-1. As shown in the Heatmap in fig. 14, the lighter the color in the map, the larger the Heat value.

In one possible implementation, the feature map predicts a 3-dimensional size for each pixel location of the object, which corresponds to the length, width, and height of the object in the real physical world.

In one possible implementation, the feature map predicts an orientation angle of an object with respect to the direction of the camera's line of sight for each pixel position, and since this angle is related to the appearance of the object on the image, it can be directly predicted by the CNN model through its appearance, which can be referred to as a local angle. When the angle is obtained, the angle of the object relative to the line of sight of the camera is combined, so that the orientation angle of the object in the camera coordinate system can be recovered, and the orientation angle can be called as a global angle. The relationship between the global angle and the local angle can be illustrated by fig. 15.

In one possible implementation, the present embodiment obtains the orientation, size, and depth estimation offset of the network to the target from each centroid position, and then decodes the global orientation angle, physical length, width, and centroid depth in the camera coordinate system of each target according to the following formulas: for a certain category c, the present embodiment first counts the average length, width and height of the category target in the data

Then, the size offset (delta) of the network prediction is obtained _h ,δ _w ,δ _l ) The predicted physical size for each target may then be obtained by:

the present embodiment predicts the center point depth offset z of the target by the CNN _o It is restored to the absolute depth z of the target centroid by _r ：

In one possible implementation, similar to the 2D bounding box, the feature map predicts the location of the target keypoint in the image for each pixel location.

Illustratively, the present example defines 11 keypoints k for each target _i And i ═ 1, … 11}, which are respectively 8 corner points of the 3D bounding box of each object, the central point (i.e. centroid point) of the 3D bounding box of each object, and the projection points of the central point of the 3D bounding box on the upper and lower surfaces of the bounding box, wherein the above key points are respectively shown as yellow, red and blue points in fig. 13. Similar to the 2D box decoding process, the present embodiment can obtain the position of the keypoint of each target in the image by obtaining the estimation of the network to the corresponding keypoint offset on the centroid position.

In one possible implementation, the feature map predicts, for each pixel position, the depth of the center of mass point of the object, which is a physical quantity in 3D space, referring to the z value in the center (x, y, z) of the object.

Furthermore, in one possible implementation, each pixel location of the feature map has the ability to give uncertainty in the prediction of these attributes in addition to predicting the values of the above-mentioned 3D attributes, the prediction of uncertainty using the principle based on a bayesian network, which will result in that in the training of the network when the prediction of the above-mentioned attributes is modeled to fit a gaussian distribution, this uncertainty can be represented by the variance of the gaussian distribution.

As described above, by the machine learning model, the pixel at the centroid point of the target can be made to directly predict the depth of the target, at which time the first depth value of the object directly estimated based on the network can be obtained.

902. At least one second depth value of the object is predicted according to the target information.

Next, how to predict at least one second depth value of the object based on the target information is described:

In one possible implementation, a structure may be built for each keypointStarting from a 3D point P in the camera coordinate system ^c With 2D points P in image space ^2d And the correspondence between the two points can be represented by an internal reference matrix K of the camera:

λP ^2d ＝KP ^c

wherein P is ^c And the position P of the key point in the local coordinate system of the target ^o The relationship of (c) can then be obtained by the external parameters of the camera:

P ^c ＝RP ^o +T

thus, each keypoint P detected on the image ^2d A position T ═ x, y, z in the camera coordinate system with respect to the target center point can be formed] ^T The equation of (c):

A＝[x ^o sinθ-z ^o cosθ]

the above formula defines two linear equations for T, and when the number of key points existing on the target exceeds 2, T can be solved. This method is the so-called N-point perspective algorithm PnP.

In the present embodiment, the depth of the center point of the object is of interest, so the 3D center point of the object can be defined as a key point as well, given its 2D position (u) on the image _c ,v _c ) At the center point of the objectThe position may be expressed in terms relating only to z:

The object 3D information of this embodiment includes the physical length, width, and height of the object

Global orientation angle r of target _y And the position of the center point and corner point of the 3D bounding box of the object on the image (9 key points). Coordinates P of each corner key point in a target local coordinate system ^o Can be obtained by the length, width and height (for example

) The coordinates in its image are obtained by the keypoint decoding module (u, v). Meanwhile, the coordinates of the key point of the image centroid are (u) _c ,v _c ). According to the PnP-based depth resolution principle, two sets of estimates of the z-value can be obtained:

for example, 8 corner points of the 3D box in this embodiment may thus form 16 sets of depth estimates.

Taking the preset direction as a vertical direction as an example, and taking the length of the preset direction as a height at this time, the network may predict the physical height H of the target, and the projection points (i.e. the predefined key points) of the corner points and the central point of the target 3D bounding box on the upper and lower surfaces of the bounding box, and since the physical height of the target and the distance between these key points are related, the depth estimation for the key points may be obtained by the following formula:

For example, 10 keypoints in this embodiment can be calculated to obtain 3 sets of pixel heights of the target, where the difference in vertical distances between projected points of the centroid of the target can form a set of pixel heights, as shown by the Zc line in fig. 16 (a); the vertical distance of 8 keypoints at the corner positions of the target 3D bounding box may form 4 sets of pixel heights, as shown by the Z1 and Z3 bars in fig. 16(b), and the Z2 and Z4 bars in fig. 16 (c). In this embodiment, because it is the centroid depth of the object to be calculated, the module further divides the 4 groups of heights into two groups, where the pixel heights at the diagonal positions are averaged to obtain the pixel height at the centroid position. One target depth estimate may be solved for each set of pixel heights, which may form 3 sets of depth estimates.

It should be understood that the above method for calculating the at least one second depth value is merely illustrative, and the depth value of the object may also be predicted by performing an operation on the target information, and the application is not limited thereto.

903. And fusing the first depth value and the at least one second depth value to obtain the target depth of the object.

In one possible implementation, after obtaining the first depth value and the at least one second depth value, the first depth value and the at least one second depth value may be fused to obtain a target depth of the object.

In a possible implementation, the merging manner may be a weighted summation, and specifically, a second weight of each of the first depth values and the second depth values may be obtained; and performing a weighted summation of the first depth value and the at least one second depth value according to the second weight.

In one possible implementation, the second weight of the first depth value and each of the second depth values may be determined based on a confidence (or uncertainty) of the first depth value and each of the second depth values, wherein the confidence (or uncertainty) may describe a degree of reliability of the depth values, and the higher the confidence of the depth value, the greater its corresponding weight, i.e., the value of the second weight is positively correlated with the confidence of the corresponding depth value.

Illustratively, a plurality of sets of depth estimation values and uncertainty thereof obtained by a depth calculation system describe the reliability of each estimation value, and are used in the embodiment of the present application to filter abnormal estimation and fuse remaining estimation values, the fusion of the remaining estimation values may be performed in a weighted sum manner, and when there are N sets of estimation and variance in an estimation set, the weighting parameters of each set of estimation are:

the essence of the formula is that a plurality of random Gaussian variables are subjected to weighted summation, the result after summation according to probability theory is still a Gaussian random variable, and the mean value and the variance meet the following conditions:

in one possible implementation, a first confidence of the first depth value and a second confidence of each of the second depth values is predicted may also be output; and fusing the first confidence coefficient and the at least one second confidence coefficient to obtain a target confidence coefficient of the target depth.

In a possible implementation, the fusion mode may be weighted summation, and specifically, the first confidence and the first weight of each second confidence may be obtained; and according to the first weight, carrying out weighted summation on the first confidence coefficient and the at least one second confidence coefficient. The determining manner of the first weight may refer to the determining manner of the second weight in the above embodiments, and is not described herein again.

In one possible implementation, for each predicted object, a confidence in the prediction, i.e., the reliability of the prediction, is also calculated. The overall confidence is calculated as follows:

P _m ＝P _3d|2d ·P _2d ；

the formula expresses two contents simultaneously, namely the probability of whether the target exists in the image or not, and whether the predicted shape and position of the target in the 3D space are accurate or not, namely the geometric confidence. Since the uncertainty about the target location (target confidence) has already been obtained, the uncertainty can be used to calculate the target geometric confidence, i.e., the confidence is lower when the uncertainty estimated for the target is higher.

In one possible implementation, a confidence of the 3D envelope box of the object output by the machine learning model may be obtained; and fusing the confidence of the 3D envelope box of the object and the target confidence to obtain the geometric confidence of the object.

In one possible implementation, a confidence of the 3D envelope box of the object and a third weight of the target confidence may be obtained; and according to the third weight, carrying out weighted summation on the confidence coefficient of the 3D envelope box of the object and the target confidence coefficient.

Specifically, for each predicted target, the system will also calculate the confidence of the prediction, i.e., the reliability of the prediction. The overall confidence is calculated as follows:

P _m ＝P _3d|2d ·P _2d ；

the formula expresses two contents simultaneously, namely the probability of whether the target exists in the image or not, and whether the predicted shape and position of the target in the 3D space are accurate or not, namely the geometric confidence. Since uncertainty about the target location has been obtained, uncertainty can be used to calculate the target geometric confidence, i.e., the confidence is lower when the uncertainty of the target estimate is higher:

d＝1-min{σ ² ,1}；

wherein the overall geometric confidence is represented by a centroid depth confidence d _c And confidence D of the 3D bounding box shape _b The two parts are as follows:

P _3d|2d ＝ω _c d _c +ω _b d _b ；

weight ω _c And omega _d The determination of (2) can refer to the description about the second weight, and the similar parts are not repeated.

In one possible implementation, a confidence of a 2D envelope box of the object output by the machine learning model may be obtained; and obtaining the detection confidence of the object according to the confidence of the 2D envelope box of the object and the geometric confidence.

Exemplarily, referring to fig. 17a, fig. 17a is a schematic diagram of an overall flow of an embodiment of the present application;

next, a training flow schematic of the machine learning model in the embodiment of the present application is described:

the neural network used in the embodiment needs to be trained, in the training process, the position, orientation and size information of the object of the category of interest, which is manually labeled in advance, in the camera coordinate system is used for supervising the 2D bounding box, the central point heatmap and the 3D attribute output of the network, and the form of the objective function can use the related method in the prior art. The specific part of this embodiment is to train 20 sets of depth estimates and simultaneously train their uncertainty, and the supervised cost function for each set of depth and uncertainty is defined by the following formula:

furthermore, the target is fused in depth

And target physical size

Orientation information

Combining, generating a bounding box of the target in 3D space, calculating a cost function from the corner points of the predicted bounding box and the corner points of the true bounding box, the objective function being used to generate a 3D shape uncertainty σ as shown in the following equation _b ：

To demonstrate the technical effect of the present embodiment, the experimental results of the present embodiment on the industry public authoritative data set Kitti are summarized as follows. As shown in table 1, the estimation results generated by using the network-based direct estimation, the altitude-based estimation, the PnP-based estimation, or the pairwise combination method have a large precision gap from the diversified depth estimation system of the present invention. The result shows that the diversified depth estimation solution module of the first embodiment can maximally explore depth clues from multiple tasks, and generate more accurate depth estimation.

Table 1 shows the results of the verification experiment of the diversified depth calculation system in the Kitti verification set

As shown in table 2, for multiple sets of depth estimation, the accuracy generated by using hard selection, average fusion, and direct weighted fusion cannot reach the accuracy of the depth selection fusion module in this embodiment. Experimental results show that multiple groups of estimation results can be well fused by uncertainty-based abnormal estimation filtering and fusion strategies, and performance improvement is achieved.

Table 2 results of the verification experiment of the depth selection fusion module in Kitti verification set

As shown in table 3, the impact on final accuracy is due to the modeling approach using different 3D geometric confidences. Instead of using the 3D geometric confidence computation module of the present embodiment, only one uncertainty modeling confidence of 20 sets of depth estimates, or only σ is used _d And σ _b One of the models cannot achieve the effect of the geometric confidence coefficient provided by the invention. Experimental results show that the geometric confidence coefficient estimation based on uncertainty can effectively model the position accuracy of a target in a 3D space, can improve the 3D detection precision, and is obviously superior to 3D IOU estimation.

Table 3 results of the verification experiment of the confidence calculation module in the Kitti verification set

Table 4 shows the Kitti test set for the present example compared to the current industry best practice. The present embodiment is greatly superior to the current industry optimal level (17.14vs 14.17 precision is improved by-20%) in the main evaluation category car of Kitti. And the embodiment ranks the first in the cycle category and ranks the second in the pedestrain category.

Table 4 example comparison of the performance of the car category with other protocols in Kitti official test suite

Table 5 example comparison of performance of cycle and pedestrain categories with other protocols in Kitti official test set

The embodiment of the application provides an object detection method, which comprises the following steps: acquiring a target image; detecting an object in the target image through a machine learning model, and outputting a first depth value of the object and target information, wherein the target information comprises at least one of the following information: geometric features of the object, keypoint information of the object, and position features of the object; predicting at least one second depth value of the object according to the target information; and fusing the first depth value and the at least one second depth value to obtain the target depth of the object. According to the method, multiple groups of depth estimations and uncertainty of each depth estimation are generated simultaneously by using different theoretical assumptions and attribute combinations of targets, high-reliability depth predictions are selected and fused according to the uncertainty in the subsequent process, unreliable error predictions are abandoned, and accordingly a robust depth estimation result is generated finally, the detection robustness and accuracy of the system to different environments are greatly improved, besides, the uncertainty also helps the system to generate a more reliable evaluation score for the prediction result, and the detection accuracy and the system interpretability are further improved.

Referring to fig. 17b, fig. 17b is a schematic structural diagram of an object detection apparatus according to an embodiment of the present disclosure, and as shown in fig. 17, an object detection apparatus 1700 according to an embodiment of the present disclosure includes:

an acquisition module 1701 for acquiring a target image;

an object detection module 1702, configured to detect an object in the target image through a machine learning model, and output a first depth value of the object and target information, where the target information includes at least one of the following information: geometric features of the object, keypoint information of the object, and location features of the object;

the description of the obtaining module 1701 and the object detecting module 1702 may refer to the description of step 901 in the above embodiment, and will not be described herein again.

A depth value prediction module 1703, configured to predict at least one second depth value of the object according to the target information;

the depth value prediction module 1703 may refer to the description of step 902 in the foregoing embodiment, and is not described herein again.

A fusion module 1704, configured to fuse the first depth value and the at least one second depth value to obtain a target depth of the object.

The description of the fusion module 1704 may refer to the description of step 903 in the above embodiment, and is not repeated here.

In one possible implementation, the apparatus further comprises:

the depth value prediction module is specifically configured to:

In a possible implementation, the fusion module is specifically configured to:

acquiring the first depth value and a second weight of each second depth value;

In one possible implementation, the obtaining module is further configured to:

the fusion module is further configured to:

In a possible implementation, the fusion module is specifically configured to:

In one possible implementation, the obtaining module is further configured to:

Referring to fig. 18, fig. 18 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 1800 may be embodied as a virtual reality VR device, a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a monitoring data processing device or a server, which is not limited herein. Specifically, the execution device 1800 includes: a receiver 1801, a transmitter 1802, a processor 1803, and a memory 1804 (where the number of processors 1803 in the execution device 1800 may be one or more, for example, one processor in fig. 18), where the processor 1803 may include an application processor 18031 and a communication processor 18032. In some embodiments of the present application, the receiver 1801, transmitter 1802, processor 1803, and memory 1804 may be connected by a bus or otherwise.

Memory 1804 may include both read-only memory and random-access memory, and provides instructions and data to processor 1803. A portion of the memory 1804 may also include non-volatile random access memory (NVRAM). The memory 1804 stores a processor and operating instructions, executable modules or data structures, or subsets thereof, or expanded sets thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 1803 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiments of the present application may be applied to the processor 1803, or implemented by the processor 1803. The processor 1803 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in software form in the processor 1803. The processor 1803 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 1803 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1804, and the processor 1803 reads the information in the memory 1804, and completes the steps of the above method in combination with the hardware thereof.

The receiver 1801 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. The transmitter 1802 may be used to output numeric or character information through a first interface; the transmitter 1802 is further operable to send instructions to the disk groups via the first interface to modify data in the disk groups; the transmitter 1802 may also include a display device such as a display screen.

Referring to fig. 19, fig. 19 is a schematic structural diagram of a training device provided in the embodiment of the present application, and specifically, the training device 1900 is implemented by one or more servers, where the training device 1900 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1919 (e.g., one or more processors) and a memory 1932, and one or more storage media 1930 (e.g., one or more mass storage devices) for storing an application program 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in storage medium 1930 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Still further, central processor 1919 may be configured to communicate with storage medium 1930 to carry out a series of instruction operations in storage medium 1930 on exercise device 1900.

Training device 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958; or, one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In this embodiment, the central processing unit 1919 is configured to perform actions related to model training in the above embodiments.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the data processing method described in the above embodiment, or to cause the chip in the training device to execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 20, fig. 20 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 2000, and the NPU 2000 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2003, and the controller 2004 controls the arithmetic circuit 2003 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2003 internally includes a plurality of processing units (PEs). In some implementations, the arithmetic circuitry 2003 is a two-dimensional systolic array. The arithmetic circuit 2003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2003 is a general purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2002 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2001 and performs matrix arithmetic with the matrix B, and partial results or final results of the obtained matrix are stored in an accumulator (accumulator) 2008.

The unified memory 2006 is used to store input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 2005, and the DMAC is transferred to the weight Memory 2002. Input data is also carried into the unified memory 2006 by the DMAC.

The BIU is a Bus Interface Unit 2010 for interaction of the AXI Bus with the DMAC and the Instruction Fetch Buffer (IFB) 2009.

The Bus Interface Unit 2010(Bus Interface Unit, BIU for short) is configured to obtain an instruction from the external memory by the instruction fetch memory 2009, and is further configured to obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage Unit access controller 2005.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2006 or to transfer weight data to the weight memory 2002 or to transfer input data to the input memory 2001.

The vector calculation unit 2007 includes a plurality of arithmetic processing units, and further processes the output of the arithmetic circuit 2003 such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2007 can store the vector of the processed output to the unified memory 2006. For example, the vector calculation unit 2007 may calculate a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 2003, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 2007 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 2003, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 2009 connected to the controller 2004 for storing instructions used by the controller 2004;

the unified memory 2006, the input memory 2001, the weight memory 2002, and the instruction fetch memory 2009 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. An object detection method, comprising:

acquiring a target image, detecting an object in the target image through a machine learning model, and outputting a first depth value and target information of the object, wherein the target information comprises at least one of the following information: geometric features of the object, keypoint information of the object, and location features of the object;

predicting at least one second depth value of the object according to the target information;

and fusing the first depth value and the at least one second depth value to obtain the target depth of the object.

2. The method of claim 1, wherein after detecting the object in the target image through the machine learning model, the method further comprises:

outputting a first confidence of the first depth values and a second confidence of predicting each of the second depth values;

and fusing the first confidence coefficient and the at least one second confidence coefficient to obtain a target confidence coefficient of the target depth.

3. The method according to claim 1 or 2, wherein said fusing the first depth value and the at least one second depth value comprises:

acquiring the first depth value and a second weight of each second depth value; wherein the numerical value of the second weight is positively correlated with the confidence of the corresponding depth value;

4. The method according to any one of claims 1 to 3, wherein the keypoint information comprises 2D positions on the image of a plurality of keypoints of the object;

predicting at least one second depth value of the object according to the target information, including:

5. The method of claim 4, wherein the geometric characteristic of the object comprises a physical dimension of the object and the positional characteristic of the object comprises an orientation angle of the object.

6. The method according to any one of claims 1 to 5, wherein the geometric features of the object comprise a plurality of pixel sizes of a 3D envelope box of the object in a preset direction, and a physical size of the object in the preset direction;

7. The method of any of claims 2 to 6, wherein said fusing said first confidence level and said at least one second confidence level comprises:

8. The method of any of claims 2 to 7, further comprising:

9. The method of claim 8, wherein fusing the confidence of the 3D envelope box of the object and the target confidence comprises:

10. The method according to claim 8 or 9, characterized in that the method further comprises:

and obtaining the detection confidence of the object according to the confidence of the 2D envelope box of the object and the geometric confidence.

11. An object detecting device, comprising:

the acquisition module is used for acquiring a target image;

12. The apparatus of claim 11, further comprising:

13. The apparatus according to claim 11 or 12, wherein the keypoint information comprises 2D positions on the image of a plurality of keypoints of the object;

the depth value prediction module is specifically configured to:

14. The apparatus of claim 13, wherein the geometric characteristic of the object comprises a physical dimension of the object and the positional characteristic of the object comprises an orientation angle of the object.

15. The apparatus according to any one of claims 11 to 14, wherein the geometric features of the object comprise a plurality of pixel sizes of a 3D envelope box of the object in a preset direction, and a physical size of the object in the preset direction;

the depth value prediction module is specifically configured to:

16. The device according to any one of claims 12 to 15, wherein the fusion module is specifically configured to:

17. The apparatus according to any one of claims 11 to 16, wherein the fusion module is specifically configured to:

18. The apparatus according to any one of claims 12 to 17, wherein the obtaining module is further configured to:

the fusion module is further configured to:

19. The apparatus according to claim 18, wherein the fusion module is specifically configured to:

20. The apparatus of claim 18 or 19, wherein the obtaining module is further configured to:

21. A computer storage medium storing one or more instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 10.

22. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any one of claims 1 to 10.

23. A system comprising at least one processor, at least one memory; the processor and the memory are connected through a communication bus and complete mutual communication;

the at least one memory is for storing code;

the at least one processor is configured to execute the code to perform the method of any of claims 1 to 10.