CN116597213A

CN116597213A - Target detection method, training device, electronic equipment and storage medium

Info

Publication number: CN116597213A
Application number: CN202310564785.4A
Authority: CN
Inventors: 王娜; 江列霖
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-08-15

Abstract

The disclosure provides a target detection method, a training device, electronic equipment and a storage medium, relates to the technical field of image processing, and particularly relates to the technical field of automatic driving and the technical field of deep learning. The specific implementation scheme is as follows: extracting image features of an image to be processed to obtain a first image feature and at least one second image feature, wherein the feature scale of the first image feature is larger than that of the second image feature; determining depth information features from the at least one second image feature; and performing target detection on a target object related to the image to be processed according to the depth information characteristic and the first image characteristic to obtain a target object detection result.

Description

Target detection method, training device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technology, and in particular, to the field of autopilot technology and deep learning technology.

Background

With the rapid development of artificial intelligence technology, more and more terminals can identify target objects such as obstacles, identification plates and the like in surrounding space based on an artificial intelligence algorithm to assist the terminals to perform corresponding operations. For example, the unmanned vehicle can detect the accurate position of the obstacle in the surrounding space based on the acquired image, and further realize automatic driving functions such as automatic parking and automatic obstacle avoidance according to the detection result.

Disclosure of Invention

The present disclosure provides a target detection method, training method, apparatus, electronic device, storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a target detection method including: extracting image features of an image to be processed to obtain a first image feature and at least one second image feature, wherein the feature scale of the first image feature is larger than that of the second image feature; determining depth information features from at least one of the second image features; and performing target detection on a target object related to the image to be processed according to the depth information characteristic and the first image characteristic to obtain a target object detection result.

According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: acquiring a training sample, wherein the training sample comprises a sample image and a sample label related to a sample target object; inputting the sample image into an image feature extraction network of a deep learning model, and outputting a sample first image feature and at least one sample second image feature, wherein the feature scale of the sample first image feature is larger than that of the sample second image feature; inputting at least one second image feature into a depth detection network of the deep learning model, and outputting sample depth information features; inputting the sample depth information features and the sample first image features into a target detection network of a deep learning model, and outputting a sample target object detection result; and training the deep learning model according to the sample target object detection result and the sample label to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an object detection apparatus including: the image feature extraction module is used for extracting image features of the image to be processed to obtain a first image feature and at least one second image feature, and the feature scale of the first image feature is larger than that of the second image feature; the depth information feature determining module is used for determining depth information features according to at least one second image feature; and the target object detection result obtaining module is used for carrying out target detection on a target object related to the image to be processed according to the depth information characteristic and the first image characteristic to obtain a target object detection result.

According to another aspect of the present disclosure, there is provided a training apparatus of a deep learning model, including: the training sample acquisition module is used for acquiring a training sample, wherein the training sample comprises a sample image and a sample label which are related to a sample target object; the sample image feature extraction module is used for inputting a sample image into an image feature extraction network of the deep learning model, outputting a first sample image feature and at least one second sample image feature, wherein the feature scale of the first sample image feature is larger than that of the second sample image feature; the sample depth information feature obtaining module is used for inputting at least one second image feature into a depth detection network of the deep learning model and outputting sample depth information features; the sample target object detection result obtaining module is used for inputting the sample depth information characteristics and the sample first image characteristics to a target detection network of the deep learning model and outputting a sample target object detection result; and the training module is used for training the deep learning model according to the sample target object detection result and the sample label to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method provided according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to embodiments of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein.

Fig. 1 schematically illustrates an exemplary system architecture to which target detection methods and apparatus may be applied, according to embodiments of the present disclosure.

Fig. 2 schematically illustrates a flow chart of a target detection method according to an embodiment of the disclosure.

Fig. 3 schematically illustrates a schematic diagram of a target detection method according to an embodiment of the present disclosure.

Fig. 4 schematically illustrates a schematic diagram of a depth detection network according to an embodiment of the present disclosure.

Fig. 5 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.

Fig. 6 schematically illustrates a schematic diagram of a training method of a deep learning model according to an embodiment of the present disclosure.

Fig. 7 schematically illustrates a block diagram of an object detection apparatus according to an embodiment of the present disclosure.

Fig. 8 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the target detection method, the training method of the deep learning model, of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.

With the rapid development of artificial intelligence technology, terminals such as unmanned vehicles, intelligent robots and the like can identify target objects such as obstacles in surrounding space, identification plates and the like based on a target detection algorithm to perform corresponding operations. The inventor finds that the conventional target detection algorithm has the problem of low detection precision, and in order to improve the detection precision of the target, the calculation cost required by the target detection needs to be increased, and the problems of long time consumption, target detection delay and the like of the target detection result, so that the related terminal is difficult to operate normally.

Embodiments of the present disclosure provide a target detection method, training method, apparatus, electronic device, storage medium, and computer program product. The target detection method comprises the following steps: extracting image features of an image to be processed to obtain a first image feature and at least one second image feature, wherein the feature scale of the first image feature is larger than that of the second image feature; determining depth information features from the at least one second image feature; and performing target detection on a target object related to the image to be processed according to the depth information characteristic and the first image characteristic to obtain a target object detection result.

According to the embodiment of the disclosure, the first image features and the second image features with different feature scales are extracted from the image to be processed, and the depth information features are generated according to the second image features with smaller feature scales, so that the calculated amount and the calculated time for obtaining the depth information features can be reduced, meanwhile, the first image features fully reserve the image information in the image to be processed, and further, the target detection is performed according to the depth information features and the first image features, the calculation cost and the calculation time required by the target detection can be reduced as a whole while the target detection precision is ensured, the target detection efficiency is improved, the dependence of the target detection on hardware calculation performance is reduced, and the time delay of the target object detection result is at least partially avoided. Therefore, according to the target detection method provided by the embodiment of the disclosure, the target detection efficiency of target objects such as obstacles, moving vehicles and the like in the scenes such as automatic auxiliary driving, intelligent unmanned vehicles and unmanned aerial vehicle automatic control can be improved, and the timeliness of obtaining the target object detection result is improved.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the object detection method and apparatus may be applied may include a vehicle, but the vehicle may implement the object detection method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a vehicle 101, a network 102, and a server 103. Network 102 is the medium used to provide a communication link between vehicle 101 and server 103. Network 102 may include various connection types, such as wired and/or wireless communication links, and the like.

A user may interact with server 103 via network 102 using vehicle 101 to receive or send messages, etc. The vehicle 101 may be mounted with an image capturing device, such as a monocular camera or the like. The vehicle 101 may be provided with a data processing device such as a chip adapted to process the acquired image.

Vehicle 101 may be a vehicle with intelligent driving assistance, including but not limited to a passenger car, a truck, a special work vehicle, and the like.

The server 103 may be a server that provides various services, such as a background management server that provides support for intelligent driving assistance functions of the vehicle 101, a cloud server, and the like (just an example). The background management server may analyze the received data such as the request, etc., and may feed back the processing result (e.g., a vehicle running control signal produced according to the request, etc.) to the vehicle. The server 103 may also be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server 103 may also be a server of a distributed system or a server incorporating a blockchain. "

It should be noted that, the object detection method provided in the embodiment of the present disclosure may be generally performed by the vehicle 101. Accordingly, the object detection device provided by the embodiment of the present disclosure may also be provided in the vehicle 101.

Alternatively, the object detection method provided by the embodiments of the present disclosure may be generally performed by the server 103. Accordingly, the object detection apparatus provided by the embodiments of the present disclosure may be generally provided in the server 103. The object detection method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 103 and is capable of communicating with the vehicle 101 and/or the server 103. Accordingly, the object detection apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 103 and is capable of communicating with the vehicle 101 and/or the server 103.

It should be understood that the number of vehicles, networks, and servers in fig. 1 are merely illustrative. There may be any number of vehicles, networks, and servers, as desired for implementation.

As shown in fig. 2, the object detection method includes operations S210 to S230.

In operation S210, image feature extraction is performed on an image to be processed, so as to obtain a first image feature and at least one second image feature, where a feature scale of the first image feature is greater than a feature scale of the second image feature.

In operation S220, depth information features are determined from at least one second image feature.

In operation S230, a target object related to the image to be processed is detected according to the depth information feature and the first image feature, so as to obtain a target object detection result.

According to an embodiment of the present disclosure, the image to be processed may be an image obtained after image acquisition of a space containing the target object based on the image acquisition device. The image pickup device may be a monocular image pickup device such as a monocular camera, for example. Alternatively, the image capturing device may also be a look-around image capturing device. The embodiments of the present disclosure do not limit the specific device type of the image capturing device.

According to embodiments of the present disclosure, image feature extraction may be performed on an image to be processed based on a neural network algorithm, for example, image feature extraction may be performed on an image to be processed based on a convolutional neural network algorithm. But not limited thereto, image feature extraction may be performed on the image to be processed based on other types of neural network algorithms, and the embodiment of the present disclosure does not limit the specific algorithm type for extracting the image feature, and those skilled in the art may select according to actual requirements.

According to the embodiment of the disclosure, the image feature extraction is performed on the image to be processed to obtain at least one second image feature, which may be obtained by scaling the image to be processed and performing the image feature extraction on the scaled image to be processed, so that the second image feature with smaller feature scale may be obtained. But not limited to this, the second image feature with smaller feature scale may be obtained by downsampling the first image feature, and the embodiment of the present disclosure does not limit the specific manner of obtaining the second image feature, and those skilled in the art may select according to actual needs.

It should be noted that, the embodiments of the present disclosure do not limit the number of each of the first image feature and the second image feature.

According to the embodiment of the disclosure, at least one second image feature can be processed based on a neural network algorithm so as to extract image information related to the depth attribute of the image to be processed in the second image feature, and further obtain a depth information feature. The at least one second image feature may be processed, for example, based on a convolutional neural network algorithm, or may also be processed based on a long-term memory network algorithm. The embodiment of the disclosure does not limit the specific calculation type for obtaining the depth information feature, and a person skilled in the art can select the specific calculation type according to actual requirements.

According to the embodiment of the disclosure, the first image features and the second image features with different feature scales are extracted from the image to be processed, and the depth information features are generated according to the second image features with smaller feature scales, so that the calculated amount and the calculated time for obtaining the depth information features can be reduced, and meanwhile, the image information in the image to be processed is fully reserved according to the first image features with larger feature scales.

According to embodiments of the present disclosure, the depth information feature and the first image feature may be processed based on an object detection algorithm, for example, the depth information feature and the first image feature may be processed based on a LSS (Lift Splat Shoot) algorithm. But not limited thereto, the depth information feature and the first image feature may be processed based on other types of object detection algorithms, and embodiments of the present disclosure are not limited in the specific type of algorithm that processes the depth information feature and the first image feature.

According to the embodiment of the disclosure, the first image features with larger feature scales fully retain image information in the image to be processed, and the depth information features generated according to the second image features with smaller feature scales can accelerate the calculation speed of obtaining the depth information features and reduce the calculation cost. Therefore, the target object related to the image to be processed is subjected to target detection according to the depth information feature and the first image feature, so that the calculation cost and the calculation time required by the target detection can be reduced as a whole, the target detection efficiency is improved, the dependence of the target detection on the hardware calculation performance is reduced, the time delay of the target object detection result is at least partially avoided, and the instantaneity of the target object detection result is improved.

It should be noted that the target object detection result may be a two-dimensional detection result representing the target object, or may also be a three-dimensional detection result representing the target object. The embodiment of the present disclosure does not limit the specific type of the target object detection result, as long as any one or more object attributes such as the size, type, motion state, and the like of the target object can be represented.

According to an embodiment of the present disclosure, performing image feature extraction on an image to be processed to obtain a first image feature and at least one second image feature may include: carrying out convolution on an image to be processed at least once to obtain initial image characteristics; downsampling the initial image feature at least once to obtain a first image feature; and downsampling the first image feature at least once to obtain at least one second image feature.

As shown in fig. 3, the image 301 to be processed may be an image acquired by a through-the-road image acquisition device mounted on a vehicle, for example, an image of a vehicle surrounding space that may be acquired by image acquisition devices mounted at a plurality of positions on the vehicle.

The image 301 to be processed may be input to a deep learning model including an image feature extraction network 310, a deep detection network 320, and a target detection network 330 to implement the target detection method provided in the embodiments of the present disclosure.

Specifically, the image to be processed is convolved at least once, for example, the image to be processed 301 may be input to the initial image feature extraction layer 311 of the image feature extraction network 310, and the initial image feature 302 is output. The initial image feature extraction layer 311 may be, for example, a lightweight backbone network layer constructed based on a Rep VGG (Visual Geometry Group) network block, so that the calculation speed of obtaining the initial image feature 302 may be improved, and the calculation overhead may be reduced.

As shown in fig. 3, the initial image feature is downsampled at least once, which may be by inputting the initial image feature 302 into the first sampling layer 312 and outputting the first image feature 303. The first sampling layer 312 may be constructed based on a convolution network and a pooled network, and the first image feature 303 may be, for example, an image feature obtained by downsampling by 8 times with respect to the original size of the image 301 to be processed.

As shown in fig. 3, the first image feature is downsampled at least once to obtain at least one second image feature, for example, the first image feature 303 may be input to the second downsampling layer 313, and the 1 st second image feature 304 is output.

According to the embodiment of the disclosure, at least one second image feature is obtained by downsampling the first image feature at least once, so that the problems of large calculation cost, overlong calculation time, excessive fitting calculation and the like caused by downsampling the image to be processed directly are avoided, and the timeliness of obtaining the second image feature with smaller feature size can be further reduced, and the target detection efficiency is improved as a whole.

According to the embodiment of the disclosure, the first image feature with larger feature size is obtained through extraction, so that the first image feature contains more image information in the image to be processed, and object attribute detection results such as a detection frame and a position corresponding to the target object can be obtained accurately according to the first image feature. By extracting the second image features with smaller feature sizes, the second image features can accurately correspond to semantic features in the image to be processed, so that an accurate classification result of the target object can be obtained later, or accurate semantic segmentation of the image to be processed can be facilitated. Therefore, the target object related to the image to be processed is subjected to target detection by fusing the first image feature and the second image feature, so that the requirements of detection accuracy of an object attribute detection result and target object classification can be met at the same time.

According to an embodiment of the present disclosure, the second image features include a plurality of second image features having different feature scales therebetween.

Wherein determining the depth information feature from the at least one second image feature may comprise: fusing the plurality of second image features to obtain fused image features; and detecting the depth information of the image to be processed according to the fusion image characteristics to obtain the depth information characteristics.

As shown in fig. 3, the 1 st second image feature 304 may also be input to the third downsampling layer 314, outputting the 2 nd second image feature 305. The 1 st second image feature 304 may be an image feature obtained by downsampling 16 times with respect to the original size of the image 301 to be processed, and the 2 nd second image feature 305 may be an image feature obtained by downsampling 32 times with respect to the original size of the image 301 to be processed.

As shown in fig. 3, the plurality of second image features may be fused, for example, by inputting the 1 st second image feature 304 and the 2 nd second image feature 305 to the fusion layer 321 of the depth detection network 320, and outputting the fused image features. The fusion layer 321 can be constructed based on a multi-layer perceptron algorithm, and the fusion layer 321 can also be constructed based on a splicing algorithm. Embodiments of the present disclosure are not limited to the particular type of algorithm used to construct fusion layer 321.

As shown in fig. 3, according to the fused image features, the depth information of the image to be processed is detected, so as to obtain depth information features, which may be obtained by inputting the fused image features output by the fusion layer 321 into the depth feature detection layer 322, and outputting the depth information features 306. The depth feature detection layer 322 may be constructed based on a convolutional neural network, so that feature depth fusion between the 1 st second image feature 304 and the 2 nd second image feature 305 is realized by feature extraction on the fused image features, the receptive field of feature characterization is increased, and semantic features in the image 301 to be processed are extracted, so that the depth information features can learn more comprehensive semantic information in the image 301 to be processed, and the depth characterization precision of the depth information features is improved.

It should be appreciated that the number of second image features 304, 305 shown in fig. 3 is merely exemplary and is not intended to limit the number of second image features, and that one skilled in the art may select the actual number of second image features to be obtained according to actual needs.

According to an embodiment of the present disclosure, performing object detection on an object related to an image to be processed according to a depth information feature and a first image feature, the obtaining an object detection result may include: determining a bird's eye view feature corresponding to the image to be processed according to the depth information feature and the first image feature; and inputting the aerial view characteristics into a first target object detection layer constructed based on the attention mechanism, and outputting a target object detection result.

As shown in fig. 3, the depth information feature 306 and the first image feature 303 may be input to the bird's eye view feature detection layer 331 of the target detection network 330, and the bird's eye view feature may be output. The bird's-eye view feature detection layer 331 may be, for example, a bird's-eye view detector constructed based on LSS (Lift Splat Shoot) algorithm. The bird's eye view feature may be input to the first target object detection layer 332, and the target object detection result 307 may be output. The first target object detection layer 332 may be an end-to-end detection head (head) constructed based on a transducer algorithm.

According to an embodiment of the present disclosure, the target object detection result may be, for example, a detection frame corresponding to the target object, a classification result of the target object, a position of the target object, or the like. The embodiment of the present disclosure does not limit the specific type of the target object detection result.

According to the embodiment of the disclosure, the end-to-end detection head (head) constructed based on the transducer algorithm is used for generating the target object detection result, so that the problem of overlong calculation time generated by constructing the target object detection head based on the NMS (Non Maximum Suppression) algorithm can be at least avoided, and the overall efficiency of target detection is improved.

According to an embodiment of the present disclosure, fusing the plurality of second image features may include: determining at least one third image feature and at least one fourth image feature from the plurality of second image features, the third image feature having a feature scale greater than the feature scale of the fourth image feature; performing convolution on the fourth image feature at least once to obtain a target fourth image feature; and obtaining a fusion image feature according to the third image feature and the target fourth image feature.

As shown in fig. 4, from 2 second image features, a second image feature with a larger feature scale may be determined as a third image feature 401, and a second image feature with a smaller feature scale may be determined as a fourth image feature 402. The fourth image feature is convolved at least once, which may be by inputting the fourth image feature 402 into the convolution layer 411 of the depth detection network 410, outputting the target fourth image feature 403. The fused image feature may be obtained according to the third image feature and the target fourth image feature, for example, by stitching the third image feature 401 input to the depth detection network 410 with the target fourth image feature 403, so as to obtain the fused image feature. The obtained fused image features are input to a depth detection layer 412 constructed based on a convolutional neural network algorithm, and depth information features 404 are output.

It should be appreciated that the depth detection network 410 shown in fig. 4 may be used in the object detection method provided in embodiments of the present disclosure to facilitate extraction of depth image features that accurately characterize depth information.

As shown in FIG. 5, the training method includes operations S510-S550.

In operation S510, a training sample is acquired, the training sample including a sample image and a sample label associated with a sample target object.

In operation S520, a sample image is input to an image feature extraction network of the deep learning model, and a sample first image feature and at least one sample second image feature are output, the feature scale of the sample first image feature being larger than the feature scale of the sample second image feature.

In operation S530, at least one second image feature is input to a depth detection network of the deep learning model, outputting sample depth information features.

In operation S540, the sample depth information feature and the sample first image feature are input to the target detection network of the deep learning model, and the sample target object detection result is output.

In operation S550, the deep learning model is trained according to the sample target object detection result and the sample label, and the trained deep learning model is obtained.

According to the embodiment of the disclosure, the deep learning model obtained after training can be applied to the target detection method described above. Accordingly, the deep learning model obtained by training the training method of the deep learning model provided by the embodiment of the disclosure can be applied to the target detection method provided by the embodiment of the disclosure.

According to an embodiment of the present disclosure, the sample image may be an image obtained after image acquisition of a space containing the sample target object based on the image acquisition device. The image pickup device may be a monocular image pickup device such as a monocular camera, for example. Alternatively, the image capturing device may also be a look-around image capturing device. The embodiments of the present disclosure do not limit the specific device type of the image capturing device.

According to the embodiment of the disclosure, the first image features of the sample with larger feature scale fully retain the image information in the sample image, and the sample depth information features generated according to the second image features of the sample with smaller feature scale can accelerate the calculation speed of obtaining the sample depth information features and reduce the calculation cost. Therefore, the deep learning model obtained after training can reduce the calculation cost and calculation time required by target detection as a whole, thereby improving the target detection efficiency, reducing the dependence of target detection on hardware calculation performance, at least partially avoiding the time delay of the target object detection result, and improving the instantaneity of the target object detection result.

It should be noted that, technical terms related to the training method of the deep learning model provided in the embodiment of the present disclosure, including but not limited to a sample image, a sample first image feature, a sample second image feature, and the like, have the same or corresponding technical attributes as corresponding technical terms related to the target detection method provided in the above embodiment, including but not limited to an image to be processed, a first image feature, a second image feature, and the like, and the embodiments of the present disclosure will not be repeated.

As shown in fig. 6, the deep learning model may include an image feature extraction network 610, a depth detection network 620, an object detection network 630, and a depth information detection layer 640.

The sample image 601 may be an image acquired by a through-the-road image acquisition device mounted on the vehicle, for example, an image of a vehicle surrounding space that may be acquired by image acquisition devices mounted at a plurality of positions on the vehicle. The number of the sample images may be plural or may be one, and the embodiment of the present disclosure does not limit the specific number of the sample images.

As shown in fig. 6, a sample image 601 may be input to an initial image feature extraction layer 611 of an image feature extraction network 610, outputting sample initial image features 602. The initial image feature extraction layer 611 may be, for example, a lightweight backbone network layer constructed based on a Rep VGG (Visual Geometry Group) network block, so that the calculation speed of obtaining the sample initial image feature 602 may be improved, and the calculation overhead may be reduced.

According to the embodiment of the disclosure, an image feature extraction network can be constructed by combining a Rep VGG (Visual Geometry Group) network block and a lightweight SimSPPF (Simplified Spatial Pyramid Pooling-Fast) network block, so that the target detection efficiency is further improved.

As shown in fig. 6, sample initial image features 602 may be input to a first sampling layer 612, outputting sample first image features 603. The first sampling layer 612 may be constructed based on a convolutional network and a pooled network, and the sample first image feature 603 may be, for example, an image feature obtained by downsampling by 8 times with respect to the original size of the sample image 601. Sample first image features 603 may be input to the second downsampling layer 313, outputting sample 1 second image features 604.

As shown in fig. 6, sample 1 second image feature 604 may also be input to third downsampling layer 614, outputting sample 2 second image feature 605. The 1 st sample second image feature 604 may be an image feature obtained by downsampling 16 times with respect to the original size of the sample image 601, and the 2 nd sample second image feature 605 may be an image feature obtained by downsampling 32 times with respect to the original size of the sample image 601.

As shown in fig. 6, sample 1 second image feature 604 and sample 2 second image feature 605 may be input to depth detection network 620, outputting sample depth information feature 606. The depth detection network 620 may be constructed based on any type of neural network algorithm, such as a multi-layer perceptron algorithm, a convolutional neural network algorithm, etc. Feature depth fusion between the 1 st sample second image feature 604 and the 2 nd sample second image feature 605 is realized through the depth detection network 620, a receptive field for feature characterization is increased, semantic features in the sample image 601 are extracted, the sample depth information feature 606 can learn more comprehensive semantic information in the sample image 601, and the depth characterization precision of the depth information feature is improved.

It should be appreciated that the number of sample second image features 604, 605 shown in fig. 6 is merely exemplary and is not intended to limit the number of second image features, and that one skilled in the art may select the actual number of second image features to be obtained according to actual needs.

As shown in fig. 6, the sample depth information feature and the sample first image feature are input to the target detection network of the deep learning model, and the sample target object detection result is output, for example, the sample depth information feature 606 and the sample first image feature 603 may be input to the bird's eye view feature detection layer 631 of the target detection network 630, and the sample bird's eye view feature is output. The sample bird's-eye view feature is input to the first target object detection layer 632, the sample first target object detection result 6071 is output, the sample bird's-eye view feature may also be input to the second target object detection layer 633, and the sample second target object detection result 6072 is output.

According to an embodiment of the present disclosure, the sample target object detection results may include a sample first target object detection result 6071 corresponding to the first target object detection layer 632, and a sample second target object detection result 6072 corresponding to the second target object detection layer 633. The first target object detection layer 632 may be constructed based on an attention mechanism, for example, the first target object detection layer 632 may be constructed based on an attention neural network algorithm. The second target object detection layer 633 may be constructed based on a non-maximal suppression algorithm.

According to embodiments of the present disclosure, training the deep learning model may include: training a deep learning model according to the sample first target object detection result, the sample second target object detection result and the sample label.

As shown in fig. 6, the sample tag may include a sample detection result tag 6081 corresponding to a sample first target object detection result 6071 and a sample second target object detection result 6072. A first loss value between the sample detection result tag 6081 and the sample first target object detection result 6071 and a second loss value between the sample detection result tag 6081 and the sample second target object detection result 6082 may be calculated based on the loss function, respectively. And iteratively adjusting model parameters of the deep learning model according to the first loss value and the second loss value to obtain a trained deep learning model.

According to the embodiment of the disclosure, the deep learning model is trained through the first loss value and the second loss value, so that the obtained trained deep learning model can retain the respective target object detection advantages of the first target object detection layer constructed based on the attention mechanism and the second target object detection layer constructed based on the non-maximum suppression algorithm, the robustness of the deep learning model is further improved, the convergence speed of the loss function is increased, and the training speed is improved.

As shown in fig. 6, the sample tags may also include a sample depth information tag 6082 corresponding to the sample depth information feature 606. Sample depth information 6073 is output by inputting the sample depth information feature 606 to the depth information detection layer 640 of the deep learning model. And calculating a third loss value between the sample depth information 6073 and the sample depth information label 6082 based on the loss function, and iteratively adjusting model parameters of the deep learning model according to the first loss value, the second loss value and the third loss value to obtain the trained deep learning model.

According to an embodiment of the present disclosure, in a case where the trained deep learning model is applied to the target detection method, the target object detection result may be generated only by the first target object detection layer. The target object detection result is generated by the end-to-end first target object detection layer constructed based on the attention network algorithm, so that the problem of overlong calculation time generated by constructing the target object detection head based on the NMS (Non Maximum Suppression) algorithm can be at least avoided, and the overall efficiency of target detection is improved.

As shown in fig. 7, the object detection device 700 includes: an image feature extraction module 710, a depth information feature determination module 720, and a target object detection result obtaining module 730.

The image feature extraction module 710 is configured to perform image feature extraction on an image to be processed, so as to obtain a first image feature and at least one second image feature, where a feature scale of the first image feature is larger than a feature scale of the second image feature.

The depth information feature determining module 720 is configured to determine a depth information feature according to the at least one second image feature.

And the target object detection result obtaining module 730 is configured to perform target detection on a target object related to the image to be processed according to the depth information feature and the first image feature, so as to obtain a target object detection result.

Wherein the depth information feature determination module comprises: and a fusion image feature obtaining unit and a depth information feature obtaining unit.

And the fusion image characteristic obtaining unit is used for fusing the plurality of second image characteristics to obtain fusion image characteristics.

And the depth information feature obtaining unit is used for carrying out depth information detection on the image to be processed according to the fusion image features to obtain depth information features.

According to an embodiment of the present disclosure, a fused image feature obtaining unit includes: the image feature determining subunit, the target fourth image feature obtaining subunit and the fused image feature obtaining subunit.

An image feature determination subunit configured to determine at least one third image feature and at least one fourth image feature from the plurality of second image features, the third image feature having a feature scale that is greater than the feature scale of the fourth image feature.

And the target fourth image feature obtaining subunit is used for carrying out convolution on the fourth image feature at least once to obtain the target fourth image feature.

And the fusion image characteristic obtaining subunit is used for obtaining the fusion image characteristic according to the third image characteristic and the target fourth image characteristic.

According to an embodiment of the present disclosure, the target object detection result obtaining module includes: a bird's eye view feature obtaining unit and a target object detection result obtaining unit.

And the aerial view feature obtaining unit is used for determining aerial view features corresponding to the image to be processed according to the depth information features and the first image features.

And the target object detection result obtaining unit is used for inputting the aerial view characteristic into the first target object detection layer constructed based on the attention mechanism and outputting a target object detection result.

According to an embodiment of the present disclosure, an image feature extraction module includes: an initial image feature obtaining unit, a first image feature obtaining unit, and a second image feature obtaining unit.

The initial image feature obtaining unit is used for carrying out convolution on the image to be processed at least once to obtain initial image features.

The first image feature obtaining unit is used for carrying out downsampling on the initial image feature at least once to obtain a first image feature.

And the second image feature obtaining unit is used for carrying out at least one downsampling on the first image feature to obtain at least one second image feature.

As shown in fig. 8, the training apparatus 800 of the deep learning model includes: a training sample obtaining module 810, a sample image feature extraction module 820, a sample depth information feature obtaining module 830, a sample target object detection result obtaining module 840 and a training module 850.

A training sample acquisition module 810 is configured to acquire a training sample, where the training sample includes a sample image and a sample label associated with a sample target object.

The sample image feature extraction module 820 is configured to input a sample image to an image feature extraction network of the deep learning model, and output a sample first image feature and at least one sample second image feature, where a feature scale of the sample first image feature is larger than a feature scale of the sample second image feature.

The sample depth information feature obtaining module 830 is configured to input at least one second image feature to a depth detection network of the deep learning model, and output a sample depth information feature.

The sample target object detection result obtaining module 840 is configured to input the sample depth information feature and the sample first image feature to a target detection network of the deep learning model, and output a sample target object detection result.

The training module 850 is configured to train the deep learning model according to the sample target object detection result and the sample label, and obtain a trained deep learning model.

According to an embodiment of the present disclosure, the target detection network includes a first target object detection layer constructed based on an attention mechanism and a second target object detection layer constructed based on a non-maximal suppression algorithm, and the sample target object detection result includes a sample first target object detection result corresponding to the first target object detection layer and a sample second target object detection result corresponding to the second target object detection layer.

The training module comprises a training unit.

And the training unit is used for training the deep learning model according to the sample first target object detection result, the sample second target object detection result and the sample label.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the target detection method, the training method of the deep learning model, of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as the target detection method, the training method of the deep learning model. For example, in some embodiments, the target detection method, the training method of the deep learning model, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described object detection method, training method of the deep learning model, may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the object detection method, the training method of the deep learning model, in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A target detection method comprising:

extracting image features of an image to be processed to obtain a first image feature and at least one second image feature, wherein the feature scale of the first image feature is larger than that of the second image feature;

determining depth information features from at least one of the second image features; and

and carrying out target detection on a target object related to the image to be processed according to the depth information characteristic and the first image characteristic to obtain a target object detection result.

2. The method of claim 1, wherein the second image features comprise a plurality of the second image features having different feature scales therebetween;

wherein said determining depth information features from at least one of said second image features comprises:

fusing a plurality of the second image features to obtain fused image features; and

and carrying out depth information detection on the image to be processed according to the fusion image characteristics to obtain the depth information characteristics.

3. The method of claim 2, wherein the fusing the plurality of the second image features comprises:

determining at least one third image feature and at least one fourth image feature from a plurality of the second image features, the third image feature having a feature scale that is larger than the feature scale of the fourth image feature;

performing convolution on the fourth image feature at least once to obtain a target fourth image feature; and

and obtaining the fusion image characteristic according to the third image characteristic and the target fourth image characteristic.

4. The method according to claim 1, wherein the performing object detection on the object related to the image to be processed according to the depth information feature and the first image feature, to obtain an object detection result includes:

Determining a bird's eye view feature corresponding to the image to be processed according to the depth information feature and the first image feature; and

and inputting the aerial view characteristic to a first target object detection layer constructed based on an attention mechanism, and outputting the target object detection result.

5. The method of claim 1, wherein the performing image feature extraction on the image to be processed to obtain the first image feature and the at least one second image feature comprises:

carrying out convolution on the image to be processed at least once to obtain initial image characteristics;

downsampling the initial image feature at least once to obtain the first image feature; and

and downsampling the first image feature at least once to obtain at least one second image feature.

6. A training method of a deep learning model, comprising:

obtaining a training sample, wherein the training sample comprises a sample image and a sample label related to a sample target object;

inputting the sample image into an image feature extraction network of the deep learning model, and outputting a sample first image feature and at least one sample second image feature, wherein the feature scale of the sample first image feature is larger than that of the sample second image feature;

Inputting at least one second image feature into a depth detection network of the deep learning model, and outputting sample depth information features;

inputting the sample depth information features and the sample first image features to a target detection network of the deep learning model, and outputting a sample target object detection result; and

and training the deep learning model according to the sample target object detection result and the sample label to obtain a trained deep learning model.

7. The method of claim 6, wherein the target detection network comprises a first target object detection layer constructed based on an attention mechanism and a second target object detection layer constructed based on a non-maximal suppression algorithm, the sample target object detection results comprising a sample first target object detection result corresponding to the first target object detection layer and a sample second target object detection result corresponding to the second target object detection layer;

wherein, according to the sample target object detection result and the sample label, training the deep learning model includes:

and training the deep learning model according to the sample first target object detection result, the sample second target object detection result and the sample label.

8. An object detection apparatus comprising:

the image feature extraction module is used for extracting image features of an image to be processed to obtain a first image feature and at least one second image feature, and the feature scale of the first image feature is larger than that of the second image feature;

a depth information feature determining module, configured to determine a depth information feature according to at least one of the second image features; and

and the target object detection result obtaining module is used for carrying out target detection on a target object related to the image to be processed according to the depth information characteristic and the first image characteristic to obtain a target object detection result.

9. The apparatus of claim 8, wherein the second image features comprise a plurality of the second image features having different feature scales therebetween;

wherein the depth information feature determination module comprises:

the fusion image feature obtaining unit is used for fusing a plurality of the second image features to obtain fusion image features; and

and the depth information feature obtaining unit is used for carrying out depth information detection on the image to be processed according to the fusion image feature to obtain the depth information feature.

10. The apparatus according to claim 9, wherein the fused image feature obtaining unit includes:

an image feature determination subunit configured to determine at least one third image feature and at least one fourth image feature from a plurality of the second image features, where a feature scale of the third image feature is greater than a feature scale of the fourth image feature;

a target fourth image feature obtaining subunit, configured to perform convolution on the fourth image feature at least once to obtain a target fourth image feature; and

11. The apparatus of claim 8, wherein the target object detection result obtaining module comprises:

a bird's-eye view feature obtaining unit, configured to determine a bird's-eye view feature corresponding to the image to be processed according to the depth information feature and the first image feature; and

and the target object detection result obtaining unit is used for inputting the aerial view characteristic into a first target object detection layer constructed based on an attention mechanism and outputting the target object detection result.

12. The apparatus of claim 8, wherein the image feature extraction module comprises:

the initial image characteristic obtaining unit is used for carrying out convolution on the image to be processed at least once to obtain initial image characteristics;

the first image feature obtaining unit is used for carrying out downsampling on the initial image feature at least once to obtain the first image feature; and

and the second image feature obtaining unit is used for carrying out downsampling on the first image feature at least once to obtain at least one second image feature.

13. A training device for a deep learning model, comprising:

the training sample acquisition module is used for acquiring a training sample, wherein the training sample comprises a sample image and a sample label which are related to a sample target object;

the sample image feature extraction module is used for inputting the sample image into an image feature extraction network of the deep learning model and outputting a sample first image feature and at least one sample second image feature, and the feature scale of the sample first image feature is larger than that of the sample second image feature;

the sample depth information feature obtaining module is used for inputting at least one second image feature into a depth detection network of the deep learning model and outputting sample depth information features;

The sample target object detection result obtaining module is used for inputting the sample depth information characteristics and the sample first image characteristics to a target detection network of the deep learning model and outputting sample target object detection results; and

and the training module is used for training the deep learning model according to the sample target object detection result and the sample label to obtain a trained deep learning model.

14. The apparatus of claim 13, wherein the target detection network comprises a first target object detection layer constructed based on an attention mechanism and a second target object detection layer constructed based on a non-maximal suppression algorithm, the sample target object detection results comprising a sample first target object detection result corresponding to the first target object detection layer and a sample second target object detection result corresponding to the second target object detection layer;

wherein, training module includes:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.