CN115965935B

CN115965935B - Object detection method, device, electronic apparatus, storage medium, and program product

Info

Publication number: CN115965935B
Application number: CN202211675878.6A
Authority: CN
Inventors: 张珂; 魏亚男; 蔡叶荷; 王瑞
Original assignee: Guangzhou Woya Technology Co ltd
Current assignee: Guangzhou Woya Technology Co ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-09-12
Anticipated expiration: 2042-12-26
Also published as: CN115965935A

Abstract

Embodiments of the present disclosure relate to a target detection method, apparatus, electronic device, storage medium, and program product. The method comprises the following steps: and acquiring a target image to be detected, and detecting the target image through a pre-trained target detection model to obtain a detection result. The target detection model comprises a detection network and a plurality of attribute identification networks, wherein the detection network and each attribute identification network share the same convolution layer; the detection network is used for identifying the category information and the position information of at least one detection object in the target image; different attribute identification networks are used to identify different attributes of the test object. By adopting the method, the time consumption caused by parameter stacking can be reduced.

Description

Object detection method, device, electronic apparatus, storage medium, and program product

Technical Field

The embodiment of the disclosure relates to the technical field of automatic driving, in particular to a target detection method, a target detection device, electronic equipment, a storage medium and a program product.

Background

In the perception scene of an automatic driving automobile, the perception system can perform environment perception and output the detected attributes such as the position, the direction, the speed and the like of the obstacle. The multitasking model is a model that uniformly outputs the attributes of the position, direction, speed, etc. of the obstacle in one large model.

The existing multitask model utilizes three parallel branches to respectively extract characteristics, combines the extracted characteristics, and then identifies the attribute of the obstacle according to the combination result.

However, the three parallel branches in the conventional technology have a problem that the parameter stacking results in high time consumption.

Disclosure of Invention

Embodiments of the present disclosure provide a target detection method, apparatus, electronic device, storage medium, and program product, which can be used to improve the problem that the time consumption is high due to parameter stacking in the existing multitasking model.

In a first aspect, an embodiment of the present disclosure provides a target detection method, including:

acquiring a target image to be detected;

detecting a target image through a pre-trained target detection model to obtain a detection result;

the target detection model comprises a detection network and a plurality of attribute identification networks, wherein the detection network and each attribute identification network share the same convolution layer; the detection network is used for identifying the category information and the position information of at least one detection object in the target image; different attribute identification networks are used to identify different attributes of the test object.

In a second aspect, embodiments of the present disclosure provide an object detection apparatus, the apparatus comprising:

The acquisition module is used for acquiring a target image to be detected;

the detection module is used for detecting the target image through a pre-trained target detection model to obtain a detection result;

In a third aspect, an embodiment of the disclosure provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method of the first aspect when the processor executes the computer program.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method of the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.

The target detection method, the target detection device, the electronic equipment, the storage medium and the program product provided by the embodiment of the disclosure acquire a target image to be detected, and detect the target image through a pre-trained target detection model to obtain a detection result. The embodiment of the disclosure redesigns the target detection model from the model structure based on the problem that the time consumption is high due to the fact that the parameter stacking exists in the existing detection model, the target detection model does not have a plurality of parallel branches, and the detection network and each attribute identification network in the target detection model share one convolution layer, so that the whole model has no parameter stacking and reduces the parameter quantity, and the time consumption caused by the parameter stacking is reduced; in addition, the existing detection model has the problem that the single obstacle attribute is difficult to directionally optimize and prune, and the target detection model decouples each obstacle attribute, so that the addition, deletion and directional optimization of the single obstacle attribute are facilitated on the premise of having no influence on other obstacle attribute tasks. Further, the target image is detected based on a pre-trained target detection model, so that category information, position information and attribute information of at least one detection object in the target image are obtained, and the information is summarized to obtain a detection result.

Drawings

FIG. 1 is a schematic diagram of a conventional multitasking model structure;

FIG. 2 is a schematic diagram of a multitasking model structure in one embodiment;

FIG. 3 is a schematic diagram of a multitasking model structure in one embodiment;

FIG. 4 is a diagram of an application environment for a target detection method in one embodiment;

FIG. 5 is a flow chart of a method of detecting targets in one embodiment;

FIG. 6 is a flowchart illustrating a plurality of attribute information acquisition steps of each detection object according to an embodiment;

FIG. 7 is one of the flow diagrams of the feature map acquisition in one embodiment;

FIG. 8 is a second flow chart of a feature map acquisition in one embodiment;

FIG. 9 is a flowchart illustrating steps for obtaining category information and location information of each detected object according to an embodiment;

FIG. 10 is a graph of expansion convolution for different expansion rates in one embodiment;

FIG. 11 is a visual illustration of a aliasing effect in one embodiment;

FIG. 12 is a flowchart illustrating a step of acquiring a plurality of attribute information of each detection object according to an embodiment;

FIG. 13 is a flow chart of a method for detecting targets according to another embodiment;

FIG. 14 is a block diagram of an object detection device in one embodiment;

fig. 15 is an internal structural diagram of an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the disclosed embodiments and are not intended to limit the disclosed embodiments.

First, before the technical solution of the embodiments of the present disclosure is specifically described, a description is given of a technical background or a technical evolution context on which the embodiments of the present disclosure are based. Typically, autopilot is performed by a computer system in place of a human to perform the maneuvering of the vehicle. When the automatic driving automobile runs on a road, three major elements of perception, decision and control are not separated, wherein the perception is taken as an important link for the automobile to 'see the road', and the automatic driving automobile can read and understand the traffic elements on the road as the driver. In short, sensing takes data of various sensors and information of a high-precision map as input, and the surrounding environment of the automatic driving automobile is accurately understood through a series of calculation and processing.

In an automatic driving vehicle, sensing is completed through a sensing system, the sensing system can take data acquired by a vehicle-mounted camera and information of a high-precision map as input, then a series of calculation and processing are carried out, and finally information such as the position, shape, category and speed of an obstacle, semantic understanding of special scenes such as a construction area, traffic lights and traffic warning lights and the like is output through front fusion or rear fusion processing. The perception objects of the perception system can be divided into two types, wherein one type is a static object, namely a road, a traffic sign, a static obstacle and the like; the other is a dynamic object, i.e., a vehicle, a pedestrian, a moving obstacle, etc. For a dynamic object, besides knowing the specific category of the object, information such as position, speed, direction and the like needs to be tracked, and the next behavior of the object is predicted according to the tracking result.

The perception system effectively outputs the attributes of the position, the direction, the speed and the like of the obstacle to determine how many obstacles the automatic driving automobile can see, whether the automatic driving automobile can adapt to complex and changeable traffic environments and the like. As the intelligent level of smart cars is higher, the demand for image-based perception is higher.

The multitasking model is a model that uniformly outputs the attributes of the position, direction, speed, etc. of the obstacle in one large model. In the conventional technology, the multitasking model is to respectively realize end-to-end multitasking recognition by using the properties of the obstacle detection, element semantic segmentation and obstacle as 3 heads (heads). The scheme can realize the output of the outline, the position, the direction, the depth and other relevant attributes of the obstacle from the perception system, and the scheme is shown in fig. 1.

The backbone network (backbone) in fig. 1 refers to a backbone of a model, here DLA34 is used, the hierarchical fusion network (stack) here uses DLA-UP and IDA-UP, and the following detection head, semantic segmentation head and obstacle recognition head implement obstacle detection, element semantic segmentation and obstacle attribute recognition respectively. It should be noted that, since the target attribute of the obstacle is difficult to obtain the feature map of the input image through one attribute header, the obstacle attribute identification is performed through three parallel branches, then the parallel results are combined, and finally the obstacle attribute identification result is output through the first identification header and the second identification header. The first recognition head can recognize whether a vehicle on a road surface is in a state of opening a vehicle door, can recognize the height of the vehicle on the road surface on an input image, and can also recognize whether a traffic cone, a water road and other roadblocks on the road are in a movable state; the second recognition head can recognize an angle of an obstacle on the road surface on the input image.

Based on the background, the applicant discovers that in the existing technical scheme, as the attribute identification of the obstacle needs to be combined after three parallel branches are parallel through long-term model simulation research and development and collection, demonstration and verification of experimental data, the combined network parameters are 3 times of that of a single attribute head, so that the time consumption of a sensing system of an automatic driving automobile is high due to the existence of parameter stacking; in addition, three barrier attributes of a vehicle door opening state, a vehicle height on an input image and a movable state of a roadblock such as a traffic cone, a water road and the like in the prior art are output by one identification head, so that the three barrier attributes are in a strong coupling state, and directional optimization, random addition, deletion and correction of a single barrier attribute are difficult to support.

The applicant first thinks that the multi-task model structure is designed into the form of fig. 2, the attribute, obstacle detection and element semantic segmentation of each obstacle are independently made into an output head, and the output heads are connected behind the feature extractor of the backbone network and the hierarchical fusion network, so that the independence between the attributes can be realized to support any augmentation and deletion, and the parameter quantity can be reduced. From the statistical point of view of parameters, the model structure shown in fig. 1 is changed to the model structure shown in fig. 2, so that the delay of the network can be reduced by 7.15%, and the parameter amount can be reduced by 4.1%.

Meanwhile, model training was performed for the model structure shown in fig. 2, and the model training results are shown in table 1 as compared with the training results of the model structure shown in fig. 1:

table 1 comparison of the results of the model structure of fig. 2 and the model structure of fig. 1

Contrast item	Detection accuracy	Dynamic obstacle speed error	Dynamic obstacle direction error
				Variation of	-1.22％	+1.77％	+16.36％

Wherein the Detection accuracy (Detection Mean Average Precision, detection MAP) is used to evaluate the sample average accuracy of the Detection task, with larger values indicating better performance. A dynamic obstacle Speed Error (Speed Error) is used to evaluate the Error in estimating the Speed of a vehicle, pedestrian, etc. on a road, with smaller values indicating better performance. A dynamic obstacle direction Error (head Error) is used to evaluate the Error of estimating the direction of a vehicle, a pedestrian, etc. on a road, and a smaller value indicates better performance.

From the results in table 1, it can be seen that the structure of fig. 2, while achieving reduced time consumption and low coupling of the target properties, is not expected from a performance perspective, because the accuracy of detection is reduced by 1.22%, the velocity error of the dynamic obstacle is increased by 1.77%, and the direction error of the dynamic obstacle is increased by 16.36%.

Based on the above, the applicant visualizes the intermediate feature layer of the model to analyze the cause of the result, finds that the feature image obtained by inputting the image acquired by the sensor such as the camera into the backbone network and the hierarchical fusion network has no obvious prospect, and inputs the image acquired by the sensor such as the camera into the backbone network and the hierarchical fusion network, and then inputs the target position key points obtained by the detection network to obtain the positions of the targets such as the car and the person on the road, so that if the attribute network of the obstacle is combined with the target position key points obtained by the detection network, the performance of the attribute information of the obtained obstacle can be positively acted.

Based on this, the applicant designed the model structure of fig. 3, and trained and evaluated the model structure of fig. 3, and the model training results compared to the model structure of fig. 1 are shown in table 2:

table 2 comparison of the results of the model structure of fig. 3 with the model structure of fig. 1

Contrast item	Detection accuracy	Dynamic obstacle speed error	Dynamic obstacle direction error
				Variation of	+0.43％	-1.15％	-2.75％

From the results in table 2, it can be seen that the performance of the model structure of fig. 3 is positively enhanced. The accuracy of detection is improved by 0.43%, the speed error of the dynamic obstacle is reduced by 1.15%, and the direction error of the dynamic obstacle is reduced by 2.75%. And still possess the advantage that the parameter quantity is few and obstacle target attribute coupling nature is low, from the statistics of parameter, change the model that the figure 1 shows into the model that the figure 3 shows, can reduce 7.15% with the delay of network, the parameter quantity reduces 4.1%.

Accordingly, the model structure shown in fig. 3 is determined as a multitasking model structure of an embodiment of the present disclosure. In addition, the following embodiments describe the technical solutions, and the applicant has made a great deal of creative work.

The following describes a technical scheme related to an embodiment of the present disclosure in conjunction with a scenario in which the embodiment of the present disclosure is applied.

The target detection method provided by the embodiment of the present disclosure may be applied to an electronic device in an autonomous vehicle 10 as shown in fig. 4. The electronic device may be a vehicle-mounted central control device or a terminal disposed in a vehicle, and the embodiment of the disclosure does not limit a specific form of the electronic device.

In one embodiment, as shown in fig. 5, a target detection method is provided, and the method is applied to the electronic device in fig. 4 for illustration, and includes the following steps:

step 201, acquiring a target image to be detected.

The target image to be detected may be a long Jiao Shitu, a short Jiao Shitu, a left front view and a right front view acquired by a laser radar or a camera sensor every preset time period. The preset period of time may be a period of time when the automatic driving vehicle needs to collect the image once at intervals, and may be, for example, several seconds or several tens of seconds.

In the embodiment of the disclosure, in the process of driving an automatic driving automobile, the vehicle-mounted camera sensor can acquire target images to be detected and send the target images to be detected to the electronic equipment one by one, and the electronic equipment acquires the target images to be detected. It should be noted that, the electronic device is provided with a target detection model, and the target detection model can detect a target image to be detected.

Step 202, detecting a target image through a pre-trained target detection model to obtain a detection result.

The object detection model may be a multitasking model mentioned in the background art, and is used for detecting attribute information of an object image, and the object detection model may include, but is not limited to, a detection network and a plurality of attribute identification networks. The detection network is used for identifying the category information and the position information of at least one detection object in the target image, and the plurality of attribute identification networks are used for identifying different attributes of the detection object. The detection object may be a vehicle, a pedestrian, a person riding a two-wheeled vehicle, a person riding a tricycle, a cone, a roadblock, etc. in the target image, and the different attribute of the detection object may be whether a door of the vehicle is closed, a height of the vehicle on the target image, whether the roadblock is in a movable state, an angle of the obstacle on the target image, etc.

It should be noted that, since the dilation convolution can ensure a larger detection field of view under the condition that the parameters are unchanged, the process of pre-training the target detection model may be to perform dilation convolution on the detection network and the plurality of attribute identification networks respectively for a plurality of times. For example, the detection network may be subjected to two dilation convolutions and the plurality of attribute identification networks may be subjected to three dilation convolutions, where the detection network and each attribute identification network share a convolution layer. The common convolution layer may be a first-time expansion convolution layer of the detection network or a second-time expansion convolution layer of the detection network.

In the embodiment of the present disclosure, the electronic device may detect the target image to be detected obtained in the step 201 through a detection network and a plurality of attribute recognition networks in a pre-trained target detection model, specifically detect category information, position information and attribute information of the detection object in the target image, and aggregate the information to obtain a detection result.

According to the target detection method provided by the embodiment of the disclosure, the target image to be detected is obtained, and the target image is detected through the target detection model trained in advance, so that a detection result is obtained. The embodiment of the disclosure redesigns the target detection model from the model structure based on the problem that the time consumption is high due to the fact that the parameter stacking exists in the existing detection model, the target detection model does not have the process of combining after the attribute head branches are parallel, and the detection network and each attribute identification network share one convolution layer in the target detection model, so that the whole model is free of parameter stacking and reduces the parameter quantity, and the time consumption caused by the parameter stacking is reduced; in addition, the existing detection model has the problem that the single obstacle attribute is difficult to directionally optimize and prune, and the target detection model decouples each obstacle attribute, so that the addition, deletion and directional optimization of the single obstacle attribute are facilitated on the premise of having no influence on other obstacle attribute tasks. Further, the target image is detected based on a pre-trained target detection model, so that category information, position information and attribute information of at least one detection object in the target image are obtained, and the information is summarized to obtain a detection result.

In an embodiment, on the basis of the embodiment shown in fig. 5, the method further includes a feature extraction network, and the process of detecting the target image by using the pre-trained target detection model to obtain a detection result may be described, as shown in fig. 6, where the "detecting the target image by using the pre-trained target detection model to obtain a detection result" may include the following steps:

step 301, extracting features of the target image through a feature extraction network to obtain a feature map.

The feature extraction network may include, but is not limited to, a backbone network and a converged network, among others.

In the embodiment of the present disclosure, the target image to be detected obtained in step 201 may be input into a pre-trained target detection model, the target detection model performs an operation of adjusting the size of the image on the target image, the size of the target image is changed from 1920×1200 to 1088×704, the pixels of the image of 1088×704 are normalized, the normalized image is input into a feature extraction network to perform feature extraction, and the extracted features are summarized to obtain a feature map of the target image.

Alternatively, the process of extracting the features of the target image through the feature extraction network to obtain the feature map may be described, as shown in fig. 7, where the step of extracting the features of the target image through the feature extraction network to obtain the feature map may include the following steps:

And step 401, inputting the target image into a backbone network for feature extraction to obtain candidate features.

The backbone network may be a backhaul network, which is used to extract candidate features of the target image, where it should be noted that the candidate features may include foreground information features in a series of target detection images, such as contour features and key point features of the detection object.

In the embodiment of the disclosure, the target image may be input into the backbone network to extract the features of the target image, and the extracted features are used as candidate features of the target image. Illustratively, the target image is input into the backup network to extract candidate features of the target image, the size of the target image is continuously reduced and the number of layers of the target image is continuously increased during the process of extracting the candidate features of the target image until the feature extraction of the target image is completed, at which time the size of the target image is reduced to 34×22.

And step 402, inputting the candidate features into a fusion network for fusion to obtain a feature map.

The fusion network may be a network, and is configured to fuse candidate features of the target image extracted by the backup network, and obtain a feature map through the fused features.

In the embodiment of the present disclosure, the candidate features acquired in the step 401 may be input to a fusion network, where the fusion network fuses the candidate features, and uses the fused feature image as a feature map. Illustratively, candidate features are input into a neg network to perform feature layer fusion, and a series of sampling and convolution operations are performed on the neg network to obtain a 272×176 feature map.

Step 302, detecting the feature map through a detection network to obtain category information and position information of each detection object.

The Detection network may be a Detection network, which is used to identify category information and location information of each Detection object in the target image. The detection objects may include, but are not limited to, vehicles, pedestrians, cyclists, tricycles, cones, roadblocks, and the like in the target image. The category information may be that the detection object belongs to a certain category, and by way of example, if the detection object is a vehicle, the category information is a vehicle, and if the detection object is a roadblock, the category information is an obstacle class. The position information may be a position coordinate of the detection object on the feature map, and if the detection object is a vehicle and the position information on the feature map is (26,35), the position information of the vehicle is shown as a position where the vehicle is on an abscissa 26 and an ordinate 35 in a coordinate system having a bottom left corner of the feature map as an origin of coordinates and a bottom left side of the feature map as an x-axis.

In the embodiment of the present disclosure, the feature map obtained in the above step 301 may be input into a detection network in a preset target detection model to detect, where the detection network mainly detects category information and position information of each detection object in the target image, and uses the detected category information and position information as the category information and position information of the detection object. For example, the feature map may have a frame of the detection object within the frame, and the category information and the position information of the detection object are displayed in the upper right corner of the frame, and it should be noted that one feature map may display the category information and the position information of one detection object, and one feature map may also display the category information and the position information of a plurality of detection objects.

And 303, respectively carrying out attribute identification on the feature map through a plurality of attribute identification networks to obtain a plurality of attribute information of each detection object.

The attribute identifying network may be used to identify attribute information of each detection object in the target image, for example, an open or closed state of a door of the vehicle, whether a pedestrian on a road surface is in a movable state, an angle of a roadblock on the road surface on a bird's eye view, and the like.

Optionally, the plurality of attribute identification networks may include at least two of a door identification network, a height identification network, a barrier identification network, and an obstacle angle identification network. It should be noted that each attribute identification network includes three layers of convolution, where the first layer convolution network is common to the first layer convolution network of the detection network and the first layer convolution network of the first layer convolution network.

Wherein the door recognition network is used for recognizing whether the door is closed, and illustratively, 0 represents that the door of the vehicle is in a closed state and 1 represents that the door of the vehicle is in an open state. The height recognition network is used for recognizing the height of the vehicle on the aerial view, namely the distance from the head to the tail of the vehicle on the target image, and the height is generally expressed by floating point numbers. The barrier identification network is used to identify whether the barrier is movable, where the barrier may be a traffic cone, water horse, etc. on the road, and by way of example, 0 indicates that the barrier is not movable, and 1 indicates that the barrier is movable. The obstacle angle recognition network is used for recognizing the angle of the obstacle on the aerial view, and represents the included angle between the position of the obstacle on the target image and the view angle of the driver, wherein the angle is the meaning of the direction in the physical world, and is generally expressed by a floating point number.

In the embodiment of the present disclosure, the feature map obtained in step 301 may be input into a plurality of attribute recognition networks in a preset target detection model to perform attribute recognition on the feature map, where each attribute recognition network mainly recognizes a plurality of attribute information of each detection object in the feature image. It should be noted that one attribute recognition network may recognize one type of attribute information of each detection object in the feature map, and a plurality of attribute recognition networks may recognize a plurality of attribute information of each detection object in the feature map. For example, assuming that the door of the vehicle on the feature map is in an open state and the door of the vehicle on the feature map is in a closed state is denoted as 1, and for example, if the door of the vehicle on the feature map is in an open state and the height on the bird's eye view is 1.34cm, the detection object is the vehicle, the door attribute information of the vehicle is 1, and the height identification information of the vehicle is 1.34cm.

According to the method for obtaining the detection result, the feature extraction network is used for carrying out feature extraction on the target image to obtain the feature image, the detection network is used for detecting the feature image to obtain the category information and the position information of each detection object, and the plurality of attribute recognition networks are used for respectively carrying out attribute recognition on the feature image to obtain the plurality of attribute information of each detection object. According to the embodiment of the disclosure, the feature extraction network is used for obtaining the feature map, and then the category information and the position information of each detection object are obtained according to the feature map and the detection network, so that the obtained category information and position information of each detection object are more accurate; further, the plurality of attribute information of each detection object is obtained according to the feature map and the plurality of attribute identification networks, so that the obtained plurality of attribute information of each detection object is more accurate.

In one embodiment, on the basis of the embodiment shown in fig. 6, the object detection model further includes a semantic segmentation network, as shown in fig. 8, and the method further includes:

and 304, carrying out semantic segmentation on the feature map through a semantic segmentation network to obtain semantic segmentation results of all the detection objects.

The semantic segmentation network may be a network that classifies feature images at a pixel level and outputs contour information of each category. It should be noted that the segmentation class of the semantic segmentation network in the embodiments of the present disclosure may include 26 classes, such as sky, building, small animals, birds, pedestrians, lamp posts, and the like.

The semantic segmentation may be that each pixel in the feature map is assigned a class label (such as an automobile, a building, a ground, a sky, etc.), and represented by a different color.

In the embodiment of the present disclosure, the feature map obtained in the step 301 may be input into a semantic segmentation network in a preset target detection model to perform semantic segmentation on the feature map, and the semantic segmentation result of the feature map is used as the semantic segmentation result of the detection object. The semantic segmentation network may be obtained by a three-layer convolution network, where the first layer convolution network is a deconvolution network with a convolution kernel equal to 2, the second layer convolution network is a conventional convolution network with a convolution kernel equal to 3, the third layer convolution network is a deconvolution network with a convolution kernel equal to 2, and the feature map 272×176 output by the feature extraction network in the step 301 is transformed into a semantic segmentation result 704×1088 through two deconvolutions of the first layer convolution network and the third layer convolution network, and the semantic segmentation result is displayed on the feature map to obtain the semantic segmentation map.

According to the method for obtaining the semantic segmentation result, the semantic segmentation is carried out on the feature map through the semantic segmentation network, and the semantic segmentation result of each detection object is obtained. Since the semantic segmentation result of each detection object is obtained through the feature map, the obtained semantic segmentation result of each detection object is more accurate.

In one embodiment, on the basis of the embodiment shown in fig. 8, the detection network includes a first convolution layer and a second convolution layer, the detection network and each attribute identification network share the first convolution layer, the process of detecting, by the detection network, the feature map to obtain the category information and the position information of each detection object may be described, as shown in fig. 9, and the "detecting, by the detection network, the feature map to obtain the category information and the position information of each detection object" may include the following steps:

and step 501, inputting the feature map into a first convolution layer to perform expansion convolution, and obtaining a first convolution result output by the first convolution layer.

The convolution parameters of the first convolution layer may be parameters obtained through a plurality of experiments, and the first convolution layer may be an expansion convolution with an expansion rate of 2 and a convolution kernel of 3. The detection network and each attribute identification network share a first convolution layer.

The process of obtaining the convolution parameters of the first convolution layer is described in detail below:

as shown in fig. 10, which is a schematic diagram of the expansion convolution at expansion ratios (r) of 1, 2 and 3, respectively, it can be seen that the expansion convolution can ensure a larger receptive field without changing the parameters only for the expansion ratio. However, overlapping multiple convolution kernels with the same expansion ratio multiple times may cause some pixels in the target image to not participate in the operation all the time, so that this convolution kernel arrangement manner does not play any role, and in order to avoid the saw-tooth effect as shown in fig. 11 caused by overlapping multiple expansion convolutions, the embodiment of the disclosure experiments an arrangement manner of multiple convolution kernels is recorded in table 3.

TABLE 3 Effect of various convolution kernels in different arrangements

In the embodiment of the present disclosure, based on the statistics of the results in table 3, stability and performance are comprehensively considered, and the present disclosure selects 3×3conv with r=2 and 3×3conv with r=6 to combine, and embeds the combination into the structural design of the target detection model, thereby completing the detection of the target image in the embodiment of the present disclosure.

In summary, the convolution parameter of the first convolution layer is designed to have an expansion rate of 2, and the convolution kernel is 3.

In the embodiment of the present disclosure, the feature map obtained in the step 301 may be input into a first convolution layer in a preset target detection model to perform expansion convolution on the feature map, so as to obtain an expansion convolution result of the feature map, and the expansion convolution result is used as a first convolution result output by the first convolution layer.

Step 502, inputting the first convolution result into the second convolution layer to obtain category information and position information of each detection object.

Wherein the second convolution layer may be a conventional convolution with a convolution kernel of 1.

In the embodiment of the present disclosure, the first convolution result output by the first convolution layer in the above step 501 may be input into the second convolution layer to perform convolution, to obtain a convolution result of the second convolution layer of the feature map, and the convolution result may be used as the category information and the position information of each detection object.

According to the method for obtaining the category information and the position information of the detection objects, the feature map is input into the first convolution layer to carry out expansion convolution, a first convolution result output by the first convolution layer is obtained, and then the first convolution result is input into the second convolution layer, so that the category information and the position information of each detection object are obtained. The parameters of the first convolution layer obtained through experiments are subjected to first expansion convolution and second conventional convolution, and the detection network and each attribute identification network share the first convolution layer, so that the obtained category information and position information of each detection object have the advantage of less parameters, and the time consumption in the convolution process is reduced due to the reduction of the convolution parameters.

In one embodiment, on the basis of the embodiment shown in fig. 9, each attribute identifying network further includes a third convolution layer and a fourth convolution layer, which may describe a process of respectively performing attribute identification on the feature map through a plurality of attribute identifying networks to obtain a plurality of attribute information of each detection object, and as shown in fig. 12, the "respectively performing attribute identification on the feature map through a plurality of attribute identifying networks to obtain a plurality of attribute information of each detection object" may include the following steps:

step 601, inputting the first convolution result into a third convolution layer to obtain a second convolution result.

The parameter of the third convolution layer may be that the expansion ratio in table 3 in step 501 is 6, and the convolution kernel is 3.

In the embodiment of the present disclosure, the first convolution result output by the first convolution layer in the step 501 may be input to the third convolution layer to perform convolution, to obtain a convolution result of the third convolution layer of the feature map, and the convolution result is used as the second convolution result.

Step 602, inputting the second convolution result into a fourth convolution layer to obtain a plurality of attribute information of each detection object.

Wherein the fourth convolution layer may be a conventional convolution with a convolution kernel of 1.

In the embodiment of the present disclosure, the second convolution result in the above step 601 may be input into a fourth convolution layer to perform convolution, to obtain a convolution result of the fourth convolution layer of the feature map, and the convolution result is used as a plurality of attribute information of each detection object.

According to the method for obtaining the plurality of attribute information of each detection object, the first convolution result is input into the third convolution layer to obtain the second convolution result, and the second convolution result is input into the fourth convolution layer to obtain the plurality of attribute information of each detection object. The parameters of the third convolution layer obtained through experiments are subjected to third expansion convolution and fourth conventional convolution, and the detection network and each attribute identification network share the first convolution layer, so that the obtained multiple attribute information of each detection object has the advantage of less parameters, and time consumption in the convolution process is reduced due to reduction of convolution parameters.

An embodiment of the present disclosure will be described with reference to a specific scenario, and fig. 13 is a flowchart of a target detection method provided in an embodiment of the present application, and as shown in fig. 13, the method may include the following steps:

step 701, acquiring a target image to be detected.

And 702, inputting the target image into a backbone network for feature extraction to obtain candidate features.

And 703, inputting the candidate features into a fusion network for fusion to obtain a feature map.

Step 704, inputting the feature map to the first convolution layer for expansion convolution, and obtaining a first convolution result output by the first convolution layer.

Step 705, inputting the first convolution result to the second convolution layer to obtain the category information and the position information of each detection object.

Step 706, inputting the first convolution result into the third convolution layer to obtain a second convolution result.

Step 707, inputting the second convolution result into the fourth convolution layer to obtain a plurality of attribute information of each detection object.

The third convolution layer and the fourth convolution layer comprise at least two of a vehicle door identification network, a height identification network, a roadblock identification network and an obstacle angle identification network; the vehicle door identification network is used for identifying whether the vehicle door is closed or not; the height recognition network is used for recognizing the height of the vehicle on the aerial view; the roadblock recognition network is used for recognizing whether the roadblock is movable or not; the obstacle angle recognition network is used for recognizing the angle of the obstacle on the aerial view.

Step 708, performing semantic segmentation on the feature map through a semantic segmentation network to obtain a semantic segmentation result of each detection object.

It should be understood that, although the steps in the flowcharts of fig. 1-13 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in FIGS. 1-13 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 14, there is provided an object detection apparatus including: an acquisition module 801, a detection module 802, wherein:

an acquiring module 801, configured to acquire a target image to be detected.

The detection module 802 is configured to detect the target image through a pre-trained target detection model, so as to obtain a detection result. The target detection model comprises a detection network and a plurality of attribute identification networks, wherein the detection network and each attribute identification network share the same convolution layer; the detection network is used for identifying the category information and the position information of at least one detection object in the target image; different attribute identification networks are used to identify different attributes of the test object.

In one embodiment, the object detection model further includes a feature extraction network, and the detection module 802 includes: extraction unit, detection unit and recognition element, wherein:

the extraction unit is specifically used for extracting the characteristics of the target image through the characteristic extraction network to obtain a characteristic image;

the detection unit is specifically used for detecting the feature map through a detection network to obtain category information and position information of each detection object;

the identification unit is specifically configured to perform attribute identification on the feature map through a plurality of attribute identification networks, so as to obtain a plurality of attribute information of each detection object.

In one embodiment, the object detection model further includes a semantic segmentation network, and the detection module 802 further includes: the segmentation unit is specifically used for carrying out semantic segmentation on the feature map through a semantic segmentation network to obtain semantic segmentation results of each detection object.

In one embodiment, the detection network includes a first convolution layer and a second convolution layer, the detection network and each attribute identification network share the first convolution layer, and the detection unit is specifically configured to input the feature map to the first convolution layer to perform expansion convolution, so as to obtain a first convolution result output by the first convolution layer; and inputting the first convolution result into the second convolution layer to obtain the category information and the position information of each detection object.

In one embodiment, each attribute identification network further includes a third convolution layer and a fourth convolution layer, and the identification unit is specifically configured to input the first convolution result into the third convolution layer to obtain a second convolution result; and inputting the second convolution result into a fourth convolution layer to obtain a plurality of attribute information of each detection object.

In one embodiment, the feature extraction network includes a backbone network and a fusion network, and the extraction unit is specifically configured to input a target image into the backbone network to perform feature extraction, so as to obtain candidate features; and inputting the candidate features into a fusion network for fusion to obtain a feature map.

In one embodiment, the plurality of attribute identification networks includes at least two of a door identification network, a height identification network, a barrier identification network, and an obstacle angle identification network; the vehicle door identification network is used for identifying whether the vehicle door is closed or not; the height recognition network is used for recognizing the height of the vehicle on the aerial view; the roadblock identification network is used for identifying whether the roadblock is movable or not; the obstacle angle recognition network is used to recognize the angle of the obstacle on the bird's eye view.

For specific limitations of the object detection device, reference may be made to the above limitations of the object detection method, and no further description is given here. The respective modules in the above-described object detection apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the electronic device, or may be stored in software in a memory in the electronic device, so that the processor may call and execute operations corresponding to the above modules.

Fig. 15 is a block diagram of an electronic device 1300, according to an example embodiment. For example, the electronic device 1300 may be a mobile phone, an in-vehicle central control, a digital broadcast terminal, a messaging device, a tablet device, a personal digital assistant, or the like.

Referring to fig. 15, an electronic device 1300 may include one or more of the following components: a processing component 1302, a memory 1304, a power component 1306, a multimedia component 1308, an audio component 1310, an input/output (I/O) interface 1312, a sensor component 1314, and a communication component 1316. Wherein the memory has stored thereon a computer program or instructions that run on the processor.

The processing component 1302 generally controls overall operation of the electronic device 1300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1302 may include one or more processors 1320 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1302 can include one or more modules that facilitate interactions between the processing component 1302 and other components. For example, the processing component 1302 may include a multimedia module to facilitate interaction between the multimedia component 1308 and the processing component 1302.

The memory 1304 is configured to store various types of data to support operations at the electronic device 1300. Examples of such data include instructions for any application or method operating on the electronic device 1300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1304 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly 1306 provides power to the various components of the electronic device 1300. The power components 1306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 1300.

The multimedia component 1308 includes a touch-sensitive display screen that provides an output interface between the electronic device 1300 and a user. In some embodiments, the touch display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1308 includes a front-facing camera and/or a rear-facing camera. When the electronic device 1300 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1310 is configured to output and/or input audio signals. For example, the audio component 1310 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1300 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 1304 or transmitted via the communication component 1316. In some embodiments, the audio component 1310 also includes a speaker for outputting audio signals.

The I/O interface 1312 provides an interface between the processing component 1302 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1314 includes one or more sensors for providing status assessment of various aspects of the electronic device 1300. For example, the sensor assembly 1314 may detect an on/off state of the electronic device 1300, a relative positioning of the components, such as a display and keypad of the electronic device 1300, the sensor assembly 1314 may also detect a change in position of the electronic device 1300 or a component of the electronic device 1300, the presence or absence of a user's contact with the electronic device 1300, an orientation or acceleration/deceleration of the electronic device 1300, and a change in temperature of the electronic device 1300. The sensor assembly 1314 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1314 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1316 is configured to facilitate communication between the electronic device 1300 and other devices, either wired or wireless. The electronic device 1300 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1316 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1316 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described object detection methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 1304, including instructions executable by processor 1320 of electronic device 1300 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, which, when being executed by a processor, may implement the above-mentioned method. The computer program product includes one or more computer instructions. When loaded and executed on a computer, these computer instructions may implement some or all of the methods described above, in whole or in part, in accordance with the processes or functions described in embodiments of the present disclosure.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided by the present disclosure may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few implementations of the disclosed examples, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made to the disclosed embodiments without departing from the spirit of the disclosed embodiments. Accordingly, the protection scope of the disclosed embodiment patent should be subject to the appended claims.

Claims

1. A method of target detection, the method comprising:

acquiring a target image to be detected;

extracting the characteristics of the target image through a characteristic extraction network to obtain a characteristic diagram;

detecting the feature map through a detection network to obtain category information and position information of each detection object;

Respectively carrying out attribute identification on the feature map through a plurality of attribute identification networks to obtain a plurality of attribute information of each detection object; different attribute identification networks are used for identifying different attributes of the detection object;

wherein the detection network comprises a first convolution layer and a second convolution layer; the detection network and each attribute identification network share the first convolution layer; the step of detecting the feature map through a detection network to obtain category information and position information of each detection object comprises the following steps:

inputting the feature map into the first convolution layer to perform expansion convolution to obtain a first convolution result output by the first convolution layer;

and inputting the first convolution result into the second convolution layer to obtain category information and position information of each detection object.

2. The method according to claim 1, wherein the method further comprises:

and carrying out semantic segmentation on the feature map through a semantic segmentation network to obtain semantic segmentation results of the detection objects.

3. The method of claim 1, wherein each of the attribute identification networks further comprises a third convolution layer and a fourth convolution layer; the step of respectively carrying out attribute recognition on the feature map through a plurality of attribute recognition networks to obtain a plurality of attribute information of each detection object, including:

Inputting the first convolution result into the third convolution layer to obtain a second convolution result;

and inputting the second convolution result into the fourth convolution layer to obtain a plurality of attribute information of each detection object.

4. The method according to claim 1, wherein the feature extraction network includes a backbone network and a fusion network, the feature extraction of the target image by the feature extraction network results in a feature map, including:

inputting the target image into the backbone network for feature extraction to obtain candidate features;

and inputting the candidate features into the fusion network to be fused, so as to obtain the feature map.

5. The method of any of claims 1-4, wherein the plurality of attribute identification networks includes at least two of a door identification network, a height identification network, a barrier identification network, and an obstacle angle identification network;

the vehicle door identification network is used for identifying whether the vehicle door is closed or not;

the height recognition network is used for recognizing the height of the vehicle on the aerial view;

the roadblock recognition network is used for recognizing whether the roadblock is movable or not;

the obstacle angle recognition network is used for recognizing the angle of the obstacle on the aerial view.

6. An object detection device, the device comprising:

an acquisition device for acquiring a target image to be detected;

the detection device is used for carrying out feature extraction on the target image through a feature extraction network to obtain a feature map; detecting the feature map through a detection network to obtain category information and position information of each detection object; respectively carrying out attribute identification on the feature map through a plurality of attribute identification networks to obtain a plurality of attribute information of each detection object; wherein different attribute identification networks are used for identifying different attributes of the detection object;

the detection network comprises a first convolution layer and a second convolution layer; the detection network and each attribute identification network share the first convolution layer; the detection module is used for inputting the characteristic diagram into the first convolution layer to carry out expansion convolution to obtain a first convolution result output by the first convolution layer; and inputting the first convolution result into the second convolution layer to obtain category information and position information of each detection object.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.