CN115170903A

CN115170903A - Vehicle scene image processing method and system and electronic equipment

Info

Publication number: CN115170903A
Application number: CN202210720991.5A
Authority: CN
Inventors: 胡启昶; 李发成; 陈宇; 张如高; 虞正华
Original assignee: Shenzhen Magic Vision Intelligent Technology Co ltd
Current assignee: Shenzhen Magic Vision Intelligent Technology Co ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-10-11

Abstract

The application discloses a vehicle scene image processing method, a vehicle scene image processing system and electronic equipment, which can be used for carrying out efficient and accurate three-dimensional detection and identification on a vehicle target. The method comprises the following steps: acquiring a basic scene image, and generating training sample data according to the basic scene image; performing feature extraction and target prediction on training sample data by using a full convolution single-stage neural network to generate a prediction target and corresponding prediction information; and calculating and determining a network loss function according to the prediction information, and optimizing the full convolution single-stage neural network according to the network loss function. The system comprises a training sample preparation unit, a sample feature extraction unit, a target information prediction unit and a neural network optimization unit. The electronic device comprises a memory, a processor, and a computer program stored on the memory and executable on the processor to implement the vehicle scene image processing method.

Description

Vehicle scene image processing method and system and electronic equipment

Technical Field

The application relates to the technical field of image recognition, in particular to a vehicle scene image processing method and system and electronic equipment.

Background

With the continuous development of artificial intelligence technology, the related automatic driving and driving assistance system has also been widely developed and applied. In the automatic driving and driving-assistant system, the most key component is a sensing system, which utilizes various vehicle-mounted sensing devices to recognize and understand the surrounding environment of the vehicle. In a vision-based perception system, vehicle targets around and in front of a vehicle need to be accurately and quickly identified and positioned according to a scene image, and a data basis is provided for path planning.

In some related technologies, a two-stage neural network detection framework is adopted to detect an interested region in an image, then a vehicle target is extracted in the region, then a key point detection model is applied to the extracted vehicle target to predict a 3D key point of the vehicle, and a post-processing algorithm is adopted to calculate and complete missing points. The algorithm is high in complexity, the calculation power consumption is greatly increased when the situation of a multi-target complex scene is faced, the calculation time is increased, and the scheme timeliness is difficult to meet the deployment requirement of real-time calculation.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and a system for processing a vehicle scene image, and an electronic device, so as to solve a problem that the perception system target identification timeliness is poor under a multi-target scene.

In a first aspect, an embodiment of the present application provides a vehicle scene image processing method, where the method includes:

acquiring a basic scene image, determining three-dimensional information of a vehicle target in the basic scene image, and combining the basic scene image and the three-dimensional information to generate training sample data;

taking the training sample data as the input of a feature extraction network in a full convolution single-stage neural network, and performing feature extraction on the training sample data by using the feature extraction network to generate a feature image;

taking the characteristic image as the input of a prediction network in the full-convolution single-stage neural network, performing target prediction according to the characteristic image by using the prediction network, determining a prediction target and outputting corresponding prediction information, wherein the prediction information comprises prediction category information and prediction three-dimensional information of the prediction target;

and comparing the prediction information with the training sample data, calculating and determining a network loss function of the full convolution single-stage neural network, and optimizing the full convolution single-stage neural network according to the network loss function.

Optionally, the training sample data further includes regression box information of the vehicle target;

the prediction network is provided with a category output channel, a three-dimensional output channel, a regression output channel and a central output channel;

the predicting the target according to the characteristic image by using the prediction network, determining the prediction target and outputting corresponding prediction information comprises the following steps:

outputting the prediction category information of the prediction target using the category output channel;

outputting the predicted three-dimensional information of the predicted target using the three-dimensional output channel;

outputting the prediction regression frame information of the prediction target by using the regression output channel;

outputting predicted centrality information of the predicted target by using the central output channel;

the prediction information further includes the prediction regression box information and the prediction centrality information.

Optionally, the predicted three-dimensional information includes relative distances between a plurality of predicted key points and corresponding feature points of the vehicle target in the feature image.

Optionally, the comparing the prediction information with the training sample data to calculate and determine a network loss function of the fully-convolutional single-stage neural network includes:

calculating and determining corresponding class loss functions, three-dimensional loss functions, regression loss functions and center loss functions according to the training sample data aiming at the predicted class information, the predicted three-dimensional information, the predicted regression frame information and the predicted centrality information;

and calculating and determining the network loss function by combining the class loss function, the center loss function, the three-dimensional loss function and the regression loss function.

Optionally, the determining, according to the training sample data, a category loss function, a three-dimensional loss function, a regression loss function, and a center loss function corresponding to the prediction category information, the prediction three-dimensional information, the prediction regression frame information, and the prediction centrality information by calculation respectively includes:

calculating the class loss function corresponding to the prediction class information using a focus loss function;

calculating the three-dimensional loss function and the regression loss function respectively corresponding to the predicted three-dimensional information and the predicted regression frame information by using a smooth distance loss function;

calculating the central loss function corresponding to the predicted centrality information using a two-class cross entropy loss function.

Optionally, the optimizing the full convolution single-stage neural network according to the network loss function includes:

and performing iterative optimization adjustment on the network parameters of the full convolution single-stage neural network in a gradient reverse conduction mode according to the network loss function.

Optionally, the method further includes:

acquiring a scene image to be detected, and inputting the scene image to be detected into the optimized full convolution single-stage neural network;

determining a plurality of candidate targets and corresponding prediction information according to the output of the full convolution single-stage neural network;

and screening and filtering the candidate targets according to the prediction information, and determining a plurality of vehicle targets and corresponding three-dimensional information in the scene image to be detected.

Optionally, the filtering the candidate targets according to the prediction information includes:

carrying out category judgment on the candidate targets according to the predicted category information, and filtering out non-vehicle category targets in the candidate targets;

calculating and determining category confidence degrees of the candidate targets according to the prediction category information and the prediction centrality information;

and performing deduplication filtering on the candidate targets by adopting a non-maximum suppression algorithm based on the category confidence.

In a second aspect, an embodiment of the present application further provides a vehicle scene image processing system, where the system includes:

the training sample preparation unit is used for acquiring a basic scene image, determining three-dimensional information of a vehicle target in the basic scene image, and combining the basic scene image and the three-dimensional information to generate training sample data;

the sample feature extraction unit is used for taking the training sample data as the input of a feature extraction network in a full convolution single-stage neural network, and performing feature extraction on the training sample data by using the feature extraction network to generate a feature image;

the target information prediction unit is used for taking the characteristic image as the input of a prediction network in the full-convolution single-stage neural network, performing target prediction according to the characteristic image by using the prediction network, determining a prediction target and outputting corresponding prediction information, wherein the prediction information comprises prediction type information and prediction three-dimensional information of the prediction target; and

and the neural network optimization unit is used for comparing the prediction information with the training sample data, calculating and determining a network loss function of the full convolution single-stage neural network, and optimizing the full convolution single-stage neural network according to the network loss function.

The system is used for executing the vehicle scene image processing method according to the first aspect.

In a third aspect, an embodiment of the present application further provides a vehicle scene image processing electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the vehicle scene image processing method according to the first aspect is implemented.

As can be seen from the above, the method, the system and the electronic device for processing the vehicle scene image provided by the embodiment of the application have the following beneficial technical effects:

(1) The method comprises the steps of adopting a full convolution single-stage neural network as a detection framework, taking a basic scene image and corresponding three-dimensional information as input data in a model training stage, correspondingly setting a prediction target and corresponding predicted three-dimensional information as output, and carrying out training optimization on the full convolution single-stage neural network. The neural network model is trained in an end-to-end mode to directly output the three-dimensional information of the vehicle target, so that the suboptimal problem can be avoided, and the effective stable optimization of the neural network model is realized. Therefore, in the target identification stage, when the optimized full convolution single-stage neural network is used for target identification, the accuracy of the vehicle target identification result is ensured. And in the target identification stage, when a vehicle target is identified, only a two-dimensional vehicle scene image is required to be used as input data of the full convolution single-stage neural network, and the full convolution single-stage neural network can directly output three-dimensional information of a predicted target. The method saves the intermediate process and post-processing step of converting two-dimensional information into three-dimensional information, can greatly simplify the complexity of the method, can still ensure the execution efficiency of the method when facing a multi-target complex scene, and meets the requirement of timeliness.

(2) The training sample data further comprises regression box information of the vehicle target, and the prediction information output correspondingly to the training sample data further comprises prediction regression box information of the prediction target. By adding regression box information to training sample data, the input information amount of the feature extraction network can be increased, and the generated feature information contained in the feature image is richer. When the prediction network carries out prediction based on the characteristic image, the prediction result is more accurate. And adding the prediction regression box information in the prediction information output by the prediction network, wherein the prediction regression box information corresponds to the regression box information in the training sample data, and the comparison result of the prediction regression box information and the regression box information can be used as another index for measuring the optimization effect of the full-convolution single-stage neural network. Therefore, by adding the prediction regression frame information in the prediction information, the training optimization effect of the full convolution single-stage neural network can be improved, and the target recognition result of the full convolution single-stage neural network is more accurate.

In addition, the prediction centrality information is additionally arranged in the prediction information, and the prediction centrality information can effectively inhibit the occurrence of low-quality prediction targets, ensure the target prediction effect of the prediction network and improve the target identification performance of the full convolution single-stage neural network. The prediction centrality information can also be used as another index for measuring the optimization effect of the full convolution single-stage neural network, so that the training optimization effect of the full convolution single-stage neural network is further improved.

(3) The prediction network takes the relative distance between a plurality of prediction key points and corresponding feature points of the vehicle target in the feature image as the prediction three-dimensional information, and then the coordinate information of the prediction key points can be determined by adding the coordinates of the corresponding feature points of the vehicle target, instead of directly outputting the coordinate data of the prediction key points. The prediction network avoids directly outputting absolute coordinates of a plurality of prediction keypoints by outputting relative distances of the prediction keypoints to feature points. By adopting the method, effective training optimization can be carried out, and the determined coordinate of the prediction key is ensured to be more stable.

(4) And performing iterative optimization adjustment on the network parameters of the full convolution single-stage neural network in a gradient reverse conduction mode according to the network loss function. When the network loss function is calculated, a proper function algorithm is selected for calculating the loss function according to the prediction type information, the prediction three-dimensional information, the prediction regression frame information and the prediction centrality information, and then the network loss function of the whole full convolution single-stage neural network is comprehensively calculated. By adopting the mode to carry out iterative optimization on the network parameters of the full convolution single-stage neural network, the accuracy of the optimized full convolution single-stage neural network on target identification can be ensured while the iterative efficiency is provided.

(5) Processing a scene image to be detected by using the optimized full convolution single-stage neural network, determining a plurality of candidate targets in the scene image to be detected, further screening and filtering the plurality of candidate targets according to corresponding prediction information, ensuring the accuracy of target types, executing de-duplication operation, and finally accurately determining a plurality of mutually independent vehicle targets and corresponding three-dimensional information.

Drawings

The features and advantages of the present application will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the present application in any way, and in which:

FIG. 1 is a schematic diagram illustrating a method for processing a vehicle scene image according to one or more embodiments of the present disclosure;

FIG. 2 is a schematic diagram illustrating a plurality of three-dimensional key points of a vehicle object in a vehicle scene image processing method according to one or more alternative embodiments of the present application;

FIG. 3 is a schematic diagram illustrating a method for determining a prediction target and outputting corresponding prediction information in a vehicle scene image processing method according to one or more alternative embodiments of the present application;

FIG. 4 is a schematic diagram illustrating a method for computationally determining a network loss function in a vehicle scene image processing method according to one or more alternative embodiments of the present application;

fig. 5 is a schematic diagram illustrating a method for performing target detection and identification on a scene image to be detected in a vehicle scene image processing method according to one or more optional embodiments of the present application;

fig. 6 is a schematic diagram illustrating a method for filtering and screening a plurality of candidate targets in a vehicle scene image processing method according to one or more optional embodiments of the present application;

FIG. 7 is a schematic diagram illustrating a vehicle scene image processing system according to one or more alternative embodiments of the present application;

fig. 8 is a schematic structural diagram of an electronic device for processing a vehicle scene image according to one or more alternative embodiments of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

With the continuous development of artificial intelligence technology, the related automatic driving and driving assistance system has also been widely developed and applied. In the automatic driving and driving-assistant system, the most key component is a sensing system, which utilizes various vehicle-mounted sensing devices to recognize and understand the surrounding environment of the vehicle. In a vision-based perception system, vehicle targets around and in front of a vehicle need to be accurately and quickly identified and positioned according to scene images, a safe and feasible driving route is planned, and collision with other vehicles is avoided, so that how to accurately identify and detect the vehicle targets in three dimensions in real time is an important problem in an automatic driving and auxiliary driving system.

In the current method for detecting the vehicle target by using the vehicle-mounted equipment, the method based on the two-dimensional front-view monocular camera has the characteristics of low cost and rich information, and becomes the most widely applied method. In some related technologies, a two-stage detection framework is adopted, an interested area of a vehicle target is detected on an image, then the vehicle target is extracted in the area, a key point detection model is applied to the extracted target to predict a three-dimensional key point of the vehicle, and finally a post-processing algorithm is used for calculating and completing a missing point. The algorithm is high in complexity, the calculation power consumption is greatly improved when the multi-target complex scene situation is faced, the calculation time is increased, and the scheme timeliness is difficult to meet the deployment requirement of real-time calculation. In other related technologies, the method is based on laser radar point cloud data, and such a method needs to preprocess the point cloud data and extract target vehicle information to realize 3D vehicle detection, so that the point cloud data collected by the laser radar needs to be manually labeled, and the labeling cost is high and the time is consumed. The method has high execution cost, large calculation amount and poor practicability.

In order to solve the problems, the technical scheme of the application provides a vehicle scene image processing method, a system and electronic equipment, wherein a full convolution single-stage neural network is used as a detection model, a two-dimensional vehicle scene image and corresponding three-dimensional information are used as input in a model training optimization stage, and predicted three-dimensional information of a vehicle target is used as output; in the target identification stage, only a two-dimensional vehicle scene image is required to be used as input data of the full convolution single-stage neural network, and three-dimensional information of a predicted target can be directly output. The method is more direct, has lower hardware cost, can realize accurate and efficient recognition of the vehicle target, and is beneficial to wide deployment and application.

The technical solution of the present application is described below with reference to specific examples.

In a first aspect, an embodiment of the present application provides a vehicle scene image processing method.

As shown in fig. 1, a method for processing a vehicle scene image according to one or more optional embodiments of the present application includes:

s1: acquiring a basic scene image, determining three-dimensional information of a vehicle target in the basic scene image, and combining the basic scene image and the three-dimensional information to generate training sample data.

The basic scene image can be an image of an environment scene in any direction and at any angle around the vehicle, images collected by vehicle-mounted camera devices at different positions on the vehicle can be collected to serve as the basic scene image, and the basic scene image is a two-dimensional plane image. Taking a scene in front of the vehicle as an example, collecting an image collected by a vehicle-mounted front monocular camera as the basic scene image. Generally, a vehicle-mounted camera device will shoot a high frame rate video, and a plurality of images containing a vehicle target may be periodically selected from video stream data as the base scene image at a certain frame interval. Such a way can compromise the total amount of data with the diversity of data.

For each basic scene image, a plurality of three-dimensional key points (V) of the vehicle target can be marked in the two-dimensional image by adopting a key point marking method ₁ ,V ₂ ,V ₃ ,…,V _n ) As the three-dimensional information. As shown in FIG. 2, in some alternative embodiments, eight vertices of the cubic space corresponding to the vehicle object may be selected as eight three-dimensional key points (V) of the vehicle object ₁ ,V ₂ ,V ₃ ,V ₄ ,…,V ₈ ). The vehicle target is surrounded by the cubic space, a plurality of faces of the cubic space are tangent to a plurality of side faces of the vehicle target, and the length, width and height of the vehicle target are the length, width and height of the cubic space. Optionally, a central point of the plurality of edges of the cube space may be further selected as the three-dimensional key point.

In some alternative embodiments, a plurality of sides defining the vehicle object may be identified, and for each side, a plurality of points may be chosen equally spaced on its contour edge as the three-dimensional keypoints. And taking a plurality of three-dimensional keys of a plurality of side faces as the three-dimensional information.

S2: and taking the training sample data as the input of a feature extraction network in a full convolution single-stage neural network, and performing feature extraction on the training sample data by using the feature extraction network to generate a feature image.

The training sample data comprises the basic scene image and the corresponding three-dimensional information, and the three-dimensional information is input into the feature extraction network backbone together as a label of the basic scene image. And performing feature extraction through the feature extraction network backbone feedforward to generate a corresponding feature image.

The feature extraction network backbone can be a resnet network, a mobilenet network or a shufflent network, and the like. And a plurality of feature extraction channels are arranged through the feature extraction network backbone, and the feature image F is finally output through the plurality of feature extraction channels and comprises H multiplied by W feature points. H. W represents the height and width of the feature image, respectively.

S3: and taking the characteristic image as the input of a prediction network in the full-convolution single-stage neural network, performing target prediction according to the characteristic image by using the prediction network, determining a prediction target and outputting corresponding prediction information, wherein the prediction information comprises prediction type information and prediction three-dimensional information of the prediction target.

And the prediction network head performs target prediction according to a plurality of feature points on the feature image, determines whether the prediction target exists at the positions of H multiplied by W feature points with high width, and correspondingly outputs the prediction three-dimensional information of the prediction target. The predicted three-dimensional information may include a plurality of predicted three-dimensional keypoints corresponding to the three-dimensional information in the training sample data.

S4: and comparing the prediction information with the training sample data, calculating and determining a network loss function of the full convolution single-stage neural network, and optimizing the full convolution single-stage neural network according to the network loss function.

And comparing the prediction information with the training sample data, and calculating and determining the network loss function of the full convolution single-stage neural network according to the difference between the prediction information and the training sample data, wherein the network loss function is used for measuring and determining the accuracy of the prediction result of the full convolution single-stage neural network. And sequentially inputting a plurality of training sample data into the full convolution single-stage neural network, calculating a corresponding network loss function according to the output of the full convolution single-stage neural network, and continuously adjusting and optimizing network parameters of the full convolution single-stage neural network according to the network loss function until the network loss function meets the requirement of a preset threshold value.

When the full convolution single-stage neural network is optimized according to the network loss function, iterative optimization adjustment can be performed on network parameters of the full convolution single-stage neural network in a gradient reverse conduction mode.

After the full convolution single-stage neural network is optimized, the full convolution single-stage neural network can be utilized to process a scene image to be detected, only the two-dimensional scene image to be detected is used as neural network input, and a plurality of vehicle targets and corresponding three-dimensional information of the scene image to be detected can be rapidly and accurately identified.

According to the vehicle scene image processing method, a full convolution single-stage neural network is used as a detection framework, a basic scene image and corresponding three-dimensional information are used as input data in a model training stage, a prediction target and corresponding prediction three-dimensional information are correspondingly set as output, and training optimization is carried out on the full convolution single-stage neural network. The neural network model is trained in an end-to-end mode to directly output the three-dimensional information of the vehicle target, so that the suboptimal problem can be avoided, and the effective stable optimization of the neural network model is realized. Therefore, in the target identification stage, when the optimized full-convolution single-stage neural network is used for target identification, the accuracy of the vehicle target identification result is ensured. And in the target identification stage, when a vehicle target is identified, only a two-dimensional vehicle scene image is required to be used as input data of the full convolution single-stage neural network, and the full convolution single-stage neural network can directly output three-dimensional information of a predicted target. The method saves the intermediate process and post-processing step of converting two-dimensional information into three-dimensional information, can greatly simplify the complexity of the method, can still ensure the execution efficiency of the method when facing a multi-target complex scene, and meets the requirement of timeliness.

In one or more optional embodiments of the present application, in a method for processing an image of a vehicle scene, the training sample data further includes regression box information of the vehicle target.

In order to accurately position and detect the vehicle target, a regression frame of the vehicle target can be determined in the basic scene image, and the generated regression frame is required to exactly and completely wrap the vehicle target as much as possible. In some alternative embodiments, the regression box of the vehicle target may be determined based on the coordinate data of the plurality of three-dimensional keypoints of the vehicle target.

Wherein B represents the regression box information,

respectively representing the abscissa and the ordinate of the ith three-dimensional key point. Therefore, the coordinates of the lower left corner point and the upper right corner point of the regression frame can be determined, and the coordinates of the lower left corner point and the upper right corner point of the regression frame are used as the information of the regression frame.

In some optional embodiments, the prediction network is provided with a category output channel, a three-dimensional output channel, a regression output channel and a center output channel.

As shown in fig. 3, the performing, by using the prediction network, the target prediction according to the feature image, determining a prediction target, and outputting corresponding prediction information includes:

s301: and outputting the prediction category information of the prediction target by using the category output channel.

S302: outputting the predicted three-dimensional information of the predicted target using the three-dimensional output channel.

S303: and outputting the prediction regression frame information of the prediction target by using the regression output channel.

S304: and outputting the predicted centrality information of the predicted target by using the central output channel.

The centrality information defines the position relation between the position of the feature point on the feature image and the corresponding vehicle target regression frame. The centrality value ranges from 0 to 1, wherein 1 represents that the feature point is located at the center of the vehicle target regression frame, and 0 represents that the feature point is located at the edge of the vehicle target regression frame.

The prediction network comprises 4 convolutional layers which respectively correspond to the category output channel, the three-dimensional output channel, the regression output channel and the central output channel, and the prediction category information, the prediction three-dimensional information, the prediction regression frame information and the prediction centrality information are output through the category output channel, the three-dimensional output channel, the regression output channel and the central output channel.

In the vehicle scene image processing method, the training sample data further includes regression box information of the vehicle target, and the prediction information output correspondingly thereto further includes prediction regression box information of the prediction target. By adding regression box information in training sample data, the input information amount of the feature extraction network can be increased, and the generated feature information contained in the feature image is richer. When the prediction network carries out prediction based on the characteristic image, the prediction result is more accurate. And adding the prediction regression box information in the prediction information output by the prediction network, wherein the prediction regression box information corresponds to the regression box information in the training sample data, and the comparison result of the prediction regression box information and the regression box information can be used as another index for measuring the optimization effect of the full-convolution single-stage neural network. Therefore, by adding the prediction regression frame information in the prediction information, the training optimization effect of the full convolution single-stage neural network can be improved, and the target recognition result of the full convolution single-stage neural network is more accurate.

The prediction centrality information is additionally arranged in the prediction information, and can effectively inhibit the occurrence of a low-quality prediction regression box, ensure the target prediction effect of the prediction network and improve the target identification performance of the full convolution single-stage neural network. The prediction centrality information can also be used as another index for measuring the optimization effect of the full convolution single-stage neural network, so that the training optimization effect of the full convolution single-stage neural network is further improved.

In order to make the predicted three-dimensional key point coordinates more stable, the prediction network can be set to predict the relative distance from the three-dimensional key point to the feature point so as to avoid directly predicting the absolute coordinates of the three-dimensional key point.

In one or more optional embodiments of the present application, the predicted three-dimensional information includes relative distances between a plurality of predicted key points and corresponding feature points of the vehicle target in the feature image.

For each three-dimensional keypoint of the target vehicle, we use:

to learn the relative distance from the three-dimensional key points to the corresponding feature point coordinates of the target vehicle on the feature image, where x _center ，y _center And the characteristic points on the characteristic image F closest to the center of the vehicle target are obtained. The three-dimensional output channel outputs a plurality of three-dimensional keypoint relative distances (Delta) _x ,Δ _y ) Finally adding the coordinates (x) of the corresponding characteristic points of the target vehicle _center ,y _center ) The position of the predicted three-dimensional key point on the image coordinate system can be determined.

The prediction network takes the relative distance between a plurality of prediction key points and corresponding feature points of the vehicle target in the feature image as the prediction three-dimensional information, and then the coordinate information of the plurality of prediction key points can be determined by adding the coordinates of the corresponding feature points of the vehicle target, rather than directly outputting the coordinate data of the plurality of prediction key points. The prediction network avoids directly outputting absolute coordinates of a plurality of predicted keypoints by outputting relative distances of the predicted keypoints to feature points. By adopting the method, effective training optimization can be carried out, and the determined coordinate of the prediction key is ensured to be more stable.

As shown in fig. 4, in a vehicle scene image processing method provided in one or more optional embodiments of the present application, the comparing the prediction information with the training sample data to calculate and determine a network loss function of the full-convolution single-stage neural network includes:

s401: and respectively calculating and determining a corresponding category loss function, three-dimensional loss function, regression loss function and central loss function according to the training sample data aiming at the prediction category information, the prediction three-dimensional information, the prediction regression frame information and the prediction centrality information.

For the prediction category information, the corresponding category loss function may be calculated using a focus loss function, focal Loss (FL).

Where FL (p) represents the class loss function, p ∈ [0,1] represents the probability that the predicted target output by the prediction network is a vehicle class, and y ∈ {1, -1} represents a class of vehicle targets and a background class. Alpha and gamma are the tuning parameters of the loss function. The specific data of the adjusting parameters can be flexibly adjusted according to actual conditions, and in some optional embodiments, α may be set to 0.25, and γ may be set to 2.

For the predicted three-dimensional information, a corresponding three-dimensional loss function may be calculated using a smoothed distance loss function Smooth L1 loss.

Wherein L is _reg (p _r ,t ₁ ) Representing said three-dimensional loss function, p _r Vehicle three-dimensional relationship determined by predicted three-dimensional information representing output of three-dimensional output channelRelative distance between key point and corresponding feature point, t ₁ And representing the relative distance between the three-dimensional key point determined by the training sample data and the feature point.

For the predicted regression box information, the corresponding regression loss function may be calculated using a smoothed distance loss function.

Wherein L is _reg (p _q ，t ₂ ) Representing said regression loss function, p _q Representing the regression quantity, t, determined from the predicted regression box information output by the regression output channel ₂ Representing a regression quantity determined from the training sample data.

For the predicted centrality information, the corresponding central loss function may be calculated using a two-class cross entropy loss function.

L _cls (c，p _c )＝E _c [-c*log(p _c )+(1-c)*log(1-p _c )]

Wherein L is _cls (c，p _c ) Representing the center loss function, c representing the centricity determined from the coordinates of the feature points and the regression box information, E _c Is a mathematical expectation about centrality c, p _c The prediction centrality information representing the prediction network output.

S402: and synthesizing the class loss function, the central loss function, the three-dimensional loss function and the regression loss function to calculate and determine the network loss function.

The network loss function:

LOSS＝β ₁ FL(p)+β ₂ L _reg (p _r ，t ₁ )+β ₃ L _reg (p _q ，t ₂ )+β ₄ L _cls (c，p _c )

wherein beta is ₁ ，β ₂ ，β ₃ ，β ₄ An adjustment parameter representing the network LOSS function LOSS, the adjustment parameterCan be flexibly adjusted according to actual conditions. In some alternative embodiments, β may be ₁ ，β ₂ ，β ₃ ，β ₄ Is set to 1, i.e. the result of directly adding the class loss function, the three-dimensional loss function, the regression loss function and the central loss function is taken as the network loss function.

In the vehicle scene image processing method, iterative optimization adjustment is performed on the network parameters of the full convolution single-stage neural network in a gradient reverse conduction mode according to the network loss function. When the network loss function is calculated, a proper function algorithm is selected for calculating the loss function according to the prediction type information, the prediction three-dimensional information, the prediction regression frame information and the prediction centrality information, and then the network loss function of the whole full convolution single-stage neural network is comprehensively calculated. By adopting the method, the network parameters of the full convolution single-stage neural network are subjected to iterative optimization, so that the accuracy of the optimized full convolution single-stage neural network on target identification can be ensured while the iterative efficiency is improved.

Fig. 5 shows a schematic diagram of a method for performing target recognition detection on a scene image to be detected. As shown in fig. 5, a method for processing an image of a vehicle scene according to one or more alternative embodiments of the present application further includes:

s501: and acquiring a scene image to be detected, and inputting the scene image to be detected into the optimized full convolution single-stage neural network.

S502: determining a plurality of candidate targets and corresponding prediction information according to the output of the full convolution single-stage neural network.

S503: and screening and filtering the candidate targets according to the prediction information, and determining a plurality of vehicle targets and corresponding three-dimensional information in the scene image to be detected.

Considering that there may be targets of non-vehicle categories in the multiple candidate targets output by the full convolution single-stage neural network processing, the filtering may be performed according to the prediction category information corresponding to the candidate targets. If there may be a situation of repeated recognition for a plurality of candidate targets belonging to the vehicle category, further de-duplication processing is required for the plurality of candidate targets.

As shown in fig. 6, in a method for processing an image of a vehicle scene according to one or more optional embodiments of the present application, the filtering a plurality of candidate targets according to the prediction information includes:

s601: and carrying out category judgment on the candidate targets according to the predicted category information, and filtering out non-vehicle category targets in the candidate targets.

And processing the prediction type information and the prediction centrality information corresponding to the candidate target by adopting a nonlinear function Sigmoid, wherein the output values of the prediction type information and the prediction centrality information are both in the range of [0,1 ].

The calculation formula of the nonlinear function Sigmoid is as follows:

output value S corresponding to the prediction type information _cls And comparing the candidate target with a preset threshold value so as to judge whether the candidate target belongs to the vehicle class. Objects in non-vehicle categories are filtered out.

S602: and calculating and determining category confidence degrees of the candidate targets according to the prediction category information and the prediction centrality information.

Sig＝S _cls *ctn

Wherein S is _cls And ctn respectively represents output values of the prediction type information and the prediction centrality information after nonlinear function sigmoid processing. Sig represents the class confidence of the candidate object.

S603: and performing deduplication filtering on the candidate targets by adopting a non-maximum suppression algorithm based on the category confidence.

A set W may be constructed using a plurality of the candidate objects with non-vehicle classes filtered out, the set being filtered using a non-maximum suppression algorithm. The specific filtering method comprises the following steps:

step1: and according to the category confidence degrees of the candidate objects in the set W, arranging the candidate objects in a descending order.

Step2: and selecting the candidate target T with the maximum category confidence coefficient in the set W, and filtering out targets with the overlapping rate of the candidate target T being more than or equal to a preset overlapping threshold value.

Step3: when the overlapping rates of other candidate targets in the set W and the candidate target T are smaller than the preset overlapping rate, taking out the candidate target T from the combination W, and putting the candidate target T into the set W _fin ^l 。

Step4: and (5) circularly executing Step2-3 until the set W is empty. Set W _final The plurality of targets included in (a) is the vehicle target that is ultimately identified.

According to the vehicle scene image processing method, the optimized full convolution single-stage neural network is used for processing a scene image to be detected, a plurality of candidate targets in the scene image to be detected are determined, the candidate targets are further screened and filtered according to corresponding prediction information, the target category accuracy is guaranteed, the de-duplication operation is executed, and finally a plurality of mutually independent vehicle targets and corresponding three-dimensional information can be accurately determined.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.

It should be noted that the foregoing describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same purpose as the first aspect, in a second aspect, the embodiment of the application provides a vehicle scene image processing system.

As shown in fig. 7, a vehicle scene image processing system according to one or more alternative embodiments of the present application includes:

a training sample preparation unit 701, configured to collect a basic scene image, determine three-dimensional information of a vehicle target in the basic scene image, and combine the basic scene image and the three-dimensional information to generate training sample data;

a sample feature extraction unit 702, configured to use the training sample data as an input of a feature extraction network in a full convolution single-stage neural network, perform feature extraction on the training sample data by using the feature extraction network, and generate a feature image;

a target information prediction unit 703, configured to use the feature image as an input of a prediction network in the full-convolution single-stage neural network, perform target prediction according to the feature image by using the prediction network, determine a prediction target, and output corresponding prediction information, where the prediction information includes prediction category information and prediction three-dimensional information of the prediction target; and

and the neural network optimization unit 704 is configured to compare the prediction information with the training sample data, calculate and determine a network loss function of the full convolution single-stage neural network, and optimize the full convolution single-stage neural network according to the network loss function.

In a vehicle scene image processing system provided in one or more optional embodiments of the present application, the training sample data further includes regression box information of the vehicle target;

the target information prediction unit 703 is further configured to output the prediction category information of the prediction target by using the category output channel; outputting the predicted three-dimensional information of the predicted target using the three-dimensional output channel; outputting the prediction regression frame information of the prediction target by utilizing the regression output channel; and outputting the predicted centrality information of the predicted target by using the central output channel. The prediction information further includes the prediction regression box information and the prediction centrality information.

In one or more optional embodiments of the present application, in the vehicle scene image processing system, the neural network optimization unit 704 is further configured to calculate and determine a corresponding class loss function, three-dimensional loss function, regression loss function, and central loss function for the predicted class information, the predicted three-dimensional information, the predicted regression box information, and the predicted centrality information, respectively, according to the training sample data; and synthesizing the class loss function, the central loss function, the three-dimensional loss function and the regression loss function to calculate and determine the network loss function.

In a vehicle scene image processing system provided in one or more alternative embodiments of the present application, the neural network optimization unit 704 is further configured to calculate the category loss function corresponding to the prediction category information by using a focus loss function; calculating the three-dimensional loss function and the regression loss function respectively corresponding to the predicted three-dimensional information and the predicted regression frame information by using a smooth distance loss function; calculating the center loss function corresponding to the predicted centrality information using a two-class cross entropy loss function.

In one or more optional embodiments of the present application, in the vehicle scene image processing system, the neural network optimization unit 704 is further configured to perform iterative optimization and adjustment on a network parameter of the full convolution single-stage neural network by using a gradient back propagation manner according to the network loss function.

One or more optional embodiments of the present application provide a vehicle scene image processing system, further comprising a vehicle object recognition unit. The vehicle target identification unit is used for acquiring a scene image to be detected and inputting the scene image to be detected into the optimized full convolution single-stage neural network; determining a plurality of candidate targets and corresponding prediction information according to the output of the full convolution single-stage neural network; and screening and filtering the candidate targets according to the prediction information, and determining a plurality of vehicle targets and corresponding three-dimensional information in the scene image to be detected.

In a vehicle scene image processing system provided in one or more optional embodiments of the present application, the vehicle object identifying unit is further configured to perform class determination on a plurality of candidate objects according to the predicted class information, and filter out non-vehicle class objects in the plurality of candidate objects; calculating and determining category confidence degrees of the candidate targets according to the prediction category information and the prediction centrality information; and performing deduplication filtering on the candidate targets by adopting a non-maximum suppression algorithm based on the category confidence.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.

The system of the above embodiment is used for implementing the corresponding vehicle scene image recognition method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to the method of any embodiment, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and when the processor executes the program, the vehicle scene image recognition method according to any embodiment is implemented.

Fig. 8 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the method for recognizing a scene image of a vehicle provided in the embodiment of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called by the processor 1010 for execution.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component within the device (not shown) or may be external to the device to provide corresponding functionality. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used for implementing the corresponding vehicle scene image recognition method in any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the vehicle scene image recognition method according to any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the method for measuring and calculating electric field shielding effectiveness according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A vehicle scene image processing method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the training sample data further includes regression box information of the vehicle target;

outputting the prediction regression frame information of the prediction target by utilizing the regression output channel;

3. The method of claim 2, wherein the predicted three-dimensional information comprises relative distances between a plurality of predicted keypoints and corresponding feature points of the vehicle target in the feature image.

4. The method of claim 2, wherein said comparing said prediction information to said training sample data to computationally determine a network loss function for said fully-convolutional single-stage neural network comprises:

calculating and determining corresponding category loss functions, three-dimensional loss functions, regression loss functions and central loss functions according to the training sample data aiming at the prediction category information, the prediction three-dimensional information, the prediction regression frame information and the prediction centrality information;

and synthesizing the class loss function, the central loss function, the three-dimensional loss function and the regression loss function to calculate and determine the network loss function.

5. The method according to claim 4, wherein said determining, according to the training sample data, a class loss function, a three-dimensional loss function, a regression loss function, and a center loss function corresponding to the prediction class information, the prediction three-dimensional information, the prediction regression box information, and the prediction centrality information respectively by calculation comprises:

calculating the three-dimensional loss function and the regression loss function corresponding to the predicted three-dimensional information and the predicted regression frame information respectively by using a smooth distance loss function;

6. The method of claim 1, wherein the optimizing the full convolution single-stage neural network according to the network loss function comprises:

7. The method of claim 2, further comprising:

8. The method of claim 7, wherein the filtering the plurality of candidate objects according to the prediction information comprises:

9. A vehicle scene image processing system, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.