CN111178253B

CN111178253B - Visual perception method and device for automatic driving, computer equipment and storage medium

Info

Publication number: CN111178253B
Application number: CN201911382829.1A
Authority: CN
Inventors: 李宇明; 刘国清; 郑伟; 杨广; 敖争光
Original assignee: Youjia Innovation Beijing Technology Co ltd
Current assignee: Youjia Innovation Beijing Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2024-02-27
Anticipated expiration: 2039-12-27
Also published as: CN111178253A

Abstract

The application relates to a visual perception method, a visual perception device, computer equipment and a storage medium for automatic driving. The method comprises the following steps: acquiring an acquired visual perception image; inputting the visual perception image into a trunk network of a trained multi-task neural network, and extracting sharing characteristics of the visual perception image through the trunk network to obtain a sharing characteristic diagram; inputting the shared feature map into each branch network in the multi-task neural network respectively, classifying corresponding tasks by each branch network based on the shared feature map, and outputting classification results of the corresponding tasks; and extracting classification results of the corresponding tasks according to preset visual perception targets, and fusing to obtain visual perception results, wherein the visual perception results comprise at least one of lane line information, road sign information, traffic area road condition information and road obstacle information. By adopting the method, the accuracy of visual perception can be improved.

Description

Visual perception method and device for automatic driving, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of autopilot technology, and in particular, to a visual perception method, apparatus, computer device, and storage medium for autopilot.

Background

With the continuous improvement of the capability of computer software and hardware and the general improvement of the precision of various sensors, the automatic driving technology becomes an important research field and is widely focused by academia and industry. The automatic driving system can be mainly divided into three layers of a perception layer, a decision layer and a control layer. The perception layer is the basis of the three module layers and is used for completing perception and identification of the surrounding environment of the vehicle. The sensing layer needs to work cooperatively with various sensing technologies, such as cameras, millimeter wave radar, laser radar, ultrasonic radar, infrared night vision, and sensors for positioning and navigation such as GPS (Global Position System, global positioning system) and IMU (Inertial Measurement Unit ). In addition, the system is not an active detection element, but belongs to a cooperative global data assistance, so that the environment sensing capability of the vehicle can be expanded, such as a high-precision map, a vehicle networking technology and the like. Each type of perception technology finally enables the vehicle to reach very high safety requirements under driving scenes through mutual complementation and fusion.

In recent years, with the rapid development of deep learning technology, the precision of many traditional tasks in the field of computer vision can be greatly improved. In addition, the camera has low price and can make up for the comprehensive advantages of the work which can not be completed by other sensors, so that the vision-based perception algorithm is widely researched and applied and truly falls to the ground in the fields of automatic driving and auxiliary driving. However, since the existing visual perception algorithms are all based on the traditional feature extraction mode, the existing visual perception algorithms cannot adapt to complex working conditions in a real environment, and therefore accuracy of visual perception is reduced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a visual perception method, apparatus, computer device, and storage medium for automatic driving that can improve visual perception accuracy.

A method of visual perception of autopilot, the method comprising:

acquiring an acquired visual perception image;

inputting the visual perception image into a trunk network of a trained multi-task neural network, and extracting sharing characteristics of the visual perception image through the trunk network to obtain a sharing characteristic diagram;

inputting the shared feature map into each branch network in the multi-task neural network respectively, classifying corresponding tasks by each branch network based on the shared feature map, and outputting classification results of the corresponding tasks;

and extracting classification results of the corresponding tasks according to preset visual perception targets, and fusing to obtain visual perception results, wherein the visual perception results comprise at least one of lane line information, road sign information, traffic area road condition information and road obstacle information.

In one embodiment, the branch network includes a lane line detection network, a lane line instance segmentation network, and a line classification network;

The step of inputting the shared feature map into each branch network in the multi-task neural network, and the step of classifying the corresponding tasks based on the shared feature map and outputting classification results of the corresponding tasks includes:

inputting the shared feature map into the lane line detection network, and detecting lane lines based on the shared feature map through the lane line detection network to obtain a binary lane line image;

inputting the shared feature map into the lane line instance segmentation network, and performing lane line instance segmentation based on the shared feature map through the lane line instance segmentation network to obtain a lane line instance segmentation image;

inputting the shared feature map into the linear classification network, and carrying out lane line type classification based on the shared feature map through the linear classification network to obtain lane line type images.

In one embodiment, the extracting the classification result of the corresponding task according to the preset visual perception target to fuse, to obtain the visual perception result includes:

extracting a binary lane line image, a lane line instance segmentation image and a lane line type image from the classification result according to a preset visual perception target;

Clustering the image matrix corresponding to the lane line instance segmentation image to obtain a lane line instance clustering image;

according to the lane line example cluster image and the lane line type image, carrying out example classification and line type classification on the binary lane line image to obtain a lane line image;

and performing curve fitting on each lane line in the lane line image, and calculating the average confidence coefficient of the lane line pixel points corresponding to each lane line to obtain lane line information.

In one embodiment, the branch network comprises a pavement marking classification network;

and inputting the shared feature map into the road sign classification network, and carrying out road sign detection classification based on the shared feature map through the road sign classification network to obtain a road sign classification image.

Extracting road sign classified images from the classified results according to preset visual perception targets;

extracting pavement marker images corresponding to the pavement markers from the pavement marker classification images;

and respectively carrying out ellipse fitting on each pavement marking image, and calculating the average confidence coefficient of the corresponding pixel point of each pavement marking image to obtain pavement marking information.

In one embodiment, the branch network includes a traffic zone detection network and a vehicular pedestrian instance segmentation network;

the step of inputting the shared feature map into each branch network in the multi-task neural network, respectively performing corresponding task classification by each branch network based on the shared feature map, and outputting classification results of corresponding tasks, including:

inputting the shared feature map into a passing area detection network, and carrying out area detection based on the shared feature map through the passing area detection network to obtain an area classification image;

and inputting the shared feature map into a vehicle-pedestrian instance segmentation network, and carrying out vehicle-pedestrian instance segmentation on the basis of the shared feature map through the vehicle-pedestrian instance segmentation network to obtain a vehicle-pedestrian instance segmentation image.

In one embodiment, the extracting the classification result of the corresponding task according to the preset visual perception target and fusing the classification result to obtain the visual perception result includes at least one of the following:

a first item:

extracting a regional classification image from the classification result according to a preset visual perception target;

extracting passable region images and road edge images from the region classification images, and acquiring first requirements;

parameterizing the passable region image and the road edge image according to the first requirement, and respectively calculating the average confidence coefficient of the pixel points corresponding to the passable region image and the average confidence coefficient of the pixel points corresponding to the road edge image to obtain traffic region road condition information;

the second item:

extracting a region classification image and a vehicle pedestrian instance segmentation image from the classification result according to a preset visual perception target;

extracting a vehicle image and a pedestrian image from the region classification image;

clustering the image matrix corresponding to the vehicle pedestrian instance segmentation image to obtain a vehicle pedestrian instance clustering image;

performing instance classification on the vehicle image and the pedestrian image according to the vehicle pedestrian instance clustering image to obtain a vehicle instance image and a pedestrian instance image;

Respectively performing matrix fitting on the vehicle example image and the pedestrian example image, and respectively calculating the average confidence coefficient of the pixel points corresponding to the vehicle example image and the average confidence coefficient of the pixel points corresponding to the pedestrian example image to obtain traffic area road condition information;

third item:

extracting a traffic sign image from the region classification image;

and carrying out rectangular fitting on the traffic sign image, and calculating the average confidence coefficient of the pixel points corresponding to the traffic sign image to obtain traffic area road condition information.

In one embodiment, the branch network comprises a road obstacle classification network;

inputting the shared feature map into the road surface obstacle classification network, and carrying out obstacle detection classification based on the shared feature map through the road surface obstacle classification network to obtain an obstacle classification image.

extracting an obstacle classification image from the classification result according to a preset visual perception target;

extracting an obstacle image corresponding to each obstacle from the obstacle classified image, and acquiring a second requirement;

and parameterizing the obstacle images according to the second requirement, and calculating the average confidence coefficient of the corresponding pixel points of each obstacle image to obtain road obstacle information.

In one embodiment, the method further comprises:

acquiring a training data set; the training data set comprises a plurality of types of training samples corresponding to the branch networks and labeling results of the training samples;

invoking a data loader corresponding to each branch network in the multi-task neural network, wherein the data loader acquires training samples corresponding to each branch network and labeling results of the training samples from the training data set;

inputting each training sample into a main network to be trained of a multi-task neural network to be trained, and extracting shared features of the training samples through the main network to be trained to obtain a shared sample feature map;

Respectively inputting the shared sample feature graphs into each to-be-trained branch network in the corresponding to-be-trained multi-task neural network, respectively classifying corresponding tasks by each to-be-trained branch network based on the shared sample feature graphs, and outputting training classification results of the corresponding tasks;

determining a loss function of each branch network according to the training classification result and the labeling result of the training sample;

linearly superposing the loss functions to obtain a global loss function;

and carrying out back propagation on the multi-task neural network for obtaining the training classification result according to the global loss function, and carrying out iterative training to obtain a trained multi-task neural network.

A vision-aware device for autopilot, the device comprising:

the acquisition module is used for acquiring the acquired visual perception image;

the extraction module is used for inputting the visual perception image into a trunk network of a trained multi-task neural network, and extracting the sharing characteristics of the visual perception image through the trunk network to obtain a sharing characteristic diagram;

the classification module is used for respectively inputting the shared feature graphs into each branch network in the multi-task neural network, respectively classifying corresponding tasks based on the shared feature graphs, and outputting classification results of the corresponding tasks;

The fusion module is used for extracting classification results of corresponding tasks according to preset visual perception targets to fuse to obtain visual perception results, wherein the visual perception results comprise at least one of lane line information, road sign information, traffic area road condition information and road obstacle information.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the vision-awareness method of autopilot of any one of the preceding claims when the computer program is executed.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the vision-awareness method of autopilot of any one of the preceding claims.

According to the visual perception method, the device, the computer equipment and the storage medium for automatic driving, after the acquired visual perception images are acquired, the shared feature images of the visual perception images are extracted through the main network of the trained multi-task neural network, the corresponding tasks are classified through the branch networks in the multi-task neural network based on the shared feature images, the classification results output by the branch networks are obtained, and finally the corresponding classification results are fused according to the visual perception targets to obtain the visual perception results of lane line information, road sign information, traffic area road condition information, obstacle information and the like. According to the method, visual perception tasks can be used as a whole, and the trained multi-task neural network is utilized to detect and classify a plurality of visual perception tasks simultaneously, so that the method can be used for rapidly adapting to the detection requirements of a plurality of tasks in a real environment, and the accuracy of visual perception is improved.

Drawings

FIG. 1 is an application environment diagram of a visual perception method of autopilot in one embodiment;

FIG. 2 is a flow diagram of a visual perception method of autopilot in one embodiment;

FIG. 3 is a schematic diagram of a binary lane line image in one embodiment;

FIG. 4 is a schematic diagram of an example segmentation image of lane lines in one embodiment;

FIG. 5 is a schematic view of a lane line type image in one embodiment;

FIG. 6 is a schematic diagram of a pavement marker classification image in one embodiment;

FIG. 7 is a schematic diagram of a region classified image in one embodiment;

FIG. 8 is a schematic diagram of a region classified image in one embodiment;

FIG. 9 is a schematic diagram of a structure of a multi-tasking neural network in one embodiment;

FIG. 10 is a flow chart of a method of training a multi-tasking neural network in one embodiment;

FIG. 11 is a block diagram of a visual perception device for autopilot in one embodiment;

fig. 12 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The visual perception method for automatic driving can be applied to an application environment shown in fig. 1. The application environment relates to the image pickup apparatus 102 and the computer apparatus 104, and can be applied to an automatic driving system. Wherein the image capturing apparatus 102 communicates with the computer apparatus 104 via a network. After the image capture device 102 captures the visual perception image, the visual perception image is transmitted to the computer device 104. After the computer equipment 104 acquires the acquired visual perception image, the computer equipment 104 inputs the visual perception image into a trunk network of the trained multi-task neural network, and the sharing characteristic of the visual perception image is extracted through the trunk network to obtain a sharing characteristic diagram; the computer equipment 104 inputs the shared feature map into each branch network in the multi-task neural network respectively, and each branch network classifies the corresponding task based on the shared feature map and outputs the classification result of the corresponding task; the computer device 104 extracts and fuses the classification results of the corresponding tasks according to the preset visual perception targets to obtain visual perception results, wherein the visual perception results comprise at least one of lane line information, road sign information, traffic area road condition information and road obstacle information. Among them, the image pickup apparatus 102 includes, but is not limited to, a camera, a video camera, or an apparatus carrying an image pickup function. The computer device 104 may be a terminal or a server, which may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, there is provided a visual perception method of automatic driving, which is described by taking an example that the method is applied to the computer device in fig. 1, and includes the following steps:

step S202, acquiring an acquired visual perception image.

Wherein, the visual perception image is an image which needs to be visually perceived and is acquired by the image pickup device. For example, an image captured by an image capturing apparatus mounted on an autonomous vehicle in an autonomous system. The mounting positions of the image pickup apparatus include, but are not limited to, front view, side view, rear view, and built-in.

Specifically, when the automatic driving vehicle starts automatic driving, the image capturing device performs image capturing on a driving area within a shooting range to obtain a visual perception image. Then, the image pickup apparatus transmits the acquired visual sense image to the computer apparatus, from which the computer apparatus acquires the acquired visual sense image.

Step S204, the visual perception image is input into a trunk network of the trained multi-task neural network, and the sharing characteristic of the visual perception image is extracted through the trunk network to obtain a sharing characteristic diagram.

The multi-task neural network is a neural network combining a plurality of visual perception tasks, and comprises a backbone network and branch networks corresponding to the tasks. The input of each branch network is the output of the main network, and the output of the main network is input into each branch network to be used as the input of the branch network. In this embodiment, the multi-task neural network is a neural network that has been trained in advance to perform visual perception tasks directly. The backbone network is shared by a plurality of tasks, and the sharing mode can improve the generalization capability of the backbone network while saving the operation amount.

Specifically, after the computer device acquires the visual perception image, a trained multi-tasking neural network is invoked. And inputting the visual perception image into a backbone network of the multi-task neural network, and extracting image features of the visual perception image by the backbone network to obtain a shared feature map. For example, the shared features of the visual perception image are obtained by performing operations such as convolution pooling on the visual perception image through each network layer in the backbone network.

Step S206, the shared feature map is respectively input into each branch network in the multi-task neural network, each branch network respectively classifies the corresponding task based on the shared feature map, and the classification result of the corresponding task is output.

The branch network is a network structure for further extracting image features required by specific tasks based on the information of the shared feature map. In the present embodiment, the branch network includes, but is not limited to, a lane line semantic segmentation network, a lane line instance segmentation network, a line type classification network, a road sign classification network, a traffic area detection network, a vehicle pedestrian instance segmentation network, and a road obstacle classification network.

Specifically, after the computer device obtains the sharing feature map output by the backbone network of the multi-task neural network, the sharing feature map is input to each branch network in the multi-task neural network. For example, the shared feature map is input into a lane line semantic segmentation network, a lane line instance segmentation network, a line type classification network, a road sign classification network, a traffic area detection network, a vehicle pedestrian instance segmentation network, and a road obstacle classification network, respectively. And then, the branch network which receives the shared feature map further extracts the image features of the visual perception image aiming at different tasks according to the input shared feature map, and outputs corresponding results according to the extracted deep image features.

It should be appreciated that when the shared feature map is input to each branch network, the shared feature map may be input to a specified branch network by an instruction. For example, the shared feature map may be assigned to any one or more of seven branch networks of the lane line semantic division network, the lane line instance division network, the line type classification network, the road sign classification network, the traffic area detection network, the vehicle pedestrian instance division network, and the road obstacle classification network.

And step S208, extracting classification results of the corresponding tasks according to the preset visual perception targets, and fusing to obtain visual perception results.

The preset visual perception target is a visual perception task configured for a visual perception layer in an automatic driving system, and corresponds to an obtained visual perception result, and the visual perception target comprises at least one of lane lines, road marks, road conditions of a passing area and road obstacles. The visual perception result comprises at least one of lane line information, road sign information, traffic zone road condition information and road obstacle information. The fusion is to systematically fuse the classified results output by each branch network. It is understood that the classification result is parameterized so as to facilitate upper layer processing.

Specifically, after the classification results output by the branch tasks are obtained, corresponding classification results are extracted from the classification results output by the plurality of branch networks according to visual perception targets preset in the automatic driving system. And then, parameterizing the extracted corresponding classification result to obtain a corresponding visual perception result.

According to the automatic driving visual perception method, after the acquired visual perception images are acquired, the shared feature images of the visual perception images are extracted through the main network of the trained multi-task neural network, the corresponding tasks are classified based on the shared feature images through the branch networks in the multi-task neural network, the classification results output by the branch networks are obtained, and finally the corresponding classification results are fused according to the visual perception targets to obtain the visual perception results of lane line information, road sign information, traffic area road condition information, obstacle information and the like. According to the method, visual perception tasks can be used as a whole, and the trained multi-task neural network is utilized to detect and classify a plurality of visual perception tasks simultaneously, so that the method can be used for rapidly adapting to the detection requirements of a plurality of tasks in a real environment, and the accuracy of visual perception is improved. Meanwhile, the multitasking mode has no coupling between tasks, so that branch tasks can be added and subtracted as required under the condition that the network is not required to be retrained and the accuracy of other tasks is not affected.

In one embodiment, the method includes respectively inputting the shared feature map into each branch network in the multi-task neural network, respectively classifying the corresponding tasks based on the shared feature map, and outputting classification results of the corresponding tasks, including: inputting the shared feature map into a lane line semantic segmentation network, and detecting lane lines based on the shared feature map through the lane line semantic segmentation network to obtain a binary lane line image; inputting the shared feature image into a lane line instance segmentation network, and carrying out lane line instance segmentation on the basis of the shared feature image through the lane line instance segmentation network to obtain a lane line instance segmentation image; inputting the shared feature map into a linear classification network, and carrying out lane line type classification based on the shared feature map through the linear classification network to obtain lane line type images.

The lane line semantic segmentation network is a network for performing target detection on lane lines in the visual perception image based on semantic segmentation. The lane line instance segmentation network is a network for performing object detection on lane lines in a visual perception image based on instance segmentation. Semantic segmentation refers to classifying all pixels on an image, while different instances belonging to the same object do not need to be distinguished separately. The example segmentation can be used for distinguishing individual information while segmenting specific semantics, namely, different individuals of the same object on the image can be marked. In this embodiment, only the lane lines and other image areas are marked in the image output by the lane line semantic segmentation network based on semantic segmentation. And different lane lines can be distinguished while marking the region where the lane line is located and other regions in the output image of the lane line example segmentation network based on the example segmentation. For example, points on the same lane have the same label value, and point labels on different lanes have different label values. The line type classification network is a network for detecting the line type of each lane line in the visual perception image, and the line type comprises a solid line, a broken line, a double line and the like.

Specifically, after a shared feature map output by a backbone network of the multi-task neural network is obtained, the shared feature map is respectively input into a lane line semantic segmentation network, a lane line instance segmentation network and a line type classification network.

The lane line semantic segmentation network further extracts the characteristics of the lane lines in the visual perception image based on the shared feature map, so that the binary lane line image is obtained through semantic segmentation. The binary lane line image refers to an image having pixel values of only 1 and 0. When the lane line on the visual perception image is represented by 1 and the other image areas are represented by 0, then the pixel value of the pixel point corresponding to the lane line on the binary lane line image is 1, and the pixel value of the pixel point corresponding to the other image areas is 0. Thus, the detected lane line and other area images are distinguished by the binary lane line image. As shown in fig. 3, a schematic diagram of a binary lane line image is provided. A schematic diagram of a binary lane line image is output by a lane line semantic segmentation network, and referring to fig. 3, the pixel value of a point on a lane line is 1 (white area), and the pixel value of other points is 0 (black area).

The lane line example segmentation network further extracts the features of the spatial positions of the lane lines in the visual perception image based on the shared feature map, so as to obtain a lane line example segmentation image. The lane line example segmentation image can be understood as an image obtained by performing spatial division on the visual perception image according to the spatial position relation of the lane lines. As shown in fig. 4, a schematic diagram of a lane line example cluster image is provided. The lane line instance clustering image is obtained by clustering lane line instance segmentation images output by a lane line instance segmentation network. Referring to fig. 4, different gray values represent different clustered regions in the figure.

The line type classification network further extracts the corresponding characteristics of the line types of the lane lines in the visual perception image based on the shared characteristic image, so that the line types of the lane lines are determined, and the line type image of the lane lines is obtained. As shown in fig. 5, a schematic diagram of a lane line type image is provided. The lane line type image is output by the line type classification network, and referring to fig. 5, different gray values in the figure represent different line types, and the line type of each lane line can be obtained by combining the lane line type image.

In one embodiment, the method includes respectively inputting the shared feature map into each branch network in the multi-task neural network, respectively classifying the corresponding tasks based on the shared feature map, and outputting classification results of the corresponding tasks, including: and inputting the shared feature map into a road sign classification network, and carrying out road sign detection classification based on the shared feature map through the road sign classification network to obtain a road sign classification image.

The road sign classification network is a network for detecting road signs on visual perception images, and the road signs comprise, but are not limited to, stop lines, zebra stripes, forbidden regions, straight running signs, left-turn signs, right-turn signs, straight left-turn, straight right-turn, straight left-turn, turning around signs, left-turn turning around, diamond signs, inverted triangle signs, road characters, opposite lane arrows and the like.

Specifically, after the shared feature map output by the backbone network of the multi-task neural network is obtained, the shared feature map is input to the road sign classification network. The road sign classification network further extracts features related to the road signs based on the shared feature map, so as to detect the road signs included in the visual perception image, and outputs the road sign classification image after marking the road signs according to different types of road signs. As shown in fig. 6, a schematic representation of a pavement marker classification image is provided. The pavement marking image is output by a pavement marking classification network, and referring to fig. 6, different gray values represent different pavement marking regions.

In one embodiment, the method includes respectively inputting the shared feature map into each branch network in the multi-task neural network, respectively classifying the corresponding tasks based on the shared feature map, and outputting classification results of the corresponding tasks, including: inputting the shared feature map into a traffic area detection network, and carrying out area detection based on the shared feature map through the traffic area detection network to obtain an area classification image; and inputting the shared feature map into a vehicle-pedestrian instance segmentation network, and carrying out instance segmentation on the vehicle-pedestrian by the vehicle-pedestrian instance segmentation network based on the shared feature map to obtain a vehicle-pedestrian instance segmentation image.

The traffic area detection network is a network for detecting road conditions of traffic areas on visual perception images, and comprises a passable area (an area where vehicles can pass), vehicles, pedestrians, traffic signs, road edges and the like. The vehicle pedestrian example segmentation network is a network for detecting the lane lines and pedestrians in the visual perception image based on example segmentation.

Specifically, after the shared feature map output by the backbone network of the multi-task neural network is obtained, the shared feature map is respectively input into the traffic area detection network and the vehicle pedestrian instance segmentation network. The traffic area detection network obtains a passable area, vehicles, pedestrians, traffic signs and road edges included in the visual perception image based on feature detection further extracted by the shared feature map, and obtains an area classification image. It is understood that the region classification image includes at least one of a passable region, a vehicle, a pedestrian, a traffic sign, and a road edge thereon. As shown in fig. 7, a schematic diagram of a region classification image is provided. The region classification image is output by the traffic region detection network, referring to fig. 7, the gray region in the lower half is a passable region, the gray region in the upper right corner is a traffic sign, the white line is a road edge, the gray-black region around the line is a vehicle, and the black region is other regions not including the passable region, the traffic sign, the road edge and the pedestrians of the vehicle.

And the vehicle-pedestrian example segmentation network is similar to the lane line example segmentation network, and the vehicle-pedestrian example segmentation network further extracts the characteristics of the vehicle and the pedestrian in the visual sense perception image based on the shared characteristic diagram, so that the individual information of the vehicle and the pedestrian is determined, and the vehicle-pedestrian example segmentation image is obtained. And then clustering the pedestrian segmentation images of the vehicle examples to obtain clustered images of the pedestrian examples of the vehicle examples. And combining the vehicle image and the pedestrian image obtained by semantic segmentation with the passing area detection network, so as to further obtain a vehicle instance image and a pedestrian instance image. The vehicle-pedestrian example division image can be understood as an image obtained by example division of a vehicle and a pedestrian in the visual sense image, and each vehicle individual and each pedestrian individual are marked by different marks.

In one embodiment, the method includes respectively inputting the shared feature map into each branch network in the multi-task neural network, respectively classifying the corresponding tasks based on the shared feature map, and outputting classification results of the corresponding tasks, including: inputting the shared feature map into a road surface obstacle classification network, and detecting and classifying the obstacle based on the shared feature map through the road surface obstacle classification network to obtain an obstacle classification image.

Wherein the road obstacle classification network is a network for detecting road obstacles on the visual perception image, including but not limited to road blocks, cones, and the like.

Specifically, after the shared feature map output by the backbone network of the multi-task neural network is obtained, the shared feature map is input to the road obstacle classification network. The road surface obstacle classification network further extracts characteristics related to road surface obstacles based on the shared characteristic diagram, so that the road surface obstacles contained in the visual perception image are detected, marked according to different types of road surface obstacles and then output to obtain an obstacle classification image. As shown in fig. 8, a schematic diagram of an obstacle classification image is provided. The obstacle classification image has a road surface obstacle classification network output, and referring to fig. 8, a white area is an obstacle and a black area is other areas not including the obstacle.

In one embodiment, the visual perception results include lane line information. Extracting classification results of corresponding tasks according to preset visual perception targets for fusion, wherein the steps of obtaining the visual perception results include: extracting a binary lane line image, a lane line instance segmentation image and a lane line type image from a classification result according to a preset visual perception target; clustering the image matrix corresponding to the lane line instance segmentation image to obtain a lane line instance clustering image; according to the lane line example cluster image and the lane line type image, carrying out example classification and line type classification on the binary lane line image to obtain a lane line image; and performing curve fitting on each lane line in the lane line image, and calculating the average confidence coefficient of the lane line pixel points corresponding to each lane line to obtain lane line information.

Specifically, when the preset visual perception target is a lane line, a binary lane line image, a lane line instance division image and a lane line type image are obtained from the classification result. And then clustering the image matrix corresponding to the lane line instance segmentation image by using a clustering algorithm to obtain a lane line instance clustering image. The clustering algorithm may employ any one, including but not limited to DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering algorithm), mean-shift (Mean shift) algorithm, K-means clustering, and the like. And clustering the pixels of the lane lines belonging to the same area into one class.

And after the lane line example cluster image is obtained, carrying out example classification on the binary lane line image according to the lane line example cluster image, and carrying out line type classification on the binary lane line image according to the lane line image to obtain the lane line image. Because the space is divided according to the spatial position relation of each lane line in the lane line example clustering image (as shown in fig. 5, for convenience of visualization, the embodiment adopts two-dimensional spatial representation), the lane line example clustering image is combined with the binary lane line image output by the lane line semantic segmentation network, so that the same label can be marked for the points on the same lane line, and meanwhile, the point labels on different lane lines are different, thereby achieving the purpose of lane line example segmentation. The line type image of the lane lines has the line type corresponding to each lane line, so that the line type corresponding to each lane line can be further determined, and the line type zone bit of each lane line is obtained. Therefore, each lane line in the lane line image obtained through the example classification and the line classification has corresponding color and line type, and accords with the lane line on the actual road. For example, the lane lines included on the lane line image may be any one or more of a white solid line, a white broken line, a yellow solid line, a yellow broken line, a double-yellow line, a solid broken line, a virtual solid line, a deceleration strip, and an in-out ramp thick line.

Then, since the upper layer of the perception layer is not able to directly determine the lane line condition from the obtained lane line image, the control of the subsequent automatic driving is performed. Therefore, in order to facilitate the upper layer processing, the lane line image needs to be parameterized to obtain a parameterized result. The parameterization of the lane line image can be achieved by performing curve fitting on the lane line image by using curve fitting. Meanwhile, the confidence coefficient of each lane line on the lane line image is calculated, and the credibility of the lane line is determined through the confidence coefficient. The confidence of each lane line in the lane line image is an average value of the confidence of each pixel point in each lane line region. After obtaining the confidence coefficient of each pixel point forming the lane line, calculating the average value of the confidence coefficient of all the pixel points of the lane line, wherein the obtained average confidence coefficient is the confidence coefficient of the lane line. After the upper layer obtains the curve fitting result of a certain lane line, corresponding automatic driving control is carried out after the credibility degree of the lane line is determined according to the corresponding confidence degree. It can be understood that the finally obtained lane line information includes curve fitting parameters, linear zone bits and corresponding confidence degrees after curve fitting of the lane line images. The curve fitting may be performed using a RANSAC (Random Sample Consensus ) framework, through which robustness may be improved. The corresponding confidence of all the pixel points of the lane line is output when the binary lane line image, the lane line instance segmentation image and the lane line type image perform task detection of the lane line.

In one embodiment, the visual perception includes pavement marking information. Extracting classification results of corresponding tasks according to preset visual perception targets for fusion, wherein the steps of obtaining the visual perception results include: extracting road sign classified images from the classified results according to preset visual perception targets; extracting pavement marker images corresponding to the pavement markers from the pavement marker classified images; and respectively carrying out ellipse fitting on each pavement marking image, and calculating the average confidence coefficient of the corresponding pixel point of each pavement marking image to obtain pavement marking information.

Specifically, when the preset visual perception target is a road sign, a road sign classification image is obtained from the classification result. And extracting pavement marker images corresponding to the pavement markers from the pavement marker classification images. For example, when the road sign classification image includes the straight sign, the left turn sign and the right turn sign at the same time, the image areas where the straight sign, the left turn sign and the right turn sign are located are extracted from the road sign classification image, and the corresponding road sign image is obtained. The extraction of the image region may employ any extraction method, for example, a connected region-based extraction method.

Then, since the upper layer of the sensing layer is not capable of directly determining the condition of the road sign from the obtained road sign image, the control of the subsequent automatic driving is performed. Therefore, in order to facilitate the upper layer processing, a process of parameterizing the road sign image is required. And parameterizing each extracted pavement marker image, calculating the confidence coefficient corresponding to each pavement marker image, and determining the credibility of the pavement marker according to the confidence coefficient. The parameterization of the pavement marker images may employ an ellipse fitting to each pavement marker image. Meanwhile, the confidence coefficient of the corresponding pixel point of each road sign image is obtained, the average confidence coefficient of each confidence coefficient is calculated, and the obtained average confidence coefficient is the confidence coefficient of the road sign image. Therefore, the road sign information includes ellipse fitting parameters after ellipse fitting of the road sign image, road sign category sign bits and corresponding confidence. After the upper layer obtains the elliptical fitting result of the road sign, corresponding automatic driving control is carried out after the credibility degree of the road sign is determined according to the corresponding confidence degree. The confidence corresponding to each pixel point of the pavement marker image is output when the pavement marker classification network detects the pavement marker. Pavement marker class markers represent different classes of pavement markers and are derived from the output of the pavement marker classification network.

In one embodiment, the visual perception includes traffic zone traffic information. And the traffic information of the traffic zone comprises at least one of traffic zone, road edge, traffic sign, vehicle and pedestrian. Therefore, when the system fuses and acquires traffic zone road condition information, the traffic zone road condition information can be respectively processed according to three items of passable zones and road edges, traffic signs, vehicles and pedestrians.

Extracting classification results of corresponding tasks according to preset visual perception targets for fusion, wherein the steps of obtaining the visual perception results include: obtaining a region classification image from the classification result; extracting passable region images and road edge images from the region classification images, and acquiring first requirements; parameterizing the passable region image and the road edge image according to the first requirement, and respectively calculating the average confidence coefficient of the pixels corresponding to the passable region image and the average confidence coefficient of the pixels corresponding to the road edge image to obtain the road condition information of the passable region.

Specifically, when the preset visual perception target is a passable area and a road edge in the road condition of the passing area, an area classification image is obtained from the classification result. And extracting passable region images and road edge images of passable regions and road edges from the region classification images. Any extraction method can be used for extracting the passable region image and the road edge image.

Then, since the upper layer of the perception layer is not capable of directly determining the passable area and the road edge according to the obtained passable area image and the road edge image, the subsequent automatic driving control is performed. Therefore, in order to facilitate the upper layer processing, a process of parameterizing the passable area image and the road edge image is required. Firstly, acquiring a first requirement, and parameterizing each extracted passable area image and each extracted road edge image according to the acquired first requirement. The first requirement can be understood as an instruction sent by the upper layer based on the requirement of the visual perception result of the passable area and the road edge, and the corresponding parameterization method is adopted to parameterize the passable area image and the road edge image according to different requirements. For example, the passage area image may be expressed as n sampling points in the column direction according to the need. The road edge image can be processed like a lane line, and curve fitting is performed on the road edge image.

And simultaneously, calculating the confidence coefficient corresponding to each passable area image and each road edge image. The confidence coefficient of the corresponding pixel point of each passable area image is obtained, the average confidence coefficient of the corresponding pixel point of each passable area image is calculated, and the obtained average confidence coefficient is the confidence coefficient of the passable area image. And obtaining the confidence coefficient of the pixel point corresponding to each road edge image, and calculating the average confidence coefficient of the pixel point corresponding to each road edge image, wherein the obtained average confidence coefficient is the confidence coefficient of the road edge image. Therefore, the traffic area road condition information comprises parameterized results of the traffic area image and the road edge image and corresponding confidence degrees. After the upper layer obtains the curve fitting result of a certain lane line, corresponding automatic driving control is carried out after the credibility degree of the passable area and the road edge is determined according to the corresponding confidence degree. The confidence corresponding to each pixel point of the passable area image and the road edge image is output when the passable area detection network detects the image.

In one embodiment, extracting classification results of corresponding tasks for fusion according to a preset visual perception target, and obtaining visual perception results includes: extracting a region classification image and a vehicle pedestrian example segmentation image from the classification result according to a preset visual perception target; extracting a vehicle image and a pedestrian image from the region classification image; clustering the image matrixes corresponding to the vehicle pedestrian instance segmentation images to obtain vehicle pedestrian instance clustered images; according to the vehicle pedestrian instance clustering images, the vehicle images and the pedestrian images are subjected to instance classification to obtain vehicle instance images and pedestrian instance images; and respectively performing matrix fitting on the vehicle instance image and the pedestrian instance image, and respectively calculating the average confidence coefficient of the pixel points corresponding to the vehicle instance image and the average confidence coefficient of the pixel points corresponding to the pedestrian instance image to obtain traffic area road condition information.

Specifically, when the preset visual perception target is a vehicle and a pedestrian in the traffic area road condition, an area classification image and a vehicle pedestrian instance segmentation image are obtained from the classification result. Vehicle images and pedestrian images are extracted from the region classification images. Any extraction method may be used for extracting the vehicle image and the pedestrian image. And then clustering the image matrix corresponding to the vehicle pedestrian instance segmentation image by using a clustering algorithm to obtain a vehicle pedestrian instance clustering image. The clustering algorithm may employ any one, including but not limited to DBSCAN (Density-Based Spatial Clustering of Applications with Noise, density-based clustering algorithm), mean-shift (Mean shift) algorithm, K-means clustering, and the like. And clustering the pixels belonging to the same vehicle or the pixels belonging to the same pedestrian into one class.

And after the vehicle pedestrian example clustering image is obtained, carrying out example classification on the vehicle pedestrian image according to the vehicle pedestrian example clustering image, and distinguishing the vehicle individual from the pedestrian individual by combining the vehicle pedestrian example clustering image to obtain the vehicle example image and the pedestrian example image. Then, also in order to facilitate the upper layer processing, a process of parameterizing the vehicle example image and the pedestrian example image is required. Parameterization of both the vehicle example image and the pedestrian example image may utilize rectangular fitting. And respectively performing rectangular fitting on the vehicle instance image and the pedestrian instance image to complete parameterization of the vehicle instance image and the pedestrian instance image. Meanwhile, the confidence coefficient of each vehicle on the vehicle example image and the confidence coefficient of each pedestrian on the pedestrian example image are calculated respectively, and the detected confidence coefficient of each vehicle and each pedestrian is determined through the confidence coefficient. After the confidence coefficient of the pixel point corresponding to the vehicle instance image is obtained, calculating the average confidence coefficient of the pixel point corresponding to the vehicle instance image, wherein the obtained average confidence coefficient is the confidence coefficient of the vehicle instance image. Similarly, after the confidence coefficient of the pixel point corresponding to the pedestrian example image is obtained, calculating the average confidence coefficient of the pixel point corresponding to the pedestrian example image, wherein the obtained average confidence coefficient is the confidence coefficient of the pedestrian example image. After the upper layer acquires the matrix line fitting result of the vehicle example image and the pedestrian example image, corresponding automatic driving control is carried out after the credibility degree of the vehicle and the pedestrian is determined according to the corresponding confidence degree. Therefore, the obtained traffic area road condition information comprises rectangular fitting result parameters after rectangular fitting of the vehicle example image and the pedestrian example image, target category zone bits and corresponding confidence degrees. The confidence levels of the corresponding pixel points of the vehicle instance image and the pedestrian instance image are output when the traffic area detection and the task detection of the vehicle and the pedestrian are carried out by the vehicle and pedestrian instance segmentation network. The target class zone bit represents targets of different classes and can be obtained according to output results of the traffic area detection network and the vehicle pedestrian instance segmentation network.

In one embodiment, extracting classification results of corresponding tasks for fusion according to a preset visual perception target, and obtaining visual perception results includes: extracting a region classification image from the classification result according to a preset visual perception target; extracting a traffic sign image from the region classification image; and carrying out rectangular fitting on the traffic sign image, and calculating the average confidence coefficient of the pixel points corresponding to the traffic sign image to obtain traffic area road condition information.

Specifically, when the preset visual perception target is a traffic sign in the traffic area road condition, an area classification image is obtained from the classification result. And extracting traffic sign images corresponding to the traffic signs from the region classification images. Any extraction method can be used for extracting the traffic sign image.

And then, parameterizing each extracted traffic sign image, and simultaneously calculating the confidence coefficient corresponding to each traffic sign image. Since traffic signs are mostly rectangular, in order to meet the actual traffic sign situation, the parameterization of the traffic sign images can use rectangular fitting, and rectangular fitting can be performed on each traffic sign image. Meanwhile, the confidence coefficient of the corresponding pixel point of each traffic sign image is obtained, the average confidence coefficient of each confidence coefficient is calculated, and the obtained average confidence coefficient is the confidence coefficient of the traffic sign image. Therefore, the traffic zone road condition information comprises rectangular fitting result parameters after rectangular fitting of the traffic sign image, traffic sign category sign bits and corresponding confidence degrees. After the upper layer obtains the curve fitting result of a certain lane line, corresponding automatic driving control is carried out after the credibility degree of the traffic sign is determined according to the corresponding confidence degree. The confidence coefficient corresponding to each pixel point of the traffic sign image is output when the traffic area detection network detects the road sign. The traffic sign category sign bit represents traffic signs of different categories and can be obtained according to the output result of the traffic area detection network.

In one embodiment, the visual perception includes road obstacle information. Extracting classification results of corresponding tasks according to preset visual perception targets for fusion, wherein the steps of obtaining the visual perception results include: extracting an obstacle classification image from the classification result according to a preset visual perception target; extracting an obstacle image corresponding to each obstacle from the obstacle classified image, and acquiring a second requirement; and parameterizing the obstacle images according to the second requirement, and calculating the average confidence coefficient of the corresponding pixel points of each obstacle image to obtain road obstacle information.

Specifically, when the preset visual perception target is road obstacle information, a road obstacle classification image is acquired from the classification result. And extracting an obstacle image corresponding to each obstacle from the obstacle classification image. For example, when the obstacle classification image includes both the roadblock and the cone, the image areas where the roadblock and the cone are located are extracted from the obstacle classification image, and the corresponding obstacle image is obtained. Any extraction method can be used for extracting the image area.

Then, the obstacle image is subjected to parameterization. Firstly, acquiring a second requirement, and parameterizing each extracted obstacle image according to the acquired second requirement. The second requirement can be understood as an instruction sent by the upper layer based on the requirement of the visual perception result of the obstacle, and the parameterization of the obstacle image is performed by adopting a corresponding parameterization method according to different requirements. And simultaneously, calculating the confidence coefficient corresponding to each obstacle image. The confidence coefficient of the corresponding pixel point of each obstacle image is obtained, the average confidence coefficient of each confidence coefficient is calculated, and the obtained average confidence coefficient is the confidence coefficient of the obstacle image. Therefore, the road surface obstacle information includes parameterized results of the obstacle image parameterization and corresponding confidence degrees. After the upper layer acquires the parameterized obstacle detection result, corresponding automatic driving control is performed after the credibility of the obstacle is determined according to the corresponding confidence level. The confidence corresponding to each pixel point of the obstacle image is output when the road obstacle classification network detects the road obstacle.

As shown in fig. 9, a schematic structural diagram of a multi-task neural network is provided, and a visual perception method of automatic driving is explained in detail based on the schematic structural diagram of the multi-task neural network.

Specifically, the multi-tasking neural network includes a backbone network 90 and seven branch networks of a multi-resolution U-shaped structure. The seven branch networks are a lane line semantic segmentation network 901, a lane line instance segmentation network 902, a line classification network 903, a pavement marking classification network 904, a traffic zone detection network 905, a vehicle pedestrian instance segmentation network 906, and a pavement obstacle classification network 907, respectively. The seven branch networks may perform up-sampling operations in either Deconvolution or Pixel Shuffle.

After the visual perception image is acquired, the visual perception image is input into the backbone network 90 of the multi-resolution U-shaped structure. The image features of the visual perception image are extracted by the backbone network 90 with a multi-resolution U-shaped structure, and a shared feature map of the visual perception image is obtained. Then, the shared feature map is input to the lane line semantic division network 901 for lane line detection based on semantic division, to the lane line instance division network 902 for lane line detection based on instance division, to the line type classification network 903 for lane line type detection, to the road sign classification network 904 for road sign detection, to the traffic area detection network 905 for traffic areas, road edges, vehicles and pedestrians, to the vehicle-pedestrian instance division network 906 for vehicle and pedestrian detection based on instance division, and to the road obstacle classification network 907 for obstacle detection. And finally, carrying out system fusion based on the output results of the seven branch networks 901-907 to respectively obtain lane line information, road sign information, traffic zone road condition information and road obstacle information.

The lane line information is obtained by fusing the binary lane line images, the lane line instance segmentation images and the lane line type images output by the lane line semantic segmentation network 901, the lane line instance segmentation network 902 and the line type classification network 903, and comprises parameterized lane lines and corresponding confidence degrees. Firstly, carrying out instance classification and linear classification on a binary lane line image through a lane line instance clustering image and a lane line type image which are obtained by clustering lane line instance segmentation images to obtain a lane line image. And performing curve fitting on the lane line image, wherein the obtained curve fitting result is a parameterized lane line. And then, calculating the average confidence coefficient of the pixel points corresponding to each lane line to obtain the confidence coefficient of the lane line.

The pavement marking information includes parameterized pavement markings and corresponding confidence levels. Each of the pavement marker images is extracted from the pavement marker classification images output from the pavement marker classification network 904. And then carrying out ellipse fitting on the pavement marker image, wherein the obtained ellipse fitting result is the parameterized pavement marker. And then, calculating the average confidence coefficient of the corresponding pixel points of each pavement marker image to obtain the confidence coefficient of the pavement marker.

The traffic zone road condition information comprises parameterized traffic zones, road edges, traffic signs, vehicles and pedestrians, and confidence levels corresponding to the traffic zones, the road edges, the traffic signs, the vehicles and the pedestrians. The passable area pattern, the road edge image, the traffic sign image, the vehicle image, and the pedestrian image are extracted from the area classification image output from the passing area detection network 905. And parameterizing the communication area image and the road edge image according to the upper layer requirement. And clustering the vehicle pedestrian segmentation images output by the vehicle pedestrian segmentation network 906 to obtain a vehicle instance clustered image. Combining the vehicle instance clustering image with the vehicle image and the pedestrian image to obtain a vehicle instance image and a pedestrian instance image, and respectively performing rectangular fitting on the vehicle instance image and the pedestrian instance image to obtain rectangular fitting results which are parameterized vehicles and pedestrians. And carrying out rectangular fitting on the traffic sign image, wherein the obtained rectangular fitting result is the parameterized traffic sign. And then, respectively calculating the average confidence coefficient of the pixel points corresponding to the passable area image and the road edge image to obtain the confidence coefficient of the communication area and the confidence coefficient of the road edge. And respectively calculating the average confidence coefficient of the corresponding pixel points of each vehicle and each pedestrian on the vehicle example image to obtain the confidence coefficient of the vehicle and the pedestrian. And calculating the average confidence coefficient of the corresponding pixel points of the traffic sign image to obtain the confidence coefficient of the traffic sign.

The road obstacle information includes parameterized road obstacles and corresponding confidence levels. Each obstacle image is extracted from the road surface obstacle classification image output from the road surface obstacle classification network 907. And then parameterizing the obstacle image according to the requirement to obtain the parameterized road obstacle. And then, calculating the average confidence coefficient of the corresponding pixel points of each obstacle image to obtain the confidence coefficient of the obstacle.

The classification of the lane lines, the classification of the road signs, the classification of the road conditions of the passing areas and the classification of the road barriers are shown in the following table 1:

TABLE 1

It should be understood that in this embodiment, the outputs of the branch networks of the multi-task neural network are divided into seven branch networks 901-907, and finally, the outputs are fused into lane line information, road sign information, traffic area road condition information and road obstacle information, which are obtained by dividing the four categories according to the actual requirements of this embodiment, specifically, the four categories are determined according to the density and the sparsity of the visually perceived objects. The probability of the occurrence of lane lines, road surfaces and road edges on driving roads is high and the lane lines, the road surfaces and the road edges are easy to collect, belong to dense samples and can be maintained together by corresponding outputs of a plurality of branch networks. The road sign and the road obstacle have extremely low occurrence probability and are not easy to collect, and belong to sparse samples, so that the road sign and the road obstacle can be independently maintained. In addition, the classification can be performed according to the requirement of the detection precision or the update frequency as the division basis. For example, tasks with high precision requirements are put into separate branch maintenance, and tasks with low precision can be maintained together. The frequently updated tasks are maintained separately and less frequently updated together.

In this embodiment, compared with the conventional method for identifying a single task, in this embodiment, the visual perception task is made into a whole, and feature extraction is performed based on a neural network, so that the method can adapt to requirements of detection of each task in a real environment and ensure detection accuracy. Meanwhile, the marking data of each task in the embodiment are maintained separately, all detection targets do not need to be marked simultaneously in the same graph, but different tasks are marked with different targets, so that the method is beneficial to data batch production and reduces data marking and maintenance cost.

In one embodiment, as shown in fig. 10, a training method of a multi-task neural network is provided, which specifically includes the following steps:

step S1002, a training data set is obtained; the training data set comprises a plurality of types of training samples corresponding to each branch network and labeling results of each training sample.

The training dataset is a visual perception image for training the multi-tasking neural network. Since the trained neural network is a multi-branched neural network that is multi-tasked, the acquired training set should include various types of training samples corresponding to each branched network. Meanwhile, the training samples in the training data set have corresponding labeling results, and the labeling can be obtained by labeling the training samples by operating a labeling tool in advance. In this embodiment, the labels do not need to label all tasks of the branch networks in the same graph, but only the content corresponding to the tasks of the corresponding branch networks need to be labeled in the training sample, and other content may not be labeled.

In addition, in order to increase the diversity of the training samples and ensure the expressive power of the neural network, the training samples corresponding to each branch in the training data set can be different visual perception images under a plurality of different conditions. For example, different lighting conditions, different scenes, images of different mounting angles of the image capturing apparatus. The method can comprise images under scenes such as sunny days, rainy days, daytime, nighttime, high speed, urban areas, large car view angles, small car view angles and the like.

Step S1004, calling a data loader corresponding to each branch network in the multi-task neural network, and obtaining training samples corresponding to each branch network and labeling results of the training samples from the training data set by the data loader.

The data loader is a program for loading data. The data loader of the present embodiment is used to load training samples from a training dataset. Because the training tasks of different branch networks are different, a corresponding data loader is written for each branch network in advance and is specially used for acquiring training samples required by the branch network.

Step S1006, inputting each training sample into a to-be-trained backbone network of the to-be-trained multi-task neural network, and extracting the sharing characteristics of the training samples through the to-be-trained backbone network to obtain a sharing sample characteristic diagram.

Step S1008, inputting the shared sample feature graphs into the corresponding branch networks to be trained in the multi-task neural networks to be trained respectively, classifying the corresponding tasks by the branch networks to be trained based on the shared sample feature graphs respectively, and outputting training classification results of the corresponding tasks.

And step S1010, determining the loss function of each branch network according to the training classification result and the labeling result of the training sample.

Specifically, the multi-task neural network is first initialized when training is started. For example, the number of iterations is determined, the length of data that the data loader can load the data, the optimizer gradient clears, etc. Training begins when initialization is complete. And inputting training samples required by loading the corresponding branch network from the training data set through each data loader into the to-be-trained multi-task neural network to start training.

The training process is illustrated with 1 branch network: assuming that the branch network is the first branch network, the data loader 1 corresponding to the first branch network loads the training sample 1 obtained from the training data set to the task corresponding to the first branch network. If the first branch network is a pavement marker classification network, the data loader 1 loads training samples classified as pavement markers from the training data set. Secondly, the training sample 1 loaded by the data loader 1 is input into a backbone network to be trained of the multi-task neural network to be trained. After the feature extraction is performed on the training sample 1 by the backbone network to be trained to obtain the shared sample feature map 1, the shared sample feature map 1 is input into the first branch network. The first branch network further performs feature extraction on the shared sample feature map 1, and outputs a corresponding training classification result 1 according to the extracted features, thereby completing training of the first branch network. Finally, because the training classification result 1 is a predicted value, the labeling result 1 corresponding to the training sample 1 is a true value. The loss function 1 of the first branch network can thus be determined from the training classification result 1 and the corresponding flag result 1 of the training sample 1. It should be understood that if there is a second branch network, a third branch network, and a fourth branch network … …, the training principle of the second branch network, the third branch network, and the fourth branch network … …, and the nth branch network, are the same as those of the first branch network, and will not be described herein.

Step S1012, linearly superposing the loss functions to obtain a global loss function.

And step S1014, carrying out back propagation and iterative training on the multi-task neural network after the training classification result is obtained according to the global loss function, and obtaining the trained multi-task neural network.

Specifically, after the training of forward propagation of the branch networks of all tasks is completed, a loss function corresponding to each branch network is obtained. And then, carrying out linear superposition on all the loss functions, wherein the obtained loss function is the global loss function. Linear superposition can be understood as setting corresponding weights for the various branch networks according to the importance of the branch networks for different tasks or the demands of visual perception. And carrying out weighting treatment according to the weights to obtain a global loss function, wherein a linear superposition formula is as follows:

L＝a ₁ ×loss ₁ +a ₂ ×loss ₂ +…+a _n ×loss _n

l represents a global loss function, a ₁ 、a ₂ …a _n Representing the weight, loss of the nth branch network ₁ 、loss ₂ …loss _n Representing the loss function of the nth branch network.

After the global loss function is obtained, the whole multi-task neural network is trained in a back propagation mode according to the global loss function, and the gradient of the optimizer is updated, so that one training is completed. When it is determined that training is still needed according to the iteration number, iterative training is performed on the multi-task neural network based on the same principle, and the training process is the same as the principle of the training process, and is not repeated here. The next training sample and the corresponding labeling result are obtained through the data loader to carry out iterative training until the iteration times are met. And the finally obtained multi-task neural network is a trained multi-task neural network. In the implementation, the end-to-end training of the multiple tasks can be achieved by the joint training mode of the multiple task training data sets. And training is carried out through the training samples of the branch networks and the incomplete training data sets with different labeling results, compared with the traditional training mode adopting the complete training sets, the data cost is reduced, the coupling degree of the branch network modules is also reduced, and therefore the expressive capacity of the branch networks is improved.

It should be understood that, although the steps in the flowcharts of fig. 2 and 10 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 and 10 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 11, there is provided a visual perception apparatus for automatic driving, comprising: an acquisition module 1102, an extraction module 1104, a classification module 1106, and a fusion module 1108, wherein:

an acquisition module 1102 is configured to acquire an acquired visual perception image.

The extraction module 1104 is configured to input the visual perception image into a backbone network of the trained multi-task neural network, and extract the sharing feature of the visual perception image through the backbone network to obtain a sharing feature map.

The classification module 1106 is configured to input the shared feature graphs into each branch network in the multi-task neural network, and each branch network performs classification of a corresponding task based on the shared feature graphs, and output a classification result of the corresponding task.

The fusion module 1108 is configured to extract classification results of corresponding tasks according to preset visual perception targets, and fuse the classification results to obtain visual perception results, where the visual perception results include at least one of lane line information, road sign information, traffic area road condition information and road obstacle information.

In one embodiment, the classification module 1106 is further configured to input the shared feature map into a lane line semantic segmentation network, and perform lane line detection based on the shared feature map through the lane line semantic segmentation network to obtain a binary lane line image; inputting the shared feature image into a lane line instance segmentation network, and carrying out lane line instance segmentation on the basis of the shared feature image through the lane line instance segmentation network to obtain a lane line instance segmentation image; inputting the shared feature map into a linear classification network, and carrying out lane line type classification based on the shared feature map through the linear classification network to obtain lane line type images.

In one embodiment, classification module 1106 is further configured to input the shared feature map into a pavement marker classification network, and perform pavement marker detection classification based on the shared feature map through the pavement marker classification network to obtain a pavement marker classification image.

In one embodiment, the classification module 1106 is further configured to input the shared feature map into a traffic area detection network, and perform area detection based on the shared feature map through the traffic area detection network to obtain an area classification image; and inputting the shared feature map into a vehicle-pedestrian instance segmentation network, and carrying out instance segmentation on the vehicle-pedestrian by the vehicle-pedestrian instance segmentation network based on the shared feature map to obtain a vehicle-pedestrian instance segmentation image.

In one embodiment, the classification module 1106 is further configured to input the shared feature map into a road obstacle classification network, and perform detection classification of the obstacle based on the shared feature map through the road obstacle classification network to obtain an obstacle classification image.

In one embodiment, the fusion module 1108 is further configured to extract a binary lane line image, a lane line instance segmentation image, and a lane line image from the classification result according to the preset visual perception target; clustering the image matrix corresponding to the lane line instance segmentation image to obtain a lane line instance clustering image; according to the lane line example cluster image and the lane line type image, carrying out example classification and line type classification on the binary lane line image to obtain a lane line image; and performing curve fitting on each lane line in the lane line image, and calculating the average confidence coefficient of the lane line pixel points corresponding to each lane line to obtain lane line information.

In one embodiment, the fusion module 1108 is further configured to extract a pavement marker classification image from the classification result according to a preset visual perception target; extracting pavement marker images corresponding to the pavement markers from the pavement marker classified images; and respectively carrying out ellipse fitting on each pavement marking image, and calculating the average confidence coefficient of the corresponding pixel point of each pavement marking image to obtain pavement marking information.

In one embodiment, the fusion module 1108 is further configured to obtain a region classification image from the classification result; extracting passable region images and road edge images from the region classification images, and acquiring first requirements; parameterizing the passable region image and the road edge image according to the first requirement, and respectively calculating the average confidence coefficient of the pixels corresponding to the passable region image and the average confidence coefficient of the pixels corresponding to the road edge image to obtain the road condition information of the passable region.

In one embodiment, the fusion module 1108 is further configured to extract a region classification image and a vehicle-pedestrian instance segmentation image from the classification result according to a preset visual perception target; extracting a vehicle image and a pedestrian image from the region classification image; clustering the image matrixes corresponding to the vehicle pedestrian instance segmentation images to obtain vehicle pedestrian instance clustered images; according to the vehicle pedestrian instance clustering images, the vehicle images and the pedestrian images are subjected to instance classification to obtain vehicle instance images and pedestrian instance images; and respectively performing matrix fitting on the vehicle instance image and the pedestrian instance image, and respectively calculating the average confidence coefficient of the pixel points corresponding to the vehicle instance image and the average confidence coefficient of the pixel points corresponding to the pedestrian instance image to obtain traffic area road condition information.

In one embodiment, the fusion module 1108 is further configured to extract a region classification image from the classification result according to a preset visual perception target; extracting a traffic sign image from the region classification image; and carrying out rectangular fitting on the traffic sign image, and calculating the average confidence coefficient of the pixel points corresponding to the traffic sign image to obtain traffic area road condition information.

In one embodiment, the fusion module 1108 is further configured to extract an obstacle image corresponding to each obstacle from the obstacle classification images, and obtain a second requirement; and parameterizing the obstacle images according to the second requirement, and calculating the average confidence coefficient of the corresponding pixel points of each obstacle image to obtain road obstacle information.

In one embodiment, the vision perception device for automatic driving further comprises a training module for acquiring a training data set; the training data set comprises a plurality of types of training samples corresponding to each branch network and labeling results of each training sample; invoking a data loader corresponding to each branch network in the multi-task neural network, and acquiring training samples corresponding to each branch network and labeling results of the training samples from a training data set by the data loader; inputting each training sample into a to-be-trained backbone network of a to-be-trained multi-task neural network, and extracting sharing characteristics of the training samples through the to-be-trained backbone network to obtain a sharing sample characteristic diagram; respectively inputting the shared sample feature graphs into each to-be-trained branch network in the corresponding to-be-trained multi-task neural network, respectively classifying the corresponding tasks by each to-be-trained branch network based on the shared sample feature graphs, and outputting training classification results of the corresponding tasks; determining a loss function of each branch network according to the training classification result and the labeling result of the training sample; linearly superposing the loss functions to obtain a global loss function; and carrying out back propagation on the multi-task neural network after the training classification result is obtained according to the global loss function, and carrying out iterative training to obtain the trained multi-task neural network.

For specific limitations on the visual perception means of autopilot, reference may be made to the above limitations on the visual perception method of autopilot, and no further description is given here. The above-described modules in the visual perception device for autopilot may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by the processor to implement a visual perception method of autopilot. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, which when executed by the processor implements the steps of the vision-awareness method of autopilot provided in any one of the embodiments of the present application.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the vision-awareness method of autopilot provided in any one of the embodiments of the present application.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of visual perception of autopilot, the method comprising:

acquiring an acquired visual perception image;

According to a preset visual perception target, extracting classification results of corresponding tasks to be fused, and obtaining a visual perception result, wherein the visual perception result comprises at least one of lane line information, road sign information, traffic area road condition information and road obstacle information;

the training method of the multi-task neural network comprises the following steps:

linearly superposing the loss functions to obtain a global loss function;

performing back propagation and iterative training on the multi-task neural network with the training classification result according to the global loss function, and obtaining a next training sample and a corresponding labeling result through the data loader to perform iterative training until the iteration times are met, so as to obtain a trained multi-task neural network;

and writing a corresponding data loader for each branch network in advance, wherein the data loader is specially used for acquiring the training samples required by the branch network.

2. The method of claim 1, wherein the branch network comprises a lane line semantic segmentation network, a lane line instance segmentation network, and a line classification network;

Inputting the shared feature map into the lane line semantic segmentation network, and detecting lane lines based on the shared feature map through the lane line semantic segmentation network to obtain a binary lane line image;

3. The method according to claim 1, wherein the step of extracting the classification result of the corresponding task according to the preset visual perception target and fusing the classification result to obtain the visual perception result includes:

4. The method of claim 1, wherein the branch network comprises a pavement marking classification network;

5. The method according to claim 1, wherein the step of extracting the classification result of the corresponding task according to the preset visual perception target and fusing the classification result to obtain the visual perception result includes:

6. The method of claim 1, wherein the branch network comprises a traffic zone detection network and a vehicular pedestrian instance segmentation network;

7. The method according to claim 1, wherein the step of extracting the classification result of the corresponding task according to the preset visual perception target and fusing the classification result to obtain the visual perception result includes at least one of the following:

A first item:

the second item:

Third item:

extracting a traffic sign image from the region classification image;

8. The method of claim 1, wherein the branch network comprises a road obstacle classification network;

9. The method according to claim 1, wherein the step of extracting the classification result of the corresponding task according to the preset visual perception target and fusing the classification result to obtain the visual perception result includes:

10. A vision-aware device for autopilot, the device comprising:

the fusion module is used for extracting classification results of corresponding tasks according to preset visual perception targets to fuse to obtain visual perception results, wherein the visual perception results comprise at least one of lane line information, road sign information, traffic area road condition information and road obstacle information;

The training module is used for acquiring a training data set; the training data set comprises a plurality of types of training samples corresponding to the branch networks and labeling results of the training samples; invoking a data loader corresponding to each branch network in the multi-task neural network, wherein the data loader acquires training samples corresponding to each branch network and labeling results of the training samples from the training data set; inputting each training sample into a main network to be trained of a multi-task neural network to be trained, and extracting shared features of the training samples through the main network to be trained to obtain a shared sample feature map; respectively inputting the shared sample feature graphs into each to-be-trained branch network in the corresponding to-be-trained multi-task neural network, respectively classifying corresponding tasks by each to-be-trained branch network based on the shared sample feature graphs, and outputting training classification results of the corresponding tasks; determining a loss function of each branch network according to the training classification result and the labeling result of the training sample; linearly superposing the loss functions to obtain a global loss function; performing back propagation and iterative training on the multi-task neural network with the training classification result according to the global loss function, and obtaining a next training sample and a corresponding labeling result through the data loader to perform iterative training until the iteration times are met, so as to obtain a trained multi-task neural network;

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 9.