CN114792417B

CN114792417B - Model training method, image recognition method, device, equipment and storage medium

Info

Publication number: CN114792417B
Application number: CN202210171695.4A
Authority: CN
Inventors: 郭湘; 韩文韬; 鲁赵晗; 熊邦国; 林家彦; 陈连胜; 韩旭
Original assignee: Guangzhou Weride Technology Co Ltd
Current assignee: Wuxi Wenyuan Zhixing Intelligent Technology Co ltd
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2023-06-16
Anticipated expiration: 2042-02-24
Also published as: CN114792417A

Abstract

The invention relates to the technical field of image processing and discloses a model training method, an image recognition method, a device, equipment and a storage medium.

Description

Model training method, image recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a model training method, an image recognition device, an apparatus, and a storage medium.

Background

With the development of artificial intelligence technology, especially in the research of automobile automatic driving direction, through setting up hardware such as laser radar, image acquisition on the vehicle, the software program of cooperation image analysis realizes automatic recognition object to realize the unmanned of vehicle.

At present, the perception task of the vehicle for perceiving an object in the driving process is mainly realized by arranging a plurality of common sensors for selection, such as images, laser point clouds and the like. For laser point cloud sensing, the device has good 3d position sensing capability and strong anti-interference capability, and can work under various illumination conditions. However, the price is high, and the high-performance mechanical rotation is that the laser radar structure is complex, and particularly, the vehicle gauge level is difficult to realize in the analysis process after the laser radar is used for collecting the laser point cloud. For image perception, the price is low, the vehicle rule level can be realized, but the image cannot be marked with 3d information, so that the recognition accuracy is low. Therefore, it is highly desirable to provide an image sensing scheme capable of realizing both rapid identification and 3d information labeling.

Disclosure of Invention

The invention mainly aims to solve the problems of low accuracy and efficiency of perceived identification of objects in driving by image perception or laser point cloud perception in automatic driving.

The first aspect of the present invention provides a model training method, which includes:

acquiring image data and laser point cloud data corresponding to the image data;

3D information is identified for each frame of point cloud in the laser point cloud data, and the identified 3D information is projected into the image data to obtain monitoring image data;

extracting pixel-by-pixel information in the image data by using a neural network, and predicting corresponding object information based on the pixel-by-pixel information;

and performing depth training of three-dimensional perception capability on a preset visual neural network according to the supervision image data and the corresponding object information to obtain a supervision image model so as to realize perception operation of the visual neural network for obtaining the image data and the laser point cloud.

In this embodiment, in a first implementation manner of the first aspect of the present invention, the identifying 3D information of each frame of point cloud in the laser point cloud data and projecting the identified 3D information into the image data to obtain the supervision image data includes:

sequentially inputting each frame of point clouds in the laser point cloud data into a laser point cloud neural network according to the sequence of the acquisition time, and detecting a 3D object frame through the laser point cloud neural network to obtain 3D information corresponding to an object;

And acquiring a parameter relation between the camera and the laser radar, and projecting 3D information corresponding to each object onto the image data based on the parameter relation to obtain the supervision image data.

In this embodiment, in a second implementation manner of the first aspect of the present invention, the performing, by using the laser point cloud neural network, 3D object frame detection to obtain 3D information corresponding to an object includes:

acquiring a depth map of each frame of point cloud through the laser point cloud neural network, and dividing the point cloud belonging to the same object based on the depth map to obtain a plurality of point clouds;

constructing an object detection frame based on the point cloud set, and matching the object detection frame with a plurality of preset object frames to obtain object category information of the point cloud set;

and fusing the object type information into the object detection frame to obtain 3D information corresponding to the object.

In this embodiment, in a third implementation manner of the first aspect of the present invention, the acquiring a parameter relationship between a camera and a laser radar, and projecting 3D information corresponding to each object onto the image data based on the parameter relationship, to obtain supervised image data includes:

determining first coordinate information of each point cloud based on the 3D information;

Acquiring internal parameters of a camera and internal parameters of a laser radar, and calculating a parameter relation between the camera and the laser radar based on the internal parameters of the camera and the laser radar;

calculating second coordinate information of the point cloud in a coordinate system of the camera based on the parameter relation and the first coordinate information;

and projecting corresponding 3D information onto the image according to the second coordinate information to obtain the supervision image data.

In a fourth implementation manner of the first aspect of the present invention, the extracting pixel-by-pixel information in the image using a neural network and predicting corresponding object information based on the pixel-by-pixel information includes:

performing a first convolution operation on each channel picture in the image by using a residual convolution neural network to obtain a plurality of characteristic tensors;

fusing a plurality of characteristic tensors by using an FPC network to obtain a final characteristic diagram;

constructing a unit convolution kernel according to the final feature map;

performing a second convolution operation on the final feature map based on the unit convolution check to obtain a multidimensional tensor;

and predicting based on the multidimensional tensor to obtain object information.

In a fifth implementation manner of the first aspect of the present invention, the performing depth training of the three-dimensional perceptibility of the preset optical neural network according to the supervisory image data and the corresponding object information to obtain a supervisory image model includes:

Defining a loss function of model training according to a preset visual neural network and parameter variables in the 3D information;

calculating a loss value between the supervisory image data and corresponding object information based on the loss function;

according to the loss value, utilizing a deep neural network training optimizer to carry out optimization adjustment on parameters of the visual neural network to obtain an optimization model;

and performing depth training of three-dimensional perceptibility on the optimization model by using the supervision image data and the object information to obtain a supervision image model.

The second aspect of the present invention provides an image recognition method, the image recognition method comprising:

collecting an image to be identified;

inputting the image to be identified into a supervision image model to identify 3D information, and generating supervision signals of all objects in the image to be identified based on the identified 3D information, wherein the supervision image model is trained according to the provided model training method;

and performing supervision and identification on the corresponding object based on the supervision signal to obtain an identification result.

A third aspect of the present invention provides a model training apparatus comprising:

The acquisition module is used for acquiring image data and laser point cloud data corresponding to the image data;

the point cloud identification module is used for identifying 3D information of each frame of point cloud in the laser point cloud data and projecting the identified 3D information into the image data to obtain monitoring image data;

the image recognition module is used for extracting pixel-by-pixel information in the image data by utilizing a neural network and predicting corresponding object information based on the pixel-by-pixel information;

and the model training module is used for carrying out depth training of three-dimensional perception capability on a preset visual neural network according to the supervision image data and the corresponding object information to obtain a supervision image model so as to realize the perception operation of the visual neural network for obtaining the image data and the laser point cloud.

In this embodiment, in a first implementation manner of the third aspect of the present invention, the point cloud identifying module includes:

the point cloud detection unit is used for sequentially inputting each frame of point cloud in the laser point cloud data into a laser point cloud neural network according to the sequence of the acquisition time, and carrying out 3D object frame detection through the laser point cloud neural network to obtain 3D information corresponding to an object;

And the projection unit is used for acquiring the parameter relation between the camera and the laser radar, and projecting the 3D information corresponding to each object onto the image data based on the parameter relation to obtain the supervision image data.

In this embodiment, in a second implementation manner of the third aspect of the present invention, the point cloud detection unit is specifically configured to:

In this embodiment, in a third implementation manner of the third aspect of the present invention, the projection unit is specifically configured to:

In this embodiment, in a fourth implementation manner of the third aspect of the present invention, the image identifying module includes:

the first convolution unit is used for carrying out a first convolution operation on each channel picture in the image by using a residual convolution neural network to obtain a plurality of characteristic tensors;

the fusion unit is used for fusing the plurality of characteristic tensors by using the FPC network to obtain a final characteristic diagram;

a construction unit for constructing a unit convolution kernel according to the final feature map;

the second convolution unit is used for carrying out second convolution operation on the final feature map based on the unit convolution check to obtain a multidimensional tensor;

and the prediction unit is used for predicting based on the multidimensional tensor to obtain object information.

In this embodiment, in a fifth implementation manner of the third aspect of the present invention, the model training module includes:

the function definition unit is used for defining a loss function of model training according to a preset visual neural network and parameter variables in the 3D information;

a calculation unit for calculating a loss value between the supervisory image data and the corresponding object information based on the loss function;

The optimizing unit is used for optimizing and adjusting parameters of the visual neural network by utilizing a deep neural network training optimizer according to the loss value to obtain an optimizing model;

and the training unit is used for carrying out depth training of the three-dimensional perception capability on the optimized model by utilizing the supervision image data and the object information to obtain a supervision image model.

A fourth aspect of the present invention provides an image recognition apparatus comprising:

the shooting module is used for acquiring an image to be identified;

the identification module is used for inputting the image to be identified into a supervision image model to identify 3D information and generating supervision signals of all objects in the image to be identified based on the identified 3D information, wherein the supervision image model is obtained by training by using the provided model training method;

and the supervision module is used for carrying out supervision and identification on the corresponding object based on the supervision signal to obtain an identification result.

A fifth aspect of the present invention provides a computer apparatus comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the computer device to perform the steps of the model training method provided above, or the steps of the image recognition method provided above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the steps of the model training method provided above, or the steps of the image recognition method provided above.

The beneficial effects are that:

according to the technical scheme, the laser point cloud data are identified by utilizing the laser point cloud neural network, the corresponding 3D information is output, the convolutional neural network is utilized to predict object information of the image corresponding to the laser point cloud data, a prediction result is obtained, finally, the deep learning of the three-dimensional perception capability is carried out on the visual neural network based on the prediction result and the corresponding 3D information, so that a supervision image model with the image identification perception capability and the point cloud identification perception capability is obtained, the object in the image is supervised and identified based on the supervision image model, the object automatic labeling and supervision based on the image are realized, the supervision and identification of the object can be realized without manual intervention labeling, and the scene identification capability, the identification accuracy and the identification efficiency of the unmanned vehicle are improved. Meanwhile, the experience of automatic driving of the vehicle is improved.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a model training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a second embodiment of a model training method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a third embodiment of an image recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of a model training apparatus in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of another embodiment of a model training apparatus in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of an image recognition apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of one embodiment of a computer device in an embodiment of the invention.

Detailed Description

The embodiment of the invention provides a model training method, an image recognition device, equipment and a storage medium, which are used for solving the problem that the recognition efficiency and accuracy of an object in vision are low in the driving process of the existing automatic driving system.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, a specific flow of an embodiment of the present invention is described below, referring to fig. 1, a first embodiment of a model training method in the embodiment of the present invention is mainly a training method based on laser point cloud and image, and the method uses a visual neural network for image recognition in an unmanned vehicle to learn three-dimensional recognition information of the laser point cloud so as to obtain recognition capability of the laser point cloud and obtain a supervised image model, and specifically includes the following steps:

101. Acquiring image data and laser point cloud data corresponding to the image data;

it is to be understood that the execution body of the present invention may be a model training device, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a terminal as an execution main body of an automatic driving controller on a vehicle.

In this embodiment, the image data is a picture set of each time frame collected by a camera, a camera or an installation vehicle recorder on the autopilot system, and the picture set constitutes the image data. The laser point cloud data is a point cloud set of each time frame acquired by a laser radar on an automatic driving system.

In the present embodiment, the acquired image data and the laser point cloud data have the same object therein, which can be understood as an obstacle, a tracking target, or the like; specifically, after image data and laser point cloud are obtained through a camera and a laser radar, according to the acquired time relationship, intercepting data fragments of the same time point or the same time period from the camera and the laser radar, wherein the data fragments comprise image fragments and point cloud fragments, identifying objects in the data fragments, such as obstacles, when the same obstacles in the data fragments are identified, intercepting fragments of the data fragments, which take the obstacles as centers, with a certain time length, and taking the intercepted image fragments as image data, and taking the intercepted point cloud fragments as laser point cloud data, so as to realize correspondence between the image data and the laser point cloud data.

In practical application, when image data and laser point cloud data are acquired, data containing multiple types of objects are acquired respectively, and data extraction is performed for each object and are associated with each other, so that multiple data pairs are obtained. Meanwhile, the data pairs are classified, and a plurality of data pair sets are obtained.

102. 3D information is identified for each frame of point cloud in the laser point cloud data, and the identified 3D information is projected into the image data to obtain the supervision image data;

in this embodiment, the supervisory image data is combined data including point cloud labeling information and an image, the expression form of the supervisory image data is a picture, and labeling information exists in the picture, and the labeling information is information output by a laser point cloud.

In practical application, when 3D information is identified on laser point cloud data, specifically, the identification is realized through a laser point cloud model, the laser point cloud data is input into the laser point cloud model, the laser point Yun Mokuai detects all objects in the point cloud data through identification and analysis on each point cloud, all objects are used as targets, three-dimensional space coordinate information of the point cloud is extracted, point cloud characteristics are obtained based on the three-dimensional space coordinate information of the point cloud, a 3D object detection frame of each object is constructed based on the point cloud characteristics, and 3D information corresponding to each object is output.

Specifically, extracting three-dimensional coordinate information of an object in laser point cloud data by using a laser point cloud model, wherein the three-dimensional coordinate information comprises a target center point, a long width, even further comprises information such as the orientation of the object; boundary information corresponding to the object is calculated based on the three-dimensional coordinate information, a 3D detection frame is constructed based on the boundary information, the 3D detection frame is projected into image data as 3D information, and supervision image data is constructed.

In this embodiment, 3D information is projected into image data, specifically, a coordinate transformation mode is adopted to perform projection, when data is collected through a camera and a laser radar, a coordinate system is respectively constructed on the basis of the same field of view, a transformation function is determined based on the two coordinate systems, after 3D information is identified through a laser point cloud model, 2D information in each point cloud corresponding to the image data is calculated based on the 3D information and the transformation function, and supervision image data can be obtained by mapping based on the 2D information.

103. Extracting pixel-by-pixel information in the image data by using a neural network, and predicting corresponding object information based on the pixel-by-pixel information;

in this embodiment, the neural network is a deep convolutional neural network, and the deep convolutional neural network includes a first convolutional neural network and a second convolutional neural network, where the first convolutional neural network is used for feature extraction of the image data; the second convolution neural network is used for unit convolution, namely dimension reduction to extract image pixel characteristics.

In practical application, the image data is an image containing three channels of RGB, specifically comprises acquiring multiple images of the three channels in an aggregation manner when the image data is acquired, inputting all the images into a neural network, respectively carrying out time sequence division through a first convolution neural network, extracting tensors of the same object at different continuous times, inputting the tensors into a second convolution neural network, carrying out unit convolution, namely dimension reduction processing, to obtain multiple tensors with single dimension, and carrying out object information prediction based on the tensors with single dimension.

Specifically, when predicting the object information, the object information can be identified through a target identification model, optionally, a plurality of tensors with single dimensions are combined to construct a pixel boundary, the object identification model is adopted to identify the pixel boundary, and the matched object information is found out from a preset object class library.

104. And performing depth training of three-dimensional perception capability on a preset visual neural network according to the supervision image data and the corresponding object information to obtain a supervision image model so as to realize perception operation of the visual neural network for obtaining the image data and the laser point cloud.

In this embodiment, a visual neural network in an image recognition system in an unmanned vehicle is called, supervised image data and corresponding object information are input into the visual neural network, the deep learning neural network is utilized to learn and classify labeling information in the supervised image data, classified information is output into the visual neural network, the visual neural network learns and classifies the information to obtain the feature recognition capability of laser point clouds, then a corresponding training result is output, the training result is compared with the corresponding object information, and model parameters of the visual neural network are adjusted based on the comparison result to obtain a supervised image model.

In this embodiment, after 3D information of an object is extracted through laser point cloud, the 3D information is projected into image data, pixel extraction is performed based on the image data, corresponding object information is predicted, the image data containing the 3D information and the object information obtained from the image data are trained on a visual neural network, and a supervision image model is obtained, so that the model has the capabilities of point cloud perception and image perception, the object is identified based on the model in the driving process, the identification efficiency and accuracy are improved, meanwhile, environmental data acquired in the automatic driving test process is marked by adopting the model, information can be rapidly identified without manual access, large-scale and high-quality supervision data can be obtained, and subsequent test evaluation is facilitated.

Referring to fig. 2, a second embodiment of the model training method according to the embodiment of the present invention includes:

201. shooting image data through a camera on the unmanned vehicle, and acquiring laser point cloud data through a laser radar on the unmanned vehicle;

in the step, the image data and the laser point cloud data are matched data, namely the image data and the laser point cloud data are all the same target objects, and are used as model training samples. When acquiring, the vision range of the control camera and the laser radar covers the same object, and the acquisition azimuth is also kept the same.

In practical applications, the image data is an image set, preferably including N images, each image has three channels RGN, and the high-width bits h×w of each image, and then the image data may be expressed as n×h×w×3.

202. Sequentially inputting each frame of point clouds in the laser point cloud data into a laser point cloud neural network according to the sequence of acquisition time, and detecting a 3D object frame through the laser point cloud neural network to obtain 3D information corresponding to an object;

in this step, each frame of laser point cloud may be represented as an output of m×4, where M is the number of points of the laser point cloud, and 4 represents the characteristic (position information x, y, z; intensity) of each laser point.

After each frame of laser point cloud is input into the laser point cloud neural network, the laser point cloud neural network carries out panoramic recognition on the laser point cloud, segments the point cloud of a single object in each laser point cloud, constructs information of a 3D detection frame, specifically comprises a center point, a size, a rotation call and a category, and simultaneously also comprises recognition of category information of the point cloud of the object.

In this embodiment, a depth map of each frame of point cloud is obtained through the laser point cloud neural network, and the point clouds belonging to the same object are segmented based on the depth map, so as to obtain a plurality of point clouds;

In practical application, the specific steps are as follows:

1. model inference (model reference): the laser point cloud data are input into a laser point cloud model (a laser point cloud neural network), and tasks such as 3d obstacle frame detection (each 3d obstacle frame comprises information of a center point, a size, a rotation angle, a category and the like), laser radar point cloud segmentation (one category of information is output to each laser radar point) and the like are performed.

2. Post-treatment: each lidar point, if in the 3d obstacle detection box, is replaced by the 3d detection box class; if not in the 3d obstacle detection frame, the category is the output of the point cloud segmentation task.

203. Acquiring a parameter relation between a camera and a laser radar, and projecting 3D information corresponding to each object onto the image data based on the parameter relation to obtain monitoring image data;

in this embodiment, first coordinate information of each point cloud is determined based on the 3D information;

In practical applications, the parameter relationship can be understood as an external parameter between the two, and by the external parameter between the laser point cloud and the camera, and the internal parameter of the camera, we can project the laser point cloud onto the image, and the projection effect is shown in fig. one. Each laser spot cloud point is matched with its corresponding pixel. The model output information of the point cloud itself here can be used as a supervisory signal for the image pixels.

Specifically, any one point cloud point P with the 3d coordinates (x, y, z) in the point cloud coordinate system can be obtained according to the external parameters (including the rotation matrix R and the displacement vector t) between the laser radar and the camera, and the 3d coordinates (x _c ，y _c ，z _c )；

From the camera's reference matrix K we can get the position of the point on the image:

Where v denotes the rows of the image and u denotes the columns of the image.

Specifically, the laser radar and the camera are jointly calibrated in advance to obtain a projection transformation matrix between the three-dimensional point cloud and the image pixels;

when the camera acquires image data, the projection transformation matrix is used for projecting the three-dimensional boundary of the target to an image plane to obtain a two-dimensional boundary frame of the target in the image;

and extracting the characteristics of the image in the two-dimensional boundary box to obtain the image characteristics of the target.

Assuming that (x, y, z) and (u, v) are coordinates in a laser radar coordinate system and an image pixel coordinate system, respectively, the conversion relationship between the two coordinates is obtained after the association is as follows:

where K is the camera's internal reference matrix, the camera's internal reference is fixed after shipment, usually provided by the manufacturer or obtained by calibration algorithms, [ R, T ] is the camera's external reference matrix. The above formula needs to be solved by a 3D Point to 2D Point projective transformation matrix M, which can be solved by a classical PnP (selective-n-Point) algorithm, and at least 3 pairs need to be selected by adopting the PnP algorithm.

204. Performing a first convolution operation on each channel picture in the image by using a residual convolution neural network to obtain a plurality of characteristic tensors;

205. Fusing a plurality of characteristic tensors by using an FPC network to obtain a final characteristic diagram;

in this embodiment, a convolutional neural network or a visual transducer backbone is used to convolve each picture, and features of different scales are extracted by using a feature pyramid. The image may be processed, for example, with a rest (residual convolutional neural network). The input tensor of the neural network is n×h×w×3, where N represents the number of images and H/W represents the height/width of the images. 3 denote RGB three channels. And the RESNET acts convolution and other operations on the input tensor to obtain output tensors in different stages. And finally, fusing the feature tensors of all scales by using a Feature Pyramid (FPN) network to obtain a feature map of the original image size, wherein the size is N, H, W and C. Where C is the number of channels of the final feature map, e.g., 256.

206. Constructing a unit convolution kernel according to the final feature map;

207. performing a second convolution operation on the final feature map based on the unit convolution check to obtain a multidimensional tensor;

208. predicting based on the multidimensional tensor to obtain object information;

in this embodiment, each pixel can predict the following information: depth (1), object class (K, e.g., 10), center position (3), size (3), angle (2) of the obstacle. The number of variables is in parentheses, 19 numbers in total. Convolution with 1*1 can thus output an NxHxWx 19-dimensional variable. After steps 204-205, a feature tensor with a size of n×h×w×c is obtained. We perform a convolution operation using a convolution kernel of 1×1×c×19 to convert the n×h×w×w×19-dimensional tensor into an n×h×w×w×19-dimensional tensor. This final tensor is the result of the image pixel-by-pixel information prediction.

209. And performing depth training of three-dimensional perception capability on a preset visual neural network according to the supervision image data and the corresponding object information to obtain a supervision image model.

In the step, during training of the three-dimensional perception capability of the visual neural network, specifically, 3D information in the supervised image data is taken as a learning object, the visual neural network learns the recognition relationship between the 3D information and the image pixels to obtain three-dimensional perception recognition (namely, the recognition process of the laser point cloud neural network), a supervised image model is obtained, then the supervised image model is evaluated by utilizing object information, and parameters of the model are adjusted based on evaluation calculation so as to obtain a final supervised image model.

In this embodiment, according to a preset optical neural network and parameter variables in the 3D information, a loss function of model training is defined;

In practical applications, the defined loss functions include the following:

1. depth regression loss function

2. Class cross entropy loss function

3. Center position regression loss function

4. Size regression loss function

5. Angle regression loss function

Wherein d represents depth; y represents a category; i (pixel ε box) represents an indicator function, and for pixel values belonging to the 3d box, 1 is set, otherwise 0 is set; c represents the center position, s represents the dimension, sin/cos represents the sine sum of the anglesA cosine value;

the symbols represent the corresponding true values.

The weighted sum of the above 5 loss functions is most used as the total loss function to supervise the network training. And (3) through the loss value calculated by the loss function, the deep neural network training optimizer can optimize parameters of the model and perform model training.

According to the embodiment, based on the previous embodiment, the output of the existing point cloud neural network is utilized, depth information can be provided for the image neural network, the depth information is projected into image data, the image data are combined to train the visual neural network, a supervision image model is obtained, object recognition is carried out on images acquired during driving of an unmanned vehicle based on the model, recognition efficiency and accuracy are improved, supervision information can be provided for subsequent driving, and objects can be detected, recognized and tracked in real time and rapidly based on supervision signals.

Referring to fig. 3, in an embodiment of an image recognition method according to the present invention, the method includes the following steps:

301. the camera image is input into a convolutional neural network for convolution to obtain an image characteristic tensor;

in this step, the image acquired by the camera generally has three channels of RGB, and for N images, the width of each image is h×w, and then the image input may be represented as n×h×w×3.

And carrying out convolution operation on each picture through a visual transducer backbone in the convolution neural network, and extracting features with different scales by using a feature pyramid. The image may be processed, for example, with a rest (residual convolutional neural network). The input tensor of the neural network is n×h×w×3, where N represents the number of images and H/W represents the height/width of the images. 3 denote RGB three channels. And the RESNET acts convolution and other operations on the input tensor to obtain output tensors in different stages.

302. Predicting object information therein based on the image feature tensor;

in the step, a unit convolution kernel is defined, the unit convolution kernel is used for carrying out convolution calculation on the image characteristic tensor to obtain a single-dimension tensor, and object information in the single-dimension tensor is predicted based on the single-dimension tensor, for example: each pixel can predict the following information: depth (1), object class (K, e.g., 10), center position (3), size (3), angle (2) of the obstacle. The number of variables is in parentheses, 19 numbers in total. The convolution with 1x1 can output a N x H x W x 19 dimensional variable. After step 301, a feature tensor with a size of n×h×w×c is obtained. We perform a convolution operation using a convolution kernel of 1×1×c×19 to convert the n×h×w×w×19-dimensional tensor into an n×h×w×w×19-dimensional tensor. This final tensor is the result of the image pixel-by-pixel information prediction.

303. Laser point cloud input;

each frame of laser point cloud can be represented as an output of Mx4, where M is the number of points of the laser point cloud, and 4 represents the characteristics (position information x, y, z; intensity) of each laser point.

304. Outputting a laser point cloud model;

in the step, a trained laser point cloud neural network is used to obtain a panoramic segmentation result of each frame of point cloud, namely, the information (center point, size, rotation angle and category) of the 3D detection frame and the category information of the rest points.

The method comprises the following specific steps:

1. model inference (model reference): and inputting laser point cloud data into a laser point cloud model, and performing tasks such as 3d obstacle frame detection (each 3d obstacle frame comprises information such as a center point, a size, a rotation angle, a category and the like), laser radar point cloud segmentation (outputting information of one category to each laser radar point) and the like.

305. The laser point cloud results are projected to an image plane;

the method comprises the steps of obtaining a parameter relation between a camera and a laser radar, and projecting 3D information corresponding to each object onto the image data based on the parameter relation to obtain monitoring image data.

306. Calculating a loss function, and supervising image model training;

for a supervised image model training, a loss function is first defined, specifically including:

1. depth regression loss function

2. Class cross entropy loss function

3. Center position regression loss function

4. Size regression loss function

5. Angle regression loss function

/>

Where d represents depth; y represents a category; i (pixel ε box) represents an indicator function, and for pixel values belonging to the 3d box, 1 is set, otherwise 0 is set; c represents the center position, s represents the size, sin/cos represents the sine and cosine values of the angle;

the symbols represent the corresponding true values.

307. Collecting an image to be identified;

308. inputting an image to be identified into a supervision image model to identify 3D information, and generating supervision signals of all objects in the image to be identified based on the identified 3D information;

in this embodiment, after an image to be identified is input to a supervisory image model, the model first identifies an obstacle or a target object to be identified in the image to be identified, performs point cloud identification based on the obstacle or the target object to be identified, extracts three-dimensional information therein, generates a 3D detection frame, and marks the three-dimensional information in the image, and simultaneously identifies each pixel point in the image to obtain a pixel feature, and applies a supervisory signal based on the pixel feature and the 3D detection frame.

309. And performing supervision and identification on the corresponding object based on the supervision signal to obtain an identification result.

In this embodiment, the monitoring and recognition may be understood as tracking of a target, and the monitoring signal is used to recognize an object from an image acquired in real time, and monitor a motion path of the object to determine whether the vehicle needs to avoid, so as to improve the control degree of automatic driving of the vehicle.

According to the embodiment, the laser point cloud data are identified by utilizing the laser point cloud neural network, the corresponding 3D information is output, the convolutional neural network is utilized to predict the object information of the image corresponding to the laser point cloud data, a prediction result is obtained, finally, the deep learning of the three-dimensional perception capability is carried out on the visual neural network based on the prediction result and the corresponding 3D information, so that a supervision image model with the image identification perception capability and the point cloud identification perception capability is obtained, the object in the image is supervised and identified based on the supervision image model, the object automatic labeling and supervision based on the image are realized, the supervision and identification of the object can be realized without manual intervention labeling, and the scene identification capability, the identification accuracy and the identification efficiency of the unmanned vehicle are improved. Meanwhile, the experience of automatic driving of the vehicle is improved.

The method for training a model in the embodiment of the present invention is described above, and the following describes a model training device in the embodiment of the present invention, referring to fig. 4, an embodiment of the model training device in the embodiment of the present invention includes:

an acquisition module 401, configured to acquire image data and laser point cloud data corresponding to the image data;

the point cloud identification module 402 is configured to identify 3D information of each frame of point cloud in the laser point cloud data, and project the identified 3D information into the image data to obtain supervised image data;

an image recognition module 403, configured to extract pixel-by-pixel information in the image data by using a neural network, and predict corresponding object information based on the pixel-by-pixel information;

the model training module 404 is configured to perform depth training of three-dimensional perceptibility on a preset optical neural network according to the supervised image data and the corresponding object information, so as to obtain a supervised image model, so as to implement the perception operation of the optical neural network to obtain the image data and the laser point cloud.

In the embodiment of the invention, the model training device runs the model training method, the model training device extracts the 3D information of the object through the laser point cloud, then projects the 3D information into the image data, extracts pixels based on the image data and predicts corresponding object information, trains the image data containing the 3D information and the object information obtained from the image data to the visual neural network to obtain the supervised image model, so that the model has the capabilities of point cloud perception and image perception, and recognizes the object based on the model in the driving process, thereby improving the recognition efficiency and accuracy.

Referring to fig. 5, a second embodiment of the model training apparatus according to the present invention includes:

Wherein, the point cloud identification module 402 includes:

the point cloud detection unit 4021 is configured to sequentially input each frame of point cloud in the laser point cloud data into a laser point cloud neural network according to the sequence of the acquisition time, and perform 3D object frame detection through the laser point cloud neural network to obtain 3D information corresponding to an object;

The projection unit 4022 is configured to obtain a parameter relationship between the camera and the lidar, and project 3D information corresponding to each object onto the image data based on the parameter relationship, so as to obtain the surveillance image data.

In this embodiment, the point cloud detection unit 4021 is specifically configured to:

In this embodiment, the projection unit 4022 is specifically configured to:

Wherein, the image recognition module 403 includes:

a first convolution unit 4031, configured to perform a first convolution operation on each channel picture in the image by using a residual convolution level network, so as to obtain a plurality of feature tensors;

a fusion unit 4032, configured to fuse a plurality of the feature tensors by using an FPC network, so as to obtain a final feature map;

a construction unit 4033 for constructing a unit convolution kernel from the final feature map;

a second convolution unit 4034, configured to perform a second convolution operation on the final feature map based on the unit convolution kernel, to obtain a multidimensional tensor;

and a prediction unit 4035, configured to predict based on the multi-dimensional tensor to obtain object information.

Wherein the model training module 404 comprises:

a function definition unit 4041, configured to define a loss function of model training according to a preset optical neural network and parameter variables in the 3D information;

a calculation unit 4042 for calculating a loss value between the supervisory image data and the corresponding object information based on the loss function;

The optimizing unit 4043 is configured to perform optimization adjustment on parameters of the optical neural network by using a deep neural network training optimizer according to the loss value, so as to obtain an optimization model;

and a training unit 4044, configured to perform depth training of the three-dimensional perceptibility on the optimization model by using the surveillance image data and the object information, so as to obtain a surveillance image model.

According to the embodiment, on the basis of the previous embodiment, the three-dimensional perception recognition capability of the laser point cloud is learned on the basis of visual image recognition, so that the operation of an image perception object and the laser point cloud perception object is obtained, and the recognition and marking of information in the object are directly carried out on the image acquired by the automatic driving system through the model, so that the subsequent system can be conveniently recognized.

The image recognition method in the embodiment of the present invention is described above, and the image recognition apparatus in the embodiment of the present invention is described below, referring to fig. 6, where an embodiment of the image recognition apparatus in the embodiment of the present invention includes:

The shooting module 601 is used for acquiring an image to be identified;

the identifying module 602 is configured to input the image to be identified to a supervised image model for identifying 3D information, and generate a supervisory signal of each object in the image to be identified based on the identified 3D information, where the supervised image model is obtained by training using the model training method provided above;

and the supervision module 603 is configured to perform supervision and identification on the corresponding object based on the supervision signal, so as to obtain an identification result.

In the embodiment of the invention, after 3D information of an object is extracted through laser point cloud, 3D information is projected into image data, pixel extraction is carried out based on the image data, corresponding object information is predicted, the image data containing the 3D information and the object information obtained from the image data are trained on a visual neural network to obtain a supervision image model, so that the model has the capabilities of point cloud perception and image perception, object recognition is carried out on an image to be recognized acquired in a driving process based on the model to obtain a supervision signal, and object in a subsequent image is supervised and recognized based on the supervision signal, thereby improving recognition efficiency and accuracy and efficiency of supervision and recognition.

Fig. 4 to 6 above describe the model training device and the image recognition device in the embodiment of the present invention in detail from the point of view of the modularized functional entity, and the computer device in the embodiment of the present invention is described in detail from the point of view of hardware processing.

Fig. 7 is a schematic diagram of a computer device according to an embodiment of the present invention, where the computer device 700 may have a relatively large difference due to configuration or performance, and may include one or more processors (central processing units, CPU) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) storing application programs 733 or data 732. Wherein memory 720 and storage medium 730 may be transitory or persistent. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations in the computer device 700. Still further, the processor 710 may be configured to communicate with the storage medium 730 and execute a series of instruction operations in the storage medium 730 on the computer device 700 to implement the steps of the model training method or to implement the steps of the image recognition method described above.

The computer device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input/output interfaces 760, and/or one or more operating systems 731, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the computer device structure shown in FIG. 7 is not limiting of the computer device provided by the present invention and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the model training method, or the steps of the image recognition method provided above.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A model training method, characterized in that the model training method comprises:

3D information is identified for each frame of point cloud in the laser point cloud data, and the identified 3D information is projected into the image data to obtain monitoring image data, wherein the 3D information is a 3D detection frame, and the monitoring image data is combined data comprising point cloud marking information and an image;

dividing the image data according to time sequence by utilizing a neural network, performing dimension reduction processing to obtain a plurality of single-dimensional tensors, combining the plurality of single-dimensional tensors to construct a pixel boundary, identifying the pixel boundary by adopting a target identification model, and finding out matched object information from a preset object class library;

and learning and classifying cloud label information in the supervision image data by using a deep learning neural network, inputting the classified information into a preset visual neural network, learning and classifying the information by using the visual neural network to obtain the characteristic recognition capability of laser point clouds, outputting a corresponding training result, comparing the training result with corresponding object information, and adjusting model parameters of the visual neural network based on the compared result to obtain a supervision image model so as to realize the perception operation of the visual neural network for obtaining the image data and the laser point clouds.

2. The model training method according to claim 1, wherein the identifying 3D information for each frame of point cloud in the laser point cloud data and projecting the identified 3D information into the image data to obtain the supervised image data includes:

3. The model training method according to claim 2, wherein the performing 3D object frame detection by the laser point cloud neural network to obtain 3D information corresponding to an object includes:

4. A model training method according to claim 3, wherein the acquiring the parameter relation between the camera and the laser radar, and projecting the 3D information corresponding to each object onto the image data based on the parameter relation, to obtain the supervised image data, comprises:

5. The model training method according to claim 1, wherein the performing the dimension reduction processing on the image data after time-series division by using the neural network to obtain a plurality of single-dimensional tensors, and combining the plurality of single-dimensional tensors to construct a pixel boundary, performing recognition by using a target recognition model based on the pixel boundary, and finding out matched object information from a preset object class library, includes:

constructing a unit convolution kernel according to the final feature map;

and constructing a pixel boundary based on the multidimensional tensor, identifying the pixel boundary by adopting a target identification model, and finding out matched object information from a preset object class library.

6. The model training method according to any one of claims 1 to 5, wherein the comparing the training result with corresponding object information, and adjusting the model parameter supervision image model of the optical neural network based on the comparison result, comprises:

defining a loss function of model training according to parameter variables in the classified information;

calculating a loss value between the categorized information and the corresponding object information based on the loss function;

and according to the training result and the loss value, utilizing a deep neural network training optimizer to carry out optimization adjustment on parameters of the visual neural network, and obtaining a supervision image model.

7. An image recognition method, characterized in that the image recognition method comprises:

collecting an image to be identified;

inputting the image to be identified into a supervision image model for 3D information identification, and generating supervision signals of all objects in the image to be identified based on the identified 3D information, wherein the supervision image model is trained according to the model training method of any one of claims 1-6;

8. A model training apparatus, characterized in that the model training apparatus comprises:

the point cloud identification module is used for identifying 3D information of each frame of point cloud in the laser point cloud data, projecting the identified 3D information into the image data to obtain monitoring image data, wherein the 3D information is a 3D detection frame, and the monitoring image data is combined data comprising point cloud marking information and an image;

the image recognition module is used for dividing the image data according to time sequence by utilizing a neural network, performing dimension reduction processing to obtain a plurality of single-dimensional tensors, combining the plurality of single-dimensional tensors to construct a pixel boundary, recognizing the pixel boundary by adopting a target recognition model based on the pixel boundary, and finding out matched object information from a preset object class library;

The model training module is used for learning and classifying cloud labeling information in the supervision image data by using a deep learning neural network, inputting the classified information into a preset visual neural network, learning the classified information by the visual neural network to obtain the characteristic recognition capability of laser point clouds, outputting a corresponding training result, comparing the training result with corresponding object information, and adjusting model parameters of the visual neural network based on the compared result to obtain a supervision image model so as to realize the perception operation of the visual neural network for obtaining the image data and the laser point clouds.

9. An image recognition apparatus, characterized in that the image recognition apparatus comprises:

the shooting module is used for acquiring an image to be identified;

the recognition module is used for inputting the image to be recognized into a supervision image model to recognize 3D information and generating supervision signals of all objects in the image to be recognized based on the recognized 3D information, wherein the supervision image model is trained by the model training method according to any one of claims 1-6;

10. A computer device, the computer device comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the computer device to perform the steps of the model training method of any one of claims 1-6, or the steps of the image recognition method of claim 7.

11. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the model training method according to any of claims 1-6 or the steps of the image recognition method according to claim 7.