CN114494158A

CN114494158A - Image processing method, lane line detection method and related equipment

Info

Publication number: CN114494158A
Application number: CN202210018538.XA
Authority: CN
Inventors: 韩建华; 徐航; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-05-13
Also published as: WO2023131065A1

Abstract

The embodiment of the application discloses an image processing method which can be applied to scenes comprising lane line detection, such as adaptive cruise, lane departure early warning, lane keeping assistance and the like. The method comprises the following steps: extracting the features of an image to be detected to obtain first features; processing the detection frame information of the image to be detected to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of at least one object in the image to be detected; and inputting the first characteristic and the second characteristic into a first neural network based on a transformer structure to obtain a lane line in the image to be detected. By applying the neural network of the transform structure to the lane line detection task, the global information of the image to be detected can be obtained, and further the long-distance relation between lane lines is effectively modeled. And the perception capability of the image scene is improved by adding the detection frame information.

Description

Image processing method, lane line detection method and related equipment

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to an image processing method, a lane line detection method and related equipment.

Background

The intelligent driving (such as automatic driving, auxiliary driving and the like) technology depends on the cooperative cooperation of artificial intelligence, visual calculation, radar, a monitoring device and a global positioning system, so that the vehicle can realize automatic driving without the active operation of human beings. The lane line detection technology is one of the most important technologies in intelligent driving, and has very important significance for other technologies (such as adaptive cruise control, lane departure warning, road condition understanding and the like) applied to an intelligent driving system. The goal of lane line detection technology is to predict each lane line in a picture by inputting the picture acquired by a camera so as to assist a vehicle to run on a correct lane.

With the development of the deep learning technology, lane line detection based on image segmentation starts to appear, a lane line detection model based on image segmentation firstly predicts the segmentation result of the whole image, and then outputs the lane line detection result after clustering.

However, most of lane line detection methods based on the deep learning technology are based on a convolutional neural network, such as a Spatial Convolutional Neural Network (SCNN), and since the convolutional neural network is limited by a receptive field, global information of a picture cannot be well perceived, a position of a lane line cannot be accurately predicted, and particularly, in a scene where many vehicles are shielded, a model is prone to being misdetected.

Disclosure of Invention

The embodiment of the application provides an image processing method, a lane line detection method and related equipment. The accuracy of lane lines in the detected image can be improved.

A first aspect of an embodiment of the present application provides an image processing method, which may be applied to an intelligent driving scenario. For example: scenes including lane line detection, such as adaptive cruise, Lane Departure Warning (LDW), Lane Keeping Assist (LKA), and the like. The method may be performed by an image processing device (e.g., a terminal device or a server) or may be performed by a component of an image processing device (e.g., a processor, a chip, or a system of chips, etc.). The method is realized by a target neural network containing a transformer structure, and comprises the following steps: extracting the features of an image to be detected to obtain first features; processing the detection frame information of the image to be detected to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of an object in the image to be detected; and inputting the first characteristic and the second characteristic into a first neural network based on a transformer structure to obtain a lane line in the image to be detected.

In the embodiment of the application, on one hand, the transform structure is applied to the lane line detection task, so that the global information of the image to be detected can be acquired, and further the long-distance relation between lane lines is effectively modeled. On the other hand, by adding the detection frame information of the object in the image in the process of detecting the lane line, the perception capability of the image scene can be improved, and the misjudgment of the scene due to the fact that the lane line is shielded by the vehicle is reduced.

Optionally, in a possible implementation manner of the first aspect, the step of: processing the detection frame information of the image to be detected to obtain a second characteristic, comprising: and processing the at least one third feature and the detection frame information to obtain a second feature, wherein the at least one third feature is an intermediate feature obtained in the process of obtaining the first feature.

In this possible implementation manner, the acquired second feature not only contains the detection frame information, but also contains the feature of the image. More details are provided for subsequent lane line determinations.

Optionally, in a possible implementation manner of the first aspect, the second feature includes a position feature and a semantic feature of a detection frame corresponding to an object in the image to be detected, and the detection frame information further includes: detecting the category and the confidence of the frame; processing the at least one third feature and the detection frame information to obtain a second feature comprises: obtaining semantic features based on the at least one third feature, the location, and the confidence level; location features are obtained based on location and category.

In this possible implementation manner, the second feature considers not only the position of the detection frame but also the category and the confidence of the detection frame, so that the lane line determined subsequently is more accurate.

Optionally, in a possible implementation manner of the first aspect, the step of: obtaining semantic features based on the at least one third feature, the location, and the confidence, including: extracting region of interest ROI features from the at least one third feature based on the position; performing multiplication processing on the ROI features and the confidence coefficient, and inputting the obtained features into a full-connection layer to obtain semantic features; obtaining location features based on location and category, including: and obtaining the vector of the category, splicing the vector with the vector corresponding to the position, and inputting the obtained characteristics into the full-connection layer to obtain the position characteristics.

In the possible implementation mode, the semantic features related to the detection frame in the image features are determined, and the position features containing the position information of the detection frame are introduced, so that the information of the second features is more comprehensive, and the accuracy of lane line prediction is improved.

Optionally, in a possible implementation manner of the first aspect, the first neural network based on a transform structure includes an encoder, a decoder, and a feedforward neural network; inputting the first characteristic and the second characteristic into a first neural network based on a transformer structure to obtain a lane line in the image to be detected, wherein the method comprises the following steps: acquiring a fourth feature based on the first feature, the second feature and the encoder; inputting the fourth characteristic, the second characteristic and the query characteristic into a decoder to obtain a fifth characteristic; and inputting the fifth characteristic into a feedforward neural network to obtain a plurality of point sets.

In the possible implementation manner, on one hand, by applying the transform structure to the lane line detection task, the global information of the image to be detected can be acquired, and then the long-range relation between lane lines is effectively modeled. On the other hand, the second characteristic containing the detection frame information is added in the process of determining the point set, so that the lane line determined based on the point set is more accurate.

Optionally, in a possible implementation manner of the first aspect, the step further includes: acquiring first row features and first column features based on the first features, wherein the first row features are obtained by flattening (flattening) a matrix corresponding to the first features along the row direction, and the first column features are obtained by flattening (flattening) the matrix along the column direction; inputting the first feature and the second feature into an encoder to obtain a fourth feature, wherein the fourth feature comprises: and inputting the first characteristic, the second characteristic, the first row characteristic and the first column characteristic into a decoder to obtain a fourth characteristic.

In the possible implementation mode, the first row characteristic and the first column characteristic which can be compliant with the shape of the lane line to mine the context information are introduced, so that the construction capacity of the long-strip-shaped lane line characteristic can be improved, and a better lane line detection effect is achieved.

Optionally, in a possible implementation manner of the first aspect, the step of: inputting the first feature, the second feature, the first row feature and the first column feature into an encoder to obtain a fourth feature, including: performing self-attention calculation on the first characteristic to obtain a first output; performing cross attention calculation on the first characteristic and the second characteristic to obtain a second output; performing self-attention calculation and splicing processing on the first row characteristics and the first column characteristics to obtain row and column output; a fourth feature is obtained based on the first output, the second output, and the row-column output.

In this possible implementation, the row-column output is also considered in the process of acquiring the fourth feature, and the row-column output capable of following the shape of the lane line to mine the context information is introduced, so that the construction capability of the features of the long-strip lane line can be improved, and a better lane line detection effect is achieved.

Optionally, in a possible implementation manner of the first aspect, the step of: obtaining a fourth feature based on the first output, the second output, and the rank output, including: adding the first output and the second output to obtain a fifth output; and splicing the fifth output and the row-column output to obtain a fourth characteristic.

In this possible implementation, a specific process of the fourth feature is refined, and the fourth feature is obtained by splicing a result obtained by adding the first output and the second output with the row-column output. By introducing row and column output which can be compliant with the shape of the lane line to mine context information, the construction capability of the characteristics of the long-strip lane line can be improved, and a better lane line detection effect is achieved.

Optionally, in a possible implementation manner of the first aspect, the step of: inputting the first feature and the second feature into an encoder to obtain a fourth feature, comprising: performing self-attention calculation on the first characteristic to obtain a first output; performing cross attention calculation on the first characteristic and the second characteristic to obtain a second output; and adding the first output and the second output to obtain a fourth characteristic.

In this possible implementation manner, the fourth feature not only includes a first output calculated by the self-attention mechanism based on the first feature, but also includes a second output calculated by the cross-attention mechanism based on the first feature and the second feature, so that the expression capability of the fourth feature is improved.

Optionally, in a possible implementation manner of the first aspect, the step of: inputting the fourth feature, the second feature and the query feature into a decoder to obtain a fifth feature, including: performing cross attention calculation on the query feature and the fourth feature to obtain a third output; processing the query feature and the second feature to obtain a fourth output; and adding the third output and the fourth output to obtain a fifth characteristic.

In the possible implementation manner, more information with predicted images is considered in the acquired fifth feature through cross attention calculation, the expression capability of the fifth feature is improved, and the lane line determined based on the point set is more accurate.

Optionally, in a possible implementation manner of the first aspect, the step of: the method for extracting the features of the image to be detected comprises the following steps: and performing feature fusion and dimension reduction processing on features output by different layers in the backbone network to obtain a first feature, wherein the input of the backbone network is an image to be detected.

In the possible implementation mode, by splicing the features of each layer, because the characteristics extracted by different layers of the neural network have different performance, the resolution of the features of the lower layer is higher and contains more position and detail information, but because the convolution is less, the semantic property is lower and the noise is more; the high-level features have stronger semantic information, but have low resolution and poor detail perception capability. Therefore, feature fusion is carried out on features extracted from different layers of the neural network, and the obtained first feature has multi-level features.

The second aspect of the embodiments of the present application provides a lane line detection method, which may be applied to an intelligent driving scenario. For example: the method comprises the following steps of self-adaptive cruise, lane departure early warning, lane keeping assistance and the like. The method may be performed by a detection device (e.g., a vehicle or a device in a vehicle) or may be performed by a component of a detection device (e.g., a processor, a chip, or a system of chips, etc.). The method comprises the following steps: acquiring an image to be detected; processing an image to be detected to obtain a plurality of point sets, wherein each point set in the plurality of point sets represents a lane line in the image to be detected; the method comprises the steps of processing a first neural network based on a transform structure and a point set of a lane line in a predicted image based on detection frame information, wherein the detection frame information comprises the position of a detection frame of at least one object in an image to be detected in the image to be detected.

In the embodiment of the application, on one hand, the transform structure is applied to the lane line detection task, so that the global information of the image to be detected can be acquired, and further the long-distance relation between lane lines is effectively modeled. On the other hand, by adding the detection frame information of the object in the image in the process of detecting the lane line, the perception capability of the target neural network on the image scene can be improved, and the misjudgment of the lane line in the scene shielded by the vehicle is reduced.

Optionally, in a possible implementation manner of the second aspect, the detection frame information further includes: the category and confidence of the detection box.

In this possible implementation manner, by introducing the category and the confidence of the detection frame, the detection frame information referred by the lane line to be subsequently predicted can be increased, so that the lane line determined based on the point set is more accurate subsequently.

Optionally, in a possible implementation manner of the second aspect, the step further includes: and displaying the lane line.

In the possible implementation mode, the lane line is displayed, so that a user can pay attention to the lane line condition of the current road, particularly in the scenes that the lane line is shielded and the like, the user is helped to accurately determine the lane line, and the risk caused by the fuzzy lane line is reduced.

Optionally, in a possible implementation manner of the second aspect, the step further includes: modeling at least one object to obtain a virtual object; performing fusion processing on the plurality of point sets and the virtual object based on the position to obtain a target image; and displaying the target image.

In this possible implementation, the target image is obtained by modeling the virtual object and fusing the virtual object with the plurality of point sets based on the position. The user can know surrounding objects and lane lines through the target image, help the user accurately determine the surrounding objects and the lane lines, and reduce risks caused by fuzzy lane lines.

A third aspect of the embodiments of the present application provides an image processing method, which may be applied to an intelligent driving scenario. For example: the method comprises the following steps of self-adaptive cruise, lane departure early warning, lane keeping assistance and the like. The method may be performed by an image processing device (e.g., a terminal device or a server) or may be performed by a component of an image processing device (e.g., a processor, a chip, or a system of chips, etc.). The method comprises the following steps: acquiring a training image; inputting the training image into a target neural network to obtain a first point set of the training image, wherein the first point set represents a predicted lane line in the training image; the target neural network is operable to: extracting features of the training image to obtain first features; processing detection frame information of the training image to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of an object in the training image; acquiring a first point set based on the first characteristic and the second characteristic, wherein the target neural network is used for predicting the point set of the lane line in the image based on the transformer structure; and training the target neural network according to the first point set and the real point set of the actual lane line in the training image to obtain the trained target neural network.

A fourth aspect of embodiments of the present application provides an image processing apparatus that may be applied to a smart driving scenario. For example: the method comprises the following steps of self-adaptive cruise, lane departure early warning, lane keeping assistance and the like. The image processing apparatus includes: the extraction unit is used for extracting the features of the image to be detected to obtain first features; the processing unit is used for processing the detection frame information of the image to be detected to obtain a second characteristic, wherein the detection frame information comprises the position of the detection frame of at least one object in the image to be detected; and the determining unit is used for inputting the first characteristic and the second characteristic into a first neural network based on a transform structure to obtain a lane line in the image to be detected.

Optionally, in a possible implementation manner of the fourth aspect, the processing unit is specifically configured to process at least one third feature and the detection frame information to obtain the second feature, where the at least one third feature is an intermediate feature obtained in a process of obtaining the first feature.

Optionally, in a possible implementation manner of the fourth aspect, the second feature includes a position feature and a semantic feature of a detection frame corresponding to an object in the image to be detected, and the detection frame information further includes: detecting the category and the confidence of the frame; a processing unit, specifically configured to obtain semantic features based on the at least one third feature, the location, and the confidence level; and the processing unit is specifically used for acquiring the position characteristics based on the position and the category.

Optionally, in a possible implementation manner of the fourth aspect, the processing unit is specifically configured to extract a region of interest ROI feature from the at least one third feature based on the position; the processing unit is specifically used for performing multiplication processing on the ROI features and the confidence coefficient, and inputting the obtained features into the full-connection layer to obtain semantic features; and the processing unit is specifically used for acquiring the vector of the category, splicing the vector with the vector corresponding to the position, and inputting the acquired feature into the full-connection layer to acquire the position feature.

Optionally, in a possible implementation manner of the fourth aspect, the first neural network based on a transform structure includes an encoder, a decoder, and a feedforward neural network; the determining unit is specifically configured to input the first feature and the second feature into the encoder to obtain a fourth feature; the determining unit is specifically configured to input the fourth feature, the second feature and the query feature into the decoder to obtain a fifth feature; and the determining unit is specifically used for inputting the fifth characteristic into the feedforward neural network to obtain a plurality of point sets, wherein each point set in the plurality of point sets represents one lane line in the image to be detected.

Optionally, in a possible implementation manner of the fourth aspect, the image processing apparatus further includes: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring first row characteristics and first column characteristics based on first characteristics, the first row characteristics are obtained by flattening (flatten) a matrix corresponding to the first characteristics along the row direction, and the first column characteristics are obtained by flattening (flatten) the matrix along the column direction; and the determining unit is specifically used for inputting the first feature, the second feature, the first row feature and the first column feature into the decoder to obtain a fourth feature.

Optionally, in a possible implementation manner of the fourth aspect, the determining unit is specifically configured to perform self-attention calculation on the first feature to obtain a first output; the determining unit is specifically used for performing cross attention calculation on the first feature and the second feature to obtain a second output; the determining unit is specifically used for performing self-attention calculation and splicing processing on the first row characteristics and the first column characteristics to obtain row and column output; and the determining unit is specifically configured to obtain the fourth feature based on the first output, the second output, and the row-column output.

Optionally, in a possible implementation manner of the fourth aspect, the determining unit is specifically configured to add the first output and the second output to obtain a fifth output; and the determining unit is specifically configured to perform splicing processing on the fifth output and the row-column output to obtain a fourth characteristic.

Optionally, in a possible implementation manner of the fourth aspect, the determining unit is specifically configured to perform self-attention calculation on the first feature to obtain a first output; the determining unit is specifically used for performing cross attention calculation on the first feature and the second feature to obtain a second output; and the determining unit is specifically configured to perform addition processing on the first output and the second output to obtain a fourth characteristic.

Optionally, in a possible implementation manner of the fourth aspect, the determining unit is specifically configured to perform cross attention calculation on the query feature and the fourth feature to obtain a third output; the determining unit is specifically used for processing the query feature and the second feature to obtain a fourth output; and the determining unit is specifically configured to add the third output and the fourth output to obtain a fifth characteristic.

Optionally, in a possible implementation manner of the fourth aspect, the extracting unit is specifically configured to perform feature fusion and dimension reduction on features output by different layers in a backbone network to obtain the first feature, where an input of the backbone network is an image to be detected.

The fifth aspect of the embodiments of the present application provides a detection device, which may be applied to an intelligent driving scenario. For example: the method comprises the following steps of self-adaptive cruise, lane departure early warning, lane keeping assistance and the like. This check out test set is applied to the vehicle, and the check out test set includes: the acquisition unit is used for acquiring an image to be detected; the processing unit is used for processing the image to be detected to obtain a plurality of point sets, and each point set in the plurality of point sets represents a lane line in the image to be detected; the method comprises the steps of processing a first neural network based on a transform structure and a point set of a lane line in a predicted image based on detection frame information, wherein the detection frame information comprises the position of a detection frame of at least one object in an image to be detected in the image to be detected.

Optionally, in a possible implementation manner of the fifth aspect, the detection frame information further includes: the category and confidence of the detection box.

Optionally, in a possible implementation manner of the fifth aspect, the detection device further includes: and the display unit is used for displaying the lane line.

Optionally, in a possible implementation manner of the fifth aspect, the processing unit is further configured to model at least one object to obtain a virtual object; the processing unit is also used for carrying out fusion processing on the plurality of point sets and the virtual object based on the position to obtain a target image; and the display unit is also used for displaying the target image.

A sixth aspect of embodiments of the present application provides an image processing apparatus, which may be applied to a smart driving scenario. For example: the method comprises the following steps of self-adaptive cruise, lane departure early warning, lane keeping assistance and the like. The image processing apparatus includes: an acquisition unit configured to acquire a training image; the processing unit is used for inputting the training image into the target neural network to obtain a first point set of the training image, and the first point set represents a predicted lane line in the training image; the target neural network is operable to: extracting features of the training image to obtain first features; processing detection frame information of the training image to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of an object in the training image; acquiring a first point set based on the first characteristic and the second characteristic, wherein the target neural network is used for predicting the point set of the lane line in the image based on the transformer structure; and the training unit is used for training the target neural network according to the first point set and the real point set of the actual lane line in the training image to obtain the trained target neural network.

A seventh aspect of the present application provides an image processing apparatus comprising: a processor coupled to a memory for storing a program or instructions which, when executed by the processor, cause the image processing apparatus to carry out the method of the aforementioned first aspect or any possible implementation of the first aspect, or the method of any possible implementation of the aforementioned third aspect or the third aspect.

An eighth aspect of the present application provides a detection apparatus, including: a processor coupled to a memory for storing a program or instructions which, when executed by the processor, cause the detection apparatus to carry out the method of the second aspect described above or any possible implementation of the second aspect.

A ninth aspect of the present application provides a computer readable medium having stored thereon a computer program or instructions which, when run on a computer, cause the computer to perform the method of the aforementioned first aspect or any possible implementation of the first aspect, or cause the computer to perform the method of the aforementioned second aspect or any possible implementation of the second aspect, or cause the computer to perform the method of the aforementioned third aspect or any possible implementation of the third aspect.

A tenth aspect of the present application provides a computer program product which, when executed on a computer, causes the computer to perform the method of the first aspect or any possible implementation of the first aspect, or causes the computer to perform the method of the second aspect or any possible implementation of the second aspect, or causes the computer to perform the method of the third aspect or any possible implementation of the third aspect.

For technical effects brought by the fourth, seventh, eighth, ninth, and tenth aspects or any one of possible implementation manners, reference may be made to technical effects brought by the first aspect or different possible implementation manners of the first aspect, and details are not described here.

For example, the technical effect brought by the fifth, seventh, eighth, ninth, and tenth aspects or any one of the possible implementation manners of the fifth aspect may refer to the technical effect brought by the second aspect or the different possible implementation manners of the second aspect, and details are not described here.

For technical effects brought by the sixth, seventh, eighth, ninth, and tenth aspects or any one of possible implementation manners, reference may be made to technical effects brought by different possible implementation manners of the third aspect or the third aspect, and details are not described here again.

According to the technical scheme, the embodiment of the application has the following advantages: on one hand, by applying the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and further the long-distance relation between lane lines is effectively modeled. On the other hand, by adding the detection frame information of the object in the image in the process of detecting the lane line, the perception capability of the image scene can be improved, and the misjudgment of the scene due to the fact that the lane line is shielded by the vehicle is reduced.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 3a is a schematic structural diagram of an image processing system according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram of another embodiment of an image processing system according to the present disclosure;

FIG. 4 is a schematic structural diagram of a vehicle according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating a process for obtaining a second feature according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a first neural network provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a transformer structure provided in an embodiment of the present application;

FIG. 9 is a schematic flow chart illustrating a fourth feature obtained in the embodiment of the present application;

FIG. 10 is a schematic flow chart illustrating a fourth output obtained in the embodiment of the present application;

fig. 11 is another schematic structural diagram of a first neural network provided in an embodiment of the present application;

FIG. 12 is another schematic structural diagram of a transformer structure provided in an embodiment of the present application;

FIG. 13 is a schematic diagram of a row-column attention module according to an embodiment of the present disclosure;

FIG. 14a is an exemplary diagram including a process for determining a plurality of point sets provided by an embodiment of the present application;

FIG. 14b is a diagram illustrating an example of a plurality of point sets provided by an embodiment of the present application;

FIG. 14c is a diagram of an example of an image to be detected including a plurality of point sets according to an embodiment of the present application;

fig. 14d is an exemplary diagram corresponding to lane line detection provided in the embodiment of the present application;

fig. 15 is another schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 16 is a schematic structural diagram of a target neural network provided in an embodiment of the present application;

FIG. 17 is a schematic diagram of another structure of a target neural network provided in an embodiment of the present application;

fig. 18 is a schematic flowchart of a lane line detection method according to an embodiment of the present application;

FIG. 19 is an exemplary diagram of a target image provided by an embodiment of the present application;

FIG. 20 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

fig. 21 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 22 is a schematic structural diagram of a detection apparatus provided in an embodiment of the present application;

fig. 23 is another schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 24 is another schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 25 is another schematic structural diagram of a detection apparatus according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and the like in the description and claims of this application and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The first step of intelligent driving is the collection and processing of environmental information, and lane lines are one of the most main indicating information of a road surface, and can effectively guide intelligent vehicles to run in a restricted road area. Therefore, how to accurately detect the lane lines on the road surface in real time is an important link in the design of the related system of the intelligent vehicle, can be beneficial to assisting the functions of path planning, carrying out road deviation early warning and the like, and can provide reference for accurate navigation. The lane line detection technology aims to accurately identify the lane lines on the road surface by analyzing the pictures acquired by the vehicle-mounted camera in the driving process so as to assist the automobile to drive on the correct lane.

With the development of deep learning techniques, image segmentation-based lane line detection and detection-based lane line detection have begun to appear. The lane line detection model based on image segmentation firstly predicts the segmentation result of the whole image, and then outputs the lane line detection result after clustering. And predicting a large number of lane line candidates by generating a plurality of anchor points and predicting the offset of the lane line with respect to the anchor points based on the detected lane line detection, and then performing post-processing by non-maximum suppression to obtain a final lane line detection result.

The lane line detection method based on the deep learning technology is mostly based on a convolutional neural network, for example: a Spatial Convolution Neural Network (SCNN), and the like. SCNN is a technical solution for lane line detection based on image segmentation. The scheme uses a convolutional neural network to carry out image segmentation on a picture to be detected, and predicts a category for each pixel point in the picture. According to the scheme, a traditional deep convolution structure is popularized to be a piece-by-piece convolution structure, and convolution is carried out according to different directions, so that information between rows and columns in a picture can be transmitted. Specifically, the conventional convolution is to perform convolution operation on a feature with one dimension of HxWxC, and the scheme firstly divides HxWxC into H slices of WxC according to the longitudinal direction, then convolutes the slices from bottom to top and from top to bottom respectively, then divides HxWxC into W slices of HxC according to the horizontal direction, then convolutes the slices from left to right and from right to left respectively, and finally splices the convolution results obtained according to the four directions, and outputs a segmentation graph of an image through a full connection layer. Thereby realizing the detection of the lane line.

However, the convolutional neural network is limited by the receptive field, and cannot sense the global information of the picture well. On the one hand, the prediction of objects with long tail relations (which can also be understood as being slender) such as lane lines is not facilitated. On the other hand, particularly in a scene where many vehicles are blocked, the position of the lane line cannot be accurately predicted, and the model is likely to be misdetected.

In order to solve the foregoing technical problems, embodiments of the present application provide an image processing method, a lane line detection method, and related devices, and on one hand, by applying a transform structure to a lane line detection task, long-range relations between lane lines can be effectively modeled. On the other hand, the perception capability of the scene can be improved by adding the position information of the detection frame of the object in the image in the process of detecting the lane line. And misjudgment caused by the fact that the lane line is shielded by the vehicle is reduced. The image processing method and the related apparatus according to the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

For ease of understanding, the relevant terms and concepts to which the embodiments of the present application relate generally will be described below.

1. Neural network

The neural network may be composed of neural units, which may be referred to as X_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs X_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a Relu function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

The operation of each layer in a neural network can be described by the mathematical expression y ═ a (Wx + b): from the work of each layer in the physical layer neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein the operations 1, 2 and 3 are performed by Wx, the operation 4 is performed by + b, and the operation 5 is performed by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

2. Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving the same trainable filter with an input image or convolved feature plane (feature map). The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

3、transformer

the transform structure is a feature extraction network (classified as convolutional neural network) that includes an encoder and a decoder.

An encoder: feature learning is performed in a global receptive field in a self-attention manner, such as features of pixel points.

A decoder: the characteristics of the desired module, such as the characteristics of the output box, are learned through self-attention and cross-attention.

Attention (which may also be referred to as an attention mechanism) is described below:

the attention mechanism can quickly extract important features of sparse data. The attention mechanism occurs between the encoder and the decoder, and may be said to occur between the input sentence and the generated sentence. The self-attention mechanism in the self-attention model occurs in an input sequence or an output sequence, and can extract the relation between words which are far away from each other in the same sentence, such as syntactic characteristics (phrase structures). The self-attention mechanism QKV provides an efficient modeling way to capture global context information. Assume that the input is Q (query), storing the context in the form of key-value pairs (K, V). Then the attention mechanism is actually a mapping function of the query onto a series of key value pairs (keys). The nature of an attribute function can be described as a mapping of a query (query) to a series of key-value pairs. The attribute essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in (K, V) form, then the addressing is done by computing the similarity of Q and K. The similarity calculated by Q and K reflects the importance degree, namely the weight, of the extracted V value, and then the final characteristic value is obtained by weighted summation.

The calculation of attention is mainly divided into three steps, the first step is to calculate the similarity of the query and each key to obtain weight, and common similarity functions comprise dot product, splicing, perception machine and the like; then, in a second step, the weights are generally normalized by using a softmax function (on the one hand, normalization can be performed to obtain a probability distribution with the sum of all weight coefficients being 1, and on the other hand, the weights of important elements can be highlighted by using the characteristics of the softmax function); and finally, carrying out weighted summation on the weight and the corresponding key value to obtain a final characteristic value. The specific calculation formula may be as follows:

where d is the dimension of the QK matrix.

In addition, attention includes self-attention and cross-attention, and self-attention is understood to be a special attention, i.e., the input of QKV is consistent. Whereas the input of QKV in cross-attention does not coincide. Attention is drawn to integrating the queried features as updated values of the current features using the degree of similarity (e.g., inner product) between the features as a weight. The self-attention is attention extracted based on attention of the feature map itself.

For convolution, the setting of the convolution kernel limits the size of the receptive field, so that the network often needs to stack multiple layers to focus on the whole feature map. The self-attention has the advantage that the attention is global, and the global spatial information of the feature map can be acquired through simple query and assignment. A special point in the query, key, value (QKV) model from attention is that the corresponding inputs of QKV are consistent. The QKV model will be described later.

4. Feedforward neural network

Feed Forward Neural Networks (FNNs) were the first simple artificial neural networks invented. In the feedforward neural network, each neuron belongs to a different layer. The neurons of each layer may receive signals from neurons of a previous layer and generate signals for output to a next layer. Layer 0 is referred to as the input layer, the last layer as the output layer, and the other intermediate layers as the hidden layers. No feedback exists in the whole network, and signals are transmitted from an input layer to an output layer in a single direction.

5. Multilayer perceptron (MLP)

A multi-layered perceptron, which may also be referred to as a multi-layered perceptron, is a model of a feed-forward artificial neural network that maps inputs onto a single output.

6. Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the neural network can predict the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

7. Feature fusion

Different features extracted by the neural network are generated into new features by a certain method, so that the new features are more effective for classification, identification or detection and the like, and feature fusion generally has two modes: concat and add. Wherein, concat is a series feature fusion mode, namely, two features are directly connected, if the dimensions of x and y of the two input features are p and q, the dimension of z of the output feature is p + q; add is a parallel fusion strategy, which combines two feature vectors to obtain a new feature z with unchanged channel number for input features x and y. In other words, add is that the amount of information under the feature describing the image is increased, but the dimension describing the image itself is not increased, but the amount of information under each dimension is increased; while concat is a combination of the number of channels, that is, the features describing the image itself are increased, and the information under each feature is not increased.

8. Dimensionality reduction treatment

The dimensionality reduction process is an operation of converting high-dimensional data into low-dimensional data. In this embodiment, the dimension reduction processing is mainly performed on the feature matrix. Specifically, the feature matrix may be reduced in dimension by a linear transformation layer. The dimension reduction process for the feature matrix can also be understood as reducing the dimension of the vector space corresponding to the feature matrix.

9. A region of interest.

Region of interest (ROI): in machine vision and image processing, a region to be processed is outlined from a processed image in the form of a square, a circle, an ellipse, an irregular polygon, or the like.

The system architecture provided by the embodiments of the present application is described below.

Referring to fig. 1, a system architecture 100 is provided in accordance with an embodiment of the present invention. As shown in the system architecture 100, the data collecting device 160 is configured to collect training data, which in this embodiment of the present application includes: and (5) training the image. Optionally, the training data may further include first features of the training image and detection box information corresponding to the object in the training image. And stores the training data in database 130, and training device 120 trains to obtain target model/rule 101 based on the training data maintained in database 130. How the training device 120 derives the target model/rule 101 based on the training data will be described in more detail below, where the target model/rule 101 can be used to implement the image processing method provided by the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically be a target neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR) device/Virtual Reality (VR) device, a vehicle-mounted terminal, and the like. Of course, the execution device 110 may also be a server or a cloud. In fig. 1, the execution device 110 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include, in an embodiment of the present application: and (5) an image to be detected. In addition, the input data may be input by a user, may also be uploaded by the user through a shooting device, and may also be from a database, which is not limited herein.

The preprocessing module 113 is configured to perform preprocessing according to input data received by the I/O interface 112, and in this embodiment, the preprocessing module 113 may be configured to obtain features of an image to be detected. Optionally, the preprocessing module 113 may be further configured to obtain detection frame information corresponding to an object in the image to be detected.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the point set obtained as described above or an image including the point set, to the client device 140 to be provided to the user.

It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1, a target model/rule 101 is obtained according to training of the training device 120, and the target model/rule 101 in this embodiment may specifically be a target neural network.

A hardware structure of a chip provided in an embodiment of the present application is described below.

Fig. 2 is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor 20. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101.

The neural network processor 20 may be any processor suitable for large-scale exclusive or operation processing, such as a neural-Network Processing Unit (NPU), a Tensor Processing Unit (TPU), or a Graphics Processing Unit (GPU). Taking NPU as an example: the neural network processor 20 is mounted as a coprocessor on a main Central Processing Unit (CPU) (host CPU), and tasks are allocated by the main CPU. The core portion of the NPU is an arithmetic circuit 203, and a controller 204 controls the arithmetic circuit 203 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuitry 203 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 203 is a two-dimensional systolic array. The arithmetic circuitry 203 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 203 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 203 fetches the data corresponding to the matrix B from the weight memory 202 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 201 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in the accumulator 208.

The vector calculation unit 207 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 207 may be used for network calculations of non-convolution/non-FC layers in a neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization), and the like.

In some implementations, the vector calculation unit 207 can store the processed output vector to the unified buffer 206. For example, the vector calculation unit 207 may apply a non-linear function to the output of the arithmetic circuit 203, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 207 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 203, for example for use in subsequent layers in a neural network.

The unified memory 206 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller (DMAC) 205 to carry input data in the external memory to the input memory 201 and/or the unified memory 206, store the weight data in the external memory into the weight memory 202, and store data in the unified memory 206 into the external memory.

A Bus Interface Unit (BIU) 210, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 209 through a bus.

An instruction fetch buffer (issue fetch buffer)209 coupled to the controller 204 is used to store instructions used by the controller 204.

The controller 204 is configured to call the instruction cached in the finger memory 209 to implement controlling the operation process of the operation accelerator.

Generally, the unified memory 206, the input memory 201, the weight memory 202, and the instruction fetch memory 209 are On-Chip memories (On-Chip) and the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

Several application scenarios of the present application are presented next.

Fig. 3a is a schematic structural diagram of an image processing system according to an embodiment of the present application, where the image processing system includes a user device (a vehicle is taken as an example in fig. 3 a) and an image processing device. It can be understood that the user equipment may be a mobile phone, a vehicle-mounted terminal, an airplane terminal, a VR/AR device, an intelligent robot, and other intelligent terminals besides a vehicle. The user equipment is an initiating end of image processing, and as an initiator of an image processing request, a request is generally initiated by a user through the user equipment.

The image processing device may be a device or a server having an image processing function, such as a cloud server, a web server, an application server, and a management server. The image processing equipment receives an image processing request from the intelligent terminal through an interactive interface, and then performs image processing in modes of machine learning, deep learning, searching, reasoning, decision making and the like through a memory for storing data and a processor link for image processing. The memory in the image processing device may be a generic term that includes a database that stores and stores historical data locally, either on the image processing device or on other network servers.

In the image processing system shown in fig. 3a, the user device may receive an instruction of a user, for example, the user device may obtain an image input/selected by the user (or an image captured by the user device through a camera), and then initiate a request to the image processing device, so that the image processing device executes an image processing application (for example, lane detection in the image, etc.) on the image obtained by the user device, thereby obtaining a corresponding processing result for the image. For example, the user device may obtain an image input by the user, initiate an image detection request to the image processing device, enable the image processing device to detect the image, obtain a detection result of the image (i.e., a point set of the lane line), and display the detection result of the image for the user to view and use.

In fig. 3a, the image processing apparatus may perform the image processing method of the embodiment of the present application.

Fig. 3b is another schematic structural diagram of the image processing system according to the embodiment of the present application, in fig. 3b, a user equipment (a vehicle is taken as an example in fig. 3 b) directly serves as the image processing equipment, and the user equipment can directly acquire an image and directly perform processing by hardware of the user equipment itself, and a specific process is similar to that in fig. 3a, and reference may be made to the above description, which is not repeated herein.

Alternatively, in the image processing system shown in fig. 3b, the user device may receive an instruction from the user, for example, the user device may obtain an image selected by the user in the user device, and then execute an image processing application (for example, lane line detection in the image, etc.) for the image by the user device itself, so as to obtain a corresponding processing result for the image, and display the processing result for the user to view and use.

Optionally, in the image processing system shown in fig. 3b, the user equipment may acquire an image of a road where the user equipment is located in real time or periodically, and then the user equipment itself executes an image processing application (for example, lane line detection in the image, and the like) on the image, so as to obtain a corresponding processing result for the image, and implement an intelligent driving function according to the processing result, for example: adaptive cruise, Lane Departure Warning (LDW), Lane Keeping Assist (LKA), and the like.

In fig. 3b, the user equipment itself can execute the image processing method according to the embodiment of the present application.

The user device in fig. 3a and fig. 3b may specifically be the client device 140 or the execution device 110 in fig. 1, and the image processing device in fig. 3a may specifically be the execution device 110 in fig. 1, where the data storage system 250 may store data to be processed of the execution device 210, and the data storage system 250 may be integrated on the execution device 210, or may be disposed on a cloud or other network server.

The processors in fig. 3a and 3b may perform data training/machine learning/deep learning through a neural network model or other models (e.g., models based on a support vector machine), and perform image processing application on the image using the model finally trained or learned by the data, thereby obtaining corresponding processing results.

The following describes a vehicle architecture in the above scenario. Referring to fig. 4, fig. 4 is a schematic structural diagram of a vehicle according to an embodiment of the present disclosure.

The vehicle may include various subsystems such as a travel system 402, a sensor system 404, a control system 406, one or more peripherals 408, as well as a power source 410 and a user interface 416. Alternatively, the vehicle may include more or fewer subsystems, and each subsystem may include multiple components. In addition, each of the sub-systems and components of the vehicle may be interconnected by wire or wirelessly (e.g., bluetooth).

The travel system 402 may include components that provide powered motion to the vehicle. In one embodiment, the travel system 402 may include an engine 418, an energy source 419, a transmission 420, and wheels 421.

The engine 418 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a hybrid engine composed of a gasoline engine and an electric motor, and a hybrid engine composed of an internal combustion engine and an air compression engine. The engine 418 converts the energy source 419 to mechanical energy. Examples of energy sources 419 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electrical power. The energy source 419 may also provide energy to other systems of the vehicle. The transmission 420 may transmit mechanical power from the engine 418 to the wheels 421. The transmission 420 may include a gearbox, a differential, and a drive shaft. In one embodiment, the transmission 420 may also include other components, such as a clutch. Wherein the drive shaft may comprise one or more shafts that may be coupled to the wheels 421.

The sensor system 404 may include several sensors that sense information about the location of the vehicle. For example, sensor system 404 may include a positioning system 422 (e.g., a global positioning system, a Beidou system, or other positioning system), an Inertial Measurement Unit (IMU) 424, a radar 426, a laser range finder 428, and a camera 430. The sensor system 404 may also include sensors of internal systems of the monitored vehicle (e.g., an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). The sensory data from one or more of these sensors may be used to detect the object and its corresponding characteristics (e.g., position, shape, orientation, velocity, etc.). Such detection and identification is a critical function of the safe operation of the autonomous vehicle.

The location system 422 may be used, among other things, to estimate the geographic location of the vehicle, such as latitude and longitude information of the location of the vehicle. The IMU 424 is used to sense position and orientation changes of the vehicle based on inertial acceleration rates. In one embodiment, IMU 424 may be a combination of an accelerometer and a gyroscope. The radar 426 may utilize radio signals to sense objects within the surrounding environment of the vehicle, which may be embodied as millimeter wave radar or lidar. In some embodiments, in addition to sensing objects, the radar 426 may also be used to sense the speed and/or heading of objects. The laser rangefinder 428 may use a laser to sense objects in the environment in which the vehicle is located. In some embodiments, the laser rangefinder 428 may include one or more laser sources, laser scanners, and one or more detectors, among other system components. The camera 430 may be used to capture multiple images of the surroundings of the vehicle. The camera 430 may be a still camera or a video camera.

The control system 406 is for controlling the operation of the vehicle and its components. The control system 406 may include various components including a steering system 432, a throttle 434, a brake unit 436, an electronic control unit 438 (ECU), and a vehicle control unit 440 (BCM).

Wherein the steering system 432 is operable to adjust the heading of the vehicle. For example, in one embodiment, a steering wheel system. The throttle 434 is used to control the rate of operation of the engine 418 and thus the speed of the vehicle. The brake unit 436 is used to control deceleration of the vehicle. The brake unit 436 may use friction to slow the wheel 421. In other embodiments, the brake unit 436 may convert the kinetic energy of the wheel 421 into an electric current. The brake unit 436 may take other forms to slow the rotational speed of the wheel 421 to control the speed of the vehicle. The vehicle electronic control unit 438 may be implemented as a single ECU or multiple ECUs on the vehicle that are configured to communicate with the peripheral devices 408, the sensor system 404. The vehicle ECU438 may include at least one processor 4381, memory 4382 (ROM). At least one processor may be implemented or performed with one or more general purpose processors, content addressable memories, digital signal processors, application specific integrated circuits, field programmable gate arrays, any suitable programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. In particular, at least one processor may be implemented as one or more microprocessors, controllers, Microcontrollers (MCU) or state machines. Further, at least one processor may be implemented as a combination of computing devices, e.g., a digital signal processor or microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other such configuration. The ROM may provide storage of data, including storage of addresses, routes, directions of travel, as described herein.

BCM140 may provide ECU438 with information about vehicle engine state, speed, gear, steering wheel angle, etc.

The vehicle interacts with external sensors, other vehicles, other computer systems, or users through peripherals 408. Peripheral devices 408 may include a wireless communication system 446, a navigation system 448, a microphone 450, and/or a speaker 452. In some embodiments, the peripheral devices 408 provide a means for a user of the vehicle to interact with the user interface 416. For example, the navigation system 448 may be implemented as part of an in-vehicle entertainment system, an in-vehicle display system, a cluster of in-vehicle instruments, and so forth. In a practical embodiment, the navigation system 448 is implemented to include or cooperate with the sensor system 404, which sensor system 404 derives the current geographic location of the vehicle in real time or substantially in real time. The navigation system 448 is configured to provide navigation data to a driver of the vehicle. The navigation data may include position data for the vehicle, proposed route planning driving instructions, and visible map information to the vehicle operator. The navigation system 448 may present the location data to the driver of the vehicle via a display element or other presentation device. The current position of the vehicle may be described by one or more of the following information: a triangulated position, a latitude/longitude position, x and y coordinates, or any other symbol or any manner of measurement that indicates the geographic location of the vehicle.

The user interface 416 may also operate the navigation system 448 to receive user inputs. The navigation system 448 may operate via a touch screen. The navigation system 448 provides route planning capabilities and navigation capabilities when a user enters geographic location values for a start point and an end point. In other cases, the peripheral device 408 may provide a means for the vehicle to communicate with other devices located within the vehicle. For example, the microphone 450 may receive audio (e.g., voice commands or other audio input) from a user of the vehicle. Similarly, the speaker 452 may output audio to a user of the vehicle. The wireless communication system 446 may wirelessly communicate with one or more devices directly or via a communication network. For example, the wireless communication system 446 may use 3G cellular communication such as, for example, Code Division Multiple Access (CDMA), EVD0, global system for mobile communications (GSM)/General Packet Radio Service (GPRS), or 4G cellular communication such as Long Term Evolution (LTE), or 5G cellular communication. The wireless communication system 446 may communicate with a Wireless Local Area Network (WLAN) using WiFi. In some embodiments, the wireless communication system 446 may communicate directly with the devices using an infrared link, bluetooth, or ZigBee. Other wireless protocols, such as various vehicular communication systems, for example, the wireless communication system 446 may include one or more Dedicated Short Range Communications (DSRC) devices, which may include public and/or private data communications between vehicles and/or roadside stations.

The power supply 410 may provide power to various components of the vehicle. In one embodiment, power source 410 may be a rechargeable lithium ion or lead acid battery. One or more battery packs of such batteries may be configured as a power source to provide power to various components of the vehicle. In some embodiments, the power source 410 and the energy source 419 may be implemented together, such as in some all-electric vehicles.

Optionally, one or more of these components described above may be mounted or associated separately from the vehicle. For example, memory 4382 may exist partially or completely separate from the vehicle. The above components may be communicatively coupled together in a wired and/or wireless manner.

Optionally, the above components are only an example, in an actual application, components in the above modules may be added or deleted according to an actual need, and fig. 4 should not be construed as limiting the embodiment of the present application.

The vehicle may be a car, a truck, a motorcycle, a bus, a boat, a lawn mower, an amusement ride, a playground vehicle, construction equipment, a trolley, a golf cart, a trolley, etc., and the embodiment of the present invention is not particularly limited.

The following describes an image processing method provided in an embodiment of the present application. The method may be performed by an image processing apparatus, or may be performed by a component of an image processing apparatus (e.g., a processor, a chip, or a system of chips, etc.). The image processing device may be a cloud device (as shown in fig. 3 a), or may be a vehicle (for example, the vehicle shown in fig. 4) or a terminal device (for example, an in-vehicle terminal, an airplane terminal, etc.) (as shown in fig. 3 b). Of course, the method may also be performed by a system composed of the cloud device and the vehicle (as shown in fig. 3 a). Optionally, the method may be processed by a CPU in the image processing device, or may be processed by both the CPU and the GPU, or may use other processors suitable for neural network computing instead of the GPU, which is not limited in this application.

An application scenario of the method (or an application scenario understood to be a first neural network or a target neural network) may be used for the smart driving scenario. For example: scenes including lane line detection, such as adaptive cruise, Lane Departure Warning (LDW), Lane Keeping Assist (LKA), and the like. In an intelligent driving scene, the image processing method provided by the embodiment of the application can acquire an image to be detected through a sensor (such as a camera) on a vehicle, acquire a lane line in the image to be detected, and further realize the self-adaptive cruise, the LDW or the LKA and the like.

In this embodiment of the present application, according to whether the image processing device is a cloud device or a user device, the image processing method provided in this embodiment of the present application may include two cases, which are described below respectively.

In the first case, the image processing device is a user device, and here, only the user device is a vehicle (as in the scenario of fig. 3b described above). It can be understood that the user equipment may be a vehicle, and may also be an intelligent terminal such as a mobile phone, a vehicle-mounted terminal, an airplane terminal, a VR/AR device, and an intelligent robot, which is not limited herein.

Referring to fig. 5, a flowchart of an image processing method implemented by a target neural network according to an embodiment of the present disclosure may include steps 501 to 504. The following describes steps 501 to 504 in detail.

And step 501, acquiring an image to be detected.

The mode that image processing equipment acquires the image to be detected in the embodiment of the application has multiple modes, and can be a mode that the image to be detected is acquired through the image processing equipment, a mode that the image to be detected is transmitted through other equipment, a mode that training data is selected from a database and the like, and the specific mode is not limited here.

Optionally, the image to be detected comprises at least one object of a car, a person, an object, a tree, a logo, etc.

For example, in the field of smart driving, the image processing device may refer to a vehicle. Sensors (e.g., cameras or cameras) on the vehicle capture images. It will be appreciated that the sensors on the vehicle may capture images in real time, or periodically, such as: the image is acquired every 0.5 second, and the details are not limited herein.

Step 502, performing feature extraction on an image to be detected to obtain a first feature.

After the image processing device acquires the image to be detected, the first characteristic of the image to be detected can be acquired. Specifically, feature extraction is performed on an image to be detected to obtain a first feature. It is understood that the features mentioned in the embodiments of the present application can be expressed in the form of matrix or vector.

Optionally, the image processing device may perform feature extraction on the image to be detected through a backbone network to obtain the first feature. The main network may be a convolutional neural network, a Graph Convolutional Network (GCN), a cyclic neural network, or other networks having a function of extracting image features, and is not limited herein.

Further, in order to obtain multi-level features of the image to be detected, the image processing device may perform feature fusion and dimension reduction processing on features output by different layers in the backbone network to obtain the first feature. Here, the feature output by different layers may also be understood as an intermediate feature (which may also be referred to as at least one third feature) in the process of calculating the first feature, and the number of the third features is related to the number of layers of the backbone network, for example: the number of the third features is the same as the number of layers of the backbone network, or the number of the third features is the number of layers of the backbone network minus 1.

In this way, because the performance of extracting features from different layers of the neural network is different, the resolution of features of a lower layer is higher, and the features contain more position and detail information, but because the convolution is less, the semantic is lower, and the noise is more. The high-level features have stronger semantic information, but have low resolution and poor detail perception capability. Therefore, feature fusion is performed on features extracted from different layers of the backbone network to obtain a fused feature (denoted as H)_f) The fused features have multi-level features. Furthermore, the dimension reduction processing is carried out on the feature after the fusion to obtain a first feature (recorded as H'_f). Thus, the first feature also has a multi-level feature. Wherein, the above-mentioned H_f∈R^h×w×dH is H_fW is H_fD is H_fOf (c) is calculated. For example: h is transformed by a linear transform layer_fD is reduced to d ', i.e. H'_f∈R^h×w×d′。

Illustratively, the above-mentioned backbone network is a residual convolutional neural network (ResNet50) with 50 layers.

Step 503, processing the detection frame information of the image to be detected to obtain a second feature.

After the image to be detected is obtained, the image processing device can obtain the detection frame information of the image to be detected based on the man-car detection model. Specifically, the image to be detected is input into the man-car detection model to obtain detection frame information, wherein the detection frame information comprises the position of a detection frame of at least one object in the image to be detected. The human-vehicle detection model may be a regional convolutional neural network (R-CNN), a fast regional convolutional neural network (fast R-CNN), or a faster regional convolutional neural network (fast R-CNN), and is not limited herein. The above mentioned object may include at least one of a vehicle, a person, an object, a tree, a sign, etc. in the image to be detected, and is not limited herein. It is understood that the position of the detection frame may be a position subjected to normalization processing.

It can be understood that, if the detection frame information of more objects in the image to be detected is acquired, the expression capability of the acquired second feature is stronger.

Optionally, the detection box information may further include a category and a confidence of the detection box.

After the image processing device acquires the detection frame information, the image processing device may process the detection frame information to obtain a second feature, where the second feature may also be understood as a detection frame feature of the image to be detected, and the second feature includes a position feature and a semantic feature of a detection frame corresponding to an object in the image to be detected. Wherein the position feature can be noted as Z_bSemantic features can be denoted as Z_r。

Optionally, the at least one third feature and the detection frame information are processed to obtain a second feature. The at least one third feature is an intermediate feature obtained during the process of obtaining the first feature (such as the intermediate feature in step 502 described above). Specifically, the detection frame information and the intermediate features are input into a preprocessing module to obtain the position features and the semantic features.

Alternatively, if the backbone network adopts a Feature Pyramid Network (FPN) structure, the second feature may be obtained based on processing the at least one third feature and the detection box information. If the backbone network does not adopt the FPN structure, the first feature before dimensionality reduction and the detection frame information can be used for acquiring the second feature.

In the embodiment of the present application, based on the difference of the detection frame information, the process of specifically acquiring the second feature (which may also be understood as the function of the preprocessing module) is different, and the following description is respectively given:

1. the detection frame information includes only the position of the detection frame.

The process of obtaining semantic features may include: and scaling the detection frame according to the position of the detection frame and the sampling rate between different layers in the backbone network. And extracting the ROI feature from the feature layer of the sampling rate corresponding to the intermediate feature by using the scaled detection frame. And (3) processing the ROI feature (for example: processing of inputting a full connection layer or processing of inputting a single-layer perceptron and an activation layer) to obtain a semantic feature of the detection box: z_r∈R^M×d′And M is the number of detection frames in the image to be detected.

The process of acquiring the position feature may include: processing the vector corresponding to the position of the detection frame (for example, inputting the processing of a full connection layer or inputting the processing of a single-layer perceptron and an activation layer) to obtain the position characteristics of the detection frame: z_b∈R^M ^×d′。

For example, assuming that the backbone network is a 5-layer neural network, and the third layer down-sampling rate is 8, we will reduce the original detection frame by 8 times. In general, the larger the area of the detection frame is, the smaller the feature layer (the later layer) is removed to extract the ROI feature.

2. The detection frame information includes the position and confidence of the detection frame.

The process of obtaining semantic features may include: and scaling the detection frame according to the position of the detection frame and the sampling rate between different layers in the backbone network. And extracting the ROI feature from the feature layer of the sampling rate corresponding to the intermediate feature by using the scaled detection frame. Taking the confidence coefficient of the detection frame as a coefficient, multiplying the coefficient by the extracted ROI characteristic, and processing the multiplied characteristic (such as inputting the processing of a full connection layer or inputting the processing of a single-layer sensing machine and an activation layer) to obtain the semantic characteristic of the detection frame：Z_r∈R^M×d′And M is the number of detection frames in the image to be detected.

The process of acquiring the position feature may include: processing the vector corresponding to the position of the detection frame (for example, inputting the processing of a full connection layer or inputting the processing of a single-layer perceptron and an activation layer) to obtain the position characteristics of the detection frame: z is a linear or branched member_b∈R^M ^×d′. The category of the detection frame may be encoded by using an encoding method such as one-hot (one-hot) code, so as to obtain a category vector.

3. The detection frame information includes the position, confidence and category of the detection frame.

The process of obtaining semantic features may include: and scaling the detection frame according to the position of the detection frame and the sampling rate between different layers in the backbone network. And extracting the ROI feature from the feature layer corresponding to the sampling rate in the first feature by using the scaled detection frame. Taking the confidence coefficient of the detection frame as a coefficient, multiplying the coefficient by the extracted ROI characteristic, and processing the multiplied characteristic (for example, processing the input of a full connection layer or processing the input of a single-layer sensing machine and an activation layer) to obtain the semantic characteristic of the detection frame: z_r∈R^M×d′And M is the number of detection frames in the image to be detected.

The process of acquiring the position feature may include: the class of the detection frame is transformed into a class vector. And the category vector is spliced with the vector corresponding to the position of the detection frame, and the position characteristics of the detection frame are obtained through processing (for example, processing of inputting a full connection layer, or processing of inputting a single-layer sensing machine and an activation layer): z_b∈R^M×d′. The category of the detection frame may be encoded by using an encoding method such as one-hot (one-hot) code, so as to obtain a category vector.

It should be understood that the above-mentioned several cases of the detection frame information and several specific processes for obtaining the second feature are only examples, and in practical applications, the detection frame may have other cases (for example, the detection frame information includes the position and the category of the detection frame), and the second feature may also be obtained in other manners, which is not limited herein.

For example, the process of obtaining the second feature may be as shown in fig. 6. The step executed by the detection preprocessing module refers to the above description of the process of obtaining the second feature, and is not described herein again.

And step 504, inputting the first characteristic and the second characteristic into a first neural network based on a transform structure to obtain a lane line in the image to be detected.

After the image processing device acquires the first feature and the second feature, the first feature and the second feature may be input to a first neural network based on a transform structure, so as to obtain a lane line in the image to be detected. Specifically, a plurality of point sets may be obtained first, and then the lane line may be determined based on the plurality of point sets. Each point set in the plurality of point sets represents a lane line in the image to be detected.

Optionally, the first neural network based on the transform structure includes an encoder, a decoder and a feedforward neural network. The acquiring the plurality of point sets may include the steps of: inputting the first characteristic and the second characteristic into an encoder to obtain a fourth characteristic; inputting the fourth characteristic, the second characteristic and the query characteristic into a decoder to obtain a fifth characteristic; and inputting the fifth characteristic into a feedforward neural network to obtain a plurality of point sets. The following description will be made in conjunction with the accompanying drawings and the description will be made in the case where the case is described later.

Optionally, the first feature and the second feature may be input into a trained first neural network, resulting in a plurality of point sets. The trained first neural network is obtained by taking training data as input of the first neural network and taking the value of a first loss function smaller than a first threshold value as a target to train the first neural network, wherein the training data comprises a first feature of a training image, and a position feature and a semantic feature of an object in the training image corresponding to a detection box, the first loss function is used for representing the difference between a point set output by the first neural network in a training process and the first point set, and the first point set is a real point set of an actual lane line in the training image.

Further, the first neural network comprises a transform structure and a feedforward neural network. The first feature and the second feature may be processed through a transformer structure to obtain a fifth feature. And inputting the fifth characteristic into a feedforward neural network to obtain a plurality of point sets. It is understood that the feedforward neural network may be replaced by a fully-connected layer, a convolutional neural network, and the like, and is not limited herein.

In the embodiment of the present application, the structure of the transform is different based on the input of the first neural network. It is also to be understood that the steps for obtaining the fifth feature are different and are described separately below.

In the 1 st mode, the first neural network is shown in fig. 7, and the transform structure is shown in fig. 8.

In a possible implementation manner, in order to more intuitively see the process of acquiring the fifth feature based on the first feature and the second feature, reference may be made to fig. 7. The first neural network comprises a transform structure and a feedforward neural network. And inputting the first characteristic and the second characteristic into an encoder of a transform structure to obtain a fourth characteristic. Inputting the query feature, the second feature and the fourth feature into a decoder of a transform structure to obtain a fifth feature.

The transform structure in this case can be as shown in fig. 8, where the encoder of the transform structure includes a first self-attention module and a first attention module, and the decoder of the transform structure includes a second attention module and a third attention module.

Optionally, the decoder may further comprise a second self-attention module (not shown in fig. 8) for computing the query features. Specifically, the query vector is subjected to self-attention calculation to obtain query features. The query vector is initialized to a random value, and a fixed value is obtained in the training process. And the fixed value is used in the reasoning process, i.e. the query vector is a fixed value obtained by training the random value in the training process.

Under this configuration, a first feature (H ') is paired by a first self-attention module'_f) Performing a self-attention calculation to obtain a first output (O)_f). First feature (H ') through first attention module pair'_f) And the second characteristic (Z)_rAnd Z_b) Performing cross attention calculation to obtain a second output (O)_p2b). Based on the first output (O)_f) And a second output (O)_p2b) A fourth feature is obtained. Query feature (Q) pair by second attention module_q) And performing cross attention calculation with the fourth feature to obtain a third output. To query features (Q)_q) And the second characteristic (Z)_rAnd Z_b) And processing to obtain a fourth output. And adding the third output and the fourth output to obtain a fifth characteristic. Wherein the query features are obtained by performing self-attention calculation on the query vector.

Optionally, the first feature (H ') is paired with the first self-attention module'_f) Performing a self-attention calculation to obtain a first output (O)_f) The steps of (a) may specifically be: since it is self-attentive, the inputs to QKV are consistent (i.e., all are H'_f). I.e. through the first feature (H'_f) QKV was obtained by three linear treatments and O was calculated based on QKV_f. For the self-attention description, reference may be made to the foregoing description of the self-attention mechanism, which is not repeated herein. In addition, it is understood that in the process of calculating the self-attention, a position matrix of the first feature can be introduced, which is described in the following formula one and will not be expanded here.

Optionally, the above is based on O_fAnd O_p2bThe specific steps for obtaining the fourth feature may be: and adding the first output and the second output to obtain a fourth characteristic.

Further, as shown in FIG. 9, the above is based on the first output (O)_f) And a second output (O)_p2b) The step of acquiring the fourth feature may specifically be: and adding the first output and the second output, and adding and normalizing the result obtained by the adding process and the first characteristic to obtain an output. On one hand, the output is input into the feedforward neural network to obtain the output result of the feedforward neural network. And adding and normalizing the output obtained by the addition and normalization and the output result of the feedforward neural network to obtain a fourth characteristic.

Optionally, the pair H 'is provided by the first attention module'_f、Z_rAnd Z_bCarry out cross attention meterThe calculation steps may specifically be: h'_fAs Q, adding Z_bAs K, Z_rCross attention calculation is performed as V to obtain a second output (O)_p2b)。

Optionally, the above-mentioned is by the second attention module pair Q_qThe step of performing cross attention calculation with the fourth feature may specifically be: will Q_qAs Q, the fourth feature is subjected to cross attention calculation as K and V, and a third output is obtained.

Further, as shown in fig. 10, the step of processing the query feature and the second feature to obtain a fourth output may specifically be: by the third attention module pair Q_q、Z_rAnd Z_bAnd performing cross attention calculation to obtain a sixth output. Specifically, Q may be_qAs Q, adding Z_bAs K, Z_rAnd cross attention calculation is performed as V to obtain a sixth output. And adding the query features and the sixth output, and adding and normalizing the result obtained by adding and processing the query vectors to obtain the output. On one hand, the output is input into the feedforward neural network to obtain the output result of the feedforward neural network. And adding and normalizing the output obtained by adding and normalizing and the output result of the feedforward neural network to obtain a fourth output.

It should be noted that, in the present embodiment, a feature used as Q in the attention calculation process may be introduced into a position matrix (Q) of the feature_q). The position matrix may also be obtained by static position coding or dynamic position coding, for example, the position matrix may be obtained by calculating an absolute position of the first feature corresponding to the feature map, which is not limited herein.

Illustratively, the first output (O) described above_f) A second output (O)_p2b) The calculation formula of (c) is as follows:

the formula I is as follows:

the formula II is as follows:

wherein, taking formula one and formula two as examples, E_fIs of first characteristic (H'_f) The position matrix is calculated by sine and cosine by the following formula and formula four as an example:

the formula III is as follows:

the formula four is as follows:

wherein, formula three is used for calculating the even number, and formula four is used for calculating the odd number. i is the position of the row in the position matrix of the element, 2j/2j +1 is the position of the column in the position matrix of the element, d represents the dimension of the position matrix. For a more straightforward understanding of the application of the above formula three and formula four, it is assumed that if a certain element is in row 2, column 3, the calculation process of the position vector of the element can be calculated by formula four, where i is 2, j is 1, and d is 3.

It should be understood that the above formula one, formula two, formula three, and formula four are only examples, and in practical applications, there may be other forms of formulas, and the specific details are not limited herein.

In the 2 nd, the first neural network is shown in fig. 11, and the transform structure is shown in fig. 12.

In another possible implementation manner, please refer to fig. 11, wherein fig. 11 is different from fig. 7 in that the input of the encoder in fig. 7 includes the first feature and the second feature, and the input of the encoder in fig. 11 includes the first feature, the first row feature, the first column feature, and the second feature. I.e. the input to the encoder in fig. 11 has more features of the first row and the first column than the input to the encoder in fig. 7.

The transformer structure in this case is shown in fig. 12, and the encoder of the transformer structure includes a line and column attention module in addition to the structure shown in fig. 8. That is, the encoder of the transform structure shown in fig. 12 includes a line attention module, a first self-attention module, and a first attention module, and the decoder includes a second self-attention module, a second attention module, and a third attention module. The row and column attention module comprises a row attention module and a column attention module.

Under this configuration, the first row of features (H ') is addressed by the row attention module'_r) Self-attention calculation is performed to obtain line output. Pair of first column features (H ') through column attention module'_c) Self-attention calculations are performed resulting in column outputs. A row-column output is obtained based on the row output and the column output. Pairing a first feature (H ') by a first self-attention module'_f) Performing a self-attention calculation to obtain a first output (O)_f). First feature (H ') through first attention module pair'_f) And the second characteristic (Z)_rAnd Z_b) Performing cross attention calculation to obtain a second output (O)_p2b). Based on the rank output, the first output (O)_f) And a second output (O)_p2b) A fourth feature is obtained. Performing self-attention calculation on the query vector through a second self-attention module to obtain query characteristics (Q)_q). Query feature (Q) pair by second attention module_q) And performing cross attention calculation with the fourth feature to obtain a third output. Query feature (Q) pair by third attention module_q) And the second characteristic (Z)_rAnd Z_b) And processing to obtain a fourth output. And adding the third output and the fourth output to obtain a fifth characteristic.

It is understood that, the above-mentioned partial steps and related structures may refer to the description similar to the embodiment shown in fig. 8, and are not described again here.

Alternatively, as shown in fig. 13, the specific structure of the line attention module is shown. The step of obtaining the row-column output based on the row output and the column output may specifically be: and adding and normalizing the line output and the first line characteristic (adding and normalizing for short) to obtain output. On one hand, the output is input into a feedforward neural network (feedforward network for short) to obtain the output result of the feedforward network. And adding and normalizing the output obtained by the addition and normalization and the output result of the feedforward network to obtain the output of the line. And in the same way, adding and normalizing the column output and the first column characteristic to obtain the output. On one hand, the output is input into a feedforward network to obtain an output result of the feedforward network. And adding and normalizing the output obtained by the addition and normalization and the output result of the feedforward network to obtain the output of the column. And splicing the output of the row and the output of the column to obtain row-column output.

Optionally, the first row feature, the first column feature, the row output, and the column output described above are described. After the first feature is obtained, the first feature may be flattened in the line dimension to obtain H_r∈R^h×1×wdAnd processing (for example, inputting the processing and dimension reduction processing of the full connection layer, or inputting the processing and dimension reduction processing of the single layer perceptron and the activation layer) to obtain a first row of characteristics: h'_r∈R^h×1×d′. The above leveling of the row dimension can also be understood as leveling or compressing the matrix corresponding to the first feature along the row direction to obtain H_r. And in the same way, flattening the column dimension of the first characteristic to obtain H_c∈R^1×w×hdAnd processing (for example: inputting processing and dimension reduction processing of a full connection layer, or inputting processing and dimension reduction processing of a single layer perceptron and an activation layer) to obtain a first list of characteristics: h'_c∈R^1×w×d′。

Optionally, the above-mentioned output based on row and column, the first output (O)_f) And a second output (O)_p2b) The step of acquiring the fourth feature may specifically be: adding the first output and the second output to obtain a fifth output; and splicing the fifth output and the row-column output to obtain a fourth characteristic.

Illustratively, the line output (O) described above_row) Column output (Oc)_olumn) The calculation formula of (a) is as follows:

the formula five is as follows:

formula six:

wherein E is_rIs a first line feature (H'_r) A position matrix of E_cIs a first column characteristic (H'_c) A position matrix of (2). The position matrix may be obtained by static position coding or dynamic position coding, and is not limited herein.

It should be understood that the fifth formula and the sixth formula are only examples, and in practical applications, other formulas may be used, and are not limited herein.

It should be noted that, several cases of the above transformer structure, or a manner of obtaining the fifth feature, are only examples, and in practical applications, the transformer structure may also be other cases, or there may be other manners of obtaining the fifth feature, and the details are not limited herein.

After the image processing device obtains the fifth feature in any of the manners described above, the fifth feature may be input to a feed-forward neural network, so as to obtain a plurality of point sets. And determining a lane line in the image to be detected based on the plurality of point sets. It is understood that the feedforward neural network may be replaced by a fully-connected layer, a convolutional neural network, and the like, and is not limited herein.

To more intuitively understand the point set acquisition process, taking fig. 14a as an example, the lane line l shown in fig. 14a is (X, s, e), where X is a set of X coordinates corresponding to intersections of equally spaced Y-direction straight lines (e.g. 72 lines) and the lane line, a start point Y coordinate s, and an end point Y coordinate e. It is understood that the number of lane lines and the number of Y-direction lines in fig. 14a are only examples, and are not limited herein.

In one possible implementation, the plurality of point sets may be presented in an array. In another possible implementation, the plurality of point sets may also be presented by way of an image. For example: a plurality of point sets as shown in fig. 14 b. And overlapping and fusing the plurality of point sets and the image to be detected to obtain the image to be detected with the plurality of point sets, for example, as shown in fig. 14 c. The present embodiment does not limit the presentation manner of the plurality of point sets.

In order to more intuitively see the contribution of the first row features and the first column features to the detection of the lane line, please refer to fig. 14d, it can be seen that the construction capability of the network on the long-strip lane line features can be improved by introducing the first row features and the first column features which can mine context information in compliance with the shape of the lane line, so as to achieve a better lane line detection effect.

In the embodiment of the application, on one hand, the transform structure is applied to the lane line detection task, so that the global information of the image to be detected can be acquired, and further the long-distance relation between lane lines is effectively modeled. On the other hand, the scene perception capability of the network can be improved by adding the position information of the detection frame of the object in the image as input in the network for detecting the lane line. And misjudgment of the model in a scene with the lane line shielded by the vehicle is reduced. On the other hand, by introducing a row-column self-attention module capable of mining context information in compliance with the shape of the lane line into the encoder of the transducer, the construction capability of the network on the characteristics of the long-strip lane line can be improved, and therefore a better lane line detection effect is achieved. On the other hand, the modules in the existing automatic driving system are often independent from each other, for example, the lane line detection model and the man-vehicle model are independent from each other and are predicted independently. In the image processing method provided by this embodiment, the target neural network predicts the lane line by using the detection frame information obtained based on the human-vehicle detection model in the first neural network, so that the accuracy of lane line detection can be improved.

In the second case, the image processing device is a cloud server (as in the scenario of fig. 3a described earlier). It is to be understood that, in this case, the image processing device may also be a device or a server having an image processing function, such as a network server, an application server, and a management server, and the user device is an example of a vehicle, and is not limited herein.

Referring to fig. 15, a flowchart of an image processing method according to an embodiment of the present application may include steps 1501 to 1505. Step 1501 to step 1505 are explained in detail below.

Step 1501, the vehicle acquires an image to be detected.

Alternatively, the vehicle may capture an image to be detected based on a sensor (e.g., a camera or a camera) on the vehicle. Of course, the sensors on the vehicle may also periodically capture images.

It is understood that the vehicle may also be obtained by receiving an image to be detected sent by other devices, and is not limited herein.

Step 1502, the vehicle sends an image to be detected to a server. Correspondingly, the server receives the image to be detected sent by the vehicle.

And after the vehicle acquires the image to be detected, sending the image to be detected to the server. Correspondingly, the server receives the image to be detected sent by the vehicle.

And 1503, inputting the image to be detected into the trained target neural network by the server to obtain a plurality of point sets.

After receiving the image to be detected sent by the vehicle, the server can input the image to be detected into the trained target neural network to obtain a plurality of point sets.

The trained target neural network is obtained by taking a training image as the input of the target neural network and taking the value of the target loss function smaller than the target threshold value as the target to train the target neural network. The objective function is used for representing the difference between a point set output by a target neural network in the training process and a target point set, and the target point set is the point set of an actual lane line in a training image. The target loss function and the target threshold may be set according to actual needs, and are not limited herein.

The target neural network in this embodiment may include the backbone network, the preprocessing module, and the first neural network in the embodiment shown in fig. 5. Since there are two cases in the structure of the first neural network in the embodiment shown in fig. 5, there are also two cases in the target neural network in the present embodiment, which are described below separately.

In one possible implementation, the structure of the target neural network may be as shown in fig. 16. The target neural network in this case is equivalent to the first neural network including the backbone network shown in fig. 6, the preprocessing module shown in fig. 6, and the corresponding nodes shown in fig. 7 to 10. For a detailed description and a related flow of the neural network, reference may be made to the foregoing description corresponding to fig. 6 to fig. 10, and details are not repeated here.

In another possible implementation, the structure of the target neural network may be as shown in fig. 17. The target neural network in this case is equivalent to the first neural network including the backbone network shown in fig. 6, the preprocessing module shown in fig. 6, and the corresponding nodes shown in fig. 11 to 13. For a detailed description and a related flow of the neural network, reference may be made to the descriptions corresponding to fig. 6, fig. 11 to fig. 13, which are not described herein again.

At step 1504, the server sends a plurality of point sets to the vehicle. Accordingly, the vehicle receives a plurality of point sets transmitted by the server.

After the server obtains the plurality of point sets, the server sends the plurality of point sets to the vehicle.

At step 1505, intelligent driving functionality is implemented based on the plurality of point sets.

After the vehicle acquires a plurality of point sets, each point set in the plurality of point sets represents a lane line in the image to be detected. The vehicle can determine a lane line in the image to be detected, and realize an intelligent driving function according to the lane line, for example: adaptive cruise, lane departure warning, lane keeping assist, and the like.

In addition, the description for determining the lane line in the image to be predicted by using a plurality of point sets may refer to the description in step 504 of the embodiment shown in fig. 5, which is not repeated herein.

It can be understood that the steps of this embodiment may be periodically executed, that is, the lane line of the road surface may be accurately identified according to the to-be-detected image acquired by the vehicle-mounted camera in the driving process, so as to implement the functions related to the lane line in the intelligent driving, for example: adaptive cruise, lane departure warning, lane keeping assist, and the like.

In this embodiment, on the one hand, by applying the transform structure to the lane line detection task, the global information of the image to be detected can be acquired, and then the long-distance relation between lane lines is effectively modeled. On the other hand, the scene perception capability of the network can be improved by adding the position information of the detection frame of the object in the image as input in the network for detecting the lane line. And misjudgment of the model in a scene with the lane line shielded by the vehicle is reduced. On the other hand, by introducing a row-column self-attention module capable of mining context information in compliance with the shape of the lane line into the encoder of the transducer, the construction capability of the network on the characteristics of the long-strip lane line can be improved, and therefore a better lane line detection effect is achieved. On the other hand, the calculation cost of the vehicle can be saved by deploying the target neural network at the cloud and predicting the point set of the lane line. On the other hand, the modules in the existing automatic driving system are often independent from each other, for example, the lane line detection model and the man-vehicle model are independent from each other and are predicted independently. In the image processing method provided by this embodiment, the target neural network predicts the lane line by using the detection frame information obtained based on the human-vehicle detection model in the first neural network, so that the accuracy of lane line detection can be improved.

In order to more intuitively see the performance of the target neural network provided by the embodiment of the present application, a performance test is performed on the data sets of the CULane and TuSimple on the target neural network (hereinafter, referred to as "Laneformer") and other existing networks. Other existing networks include: spatial Convolutional Neural Network (SCNN), endetsad, poitlane, effective residual factorization (ERFNet), CurveLaneS, CurveLaneM, CurveLaneL, lanetatt.

Wherein, CULane is a large-scale lane line detection data set collected by a vehicle-mounted camera in Beijing of China, and the size of the collected picture is 1640 multiplied by 590. The data set has various collection places and contains samples of complex scenes in a plurality of cities. The CULane data set contained 88880 training pictures, 9675 validation pictures, and 34680 test pictures. The test set is further divided into nine categories, wherein one category is a conventional picture, and the other eight categories are challenging special categories (including shadow scenes, highlight scenes, night scenes, curve scenes, lane line-free scenes and the like). TuSimple is an autopilot data set collected by the company tussin. The data set is focused on a highway scene, so all pictures are taken on the highway, and the size of the taken pictures is 1280 × 720. The TuSimple dataset contained 3626 pictures for training and 2782 pictures for testing.

Three residual structures (ResNet18, ResNet34, ResNet122) are employed for the backbone network in lanetatt. Respectively recording as: LaneATT (ResNet18), LaneATT (ResNet34), LaneATT (ResNet 122). The trunk network of the Laneformer provided in the embodiment of the present application adopts three residual error structures (ResNet18, ResNet34, ResNet50), which are respectively marked as: laneformer (ResNet18), Laneformer (ResNet34), Laneformer (ResNet 50). And the network without the attention detection module (i.e. the first attention module and the third attention module) in the Laneformer under the ResNet50 structure is recorded as: laneformer (ResNet 50).

The detection accuracy of different lane line detection methods on the CULane is shown in table 1:

TABLE 1

As can be seen from table 1: the Laneformer model achieves the current best results on the test set corpus of CULane, a 77.06% score, in an arrangement using ResNet50 as the backbone network. In addition to reaching the optimum across the test set corpus, Laneformer also reached the optimum results across several challenging scene categories such as Night scenes (Night), highlight scenes (Dazzle), and intersection scenes (Cross) (only some of which are shown in table 1). The Laneformer model is particularly prominent in performance of the intersection scene category, and the number of mismeasured pictures is two orders of magnitude less than that of other models. Since the crossroad scene is not marked with the lane line in the data set, the measurement of the crossroad scene adopts FP as an index. The FP values of the other models in the intersection scene are thousands, while the Laneformer model proposed by this work reaches the FP value of 19. It can be inferred from table 1 that the promotion comes from the addition of the detection attention module, in the Laneformer (ResNet50) model without the addition of the detection attention module, although the FP of the intersection scene is low, the FP is still thousands of values, and after the addition of the detection attention module, the indicator is sharply reduced to dozens, so that in the intersection scene with more complicated man-vehicle conditions, the detection attention module can greatly reduce the misprediction rate of the model through the perception of surrounding scenes and objects.

The detection accuracy of different lane line detection methods on TuSimple is shown in table 2:

TABLE 2

Model (model)	Accuracy (%)	False positive rate (%)	False negative rate (%)
				SCNN	96.53	6.17	1.8
LSTR	96.18	2.91	3.38
				EnetSAD	96.64	6.02	2.05
LineCNN	96.87	4.41	3.36
				PolyLaneNet	93.36	9.42	9.33
PointLaneNet	96.34	4.67	5.18
				LaneATT(ResNet18)	95.57	3.56	3.01
LaneATT(ResNet34)	95.63	3.53	2.92
				LaneATT(ResNet122)	96.1	5.64	2.17
Laneformer(ResNet50)*	96.72	3.46	2.52
				Laneformer(ResNet18)	96.54	4.35	2.36
Laneformer(ResNet34)	96.56	5.39	3.37
				Laneformer(ResNet50)	96.8	5.6	1.99

As can be seen from table 2: the Laneformer model obtained 96.8% accuracy, 5.6% false positive rate, and 1.99% false negative rate in the TuSimple dataset using ResNet50 as the backbone network. At the most important index accuracy, Laneformer is only 0.07% lower than the first LineCNN and 0.6% higher than the working LSTR, which also uses the self-attention transform network. Meanwhile, it can be observed that, unlike the behavior of the cumane dataset, in the TuSimple dataset, very competitive results can be obtained by using smaller backbone networks such as ResNet18 and ResNet34, and the model performance difference caused by different backbone networks is almost negligible. In addition, the TuSimple data set can achieve very good effects by using only the model of the line and column attention module, namely Laneformer (ResNet 50).

In addition, in order to more intuitively see the individual functions of each module in the target neural network, performance tests are performed on the CULane data set according to different conditions of the target neural network. The target neural network comprises the effect of using only a line and row attention module, and uses different sub-modules in detection attention step by step, wherein the different sub-modules comprise whether position information (bounding box), confidence (score) and category (category) of a human-vehicle detection box are used as the influence of the input of the detection preprocessing module on the whole result.

The test results are shown in table 3:

TABLE 3

Model (model)	F1(％)	Precision ratio (%)	Recall (%)	Frame rate per second	Quantity of ginseng (million)
						Baseline(ResNet50)	75.45	81.65	70.11	61	31.02
+ line attention	76.04	82.92	70.22	58	43.02
						+ position information of detection frame	76.08	85.3	68.66	57	45.38
+ confidence of detection box	76.25	83.56	70.12	54	45.38
						+ type of detection box	77.06	84.05	71.14	53	45.38

The first model (i.e., Baseline) can be understood as a network of the target neural network shown in fig. 16, in which the first attention module and the third attention module are removed. The second model (+ line and row attention) may be understood as being based on the first model + line and row attention module, the third model (+ location information of detection box) may be understood as being based on the second model + location information of detection box, the fourth model (+ confidence of detection box) may be understood as being based on the third model + confidence of detection box, and the fifth model (+ category of detection box) may be understood as being based on the fourth model + category of detection box. The fifth model can be regarded as the target neural network shown in fig. 17 described earlier.

The Laneformer model proposed herein adds a line and column attention module and a detection attention module (including a first attention module and a third attention module) on the basis of using a Transformer, and the detection attention module is divided into three cases of simply adding detection frame information, adding detection frame confidence and adding prediction category. Therefore, this subsection experimentally explores the effect of each module on the model. As can be seen from table 3, the F1 score of the baseline can reach 75.45% in a simple Transformer model without the row attention module and the detection attention module. After the line attention module was added, the effect of the model was increased to 76.04% F1 score. Meanwhile, the effect of the model can be improved by simply adding the detection frame information coming out of the man and vehicle detection module. Furthermore, the confidence of the detection box is added into the detection information, so that the model can reach 76.25% of F1 score, and after the category information of the detection box is also added, the optimal model in the table 3 is obtained, and the F1 score of 77.06% is reached, so that the line attention module and the detection attention module can improve the performance of the model. In addition, it can be observed that the accuracy of the model can be remarkably improved by adding the detection attention module, and the influence on the recall rate is weak.

The image processing method provided by the embodiment of the present application is described above, and the lane line detection method provided by the embodiment of the present application is described below. The method may be performed by the detection device or may be performed by a component of the detection device (e.g., a processor, a chip, or a system of chips, etc.). The detection device may be a terminal device (e.g., a vehicle-mounted terminal, an airplane terminal, etc.), etc. (as shown in fig. 3b described above). Optionally, the method may be processed by a CPU in the detection device, or may be processed by both the CPU and the GPU, or may use other processors suitable for neural network computing instead of the GPU, which is not limited in this application.

An application scenario of the method (or an application scenario understood to be a first neural network) may be used for a smart driving scenario. For example: scenes including lane line detection, such as adaptive cruise, Lane Departure Warning (LDW), Lane Keeping Assist (LKA), and the like. In an intelligent driving scene, the lane line detection method provided by the embodiment of the application can acquire an image to be detected through a sensor (such as a camera) on a vehicle, acquire a lane line in the image to be detected, and further realize the self-adaptive cruise, the LDW or the LKA and the like.

Referring to fig. 18, a schematic flow chart of a lane line detection method provided in the embodiment of the present application, where the method is applied to a vehicle, and the method may include steps 1801 to 1806. The following describes steps 1801 to 1806 in detail.

Step 1801, an image to be detected is obtained.

This step is similar to step 501 in the embodiment shown in fig. 5, and is not repeated here.

Illustratively, continuing the above example, the images to be detected coincide as in fig. 6.

Step 1802, processing an image to be detected to obtain a plurality of point sets.

After the detection equipment acquires the image to be detected, the image to be detected can be processed to obtain a plurality of point sets. Each point set in the plurality of point sets represents a lane line in the image to be detected; the method comprises the steps of processing a first neural network based on a transform structure and a point set of a lane line in a predicted image based on detection frame information, wherein the detection frame information comprises the position of a detection frame of at least one object in an image to be detected in the image to be detected.

It is understood that, for the step of predicting the point set of the lane line in the image based on the transform structure neural network and the detection frame information, reference may be made to the similar steps described in the foregoing embodiments shown in fig. 5 to 17, and details are not repeated here.

Step 1803, displaying the lane line, which is optional.

Alternatively, after the detection device determines a plurality of point sets, the lane lines represented by the plurality of point sets may be displayed.

Illustratively, continuing the above example, the lane lines are shown in FIG. 14 b.

Optionally, 1804 modeling the at least one object to obtain a virtual object.

Optionally, at least one object may be modeled to obtain a virtual object. The virtual object may be two-dimensional or multi-dimensional, and is not limited herein.

Step 1805, performing fusion processing on the plurality of point sets and the virtual object based on the position to obtain a target image, which is optional.

Alternatively, after the plurality of point sets and the virtual object are acquired, the plurality of point sets and the virtual object may be subjected to fusion processing based on the positions of the plurality of point sets in the strip prediction image to obtain the target image.

Illustratively, the target image is shown in fig. 19, and it is understood that the virtual image in fig. 19 is a two-dimensional example only, and does not limit the virtual object.

Step 1806, display the target image, this step is optional.

Optionally, after the detection device acquires the target image, the target image may be displayed to a user, so that the user driving the vehicle may specify the surrounding vehicle and lane line, and improve the driving safety of the vehicle.

It can be understood that the above steps 1801 to 1806 may be performed periodically, that is, the target image may be displayed to the user in real time, so that the user may determine the surrounding objects and the lane line in real time, and the driving experience of the user is improved.

In a possible implementation manner, the lane line detection method provided in the embodiment of the present application includes

steps

1801 and 1802. In another possible implementation manner, the lane line detection method provided in the embodiment of the present application includes steps 1801 to 1803. In another possible implementation manner, the lane line detection method provided in the embodiment of the present application includes steps 1801 to 1805.

The image processing method and the lane line detection method provided by the embodiment of the present application are described above, and the training process of the target neural network provided by the embodiment of the present application is described below. The training method of the target neural network may be performed by a training device of the target neural network, which may be an image processing apparatus (e.g., a device with sufficient computational capability to perform the training method of the target neural network, such as a cloud service apparatus or a user apparatus), or may be a system composed of the cloud service apparatus and the user apparatus. Illustratively, the training method may be performed by the training apparatus 120 in fig. 1, the neural network processor 20 in fig. 2.

Optionally, the training method may be processed by a CPU, or may be processed by both the CPU and the GPU, or may use other processors suitable for neural network computation instead of the GPU, which is not limited in this application.

Referring to fig. 20, a model training method for a target neural network according to an embodiment of the present disclosure is provided. The model training method includes steps 2001 to 2004.

Step 2001, training images are acquired.

The training device may acquire a training image through a sensor (e.g., a camera, a radar, etc.), may also acquire the training image from a database, and may also receive a training image sent by another device, and the manner of acquiring the training image is not limited here.

When the target neural network needs to be trained, the training device may obtain a batch of training samples, i.e., training images for training. Wherein the set of real points of the lane lines in the training image is known.

Step 2002, inputting the training image into a target neural network to obtain a first point set.

After the training image is obtained, the training image may be input into the target neural network, so as to implement the following steps by the target neural network: acquiring a first feature of a training image; acquiring a second feature based on the first feature, wherein the second feature comprises a position feature and a semantic feature of an object in the training image corresponding to the detection frame; a first set of points is obtained based on the first feature and the second feature, the first set of points representing a lane line in the training image.

Optionally, the obtaining the first point set based on the first feature and the second feature specifically includes the following steps: performing self-attention calculation on the first characteristic to obtain a first output; performing cross attention calculation on the first characteristic and the second characteristic to obtain a second output; obtaining a fourth feature based on the first output and the second output; performing cross attention calculation on the query feature and the fourth feature line to obtain a third output, wherein the query feature is calculated by a query vector based on a self-attention mechanism; processing the query feature and the second feature to obtain a fourth output; adding the third output and the fourth output to obtain a fifth characteristic; a first set of points is obtained based on the fifth feature.

For the process of obtaining the first feature, the second feature, the fourth feature, the fifth feature and the point set, reference may be made to the description of step 502 to step 504 in the embodiment shown in fig. 5, and details are not repeated here.

Step 2003, obtaining a target loss based on the first point set and a real point set of the actual lane line in the training image, wherein the target loss is used for indicating the difference between the first point set and the real point set.

After the first point set is obtained, the first point set and the real point set may be calculated through a preset target loss function to obtain a target loss, where the target loss is used to indicate a difference between the first point set and the real point set.

It should be noted that, if the number of lane lines corresponding to the first point set is greater than the number of lane lines corresponding to the real point set, the real point set may be expanded, and the category of the lane line of the expanded point set is set as the non-lane line category. The target penalty in this case is then used to indicate the difference between the expanded set of true points and the set of true first points.

And step 2004, updating parameters of the target neural network based on the target loss until the training conditions are met, and obtaining the trained target neural network.

After the target loss is obtained, parameters of the target neural network may be updated based on the target loss, and the target neural network after the parameters are updated is trained by using a next batch of training samples (i.e., steps 2002 to 2004 are executed again) until model training conditions are met (e.g., the target loss converges, etc.), so as to obtain a trained target neural network.

In addition, the query vectors involved in the training process are random, and the query vectors are also trained in the process of continuously updating the target neural network parameters, so as to obtain target query vectors, which can be understood as query vectors used in the inference process, that is, the target query vectors are the query vectors in the embodiment shown in fig. 5.

The target neural network obtained by training in the embodiment has the capability of predicting the lane line by using the image. In the detection process, the transform structure is applied to a lane line detection task, so that the global information of the image to be detected can be acquired, and further the long-distance relation between lane lines is effectively modeled. On the other hand, the scene perception capability of the target neural network can be improved by adding the position information of the detection frame of the object in the image as input in the network for detecting the lane line. And misjudgment of the model in a scene with the lane line shielded by the vehicle is reduced. On the other hand, by introducing a row-column self-attention module which can be compliant with the shape of the lane line to mine context information into the encoder of the transformer, the construction capability of the network on the characteristics of the long-strip lane line can be improved, and therefore a better lane line detection effect is achieved. On the other hand, the modules in the existing automatic driving system are often independent from each other, for example, the lane line detection model and the man-vehicle model are independent from each other and are predicted independently. In this embodiment, the training of the target neural network is obtained by using the detection frame information obtained based on the human-vehicle detection model in the first neural network, so that the accuracy of the target neural network for lane line detection can be improved.

With reference to fig. 21, the image processing method in the embodiment of the present application is described above, and an embodiment of the image processing apparatus in the embodiment of the present application is described below, where:

the extraction unit 2101 is configured to perform feature extraction on an image to be detected to obtain a first feature;

a processing unit 2102, configured to process detection frame information of the image to be detected to obtain a second feature, where the detection frame information includes a position of a detection frame of an object in the image to be detected;

a determining unit 2103, configured to input the first feature and the second feature into a first neural network based on a transform structure, so as to obtain a lane line in the image to be detected.

Optionally, the image processing apparatus in this embodiment may further include: an obtaining unit 2104, configured to obtain, based on the first feature, a first row feature and a first column feature, where the first row feature is obtained by performing flattening (flatten) on a matrix corresponding to the first feature along a row direction, and the first column feature is obtained by performing flattening (flatten) on the matrix along a column direction.

In this embodiment, operations performed by each unit in the image processing apparatus are similar to those described in the embodiments shown in fig. 5 to 17, and are not described again here.

In this embodiment, on the one hand, by applying the transform structure to the lane line detection task, the global information of the image to be detected can be acquired, and then the long-distance relation between lane lines is effectively modeled. On the other hand, by adding the detection frame information of the object in the image in the process of detecting the lane line, the perception capability of the image scene can be improved, and the misjudgment of the scene due to the fact that the lane line is shielded by the vehicle is reduced.

Referring to fig. 22, an embodiment of a detection apparatus in an embodiment of the present application includes:

an acquisition unit 2201 for acquiring an image to be detected;

a processing unit 2202, configured to process an image to be detected to obtain a plurality of point sets, where each point set in the plurality of point sets represents a lane line in the image to be detected; the method comprises the steps of processing a first neural network based on a transform structure and a point set of a lane line in a predicted image based on detection frame information, wherein the detection frame information comprises the position of a detection frame of at least one object in an image to be detected in the image to be detected.

Optionally, the detection apparatus in this embodiment may further include: a display unit 2203 for displaying the lane line.

In this embodiment, operations performed by each unit in the detection device are similar to those described in the embodiment shown in fig. 18, and are not described again here.

In this embodiment, on the one hand, by applying the transform structure to the lane line detection task, the global information of the image to be detected can be acquired, and then the long-distance relation between lane lines is effectively modeled. On the other hand, by adding the detection frame information of the object in the image in the process of detecting the lane line, the perception capability of the target neural network on the image scene can be improved, and the misjudgment of the lane line in the scene shielded by the vehicle is reduced.

Referring to fig. 23, another embodiment of an image processing apparatus according to an embodiment of the present application includes:

an acquisition unit 2301 for acquiring a training image;

the processing unit 2302 is configured to input the training image into a target neural network to obtain a first point set of the training image, where the first point set represents a predicted lane line in the training image; the target neural network is operable to: extracting features of the training image to obtain first features; processing detection frame information of the training image to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of an object in the training image; acquiring a first point set based on the first characteristic and the second characteristic, wherein the target neural network is used for predicting the point set of the lane line in the image based on the transformer structure;

and a training unit 2303, configured to train the target neural network according to the first point set and a real point set of an actual lane line in the training image, so as to obtain a trained target neural network.

In this embodiment, operations performed by each unit in the image processing apparatus are similar to those described in the embodiment shown in fig. 20, and are not described again here.

Referring to fig. 24, a schematic structural diagram of another image processing apparatus provided in the present application is shown. The image processing device may include a processor 2401, memory 2402, and a communication interface 2403. The processor 2401, memory 2402, and communication interface 2403 are interconnected by wires. Wherein program instructions and data are stored in memory 2402.

The memory 2402 stores program instructions and data corresponding to the steps executed by the device in the corresponding embodiments shown in fig. 5 to 17 and 20.

A processor 2401, configured to perform the steps performed by the apparatus in any one of the embodiments shown in fig. 5 to 17 and fig. 20.

Communication interface 2403 may be used to receive and transmit data, and perform the steps related to obtaining, transmitting, and receiving in any of the embodiments shown in fig. 5 to 17 and 20.

In one implementation, the image processing device may include more or less components than those shown in fig. 24, which are merely exemplary and not limiting.

Referring to fig. 25, a schematic structural diagram of another detection apparatus provided in the present application is shown. The detection device may include a processor 2501, a memory 2502, and a communication interface 2503. The processor 2501, memory 2502, and communication interface 2503 are interconnected by wires. Among other things, the memory 2502 has stored therein program instructions and data.

The memory 2502 stores program instructions and data corresponding to the steps performed by the detection device in the corresponding embodiment shown in fig. 18.

A processor 2501, configured to perform the steps performed by the detection apparatus in any of the embodiments shown in fig. 18.

Communication interface 2503 may be used to receive and transmit data for performing the steps associated with acquiring, transmitting, and receiving in any of the embodiments illustrated in fig. 18 and described above.

In one implementation, the detection device may include more or less components than those shown in fig. 25, which are merely exemplary and not limiting.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is only a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated units described above may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.

When the integrated unit is implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Claims

1. An image processing method, characterized in that the method comprises:

extracting the features of an image to be detected to obtain first features;

processing the detection frame information of the image to be detected to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of at least one object in the image to be detected;

and inputting the first characteristic and the second characteristic into a first neural network based on a transformer structure to obtain a lane line in the image to be detected.

2. The method according to claim 1, wherein the processing the detection frame information of the image to be detected to obtain the second feature comprises:

and processing at least one third feature and the detection frame information to obtain the second feature, wherein the at least one third feature is an intermediate feature obtained in the process of obtaining the first feature.

3. The method of claim 2, wherein the second feature comprises a position feature and a semantic feature of the detection box, and wherein the detection box information further comprises: the category and confidence of the detection frame;

the processing the at least one third feature and the detection frame information to obtain the second feature includes:

obtaining the semantic features based on the at least one third feature, the location, and the confidence level;

and acquiring the position feature based on the position and the category.

4. The method of claim 3, wherein the obtaining the semantic features based on the at least one third feature, the location, and the confidence level comprises:

extracting region of interest, ROI, features from the at least one third feature based on the location;

multiplying the ROI features and the confidence coefficient, and inputting the obtained features into a full-connection layer to obtain the semantic features;

the obtaining the location feature based on the location and the category includes:

and acquiring the vector of the category, splicing the vector with the vector corresponding to the position, and inputting the spliced features into the full-connection layer to obtain the position features.

5. The method according to any one of claims 1 to 4, wherein the first neural network based on a transform structure comprises an encoder, a decoder and a feedforward neural network;

inputting the first feature and the second feature into a first neural network based on a transform structure to obtain a lane line in the image to be detected, wherein the method comprises the following steps:

inputting the first characteristic and the second characteristic into the encoder to obtain a fourth characteristic;

inputting the fourth feature, the second feature and the query feature into the decoder to obtain a fifth feature;

and inputting the fifth characteristic into the feedforward neural network to obtain a plurality of point sets, wherein each point set in the plurality of point sets represents a lane line in the image to be detected.

6. The method of claim 5, further comprising:

acquiring first row features and first column features based on the first features, wherein the first row features are obtained by performing flattening (flatten) on a matrix corresponding to the first features along the row direction, and the first column features are obtained by performing flattening (flatten) on the matrix along the column direction;

the inputting the first feature and the second feature into the encoder to obtain a fourth feature includes:

inputting the first feature, the second feature, the first row feature, and the first column feature into the encoder to obtain the fourth feature.

7. The method of claim 6, wherein inputting the first feature, the second feature, the first row feature, and the first column feature into the encoder to obtain the fourth feature comprises:

performing self-attention calculation on the first characteristic to obtain a first output;

performing cross attention calculation on the first feature and the second feature to obtain a second output;

performing self-attention calculation and splicing processing on the first row characteristics and the first column characteristics to obtain row and column output;

obtaining the fourth feature based on the first output, the second output, and the rank output.

8. The method of claim 7, wherein the obtaining the fourth feature based on the first output, the second output, and the rank output comprises:

adding the first output and the second output to obtain a fifth output;

and splicing the fifth output and the row-column output to obtain the fourth characteristic.

9. The method of claim 5, wherein inputting the first feature and the second feature into the encoder to obtain a fourth feature comprises:

and adding the first output and the second output to obtain the fourth characteristic.

10. The method according to any one of claims 5 to 9, wherein the inputting the fourth feature, the second feature and the query feature into the decoder to obtain a fifth feature comprises:

performing cross attention calculation on the query feature and the fourth feature to obtain a third output;

processing the query feature and the second feature to obtain a fourth output;

and adding the third output and the fourth output to obtain the fifth characteristic.

11. The method according to any one of claims 1 to 10, wherein the extracting features of the image to be detected to obtain the first features comprises:

and carrying out feature fusion and dimension reduction processing on features output by different layers in a main network to obtain the first feature, wherein the input of the main network is the image to be detected.

12. A lane line detection method, applied to a vehicle, comprising:

acquiring an image to be detected;

processing the image to be detected to obtain a plurality of point sets, wherein each point set in the plurality of point sets represents a lane line in the image to be detected; the processing is based on a first neural network of a transform structure and detection frame information, and the detection frame information comprises the position of a detection frame of at least one object in the image to be detected.

13. The method of claim 12, wherein the detecting box information further comprises: the class and confidence of the detection box.

14. The method according to claim 12 or 13, characterized in that the method further comprises:

and displaying the lane line.

15. The method according to any one of claims 12 to 14, further comprising:

modeling the at least one object to obtain a virtual object;

performing fusion processing on the point sets and the virtual object based on the positions to obtain a target image;

and displaying the target image.

16. An image processing apparatus characterized by comprising:

the extraction unit is used for extracting the features of the image to be detected to obtain first features;

the processing unit is used for processing the detection frame information of the image to be detected to obtain a second characteristic, wherein the detection frame information comprises the position of a detection frame of at least one object in the image to be detected;

and the determining unit is used for inputting the first characteristic and the second characteristic into a first neural network based on a transformer structure to obtain a lane line in the image to be detected.

17. The apparatus according to claim 16, wherein the processing unit is specifically configured to process at least one third feature and the detection frame information to obtain the second feature, where the at least one third feature is an intermediate feature obtained in a process of obtaining the first feature.

18. The apparatus according to claim 17, wherein the second feature includes a position feature and a semantic feature of the detection frame, and the detection frame information further includes: the category and confidence of the detection frame;

the processing unit is specifically configured to obtain the semantic feature based on the at least one third feature, the location, and the confidence level;

the processing unit is specifically configured to obtain the location feature based on the location and the category.

19. The image processing device according to claim 18, wherein the processing unit is specifically configured to extract a region of interest, ROI, feature from the at least one third feature based on the position;

the processing unit is specifically configured to perform multiplication processing on the ROI feature and the confidence level, and input the obtained feature into a full connection layer to obtain the semantic feature;

the processing unit is specifically configured to obtain the vectors of the categories, splice the vectors with the vectors corresponding to the positions, and input the features obtained by splicing into the full-link layer to obtain the position features.

20. The image processing device according to any one of claims 16 to 19, wherein the first neural network based on a transform structure comprises an encoder, a decoder, and a feed-forward neural network;

the determining unit is specifically configured to input the first feature and the second feature into the encoder to obtain a fourth feature;

the determining unit is specifically configured to input the fourth feature, the second feature, and the query feature into the decoder to obtain the fifth feature;

the determining unit is specifically configured to input the fifth feature into the feedforward neural network to obtain a plurality of point sets, where each point set in the plurality of point sets represents one lane line in the image to be detected.

21. The image processing apparatus according to claim 20, characterized by further comprising:

an obtaining unit, configured to obtain a first row feature and a first column feature based on the first feature, where the first row feature is obtained by performing flattening (flattening) on a matrix corresponding to the first feature along a row direction, and the first column feature is obtained by performing flattening (flattening) on the matrix along a column direction;

the determining unit is specifically configured to input the first feature, the second feature, the first row feature, and the first column feature into the encoder to obtain the fourth feature.

22. The apparatus according to claim 21, wherein the determining unit is specifically configured to perform a self-attention calculation on the first feature to obtain a first output;

the determining unit is specifically configured to perform cross attention calculation on the first feature and the second feature to obtain a second output;

the determining unit is specifically configured to perform self-attention calculation and splicing processing on the first row features and the first column features to obtain row-column output;

the determining unit is specifically configured to obtain the fourth feature based on the first output, the second output, and the row-column output.

23. The apparatus according to claim 22, wherein the determining unit is specifically configured to add the first output and the second output to obtain a fifth output;

the determining unit is specifically configured to perform a splicing process on the fifth output and the row-column output to obtain the fourth feature.

24. The apparatus according to claim 20, wherein the determining unit is specifically configured to perform a self-attention calculation on the first feature to obtain a first output;

the determining unit is specifically configured to add the first output and the second output to obtain the fourth feature.

25. The image processing apparatus according to any one of claims 20 to 24, wherein the determining unit is specifically configured to perform cross attention calculation on the query feature and the fourth feature to obtain a third output;

the determining unit is specifically configured to process the query feature and the second feature to obtain a fourth output;

the determining unit is specifically configured to add the third output and the fourth output to obtain the fifth characteristic.

26. The apparatus according to any one of claims 16 to 25, wherein the extracting unit is specifically configured to perform feature fusion and dimension reduction on features output from different layers in a backbone network, so as to obtain the first feature, where an input of the backbone network is the image to be detected.

27. A detection apparatus, characterized in that the detection apparatus is applied to a vehicle, the detection apparatus comprising:

the acquisition unit is used for acquiring an image to be detected;

the processing unit is used for processing the image to be detected to obtain a plurality of point sets, and each point set in the plurality of point sets represents one lane line in the image to be detected; the processing is based on a first neural network of a transform structure and detection frame information, and the detection frame information comprises the position of a detection frame of at least one object in the image to be detected.

28. The detection apparatus according to claim 27, wherein the detection frame information further includes: the class and confidence of the detection box.

29. The detection apparatus according to claim 27 or 28, characterized in that the detection apparatus further comprises:

and the display unit is used for displaying the lane line.

30. The detection apparatus according to any one of claims 27 to 29, wherein the processing unit is further configured to model the at least one object to obtain a virtual object;

the processing unit is further configured to perform fusion processing on the plurality of point sets and the virtual object based on the positions to obtain a target image;

the display unit is further used for displaying the target image.

31. An image processing apparatus characterized by comprising: a processor coupled with a memory, the memory for storing a program or instructions that, when executed by the processor, cause the image processing apparatus to perform the method of any of claims 1 to 11.

32. A detection apparatus, characterized in that the detection apparatus is applied to a vehicle, the detection apparatus comprising: a processor coupled with a memory for storing a program or instructions that, when executed by the processor, cause the image processing apparatus to perform the method of any of claims 12 to 15.

33. A computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-11 or cause the electronic device to perform the method of any of claims 12-15.

34. A computer program product, characterized in that, when run on a computer, causes the computer to perform the method of any one of claims 1 to 11, or causes the computer to perform the method of any one of claims 12 to 15.