WO2023131065A1 - 一种图像处理方法、一种车道线检测方法及相关设备 - Google Patents
一种图像处理方法、一种车道线检测方法及相关设备 Download PDFInfo
- Publication number
- WO2023131065A1 WO2023131065A1 PCT/CN2022/143779 CN2022143779W WO2023131065A1 WO 2023131065 A1 WO2023131065 A1 WO 2023131065A1 CN 2022143779 W CN2022143779 W CN 2022143779W WO 2023131065 A1 WO2023131065 A1 WO 2023131065A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- image
- output
- detected
- detection frame
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 323
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 200
- 238000013528 artificial neural network Methods 0.000 claims abstract description 177
- 238000000034 method Methods 0.000 claims abstract description 151
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 230000008569 process Effects 0.000 claims description 77
- 238000004364 calculation method Methods 0.000 claims description 63
- 230000015654 memory Effects 0.000 claims description 50
- 239000013598 vector Substances 0.000 claims description 50
- 239000011159 matrix material Substances 0.000 claims description 43
- 230000009467 reduction Effects 0.000 claims description 15
- 230000004927 fusion Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000007499 fusion processing Methods 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 abstract description 15
- 230000008447 perception Effects 0.000 abstract description 10
- 238000012549 training Methods 0.000 description 110
- 239000010410 layer Substances 0.000 description 80
- 230000000875 corresponding effect Effects 0.000 description 43
- 230000006870 function Effects 0.000 description 36
- 238000010586 diagram Methods 0.000 description 30
- 238000013527 convolutional neural network Methods 0.000 description 27
- 238000004891 communication Methods 0.000 description 18
- 230000004913 activation Effects 0.000 description 13
- 230000001537 neural effect Effects 0.000 description 12
- 210000002569 neuron Anatomy 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 11
- 238000007781 pre-processing Methods 0.000 description 11
- 230000000694 effects Effects 0.000 description 10
- 230000007246 mechanism Effects 0.000 description 9
- 238000010606 normalization Methods 0.000 description 9
- 230000002829 reductive effect Effects 0.000 description 9
- 239000002356 single layer Substances 0.000 description 8
- 238000013500 data storage Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 6
- 230000002093 peripheral effect Effects 0.000 description 6
- 238000005070 sampling Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 238000003709 image segmentation Methods 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000010267 cellular communication Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 3
- 239000000446 fuel Substances 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 2
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 2
- ATUOYWHBWRKTHZ-UHFFFAOYSA-N Propane Chemical compound CCC ATUOYWHBWRKTHZ-UHFFFAOYSA-N 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000002485 combustion reaction Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000003999 initiator Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000011056 performance test Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- HBBGRARXTFLTSG-UHFFFAOYSA-N Lithium ion Chemical compound [Li+] HBBGRARXTFLTSG-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 229910001416 lithium ion Inorganic materials 0.000 description 1
- 238000012806 monitoring device Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 239000003208 petroleum Substances 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 239000001294 propane Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30248—Vehicle exterior or interior
- G06T2207/30252—Vehicle exterior; Vicinity of vehicle
- G06T2207/30256—Lane; Road marking
Definitions
- the embodiments of the present application relate to the field of artificial intelligence, and in particular to an image processing method, a lane line detection method and related equipment.
- Intelligent driving such as automatic driving, assisted driving, etc.
- Intelligent driving relies on the cooperation of artificial intelligence, visual computing, radar, monitoring devices and global positioning systems, so that vehicles can achieve automatic driving without human active operation.
- Lane line detection technology is one of the most important technologies in intelligent driving, and it is of great significance to other technologies applied in intelligent driving systems (such as adaptive cruise control, lane departure warning, road condition understanding, etc.).
- the goal of lane line detection technology is to predict each lane line in the picture through the image input obtained by the camera, so as to assist the car to drive in the correct lane.
- the lane line detection model based on image segmentation first predicts the segmentation results of the entire image, and then outputs the lane line detection results after clustering.
- convolutional neural networks such as spatial convolutional neural network (spatial convolutional neuron network, SCNN), etc.
- SCNN spatial convolutional neuron network
- Embodiments of the present application provide an image processing method, a lane line detection method and related equipment. It can improve the accuracy of detecting lane lines in images.
- the first aspect of the embodiments of the present application provides an image processing method, which can be applied to intelligent driving scenarios. For example: adaptive cruise, lane departure warning (lane departure warning, LDW), lane keeping assist (lane keeping assist, LKA) and other scenarios that include lane line detection.
- the method may be executed by an image processing device (such as a terminal device or a server), or may be executed by components of the image processing device (such as a processor, a chip, or a chip system, etc.).
- the method is realized by a target neural network containing a transformer structure, and the method includes: extracting features of the image to be detected to obtain the first feature; processing the detection frame information of the image to be detected to obtain the second feature, and the detection frame information includes the object in the image to be detected The position of the detection frame in the image to be detected; the first feature and the second feature are input into the first neural network based on the transformer structure to obtain the lane line in the image to be detected.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the detection frame information of the object in the image during the lane line detection process the perception of the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- processing the detection frame information of the image to be detected to obtain the second feature includes: processing at least one third feature and the detection frame information to obtain The second feature and at least one third feature are intermediate features obtained during the process of obtaining the first feature.
- the obtained second feature not only includes detection frame information, but also features of the image. Provide more details for the subsequent determination of lane lines.
- the above-mentioned second feature includes the position feature and semantic feature of the detection frame corresponding to the object in the image to be detected, and the detection frame information also includes: the category and confidence of the detection frame degree; processing at least one third feature and detection frame information to obtain the second feature includes: acquiring semantic features based on at least one third feature, position and confidence; acquiring position features based on position and category.
- the second feature not only considers the position of the detection frame, but also considers the type and confidence of the detection frame, so that the subsequent determined lane line is more accurate.
- the above step: acquiring semantic features based on at least one third feature, position and confidence includes: extracting interest from at least one third feature based on position Regional ROI features; multiply the ROI features and confidence, and input the obtained features into the fully connected layer to obtain semantic features; obtain position features based on position and category, including: obtain the vector of the category, and compare the vector corresponding to the position Splicing, input the obtained features into the fully connected layer to obtain positional features.
- the information of the second feature is more comprehensive, thereby improving the accuracy of lane line prediction.
- the above-mentioned first neural network based on the transformer structure includes an encoder, a decoder, and a feedforward neural network; inputting the first feature and the second feature based on the transformer structure
- the first neural network to obtain the lane line in the image to be detected includes: obtaining the fourth feature based on the first feature, the second feature and the encoder; inputting the fourth feature, the second feature and the query feature into the decoder to obtain the first Five features; input the fifth feature into the feedforward neural network to obtain multiple point sets.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between lane lines can be effectively modeled.
- the subsequent lane line determined based on the point set is more accurate.
- the above steps further include: obtaining the first row feature and the first column feature based on the first feature, the first row feature is the matrix corresponding to the first feature
- the first column feature is obtained by flattening the matrix along the column direction; the first feature and the second feature are input into the encoder to obtain the fourth feature, including: The first feature, the second feature, the first row feature and the first column feature are input into the decoder to obtain the fourth feature.
- the ability to construct long lane line features can be improved, so as to achieve better lane line detection Effect.
- the above step input the first feature, the second feature, the first row feature and the first column feature into the encoder to obtain the fourth feature, including: Perform self-attention calculation on the first feature to obtain the first output; perform cross-attention calculation on the first feature and the second feature to obtain the second output; perform self-attention calculation and splicing on the first row of features and the first column of features processing to obtain row and column outputs; and obtain a fourth feature based on the first output, the second output, and the row and column outputs.
- the fourth feature acquisition process also considers the row and column output.
- the ability to construct the feature of the long lane line can be improved, so as to achieve Better lane line detection.
- the above step: obtaining the fourth feature based on the first output, the second output, and the row and column output includes: adding the first output and the second output , to obtain the fifth output; splicing the fifth output and the row and column outputs to obtain the fourth feature.
- the specific process of the fourth feature is refined, and the fourth feature is obtained by concatenating the result obtained by adding the first output and the second output and the row and column outputs.
- the above step: inputting the first feature and the second feature into the encoder to obtain the fourth feature includes: performing self-attention calculation on the first feature to obtain The first output; performing cross-attention calculation on the first feature and the second feature to obtain the second output; performing addition processing on the first output and the second output to obtain the fourth feature.
- the fourth feature not only contains the first output calculated based on the first feature through the self-attention mechanism, but also contains the second output calculated based on the cross-attention between the first feature and the second feature.
- the fourth characteristic is the ability to express.
- the above step: inputting the fourth feature, the second feature and the query feature into the decoder to obtain the fifth feature includes: performing an operation on the query feature and the fourth feature
- the third output is obtained by cross-attention calculation; the query feature and the second feature are processed to obtain the fourth output; the third output and the fourth output are added to obtain the fifth feature.
- the cross-attention calculation enables the obtained fifth feature to take into account more information with predicted images, which improves the expressive ability of the fifth feature and makes the subsequent lane line determined based on the point set more accurate.
- the above step: performing feature extraction on the image to be detected, and obtaining the first feature includes: performing feature fusion and dimensionality reduction processing on features output by different layers in the backbone network, The first feature is obtained, and the input of the backbone network is the image to be detected.
- the low-level features have higher resolution and contain more position and detail information, but due to fewer convolutions, It has lower semantics and more noise; high-level features have stronger semantic information, but low resolution and poor perception of details. Therefore, by performing feature fusion on the features extracted from different layers of the neural network, the first feature obtained has multi-level features.
- the second aspect of the embodiments of the present application provides a lane line detection method, which can be applied to intelligent driving scenarios. For example: adaptive cruise, lane departure warning, lane keeping assistance and other scenarios that include lane line detection.
- the method may be executed by a detection device (such as a vehicle or a device in a vehicle), or may be executed by components of a detection device (such as a processor, a chip, or a chip system, etc.).
- the method includes: acquiring an image to be detected; processing the image to be detected to obtain a plurality of point sets, each point set in the plurality of point sets represents a lane line in the image to be detected; wherein, the processing is based on the first neural network of the transformer structure
- the network and the detection frame information predict the point set of the lane line in the image, and the detection frame information includes the position of the detection frame of at least one object in the image to be detected in the image to be detected.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the target neural network by adding the detection frame information of the object in the image during the lane line detection process, the target neural network’s ability to perceive the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- the above detection frame information further includes: a category and a confidence level of the detection frame.
- the detection frame information of the subsequent predicted lane line reference can be increased, so that the subsequent lane line determined based on the point set is more accurate.
- the above steps further include: displaying lane lines.
- the user can pay attention to the lane line of the current road, especially in the scene where the lane line is blocked, helping the user to accurately determine the lane line and reduce the risk caused by the blurred lane line .
- the above steps further include: modeling at least one object to obtain a virtual object; performing fusion processing on multiple point sets and the virtual object based on the position to obtain the target image ; Display the target image.
- the target image is obtained by modeling the virtual object and merging the virtual object with multiple point sets based on the position. Users can understand the surrounding objects and lane lines through the target image, help users accurately determine surrounding objects and lane lines, and reduce the risk caused by blurred lane lines.
- the third aspect of the embodiments of the present application provides an image processing method, which can be applied to intelligent driving scenarios. For example: adaptive cruise, lane departure warning, lane keeping assistance and other scenarios that include lane line detection.
- the method may be executed by an image processing device (such as a terminal device or a server), or may be executed by components of the image processing device (such as a processor, a chip, or a chip system, etc.).
- the method comprises: acquiring a training image; inputting the training image into a target neural network to obtain a first point set of the training image, the first point set representing a predicted lane line in the training image; the target neural network is used for: performing feature extraction on the training image , to obtain the first feature; process the detection frame information of the training image to obtain the second feature, the detection frame information includes the position of the detection frame of the object in the training image in the training image; obtain the first feature based on the first feature and the second feature Point set, the target neural network is used to predict the point set of the lane line in the image based on the transformer structure; according to the first point set and the real point set of the actual lane line in the training image, the target neural network is trained to obtain the trained target neural network network.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the target neural network by adding the detection frame information of the object in the image during the lane line detection process, the target neural network’s ability to perceive the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- the fourth aspect of the embodiments of the present application provides an image processing device, which can be applied to an intelligent driving scene. For example: adaptive cruise, lane departure warning, lane keeping assistance and other scenarios that include lane line detection.
- the image processing device includes: an extraction unit, which is used to extract features of the image to be detected to obtain the first feature; a processing unit, which is used to process the detection frame information of the image to be detected to obtain the second feature, and the detection frame information includes the detection frame information in the image to be detected The position of the detection frame of at least one object in the image to be detected; the determination unit is configured to input the first feature and the second feature into the first neural network based on the transformer structure to obtain the lane line in the image to be detected.
- the above-mentioned processing unit is specifically configured to process at least one third feature and detection frame information to obtain the second feature, and at least one third feature is to obtain Intermediate features obtained during the first feature process.
- the above-mentioned second feature includes the position feature and semantic feature of the detection frame corresponding to the object in the image to be detected, and the detection frame information also includes: the category and confidence of the detection frame The degree; the processing unit is specifically used to acquire the semantic features based on at least one third feature, the position and the confidence; the processing unit is specifically used to acquire the position features based on the position and the category.
- the above-mentioned processing unit is specifically configured to extract the ROI feature of the region of interest from at least one third feature based on the position; the processing unit is specifically configured to extract the ROI feature The feature and the confidence are multiplied, and the obtained feature is input into the fully connected layer to obtain the semantic feature; the processing unit is specifically used to obtain the vector of the category, and splicing with the vector corresponding to the position, and input the obtained feature into the fully connected layer , to get the location feature.
- the above-mentioned first neural network based on the transformer structure includes an encoder, a decoder, and a feedforward neural network; the determining unit is specifically configured to combine the first feature with The second feature is input into the encoder to obtain the fourth feature; the determination unit is specifically used to input the fourth feature, the second feature and the query feature into the decoder to obtain the fifth feature; the determination unit is specifically used to input the fifth feature before Feed the neural network to obtain multiple point sets, and each point set in the multiple point sets represents a lane line in the image to be detected.
- the above-mentioned image processing device further includes: an acquisition unit, configured to acquire the first row feature and the first column feature based on the first feature, the first row feature is The matrix corresponding to the first feature is obtained by flattening along the row direction, and the first column feature is obtained by flattening the matrix along the column direction; the determination unit is specifically used to combine the first feature, the second The second feature, the first row feature and the first column feature are input into the decoder to obtain the fourth feature.
- the above-mentioned determining unit is specifically configured to perform self-attention calculation on the first feature to obtain the first output; the determining unit is specifically configured to perform self-attention calculation on the first feature Perform cross-attention calculation with the second feature to obtain the second output; the determination unit is specifically used to perform self-attention calculation and splicing processing on the first row feature and the first column feature to obtain row and column outputs; the determination unit is specifically used for A fourth feature is obtained based on the first output, the second output, and the row and column outputs.
- the above-mentioned determining unit is specifically configured to add the first output and the second output to obtain the fifth output; the determining unit is specifically configured to The fifth output is spliced with the row and column outputs to obtain the fourth feature.
- the above-mentioned determining unit is specifically configured to perform self-attention calculation on the first feature to obtain the first output; the determining unit is specifically configured to perform self-attention calculation on the first feature Perform cross-attention calculation with the second feature to obtain the second output; the determination unit is specifically configured to add the first output and the second output to obtain the fourth feature.
- the above-mentioned determining unit is specifically configured to perform cross-attention calculation on the query feature and the fourth feature to obtain a third output; the determining unit is specifically configured to The query feature is processed with the second feature to obtain the fourth output; the determination unit is specifically configured to add the third output and the fourth output to obtain the fifth feature.
- the above-mentioned extraction unit is specifically used to perform feature fusion and dimensionality reduction processing on the features output by different layers in the backbone network to obtain the first feature, and the backbone network
- the input is the image to be detected.
- the fifth aspect of the embodiment of the present application provides a detection device, which can be applied to an intelligent driving scene. For example: adaptive cruise, lane departure warning, lane keeping assistance and other scenarios that include lane line detection.
- the detection device is applied to a vehicle, and the detection device includes: an acquisition unit for obtaining an image to be detected; a processing unit for processing the image to be detected to obtain multiple point sets, and each point set in the multiple point sets represents the image to be detected A lane line in the image; wherein, the processing is based on the first neural network of the transformer structure and the detection frame information to predict the point set of the lane line in the image, and the detection frame information includes the detection frame of at least one object in the image to be detected in the image to be detected s position.
- the above detection frame information further includes: a category and a confidence level of the detection frame.
- the above detection device further includes: a display unit, configured to display lane lines.
- the above-mentioned processing unit is further configured to model at least one object to obtain a virtual object; the processing unit is further configured to compare multiple point sets and The virtual object is fused to obtain the target image; the display unit is also used to display the target image.
- a sixth aspect of the embodiments of the present application provides an image processing device, which can be applied to intelligent driving scenarios. For example: adaptive cruise, lane departure warning, lane keeping assistance and other scenarios that include lane line detection.
- the image processing device includes: an acquisition unit for acquiring a training image; a processing unit for inputting the training image into the target neural network to obtain the first point set of the training image, the first point set representing the predicted lane line in the training image; the target The neural network is used for: performing feature extraction on the training image to obtain the first feature; processing the detection frame information of the training image to obtain the second feature, and the detection frame information includes the position of the detection frame of the object in the training image in the training image; Obtain the first point set based on the first feature and the second feature, and the target neural network is used to predict the point set of the lane line in the image based on the transformer structure; the training unit is used to compare the actual lane line in the first point set with the training image.
- the point set is used to train the target neural network to obtain the
- the seventh aspect of the present application provides an image processing device, including: a processor, the processor is coupled with a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the image processing device realizes the foregoing first A method in one aspect or any possible implementation of the first aspect, or a method in any possible implementation of the aforementioned third aspect or the third aspect.
- the eighth aspect of the present application provides a detection device, including: a processor, the processor is coupled with a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the detection device realizes the above-mentioned second aspect Or the method in any possible implementation of the second aspect.
- the ninth aspect of the present application provides a computer-readable medium, on which computer programs or instructions are stored, and when the computer programs or instructions are run on the computer, the computer executes the aforementioned first aspect or any possible implementation of the first aspect
- the method in the manner, or causing the computer to execute the method in the aforementioned second aspect or any possible implementation manner of the second aspect, or causing the computer to execute the aforementioned third aspect or the method in any possible implementation manner of the third aspect.
- the tenth aspect of the present application provides a computer program product.
- the computer program product When the computer program product is executed on a computer, the computer executes the method in the aforementioned first aspect or any possible implementation of the first aspect, or causes the computer to execute the aforementioned first aspect.
- the fourth, seventh, eighth, ninth, tenth aspects or the technical effects brought by any of the possible implementations may refer to the first aspect or the technical effects brought about by different possible implementations of the first aspect , which will not be repeated here.
- the technical effects brought by the fifth, seventh, eighth, ninth, tenth aspects or any one of the possible implementations may refer to the second aspect or the technical effects brought by different possible implementations of the second aspect , which will not be repeated here.
- the sixth, seventh, eighth, ninth, and tenth aspects or the technical effects brought by any of the possible implementations may refer to the third aspect or the technical effects brought about by different possible implementations of the third aspect , which will not be repeated here.
- the embodiments of the present application have the following advantages: On the one hand, by applying the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the distance between lane lines can be effectively modeled. long-distance contact. On the other hand, by adding the detection frame information of the object in the image in the process of lane line detection, the perception of the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- FIG. 1 is a schematic structural diagram of a system architecture provided in an embodiment of the present application.
- FIG. 2 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
- FIG. 3a is a schematic structural diagram of an image processing system provided by an embodiment of the present application.
- Fig. 3b is another schematic structural diagram of the image processing system provided by the embodiment of the present application.
- FIG. 4 is a schematic structural diagram of a vehicle provided in an embodiment of the present application.
- FIG. 5 is a schematic flowchart of an image processing method provided in an embodiment of the present application.
- FIG. 6 is a schematic flow diagram of obtaining the second feature in the embodiment of the present application.
- FIG. 7 is a schematic structural diagram of the first neural network provided by the embodiment of the present application.
- FIG. 8 is a schematic structural diagram of a transformer structure provided in an embodiment of the present application.
- FIG. 9 is a schematic flow diagram of obtaining the fourth feature in the embodiment of the present application.
- Fig. 10 is a schematic flow chart of obtaining the fourth output in the embodiment of the present application.
- FIG. 11 is another schematic structural diagram of the first neural network provided by the embodiment of the present application.
- FIG. 12 is another structural schematic diagram of the transformer structure provided by the embodiment of the present application.
- FIG. 13 is a schematic structural diagram of the row and column attention module provided by the embodiment of the present application.
- Figure 14a is an example diagram including the process of determining multiple point sets provided by the embodiment of the present application.
- Figure 14b is an example diagram of multiple point sets provided by the embodiment of the present application.
- Fig. 14c is an example diagram of an image to be detected including multiple point sets provided by the embodiment of the present application.
- Fig. 14d is an example diagram corresponding to lane line detection provided by the embodiment of the present application.
- FIG. 15 is another schematic flowchart of the image processing method provided by the embodiment of the present application.
- FIG. 16 is a schematic structural diagram of the target neural network provided by the embodiment of the present application.
- Fig. 17 is another structural schematic diagram of the target neural network provided by the embodiment of the present application.
- FIG. 18 is a schematic flowchart of a lane line detection method provided in an embodiment of the present application.
- FIG. 19 is an example diagram of a target image provided by an embodiment of the present application.
- Fig. 20 is a schematic flow chart of the model training method provided by the embodiment of the present application.
- FIG. 21 is a schematic structural diagram of an image processing device provided in an embodiment of the present application.
- Fig. 22 is a schematic structural diagram of the detection equipment provided by the embodiment of the present application.
- Fig. 23 is another schematic structural diagram of the image processing device provided by the embodiment of the present application.
- Fig. 24 is another schematic structural diagram of the image processing device provided by the embodiment of the present application.
- Fig. 25 is another schematic structural diagram of the detection device provided by the embodiment of the present application.
- Embodiments of the present application provide an image processing method, a lane line detection method and related equipment. It can improve the accuracy of detecting lane lines in images.
- the first step of intelligent driving is the collection and processing of environmental information.
- Lane lines as one of the most important indication information on the road surface, can effectively guide intelligent vehicles to drive in restricted road areas. Therefore, how to detect the lane line on the road in real time and accurately is an important link in the design of intelligent vehicle-related systems, which can help assist in path planning, road deviation warning and other functions, and can provide a reference for precise navigation.
- the purpose of lane line detection technology is to accurately identify the lane lines on the road surface by analyzing the pictures collected by the on-board camera during driving, so as to assist the car to drive in the correct lane.
- the lane line detection model based on image segmentation first predicts the segmentation results of the entire image, and then outputs the lane line detection results after clustering.
- the detection-based lane line detection predicts a large number of candidate lane lines by generating multiple anchor points and predicting the offset of the lane line relative to the anchor point, and then performs post-processing through non-maximum suppression to obtain the final lane line detection result.
- SCNN spatial convolutional neural network
- SCNN spatial convolutional neuron network
- the traditional convolution is to perform a convolution operation on a feature with a dimension of HxWxC, and this scheme first divides HxWxC into H slices of WxC vertically, and then performs these slices from bottom to top and top to bottom. Perform convolution, then divide HxWxC into W pieces of HxC according to the horizontal direction, and then convolve these pieces from left to right and from right to left, and finally, splicing the convolution results according to these four directions, A segmentation map of an image is output by a fully connected layer. In order to realize the detection of lane lines.
- the convolutional neural network is limited by the receptive field, it cannot perceive the global information of the picture well.
- it is not conducive to the prediction of objects with a long tail relationship (it can also be understood as a long and thin shape) such as lane lines.
- it is impossible to accurately predict the position of the lane line, and the model is prone to misdetection.
- the embodiment of the present application provides an image processing method, a lane line detection method and related equipment.
- the transformer structure to the lane line detection task, the lane line can be effectively modeled. long-distance connection.
- the ability to perceive the scene can be improved. Reduce misjudgment in scenarios where lane lines are occluded by vehicles.
- a neural network may be composed of neural units, and a neural unit may refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit may be:
- W s is the weight of X s
- b is the bias of the neuron unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
- the activation function can be a Relu function.
- a neural network is a network formed by connecting many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- W is a weight vector, and each value in the vector represents the weight value of a neuron in this layer of neural network.
- the vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
- the purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix.
- Convolutional neural network is a deep neural network with a convolutional structure.
- a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer.
- the feature extractor can be seen as a filter, and the convolution process can be seen as convolving the same trainable filter with an input image or convolutional feature map.
- the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
- a neuron can only be connected to some adjacent neurons.
- a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
- Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
- Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
- multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
- the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- the transformer structure is a feature extraction network (category in the convolutional neural network) that includes an encoder and a decoder.
- Encoder feature learning under the global receptive field through self-attention, such as pixel features.
- Decoder Learn the features of the required modules, such as the features of the output box, through self-attention and cross-attention.
- the attention also known as the attention mechanism.
- the attention mechanism can quickly extract important features of sparse data.
- the attention mechanism occurs between the encoder and the decoder, or between the input sentence and the generated sentence.
- the self-attention mechanism in the self-attention model occurs within the input sequence or within the output sequence, and can extract the connection between words that are far apart in the same sentence, such as syntactic features (phrase structure).
- the self-attention mechanism provides an effective modeling way to capture global context information through QKV. Assuming the input is Q(query), the context is stored in the form of key-value pairs (K, V). Then the attention mechanism is actually a mapping function from query to a series of key-value pairs (key, value). The essence of the attention function can be described as a mapping from a query to a series of (key-value) pairs.
- Attention essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in the form of (K, V), then attention completes the addressing by calculating the similarity between Q and K. The similarity calculated by Q and K reflects the importance of the V value taken out, that is, the weight, and then the weighted summation obtains the final feature value.
- the calculation of attention is mainly divided into three steps.
- the first step is to calculate the similarity between the query and each key to obtain the weight.
- Commonly used similarity functions are dot product, stitching, perceptron, etc.; then the second step is generally to use a softmax function (on the one hand, it can be normalized to obtain a probability distribution in which the sum of all weight coefficients is 1.
- the characteristics of the softmax function can be used to highlight the weights of important elements) to normalize these weights; finally, the weights and the corresponding The key value value is weighted and summed to get the final feature value.
- the specific calculation formula can be as follows:
- d is the dimension of QK matrix.
- attention includes self-attention and cross-attention.
- Self-attention can be understood as special attention, that is, the input of QKV is consistent. While the input of QKV in cross-attention is not consistent. Attention is to use the similarity between features (such as the inner product) as a weight to integrate the query feature as the update value of the current feature. Self-attention is the attention extracted based on the attention of the feature map itself.
- the setting of the convolution kernel limits the size of the receptive field, resulting in the network often requiring multiple layers of stacking to focus on the entire feature map.
- the advantage of self-attention is that its attention is global, and it can obtain the global spatial information of the feature map through simple query and assignment.
- the special point of self-attention in the query key value (QKV) model is that the corresponding input of QKV is consistent. The QKV model will be described later.
- Feedforward neural network is the first simple artificial neural network invented.
- each neuron belongs to a different layer. Neurons in each layer can receive signals from neurons in the previous layer and generate signals to output to the next layer.
- Layer 0 is called the input layer
- the last layer is called the output layer
- other intermediate layers are called hidden layers. There is no feedback in the entire network, and the signal propagates unidirectionally from the input layer to the output layer.
- Multilayer perceptron (MLP)
- a multilayer perceptron also known as a multilayer perceptron, is a feed-forward artificial neural network model that maps an input to a single output.
- concat is a series of feature fusion methods, that is, directly connect two features, if the dimensions of the two input features x and y are p and q, the dimension of the output feature z is p+q; add is a
- the parallel fusion strategy is to combine two feature vectors to obtain a new feature z with the same number of channels for the input features x and y.
- add means that the amount of information under the characteristics of the image description has increased, but the dimension of the image description itself has not increased, but the amount of information under each dimension is increasing; while concat is the combination of the number of channels, that is to say, the description The features of the image itself are increased, but the information under each feature is not increased.
- Dimensionality reduction is the operation of transforming high-dimensional data into low-dimensional data.
- the dimensionality reduction process is mainly for the feature matrix.
- the feature matrix can be reduced in dimension through a linear transformation layer.
- the dimensionality reduction processing of the feature matrix can also be understood as reducing the dimensionality of the vector space corresponding to the feature matrix.
- Region of interest region of interest, ROI: In machine vision and image processing, the area to be processed is outlined in the form of a box, circle, ellipse, irregular polygon, etc. from the processed image.
- an embodiment of the present invention provides a system architecture 100 .
- the data collection device 160 is used to collect training data
- the training data in the embodiment of the present application includes: training images.
- the training data may also include the first feature of the training image and detection frame information corresponding to the object in the training image.
- the training data is stored in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 .
- the following will describe in more detail how the training device 120 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the image processing method provided by the embodiment of the present application.
- the target model/rule 101 in the embodiment of the present application may specifically be a target neural network.
- the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
- the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
- the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (augmented reality, AR) equipment/virtual reality (virtual reality, VR) equipment, vehicle terminals, etc.
- the execution device 110 may also be a server or a cloud.
- the execution device 110 is configured with an I/O interface 112 for data interaction with external devices. Users can input data to the I/O interface 112 through the client device 140.
- the input data is described in the embodiment of this application. may include: the image to be detected.
- the input data may be input by the user, or uploaded by the user through the shooting device, and of course, may also come from a database, which is not limited here.
- the preprocessing module 113 is used to perform preprocessing according to the input data received by the I/O interface 112.
- the preprocessing module 113 may be used to acquire features of the image to be detected.
- the preprocessing module 113 may also be used to acquire detection frame information corresponding to the object in the image to be detected.
- the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
- the I/O interface 112 returns the processing result, such as the point set obtained above or the image including the point set, to the client device 140, so as to provide it to the user.
- the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete above tasks, thereby providing the desired result to the user.
- the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
- the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
- the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, and action.
- the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
- the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample The data is stored in database 130 .
- accompanying drawing 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation, for example, in accompanying drawing 1 , the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
- a target model/rule 101 is obtained through training by a training device 120 , and the target model/rule 101 in the embodiment of the present application may specifically be a target neural network.
- a chip hardware structure provided by the embodiment of the present application is introduced below.
- FIG. 2 is a hardware structure of a chip provided by an embodiment of the present invention, and the chip includes a neural network processor 20 .
- the chip can be set in the execution device 110 shown in FIG. 1 to complete the computing work of the computing module 111 .
- the chip can also be set in the training device 120 shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
- the neural network processor 20 can be a neural network processor (neural-network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU), etc. Processor for scale XOR processing.
- NPU neural-network processing unit
- TPU tensor processing unit
- GPU graphics processing unit
- Processor for scale XOR processing Taking the NPU as an example: the neural network processor 20 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
- the core part of the NPU is the operation circuit 203, and the controller 204 controls the operation circuit 203 to extract data in the memory (weight memory or input memory) and perform operations.
- the operation circuit 203 includes multiple processing units (process engine, PE).
- arithmetic circuit 203 is a two-dimensional systolic array.
- the arithmetic circuit 203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
- arithmetic circuit 203 is a general-purpose matrix processor.
- the operation circuit 203 fetches the data corresponding to the matrix B from the weight memory 202, and caches it in each PE in the operation circuit.
- the operation circuit fetches the data of matrix A from the input memory 201 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator 208 .
- the vector computing unit 207 can perform further processing on the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
- the vector calculation unit 207 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), etc. .
- the vector computation unit can 207 store the vector of the processed output to the unified buffer 206 .
- the vector computing unit 207 may apply a non-linear function to the output of the computing circuit 203, such as a vector of accumulated values, to generate activation values.
- vector computation unit 207 generates normalized values, binned values, or both.
- the vector of processed outputs can be used as an activation input to the arithmetic circuit 203, for example for use in a subsequent layer in a neural network.
- the unified memory 206 is used to store input data and output data.
- the weight data directly transfers the input data in the external memory to the input memory 201 and/or unified memory 206 through the storage unit access controller 205 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 202, And store the data in the unified memory 206 into the external memory.
- DMAC direct memory access controller
- a bus interface unit (bus interface unit, BIU) 210 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 209 through the bus.
- An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
- the controller 204 is configured to invoke instructions cached in the memory 209 to control the operation process of the computing accelerator.
- the unified memory 206, the input memory 201, the weight memory 202, and the instruction fetch memory 209 are all on-chip (On-Chip) memories
- the external memory is a memory outside the NPU
- the external memory can be a double data rate synchronous dynamic random Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
- DDR SDRAM double data rate synchronous dynamic random Memory
- HBM high bandwidth memory
- Fig. 3a is a schematic structural diagram of an image processing system provided by an embodiment of the present application.
- the image processing system includes a user device (a vehicle is taken as an example in Fig. 3a) and an image processing device.
- the user equipment can also be a smart terminal such as a mobile phone, a vehicle terminal, an aircraft terminal, a VR/AR device, and an intelligent robot.
- the user equipment is the initiator of the image processing, and as the initiator of the image processing request, usually the user initiates the request through the user equipment.
- the above-mentioned image processing device may be a device or server having an image processing function such as a cloud server, a network server, an application server, and a management server.
- the image processing device receives the image processing request from the intelligent terminal through the interactive interface, and then performs image processing such as machine learning, deep learning, search, reasoning, and decision-making through the memory for storing data and the processor link of image processing.
- the storage in the image processing device may be a general term, including local storage and a database storing historical data, and the database may be on the image processing device or on other network servers.
- the user equipment can receive user instructions, for example, the user equipment can obtain an image input/selected by the user (or the image captured by the user equipment through the camera), and then initiate a request to the image processing equipment , so that the image processing device executes an image processing application (for example, lane line detection in the image, etc.) on the image obtained by the user equipment, so as to obtain a corresponding processing result on the image.
- the user device may obtain an image input by the user, and then initiate an image detection request to the image processing device, so that the image processing device detects the image, thereby obtaining a detection result of the image (that is, a point set of lane lines), And display the detection results of the image for users to watch and use.
- the image processing device can execute the image processing method of the embodiment of the present application.
- Fig. 3b is another schematic structural diagram of the image processing system provided by the embodiment of the present application.
- the user equipment taking a vehicle as an example in Fig.
- the processing is performed by the hardware of the user equipment itself, and the specific process is similar to that shown in FIG. 3a . Reference may be made to the above description, and details will not be repeated here.
- the user equipment may receive an instruction from the user, for example, the user equipment may acquire an image selected by the user in the user equipment, and then the user equipment itself executes the Image processing application (for example, lane line detection in the image, etc.), so as to obtain the corresponding processing result for the image, and display the processing result for the user to watch and use.
- the Image processing application for example, lane line detection in the image, etc.
- the user equipment may collect an image of the road where the user equipment is located in real time or periodically, and then the user equipment itself executes an image processing application on the image (for example, the Lane line detection, etc.), so as to obtain the corresponding processing results for the image, and realize intelligent driving functions according to the processing results, such as: adaptive cruise, lane departure warning (lane departure warning, LDW), lane keeping assistance (lane keeping assist, LKA) and so on.
- an image processing application for example, the Lane line detection, etc.
- intelligent driving functions such as: adaptive cruise, lane departure warning (lane departure warning, LDW), lane keeping assistance (lane keeping assist, LKA) and so on.
- the user equipment itself can execute the image processing method of the embodiment of the present application.
- the above-mentioned user equipment in FIG. 3a and FIG. 3b may specifically be the client equipment 140 or the execution equipment 110 in FIG. 1, and the image processing equipment in FIG. 3a may specifically be the execution equipment 110 in FIG.
- the data storage system 250 may be integrated on the execution device 210, or set on the cloud or other network servers.
- the processors in Figure 3a and Figure 3b can perform data training/machine learning/deep learning through neural network models or other models (for example, models based on support vector machines), and use the data to finally train or learn the model for image execution Image processing application, so as to obtain the corresponding processing results.
- neural network models or other models for example, models based on support vector machines
- FIG. 4 is a schematic structural diagram of a vehicle provided in an embodiment of the present application.
- the vehicle may include various subsystems such as a travel system 402 , a sensor system 404 , a control system 406 , one or more peripheral devices 408 , as well as a power source 410 and a user interface 416 .
- a vehicle may include more or fewer subsystems, and each subsystem may include multiple components.
- each subsystem and component of the vehicle may be interconnected by wire or wirelessly (eg, Bluetooth).
- Propulsion system 402 may include components that provide powered motion for the vehicle.
- propulsion system 402 may include engine 418 , energy source 419 , transmission 420 , and wheels 421 .
- the engine 418 may be an internal combustion engine, an electric motor, an air compression engine or other types of engine combinations, for example, a hybrid engine composed of a gasoline engine and an electric motor, or a hybrid engine composed of an internal combustion engine and an air compression engine.
- Engine 418 converts energy source 419 into mechanical energy. Examples of energy sources 419 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electrical power. Energy source 419 may also provide energy to other systems of the vehicle.
- Transmission 420 may transmit mechanical power from engine 418 to wheels 421 .
- Transmission 420 may include a gearbox, a differential, and a drive shaft.
- the transmission device 420 may also include other devices, such as clutches.
- the drive shaft may include one or more shafts that may be coupled to the wheels 421 .
- Sensor system 404 may include a number of sensors that sense information about the vehicle's location.
- the sensor system 404 may include a positioning system 422 (eg, Global Positioning System, Beidou system, or other positioning systems), an inertial measurement unit (IMU) 424 , a radar 426 , a laser range finder 428 , and a camera 430 .
- Sensor system 404 may also include sensors of the interior systems of the monitored vehicle (eg, interior air quality monitor, fuel gauge, oil temperature gauge, etc.). Sensory data from one or more of these sensors can be used to detect objects and their corresponding properties (eg, position, shape, orientation, velocity, etc.). Such detection and identification are critical functions for the safe operation of autonomous vehicles.
- the positioning system 422 can be used to estimate the geographic location of the vehicle, such as the latitude and longitude information of the vehicle's location.
- the IMU 424 is used to sense the position and orientation changes of the vehicle based on the inertial acceleration rate.
- IMU 424 may be a combination accelerometer and gyroscope.
- the radar 426 can use radio signals to perceive objects in the surrounding environment of the vehicle, specifically, it can be expressed as a millimeter-wave radar or a laser radar. In some embodiments, in addition to sensing objects, radar 426 may be used to sense the velocity and/or heading of objects.
- Laser range finder 428 may utilize laser light to sense objects in the environment in which the vehicle is located.
- laser rangefinder 428 may include one or more laser sources, a laser scanner, and one or more detectors, among other system components.
- Camera 430 may be used to capture multiple images of the vehicle's surroundings. Camera 430 may be a still camera or a video camera.
- Control system 406 controls the operation of the vehicle and its components.
- the control system 406 may include various components, including a steering system 432, an accelerator 434, a braking unit 436, an electronic control unit (ECU) 438, and a vehicle controller 440 (body control module, BCM).
- a steering system 432 an accelerator 434, a braking unit 436, an electronic control unit (ECU) 438, and a vehicle controller 440 (body control module, BCM).
- ECU electronice control unit
- BCM body control module
- the steering system 432 is operable to adjust the forward direction of the vehicle.
- the throttle 434 is used to control the rate of operation of the engine 418 and thus the speed of the vehicle.
- the braking unit 436 is used to control the deceleration of the vehicle.
- the braking unit 436 may use friction to slow the wheels 421 .
- the brake unit 436 can convert the kinetic energy of the wheel 421 into electric current.
- the braking unit 436 may also take other forms to slow down the wheels 421 to control the speed of the vehicle.
- Vehicle electronic control unit 438 may be implemented as a single ECU or multiple ECUs on a vehicle configured to communicate with peripheral devices 408 , sensor system 404 .
- the vehicle ECU 438 may include at least one processor 4381 and a memory 4382 (read-only memory, ROM).
- processors 4381 and a memory 4382 read-only memory, ROM.
- at least one processor may be implemented as one or more microprocessors, controllers, microcontroller units (microcontroller unit, MCU) or state machines.
- At least one processor may be implemented as a combination of computing devices, such as a digital signal processor or microprocessor, multiple microprocessors, one or more microprocessors in combination with a digital signal processor core, or any other combination of this configuration.
- ROM can provide data storage, including storage of addresses, routes, and driving directions in this application.
- BCM140 can provide ECU438 with vehicle engine status, speed, gear position, steering wheel angle and other information.
- Peripherals 408 may include wireless communication system 446 , navigation system 448 , microphone 450 and/or speaker 452 .
- peripherals 408 provide a means for a user of the vehicle to interact with user interface 416 .
- navigation system 448 may be implemented as part of an in-vehicle entertainment system, an in-vehicle display system, an in-vehicle instrument cluster, and the like.
- navigation system 448 is implemented to include or cooperate with sensor system 404 that derives the current geographic location of the vehicle in real-time or substantially real-time.
- Navigation system 448 is configured to provide navigation data to the driver of the vehicle.
- Navigation data may include vehicle location data, suggested route planning driving instructions, and visual map information to the vehicle operator.
- Navigation system 448 may present this location data to the driver of the vehicle via a display element or other presentation device.
- the vehicle's current location can be described by one or more of the following information: triangulated position, latitude/longitude position, x and y coordinates, or any other symbol or any measure that indicates the vehicle's geographic location.
- User interface 416 may also operate navigation system 448 to receive user input.
- the navigation system 448 can be operated via a touch screen.
- the navigation system 448 provides route planning capability and navigation capability when the user inputs the geographic location values of the origin and destination.
- peripheral devices 408 may provide a means for the vehicle to communicate with other devices located within the vehicle.
- microphone 450 may receive audio (eg, voice commands or other audio input) from a user of the vehicle.
- speaker 452 may output audio to a user of the vehicle.
- Wireless communication system 446 may communicate wirelessly with one or more devices, either directly or via a communication network.
- wireless communication system 446 may use 3G cellular communication, such as code division multiple access (code division multiple access, CDMA), EVD0, global system for mobile communications (global system for mobile communications, GSM)/is a general packet radio service technology (general packet radio service, GPRS), or 4G cellular communication, such as long term evolution (long term evolution, LTE), or 5G cellular communication.
- the wireless communication system 446 can use WiFi to communicate with a wireless local area network (wireless local area network, WLAN).
- wireless communication system 446 may communicate directly with devices using an infrared link, Bluetooth, or ZigBee.
- Other wireless protocols, such as various vehicle communication systems, for example, wireless communication system 446 may include one or more dedicated short range communications (DSRC) equipment, which may include public and/or private data communications.
- DSRC dedicated short range communications
- Power supply 410 may provide power to various components of the vehicle.
- the power source 410 may be a rechargeable lithium ion or lead acid battery.
- One or more packs of such batteries may be configured as a power source to provide power to various components of the vehicle.
- power source 410 and energy source 419 may be implemented together, such as in some all-electric vehicles.
- one or more of these components described above may be mounted separately from or associated with the vehicle.
- memory 4382 may exist partially or completely separate from the vehicle.
- the components described above may be communicatively coupled together in a wired and/or wireless manner.
- FIG. 4 should not be construed as a limitation to the embodiment of the present application.
- the aforementioned vehicles may be cars, trucks, motorcycles, buses, boats, lawn mowers, recreational vehicles, playground vehicles, construction equipment, trams, golf carts, and trolleys, etc., which are not specifically limited in the embodiments of the present application.
- the image processing method provided by the embodiment of the present application is described below.
- the method may be executed by the image processing device, or may be executed by components of the image processing device (such as a processor, a chip, or a chip system, etc.).
- the image processing device can be a cloud device (as shown in the aforementioned Figure 3a), or a vehicle (such as the vehicle shown in Figure 4) or a terminal device (such as a vehicle terminal, an aircraft terminal, etc.) etc. (as shown in the aforementioned Figure 3b Show).
- this method can also be executed by a system composed of cloud devices and vehicles (as shown in the aforementioned FIG. 3 a ).
- the method may be processed by the CPU in the image processing device, or jointly processed by the CPU and the GPU, or other processors suitable for neural network calculations may be used instead of the GPU, which is not limited in this application.
- the application scenario of this method can be used in intelligent driving scenarios. For example: adaptive cruise, lane departure warning (lane departure warning, LDW), lane keeping assist (lane keeping assist, LKA) and other scenarios that include lane line detection.
- the image processing method provided by the embodiment of the present application can obtain the image to be detected through the sensor (such as a camera) on the vehicle, and obtain the lane line in the image to be detected, and then realize the above-mentioned adaptive cruise, LDW or LKA wait.
- the image processing method provided in the embodiment of the present application may include two situations, which will be described respectively below.
- the image processing device is a user device, and here it is only taken as an example that the user device is a vehicle (such as the aforementioned scene in FIG. 3b ). It can be understood that, besides a vehicle, the user equipment may also be a smart terminal such as a mobile phone, a vehicle terminal, an aircraft terminal, a VR/AR device, or a smart robot, which is not specifically limited here.
- FIG. 5 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
- the method is implemented through a target neural network, and the method may include steps 501 to 504 . Steps 501 to 504 will be described in detail below.
- Step 501 acquire an image to be detected.
- the image processing device to acquire the image to be detected, which may be by collecting the image to be detected by the image processing device, or by receiving the image to be detected sent by other devices, or by The method of selecting training data from the database, etc., is not specifically limited here.
- the image to be detected includes at least one object among vehicles, people, objects, trees, signs and the like.
- the image processing device may refer to a vehicle.
- Sensors on the vehicle such as cameras or cameras, capture images. It can be understood that the sensors on the vehicle may collect images in real time or periodically, for example, collecting images every 0.5 seconds, which is not specifically limited here.
- Step 502 performing feature extraction on the image to be detected to obtain the first feature.
- the image processing device After the image processing device acquires the image to be detected, it may acquire the first feature of the image to be detected. Specifically, feature extraction is performed on the image to be detected to obtain the first feature. It can be understood that the features mentioned in the embodiments of the present application may be expressed in a matrix or a vector.
- the image processing device may perform feature extraction on the image to be detected through the backbone network to obtain the first feature.
- the backbone network may be a convolutional neural network, a graph convolutional network (graph convolutional networks, GCN), a recurrent neural network, etc., which have the function of extracting image features, and are not specifically limited here.
- the image processing device may perform feature fusion and dimensionality reduction processing on features output by different layers in the backbone network to obtain the first feature.
- the features output by different layers can also be understood as intermediate features in the process of calculating the first feature (also called at least one third feature), and the number of third features is related to the number of layers of the backbone network, for example: the third The number of features is the same as the number of layers of the backbone network, or the number of third features is the number of layers of the backbone network minus 1.
- the low-level features have higher resolution and contain more position and detail information, but because of fewer convolutions, they have lower semantics and more noise.
- High-level features have stronger semantic information, but have low resolution and poor perception of details. Therefore, feature fusion is performed on the features extracted from different layers of the backbone network to obtain the fused feature (denoted as H f ), and the fused feature has multi-level features. Further, dimensionality reduction processing is performed on the fused features to obtain the first feature (denoted as H′ f ). Therefore, the first feature also has a multi-level feature.
- H f ⁇ R h ⁇ w ⁇ d h is the number of rows of H f
- w is the number of columns of H f
- d is the dimension of H f .
- the above backbone network is a 50-layer residual convolutional neural network (residual neural network-50, ResNet50).
- Step 503 process the detection frame information of the image to be detected to obtain the second feature.
- the image processing device After the image processing device acquires the image to be detected, it can first obtain the detection frame information of the image to be detected based on the human-vehicle detection model. Specifically, the image to be detected is input into the human-vehicle detection model to obtain detection frame information, and the detection frame information includes the position of the detection frame of at least one object in the image to be detected.
- the human-vehicle detection model can be a region convolutional neural network (region convolutional neuron network, R-CNN), a fast region convolutional neural network (fast R-CNN) or a faster region convolutional neural network (faster R-CNN). CNN), etc., are not limited here.
- the aforementioned objects may include at least one of vehicles, people, objects, trees, signs, etc. in the image to be detected, which is not specifically limited here. It can be understood that the position of the detection frame may be a position after normalization processing.
- the acquired second feature has a stronger expressive ability.
- the detection frame information may also include the category and confidence level of the detection frame.
- the image processing device After the image processing device obtains the detection frame information, it can process the detection frame information to obtain the second feature.
- the second feature can also be understood as the detection frame feature of the image to be detected.
- the second feature includes the object corresponding to the image to be detected.
- the positional features and semantic features of the detection frame Among them, the location feature can be recorded as , and the semantic feature can be recorded as .
- At least one third feature and the detection frame information are processed to obtain the second feature.
- the at least one third feature is an intermediate feature obtained during the process of acquiring the first feature (such as the intermediate feature in step 502 above).
- the detection frame information and intermediate features are input to the preprocessing module to obtain position features and semantic features.
- the second feature may be obtained based on processing at least one third feature and detection frame information. If the backbone network does not adopt the FPN structure, the second feature can be obtained by using the first feature and the detection frame information before dimensionality reduction.
- feature pyramid networks feature pyramid networks
- the specific process of obtaining the second feature (which can also be understood as the function of the preprocessing module) is different, and is described below:
- the detection frame information only includes the position of the detection frame.
- the above process of acquiring semantic features may include: scaling the detection frame according to the position of the detection frame and the sampling rate between different layers in the backbone network. Use the scaled detection frame to extract ROI features from the feature layer corresponding to the sampling rate of the intermediate features. Process the ROI features (for example: input into the fully connected layer, or input into the single-layer perceptron and activation layer) to obtain the semantic features of the detection frame: Z r ⁇ R M ⁇ d′ , where M is the The number of detection boxes in the image.
- the above-mentioned process of obtaining positional features may include: processing the vector corresponding to the position of the detection frame (for example: inputting the processing of the fully connected layer, or inputting the processing of the single-layer perceptron and the activation layer) to obtain the positional feature of the detection frame: Z b ⁇ R M ⁇ d' .
- the backbone network is a neural network with a 5-layer structure
- the downsampling rate of the third layer is 8
- the larger the detection frame area the smaller the feature layer (the later layer) to extract the ROI features.
- the detection frame information includes the position and confidence of the detection frame.
- the above process of acquiring semantic features may include: scaling the detection frame according to the position of the detection frame and the sampling rate between different layers in the backbone network.
- Use the scaled detection frame to extract ROI features from the feature layer corresponding to the sampling rate of the intermediate features.
- Use the confidence of the detection frame as a coefficient multiply it with the extracted ROI features, and pass the multiplied features through processing (for example: input into the fully connected layer, or input into the single-layer perceptron and activation layer processing ) to obtain the semantic features of the detection frame: Z r ⁇ R M ⁇ d′ , where M is the number of detection frames in the image to be detected.
- the above-mentioned process of obtaining positional features may include: processing the vector corresponding to the position of the detection frame (for example: inputting the processing of the fully connected layer, or inputting the processing of the single-layer perceptron and the activation layer) to obtain the positional feature of the detection frame: Z b ⁇ R M ⁇ d' .
- one-hot encoding one-hot
- other encoding methods may be used to encode the category of the detection frame to obtain a category vector.
- the detection frame information includes the position, confidence level and category of the detection frame.
- the above process of acquiring semantic features may include: scaling the detection frame according to the position of the detection frame and the sampling rate between different layers in the backbone network. Use the scaled detection frame to extract ROI features from the feature layer corresponding to the sampling rate in the first feature. Use the confidence of the detection frame as a coefficient, multiply it with the extracted ROI features, and pass the multiplied features through processing (for example: input into the fully connected layer, or input into the single-layer perceptron and activation layer processing ) to obtain the semantic features of the detection frame: Z r ⁇ R M ⁇ d′ , where M is the number of detection frames in the image to be detected.
- the above-mentioned process of obtaining position features may include: transforming the category of the detection frame into a category vector. And splicing the category vector with the vector corresponding to the position of the detection frame above, and after processing (for example: input to the processing of the fully connected layer, or input to the processing of the single-layer perceptron and activation layer) to obtain the position feature of the detection frame: Z b ⁇ R M ⁇ d′ .
- processing for example: input to the processing of the fully connected layer, or input to the processing of the single-layer perceptron and activation layer
- the detection frame may also have other situations (for example: the detection frame information includes the position of the detection frame and the category), there may be other ways to obtain the second feature, which are not limited here.
- the process of acquiring the second feature may be as shown in FIG. 6 .
- the steps performed by the detection pre-processing module refer to the above-mentioned description of the process of obtaining the second feature, which will not be repeated here.
- Step 504 input the first feature and the second feature into the first neural network based on the transformer structure to obtain the lane line in the image to be detected.
- the first feature and the second feature can be input into the first neural network based on the transformer structure to obtain the lane line in the image to be detected. Specifically, multiple point sets may be obtained first, and then lane lines may be determined based on the multiple point sets. Each point set in the plurality of point sets represents a lane line in the image to be detected.
- the first neural network based on the transformer structure includes an encoder, a decoder and a feedforward neural network.
- the above-mentioned acquisition of multiple point sets may include the following steps: input the first feature and the second feature into the encoder to obtain the fourth feature; input the fourth feature, the second feature and the query feature into the decoder to obtain the fifth feature; The five features are input into the feed-forward neural network to obtain multiple point sets.
- the first feature and the second feature may be input into the trained first neural network to obtain multiple point sets.
- the trained first neural network is obtained by training the first neural network with the training data as the input of the first neural network, with the value of the first loss function being less than the first threshold as the goal, and the training data includes training images
- the first feature, the position feature and semantic feature of the detection frame corresponding to the object in the training image, the first loss function is used to represent the difference between the point set output by the first neural network and the first point set during the training process, the first point set is the real point set of the actual lane line in the training image.
- the first neural network includes a transformer structure and a feedforward neural network.
- the first feature and the second feature can be processed through the transformer structure to obtain the fifth feature.
- the feedforward neural network here can also be replaced by structures such as fully connected layers and convolutional neural networks, which are not specifically limited here.
- the structure of the transformer is different based on the input of the first neural network. It can also be understood that the steps for obtaining the fifth feature are different, which will be described separately below.
- the first neural network is shown in Figure 7
- the transformer structure is shown in Figure 8.
- the first neural network includes a transformer structure and a feedforward neural network. Input the first feature and the second feature into the encoder of the transformer structure to obtain the fourth feature. Input the query feature, the second feature and the fourth feature into the decoder of the transformer structure to obtain the fifth feature.
- the transformer structure in this case can be shown in Figure 8.
- the encoder of the transformer structure includes the first self-attention module and the first attention module
- the decoder of the transformer structure includes the second attention module and the third attention module. module.
- the decoder may also include a second self-attention module (not shown in FIG. 8 ), which is used to calculate query features. Specifically, self-attention calculation is performed on the query vector to obtain query features.
- the query vector is initialized to a random value and trained to a fixed value during training. And use this fixed value in the reasoning process, that is, the query vector is a fixed value obtained by training the random value during the training process.
- the self-attention calculation is performed on the first feature (H' f ) through the first self-attention module to obtain the first output (O f ).
- Cross-attention calculation is performed on the first feature (H' f ) and the second feature (Z r and Z b ) by the first attention module to obtain the second output (O p2b ).
- a fourth feature is obtained based on the first output (O f ) and the second output (O p2b ).
- Cross-attention calculation is performed on the query feature (Q q ) and the fourth feature through the second attention module to obtain the third output.
- the query feature (Q q ) and the second feature (Z r and Z b ) are processed to obtain a fourth output. Addition processing is performed on the third output and the fourth output to obtain the fifth feature.
- the query feature is calculated by self-attention on the query vector.
- the above-mentioned step of performing self-attention calculation on the first feature (H' f ) through the first self-attention module to obtain the first output (O f ) may specifically be: due to the self-attention calculation, the QKV The inputs are consistent (that is, both are H′ f ). That is, QKV is obtained through the first feature (H' f ) through three linear processes, and O f is calculated based on QKV.
- self-attention please refer to the previous description of the self-attention mechanism, so I won’t go into details here.
- the position matrix of the first feature can be introduced, which is described in the following formula 1, and will not be expanded here.
- the above specific step of obtaining the fourth feature based on O f and O p2b may be: adding the first output and the second output to obtain the fourth feature.
- the above-mentioned step of obtaining the fourth feature based on the first output (O f ) and the second output (O p2b ) may specifically be: add the first output and the second output, and add The result obtained by the addition processing is added and normalized to the first feature to obtain an output.
- the output is input into the feedforward neural network to obtain the output result of the feedforward neural network.
- the output obtained by the above addition and normalization is added and normalized to the output result of the feedforward neural network, so as to obtain the fourth feature.
- the above-mentioned step of performing cross-attention calculation on H'f , Zr and Zb through the first attention module may specifically be: taking H'f as Q, taking Zb as K, and Zr as V
- the cross-attention calculation is performed to obtain the second output (O p2b ).
- the above-mentioned step of performing cross-attention calculation on Q q and the fourth feature through the second attention module may specifically be: take Q q as Q, and use the fourth feature as K and V to perform cross-attention calculation, and obtain third output.
- the above-mentioned step of processing the query feature and the second feature to obtain the fourth output may specifically be: performing cross-attention calculation on Qq , Zr , and Zb through the third attention module , to get the sixth output.
- Q q can be used as Q
- Z b can be used as K
- Z r can be used as V to perform cross-attention calculation to obtain the sixth output.
- the query feature is added to the sixth output, and the result of the addition is added to the query vector and normalized to obtain an output.
- the output is input into the feedforward neural network to obtain the output result of the feedforward neural network.
- the output obtained by the above addition and normalization is added and normalized to the output result of the feedforward neural network, so as to obtain the fourth output.
- the position matrix (Q q ) of the feature used as Q in the attention calculation process may be introduced.
- the position matrix can also be obtained by means of static position encoding or dynamic position encoding.
- the position matrix can be calculated according to the absolute position of the feature map corresponding to the first feature, which is not limited here.
- E f is the position matrix of the first feature (H′ f ), and the following uses Formula 3 and Formula 4 to illustrate the calculation of the position matrix by means of sine and cosine:
- formula 3 is used for the calculation of even numbers
- formula 4 is used for the calculation of odd numbers.
- i is the position of the row in the position matrix of the element
- 2j/2j+1 is the position of the column in the position matrix of the element
- d represents the dimension of the position matrix.
- the first neural network is shown in Figure 11
- the transformer structure is shown in Figure 12.
- FIG. 11 where the difference between FIG. 11 and FIG. 7 is that the input of the encoder in FIG. 7 includes the first feature and the second feature, and the input of the encoder in FIG. 11 Include first feature, first row feature, first column feature, and second feature. That is, the input of the encoder in Fig. 11 has more features of the first row and the first column than the input of the encoder in Fig. 7 .
- the encoder of the transformer structure also includes a row-column attention module. That is, the encoder of the transformer structure shown in Figure 12 includes a row-column attention module, a first self-attention module, and a first attention module, and a decoder includes a second self-attention module, a second attention module, and a third attention module. module. Among them, the row and column attention module includes row attention module and column attention module.
- the self-attention calculation is performed on the first row feature (H′ r ) through the row attention module to obtain the row output.
- the column output is obtained by performing self-attention calculation on the first column feature (H′ c ) through the column attention module. Get row and column output based on row output and column output.
- Cross-attention calculation is performed on the first feature (H' f ) and the second feature (Z r and Z b ) by the first attention module to obtain the second output (O p2b ).
- a fourth feature is obtained based on the row and column output, the first output (O f ) and the second output (O p2b ).
- Cross-attention calculation is performed on the query feature (Q q ) and the fourth feature through the second attention module to obtain the third output.
- the query feature (Q q ) and the second feature (Z r and Z b ) are processed by the third attention module to obtain the fourth output.
- Addition processing is performed on the third output and the fourth output to obtain the fifth feature.
- the above step of obtaining the row and column outputs based on the row output and the column output may specifically include: adding and normalizing the row output and the feature of the first row (referred to as adding & normalizing) to obtain the output.
- the output is input into a feedforward neural network (feedforward network for short), and the output result of the feedforward network is obtained.
- the output obtained by the above addition and normalization is added and normalized to the output of the feedforward network to obtain the output of the row.
- the output is input into the feedforward network to obtain the output result of the feedforward network.
- the output obtained by the above addition and normalization is added and normalized to the output of the feedforward network to obtain the output of the column. Then splice the output of the row and the output of the column to obtain the output of the row and column.
- the first feature can be flattened in the row dimension to obtain H r ⁇ R h ⁇ 1 ⁇ wd , and processed (for example: input fully connected layer processing and dimensionality reduction processing, or input single-layer perceptual Machine and activation layer processing and dimensionality reduction processing) to get the first line of features: H′ r ⁇ R h ⁇ 1 ⁇ d′ .
- the above flattening of the row dimension can also be understood as flattening or compressing the matrix corresponding to the first feature along the row direction to obtain H r .
- the first feature is flattened in the column dimension to obtain H r ⁇ R h ⁇ 1 ⁇ wd , and processed (for example: input to the fully connected layer and dimensionality reduction, or input to a single-layer perceptron and activation layer processing and dimensionality reduction processing) to obtain the first column features: H′ c ⁇ R 1 ⁇ w ⁇ d′ .
- the above-mentioned step of obtaining the fourth feature based on the row and column output, the first output (O f ) and the second output (O p2b ) may specifically be: adding the first output and the second output to obtain the fifth output; performing splicing processing on the fifth output and the row and column outputs to obtain the fourth feature.
- E r is the position matrix of the first row feature (H′ r )
- E c is the position matrix of the first column feature (H′ c ).
- the location matrix may be obtained through static location coding or dynamic location coding, which is not specifically limited here.
- transformer structures or the ways to obtain the fifth feature are just examples. In practical applications, the transformer structure can also be in other situations, or there can be other ways to obtain the fifth feature. There is no limit here.
- the fifth feature may be input into the feedforward neural network to obtain multiple point sets. And determine the lane line in the image to be detected based on multiple point sets.
- the above-mentioned feedforward neural network can also be replaced by a fully connected layer, a convolutional neural network and other structures, which are not specifically limited here.
- X is an equidistant Y-direction straight line (for example, 72) and
- the intersection point of the lane lines corresponds to the set of X coordinates, the starting point Y coordinate s, and the ending point Y coordinate e.
- the number of lane lines and the number of straight lines in the Y direction in FIG. 14a are just examples, and are not specifically limited here.
- multiple point sets may be presented in an array.
- multiple point sets may also be presented in the form of images. For example: multiple point sets as shown in Figure 14b. Multiple point sets are overlapped and fused with the image to be detected to obtain an image to be detected with multiple point sets, as shown in FIG. 14c for example. This embodiment does not limit the presentation manner of multiple point sets.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the scene perception ability of the network can be improved. Reduce the misjudgment of the model in the scene where the lane line is occluded by the vehicle.
- the network's ability to construct long lane line features can be improved, thereby achieving better lane line detection results.
- the various modules in the existing automatic driving system are often independent of each other.
- the lane line detection model and the human-vehicle model are independent of each other and predicted separately.
- the target neural network in the image processing method provided by this embodiment uses the detection frame information obtained based on the human-vehicle detection model into the first neural network to predict lane lines, which can improve the accuracy of lane line detection.
- the image processing device is a cloud server (such as the aforementioned scenario in FIG. 3 a ). It can be understood that, in this case, the image processing device may also be a device or server with image processing functions such as a network server, an application server, and a management server.
- the user device is a vehicle as an example for an exemplary description, and details are not described here. Do limited.
- FIG. 15 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
- the method may include steps 1501 to 1505 . Steps 1501 to 1505 will be described in detail below.
- Step 1501 the vehicle acquires the image to be detected.
- the vehicle may collect images to be detected based on sensors on the vehicle (such as cameras or cameras).
- sensors on the vehicle such as cameras or cameras.
- the sensors on the vehicle can also collect images periodically.
- the vehicle can also be obtained by receiving images to be detected sent by other devices, which is not limited here.
- Step 1502 the vehicle sends the image to be detected to the server.
- the server receives the image to be detected sent by the vehicle.
- the vehicle After the vehicle obtains the image to be detected, it sends the image to be detected to the server.
- the server receives the image to be detected sent by the vehicle.
- Step 1503 the server inputs the image to be detected into the trained target neural network to obtain multiple point sets.
- the server After the server receives the image to be detected sent by the vehicle, it can input the image to be detected into the trained target neural network to obtain multiple point sets.
- the trained target neural network is obtained by using the training image as the input of the target neural network, and the target neural network is trained with the target loss function value being smaller than the target threshold.
- the objective function is used to represent the difference between the point set output by the target neural network during the training process and the target point set, which is the point set of the actual lane line in the training image.
- the target loss function and target threshold can be set according to actual needs, which are not limited here.
- the target neural network in this embodiment may include the backbone network, the preprocessing module, and the first neural network in the foregoing embodiment shown in FIG. 5 . Since the structure of the first neural network in the embodiment shown in FIG. 5 has two situations, the target neural network in this embodiment also has two situations, which will be described separately below.
- the structure of the target neural network may be as shown in FIG. 16 .
- the target neural network is equivalent to including the aforementioned backbone network shown in FIG. 6 , the preprocessing module shown in FIG. 6 , and the first neural network corresponding to FIGS. 7 to 10 .
- the specific description and related processes of the neural network reference may be made to the descriptions corresponding to the aforementioned FIGS. 6 to 10 , and details are not repeated here.
- the structure of the target neural network may be as shown in FIG. 17 .
- the target neural network is equivalent to including the aforementioned backbone network shown in FIG. 6 , the preprocessing module shown in FIG. 6 , and the first neural network corresponding to FIGS. 11 to 13 .
- the specific description and related processes of the neural network reference may be made to the corresponding descriptions in FIG. 6 , FIG. 11 to FIG. 13 , and details will not be repeated here.
- Step 1504 the server sends multiple point sets to the vehicle.
- the vehicle receives multiple point sets sent by the server.
- the server After the server acquires the multiple point sets, the server sends the multiple point sets to the vehicle.
- Step 1505 realize intelligent driving function based on multiple point sets.
- each point set in the multiple point sets represents a lane line in the image to be detected.
- the vehicle can determine the lane line in the image to be detected, and implement intelligent driving functions based on the lane line, such as: adaptive cruise, lane departure warning, lane keeping assistance, etc.
- the description of determining the lane line in the image to be predicted by using multiple point sets may refer to the description in step 504 of the embodiment shown in FIG. 5 above, and details are not repeated here.
- the steps in this embodiment can be performed periodically, that is, the lane lines on the road surface can be accurately identified according to the images to be detected collected by the vehicle camera during driving, and then the lane lines related to the lane lines in intelligent driving can be realized.
- Functions such as: adaptive cruise control, lane departure warning, lane keeping assist, etc.
- the transformer structure by applying the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the scene perception ability of the network can be improved. Reduce the misjudgment of the model in the scene where the lane line is occluded by the vehicle.
- the network's ability to construct long lane line features can be improved, thereby achieving better lane line detection results.
- the target neural network in the image processing method provided by this embodiment uses the detection frame information obtained based on the human-vehicle detection model into the first neural network to predict lane lines, which can improve the accuracy of lane line detection.
- Laneformer the performance test of the target neural network (hereinafter referred to as Laneformer) and other existing networks is performed on the CULane and TuSimple datasets.
- Other existing networks include: spatial convolutional neuron network (SCNN), ENeTSAD, PointLane, efficient residual factorized network (ERFNet), CurveLaneS, CurveLaneM, CurveLaneL, LaneATT.
- CULane is a large-scale lane line detection data set collected in Beijing, China through the vehicle camera, and the size of the collected pictures is 1640 ⁇ 590.
- the data set is collected in a variety of locations and contains samples of many urban complex scenes.
- the CULane dataset contains 88880 training images, 9675 validation images and 34680 test images.
- the test set is also divided into nine categories, one category is a regular picture, and the other eight categories are challenging special categories (including shadow scenes, highlight scenes, dark night scenes, curve scenes, scenes without lane lines, etc.).
- TuSimple is an autonomous driving dataset collected by Arlington. This dataset focuses on highway scenes, so all pictures are collected on the highway, and the size of the collected pictures is 1280 ⁇ 720.
- the TuSimple dataset contains 3626 images for training and 2782 images for testing.
- Three residual structures are used for the backbone network in LaneATT. They are respectively recorded as: LaneATT (ResNet18), LaneATT (ResNet34), LaneATT (ResNet122).
- the backbone network of the Laneformer provided in the embodiment of the present application adopts three residual structures (ResNet18, ResNet34, ResNet50), respectively denoted as: Laneformer (ResNet18), Laneformer (ResNet34), Laneformer (ResNet50).
- the network that does not add the detection attention module ie, the first attention module and the third attention module
- Laneformer(ResNet50)* is recorded as: Laneformer(ResNet50)*.
- the FP values of other models in the intersection scene are thousands, while the Laneformer model proposed in this work has reached an FP value of 19. It can be inferred from Table 1 that the improvement comes from the addition of the detection attention module.
- the FP of the intersection scene is low, but still a value of thousands, while After adding the detection attention module, this indicator dropped sharply to dozens. It can be seen that in the intersection scene where the situation of people and vehicles is relatively complicated, the detection attention module can greatly reduce the misprediction rate of the model through the perception of surrounding scenes and objects.
- the target neural network includes the effect of only using the row and column attention module, and using different sub-modules in the detection attention step by step, including whether to use the position information (bounding box), confidence (score) and category of the human and vehicle detection frame (category) as the input of the detection preprocessing module on the overall result.
- Model F1(%) Accuracy (%) Recall rate (%) frame rate per second Number of parameters (millions) Baseline (ResNet50) 75.45 81.65 70.11 61 31.02 + ranks attention 76.04 82.92 70.22 58 43.02 + Position information of the detection frame 76.08 85.3 68.66 57 45.38 +Confidence of the detection box 76.25 83.56 70.12 54 45.38 +Type of detection frame 77.06 84.05 71.14 53 45.38
- the first model (namely Baseline) can be understood as the target neural network shown in Figure 16 after removing the first attention module and the third attention module.
- the second model (+ row and column attention) can be understood as based on the first model + row and column attention module
- the third model (+ position information of the detection frame) can be understood as based on the second model
- the fourth model (+ the confidence of the detection frame) can be understood as the basis of the third model + the confidence of the detection frame
- the fifth model (+ the category of the detection frame ) can be understood as the basis of the fourth model + the category of the detection frame.
- the fifth model can be regarded as the target neural network shown in Figure 17 above.
- the Laneformer model proposed in this paper adds a row and column attention module and a detection attention module (including the first attention module and the third attention module) on the basis of using the Transformer, and the detection attention module is divided into simply adding a detection frame information, additional bounding box confidence and additional predicted categories. Therefore, this subsection conducts an experimental exploration of the impact of each module on the model. It can be seen from Table 3 that in the pure Transformer model without the row-column attention module and the detection attention module, the benchmark F1 score can reach 75.45%. After adding the row-column attention module, the effect of the model can be improved to 76.04% of the F1 score. At the same time, it can be seen that simply adding the detection frame information from the human-vehicle detection module can improve the effect of the model.
- adding the confidence of the detection frame to the detection information can make the model achieve an F1 score of 76.25%, and after adding the category information of the detection frame, the optimal model in Table 3 is obtained, reaching The F1 score is 77.06%, which proves that both the row and column attention module and the detection attention module can improve the performance of the model.
- the addition of the detection attention module can significantly improve the accuracy of the model, while the impact on the recall rate is relatively weak.
- the image processing method provided by the embodiment of the present application is described above, and the lane line detection method provided by the embodiment of the present application is described below.
- the method may be executed by the detection device, or may be executed by components of the detection device (such as a processor, a chip, or a chip system, etc.).
- the detection device may be a terminal device (such as a vehicle terminal, an aircraft terminal, etc.) or the like (as shown in the aforementioned FIG. 3b ).
- the method may be processed by the CPU in the detection device, or jointly processed by the CPU and GPU, or other processors suitable for neural network calculations may be used instead of the GPU, which is not limited in this application.
- the application scenario of this method can be used in intelligent driving scenarios. For example: adaptive cruise, lane departure warning (lane departure warning, LDW), lane keeping assist (lane keeping assist, LKA) and other scenarios that include lane line detection.
- the lane line detection method provided by the embodiment of the present application can obtain the image to be detected through the sensor (such as a camera) on the vehicle, and obtain the lane line in the image to be detected, and then realize the above-mentioned adaptive cruise, LDW or LKA et al.
- FIG. 18 is a schematic flowchart of a lane line detection method provided by an embodiment of the present application.
- the method is applied to a vehicle, and the method may include steps 1801 to 1806 . Steps 1801 to 1806 will be described in detail below.
- Step 1801 acquire an image to be detected.
- This step is similar to step 501 in the aforementioned embodiment shown in FIG. 5 , and will not be repeated here.
- the image to be detected is consistent with the image to be detected in FIG. 6 .
- Step 1802 process the image to be detected to obtain multiple point sets.
- the detection device After the detection device acquires the image to be detected, it can process the image to obtain multiple point sets. Each point set in the plurality of point sets represents a lane line in the image to be detected; wherein, the processing is based on the first neural network of the transformer structure and the detection frame information to predict the point set of the lane line in the image, and the detection frame information includes the image to be detected The position of the detection frame of at least one object in the image to be detected.
- Step 1803 display lane lines, this step is optional.
- the detection device may display the lane lines represented by the multiple point sets.
- the lane lines are as shown in FIG. 14b.
- Step 1804 model at least one object to obtain a virtual object, this step is optional.
- At least one object can be modeled to obtain a virtual object.
- the virtual object may be two-dimensional or multi-dimensional, which is not limited here.
- Step 1805 performing fusion processing on multiple point sets and virtual objects based on positions to obtain a target image, this step is optional.
- the multiple point sets and virtual objects may be fused based on the positions of the multiple point sets in the predicted image to obtain the target image.
- the target image is as shown in FIG. 19 . It can be understood that the virtual image in FIG. 19 is only a two-dimensional example and does not limit the virtual object.
- the detection device after the detection device acquires the target image, it can display the target image to the user, so that the user driving the vehicle can know the surrounding vehicles and lane lines, and improve the driving safety of the vehicle.
- steps 1801 to 1806 can be performed periodically, that is, the target image can be displayed to the user in real time, so that the user can determine the surrounding objects and lane lines in real time, and improve the driving experience of the user.
- the lane line detection method provided in the embodiment of the present application includes step 1801 and step 1802 .
- the lane line detection method provided in the embodiment of the present application includes step 1801 to step 1803 .
- the lane line detection method provided in the embodiment of the present application includes step 1801 to step 1805 .
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the target neural network by adding the detection frame information of the object in the image during the lane line detection process, the target neural network’s ability to perceive the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- the image processing method and the lane line detection method provided in the embodiment of the present application are described above, and the training process of the target neural network provided in the embodiment of the present application is described below.
- the training method of the target neural network can be executed by the training device of the target neural network, and the training device of the target neural network can be an image processing device (for example, a cloud service device or user equipment, etc., with sufficient computing power to perform the training method of the target neural network device), or a system composed of cloud service equipment and user equipment.
- the training method can be executed by the training device 120 in FIG. 1 and the neural network processor 20 in FIG. 2 .
- the training method can be processed by CPU, or jointly processed by CPU and GPU, or other processors suitable for neural network calculation can be used instead of GPU, which is not limited in this application.
- the model training method includes steps 2001 to 2004.
- Step 2001 acquire training images.
- the training device can collect training images through sensors (such as cameras, radars, etc.), can also obtain training images from a database, and can also receive training images sent by other devices.
- the method of obtaining training images is not limited here.
- the training device can acquire a batch of training samples, that is, training images used for training. Among them, the real point set of the lane line in the training image is known.
- Step 2002 input the training image into the target neural network to obtain the first point set.
- the training image can be input into the target neural network to implement the following steps through the target neural network: obtain the first feature of the training image; obtain the second feature based on the first feature, and the second feature includes object correspondence detection in the training image The position feature and semantic feature of the frame; the first point set is obtained based on the first feature and the second feature, and the first point set is used to represent the lane line in the training image.
- the above-mentioned acquisition of the first point set based on the first feature and the second feature specifically includes the following steps: performing self-attention calculation on the first feature to obtain the first output; performing cross-attention on the first feature and the second feature Calculate to obtain the second output; obtain the fourth feature based on the first output and the second output; perform cross-attention calculation on the query feature and the fourth feature to obtain the third output, and the query feature is calculated by the query vector based on the self-attention mechanism ; Process the query feature and the second feature to obtain a fourth output; perform addition processing on the third output and the fourth output to obtain a fifth feature; obtain the first point set based on the fifth feature.
- Step 2003 based on the first point set and the real point set of the actual lane line in the training image, the target loss is obtained, and the target loss is used to indicate the difference between the first point set and the real point set.
- the preset target loss function can be used to calculate the first point set and the real point set to obtain the target loss, and the target loss is used to indicate the difference between the first point set and the real point set.
- the real point set can be extended, and the category of the lane lines of the extended point set can be set as non-lane lines category.
- the objective loss in this case is used to indicate the difference between the expanded true point set and the true first point set.
- step 2004 the parameters of the target neural network are updated based on the target loss until the training conditions are met, and a trained target neural network is obtained.
- the parameters of the target neural network can be updated based on the target loss, and the next batch of training samples can be used to train the target neural network after updating the parameters (that is, re-execute steps 2002 to 2004) until the model training requirements are satisfied.
- Conditions for example, the target loss reaches convergence, etc.
- the query vector involved in the training process is random, and the query vector is also trained in the process of continuously updating the target neural network parameters, and then the target query vector is obtained, which can be understood as the target query vector used in the reasoning process.
- the query vector that is, the target query vector is the query vector in the embodiment shown in FIG. 5 .
- the target neural network trained in this embodiment has the ability to predict lane lines using images.
- the detection process by applying the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the scene perception ability of the target neural network can be improved by adding the detection frame position information of the object in the image as input in the lane line detection network. Reduce the misjudgment of the model in the scene where the lane line is occluded by the vehicle.
- the network's ability to construct long lane line features can be improved, thereby achieving better lane line detection results.
- the various modules in the existing automatic driving system are often independent of each other.
- the lane line detection model and the human-vehicle model are independent of each other and predicted separately.
- the training of the target neural network is obtained by applying the detection frame information obtained based on the human-vehicle detection model to the first neural network, which can improve the accuracy of the target neural network for lane line detection.
- An embodiment of the image processing device in the embodiment of the present application includes:
- An extraction unit 2101 configured to perform feature extraction on the image to be detected to obtain a first feature
- the processing unit 2102 is configured to process the detection frame information of the image to be detected to obtain the second feature, the detection frame information includes the position of the detection frame of the object in the image to be detected in the image to be detected;
- the determining unit 2103 is configured to input the first feature and the second feature into the first neural network based on the transformer structure to obtain the lane line in the image to be detected.
- the image processing device in this embodiment may further include: an acquisition unit 2104, configured to acquire the first row feature and the first column feature based on the first feature, the first row feature is the matrix edge corresponding to the first feature
- the first column feature is obtained by flattening the matrix along the column direction.
- each unit in the image processing device is similar to those described in the foregoing embodiments shown in Fig. 5 to Fig. 17 , and will not be repeated here.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the detection frame information of the object in the image in the process of lane line detection the perception of the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- an embodiment of the detection equipment in the embodiment of the present application includes:
- An acquisition unit 2201 configured to acquire an image to be detected
- the processing unit 2202 is used to process the image to be detected to obtain multiple point sets, each point set in the multiple point sets represents a lane line in the image to be detected; wherein, the processing is based on the first neural network of the transformer structure and the detection
- the frame information predicts the point set of the lane line in the image, and the detection frame information includes the position of the detection frame of at least one object in the image to be detected in the image to be detected.
- the detection device in this embodiment may further include: a display unit 2203, configured to display lane lines.
- each unit in the detection device the operations performed by each unit in the detection device are similar to those described in the foregoing embodiment shown in FIG. 18 , and will not be repeated here.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the target neural network by adding the detection frame information of the object in the image during the lane line detection process, the target neural network’s ability to perceive the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- FIG. 23 another embodiment of the image processing device in the embodiment of the present application includes:
- An acquisition unit 2301 configured to acquire training images
- the processing unit 2302 is used to input the training image into the target neural network to obtain the first point set of the training image, and the first point set represents the predicted lane line in the training image;
- the target neural network is used to: perform feature extraction on the training image to obtain The first feature; process the detection frame information of the training image to obtain the second feature, the detection frame information includes the position of the detection frame of the object in the training image in the training image; obtain the first point set based on the first feature and the second feature , the target neural network is used to predict the point set of the lane line in the image based on the transformer structure;
- the training unit 2303 is configured to train the target neural network according to the first point set and the real point set of the actual lane line in the training image to obtain a trained target neural network.
- each unit in the image processing device the operations performed by each unit in the image processing device are similar to those described in the foregoing embodiment shown in FIG. 20 , and will not be repeated here.
- the transformer structure to the lane line detection task, the global information of the image to be detected can be obtained, and then the long-range relationship between the lane lines can be effectively modeled.
- the target neural network by adding the detection frame information of the object in the image during the lane line detection process, the target neural network’s ability to perceive the image scene can be improved, and the misjudgment in the scene where the lane line is blocked by the vehicle can be reduced.
- the image processing device may include a processor 2401 , a memory 2402 and a communication interface 2403 .
- the processor 2401, the memory 2402 and the communication interface 2403 are interconnected by wires.
- program instructions and data are stored in the memory 2402 .
- the memory 2402 stores program instructions and data corresponding to the steps executed by the device in the corresponding implementations shown in FIGS. 5 to 17 and 20 .
- the processor 2401 is configured to execute the steps performed by the device as shown in any one of the embodiments shown in the foregoing FIG. 5 to FIG. 17 and FIG. 20 .
- the communication interface 2403 may be used for receiving and sending data, and for performing steps related to acquisition, sending, and receiving in any of the embodiments shown in FIGS. 5 to 17 and 20 .
- the image processing device may include more or fewer components than those shown in FIG. 24 , which is only an example in the present application and not limited thereto.
- the detection device may include a processor 2501 , a memory 2502 and a communication interface 2503 .
- the processor 2501, memory 2502 and communication interface 2503 are interconnected by wires.
- program instructions and data are stored in the memory 2502 .
- the memory 2502 stores program instructions and data corresponding to the steps executed by the detection device in the corresponding embodiment shown in FIG. 18 .
- the processor 2501 is configured to execute the steps executed by the detection device shown in any one of the above-mentioned embodiments shown in FIG. 18 .
- the communication interface 2503 may be used for receiving and sending data, and for performing steps related to acquisition, sending, and receiving in any of the above-mentioned embodiments shown in FIG. 18 .
- the detection device may include more or less components than those shown in FIG. 25 , which is only an example in the present application and not limited thereto.
- the disclosed system, device and method can be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
- the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be fully or partially realized by software, hardware, firmware or any combination thereof.
- the integrated units When the integrated units are implemented using software, they may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present invention will be generated.
- the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
- the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
本申请实施例公开了一种图像处理方法,该方法可以应用于自适应巡航、车道偏离预警、车道保持辅助等包含车道线检测的场景。该方法包括:对待检测图像进行特征提取,得到第一特征;对待检测图像的检测框信息进行处理,得到第二特征,检测框信息包括待检测图像中至少一个对象的检测框在待检测图像中的位置;将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线。通过将transformer结构的神经网络应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。并通过增加检测框信息,提升对图像场景的感知能力。
Description
本申请要求于2022年1月7日提交中国专利局、申请号为202210018538.X、发明名称为“一种图像处理方法、一种车道线检测方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请实施例涉及人工智能领域,尤其涉及一种图像处理方法、一种车道线检测方法及相关设备。
智能驾驶(例如自动驾驶、辅助驾驶等)技术依靠人工智能、视觉计算、雷达、监控装置和全球定位系统协同合作,让车辆可以在不需要人类主动操作下,实现自动驾驶。车道线检测技术是智能驾驶中最重要的技术之一,它对其他应用在智能驾驶系统上的技术(如自适应巡航控制、车道偏离警告、道路状况理解等)都有非常重要的意义。车道线检测技术的目标是通过摄像头获取的图片输入,预测出图片中的每一条车道线,以辅助汽车行驶在正确的车道上。
随着深度学习技术的发展,基于图像分割的车道线检测开始出现,基于图像分割的车道线检测模型首先预测出整张图的分割结果,然后通过聚类后输出车道线检测结果。
然而,基于深度学习技术的车道线检测方法大多是基于卷积神经网络,例如空间卷积神经网络(spatial convolutional neuron network,SCNN)等,由于卷积神经网络会受到感受野的限制,无法很好地感知图片的全局信息,从而无法准确地预测出车道线的位置,尤其在存在很多车辆遮挡的场景下,模型容易出现误测的情况。
发明内容
本申请实施例提供了一种图像处理方法、一种车道线检测方法及相关设备。可以提升检测图像中车道线的准确性。
本申请实施例第一方面提供了一种图像处理方法,该方法可以应用于智能驾驶场景。例如:自适应巡航、车道偏离预警(lane departure warning,LDW)、车道保持辅助(lane keeping assist,LKA)等包含车道线检测的场景。该方法可以由图像处理设备(例如终端设备或服务器)执行,也可以由图像处理设备的部件(例如处理器、芯片、或芯片系统等)执行。方法通过含有transformer结构的目标神经网络实现,方法包括:对待检测图像进行特征提取,得到第一特征;对待检测图像的检测框信息进行处理,得到第二特征,检测框信息包括待检测图像中对象的检测框在待检测图像中的位置;将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线。
本申请实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升对图像场景的感知能力,减少由于 车道线被车辆遮挡场景下的误判。
可选地,在第一方面的一种可能的实现方式中,上述步骤:对待检测图像的检测框信息进行处理,得到第二特征包括:对至少一个第三特征与检测框信息进行处理,得到第二特征,至少一个第三特征为获取第一特征的过程中所得到的中间特征。
该种可能的实现方式中,获取的第二特征不仅含有检测框信息,还含有图像的特征。为后续确定车道线提供更多的细节。
可选地,在第一方面的一种可能的实现方式中,上述的第二特征包括待检测图像中对象对应检测框的位置特征与语义特征,检测框信息还包括:检测框的类别与置信度;对至少一个第三特征与检测框信息进行处理,得到第二特征包括:基于至少一个第三特征、位置以及置信度获取语义特征;基于位置与类别获取位置特征。
该种可能的实现方式中,第二特征不仅考虑了检测框的位置,还考虑了检测框的类别与置信度,使得后续确定的车道线更加准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于至少一个第三特征、位置以及置信度获取语义特征,包括:基于位置从至少一个第三特征中提取出感兴趣区域ROI特征;对ROI特征与置信度进行乘法处理,并将得到的特征输入全连接层,得到语义特征;基于位置与类别获取位置特征,包括:获取类别的向量,并与位置对应的向量进行拼接,将得到的特征输入全连接层,得到位置特征。
该种可能的实现方式中,通过确定图像特征中与检测框相关的语义特征以及引入含有检测框位置信息的位置特征,使得第二特征具有的信息更加全面,进而提升车道线预测的准确性。
可选地,在第一方面的一种可能的实现方式中,上述基于transformer结构的第一神经网络包括编码器、解码器与前馈神经网络;将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线,包括:基于第一特征、第二特征以及编码器获取第四特征;将第四特征、第二特征以及查询特征输入解码器,得到第五特征;将第五特征输入前馈神经网络,得到多个点集。
该种可能的实现方式中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在确定点集的过程中增加含有检测框信息的第二特征,使得后续基于点集确定的车道线更加准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于第一特征获取第一行特征与第一列特征,第一行特征为由第一特征对应的矩阵沿着行的方向进行拉平(flatten)得到,第一列特征为由矩阵沿着列的方向进行拉平(flatten)得到;将第一特征与第二特征输入编码器得到第四特征,包括:将第一特征、第二特征、第一行特征以及第一列特征输入解码器,得到第四特征。
该种可能的实现方式中,通过引入能够顺应车道线形状挖掘上下文信息的第一行特征与第一列特征,可以提升对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。
可选地,在第一方面的一种可能的实现方式中,上述步骤:将第一特征、第二特征、第一行特征以及第一列特征输入编码器,得到第四特征,包括:对第一特征进行自注意力计算, 得到第一输出;对第一特征与第二特征进行交叉注意力计算,得到第二输出;对第一行特征与第一列特征进行自注意力计算与拼接处理,得到行列输出;基于第一输出、第二输出以及行列输出获取第四特征。
该种可能的实现方式中,第四特征的获取过程中还考虑了行列输出,通过引入能够顺应车道线形状挖掘上下文信息的行列输出,可以提升对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一输出、第二输出以及行列输出获取第四特征,包括:对第一输出与第二输出进行相加处理,得到第五输出;对第五输出与行列输出进行拼接处理,得到第四特征。
该种可能的实现方式中,细化了第四特征的具体过程,第四特征是第一输出与第二输出相加得到的结果与行列输出拼接得到的。通过引入能够顺应车道线形状挖掘上下文信息的行列输出,可以提升对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。
可选地,在第一方面的一种可能的实现方式中,上述步骤:将第一特征与第二特征输入编码器,得到第四特征,包括:对第一特征进行自注意力计算,得到第一输出;对第一特征与第二特征进行交叉注意力计算,得到第二输出;对第一输出与第二输出进行相加处理,得到第四特征。
该种可能的实现方式中,第四特征不仅含有基于第一特征通过自注意力机制计算得到的第一输出,还含有基于第一特征与第二特征交叉注意力计算得到的第二输出,提升第四特征的表达能力。
可选地,在第一方面的一种可能的实现方式中,上述步骤:将第四特征、第二特征以及查询特征输入解码器,得到第五特征,包括:对查询特征与第四特征进行交叉注意力计算,得到第三输出;对查询特征与第二特征进行处理,得到第四输出;对第三输出与第四输出进行相加处理,得到第五特征。
该种可能的实现方式中,通过交叉注意力计算使得获取的第五特征考虑了更多的带预测图像的信息,提升第五特征的表达能力,使得后续基于点集确定的车道线更加准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤:对待检测图像进行特征提取,得到第一特征包括:对主干网络中不同层输出的特征进行特征融合与降维处理,得到第一特征,主干网络的输入为待检测图像。
该种可能的实现方式中,通过拼接各层的特征,由于神经网络不同层提取到特征性能不同,低层特征分辨率更高,包含更多位置、细节信息,但是由于经过的卷积更少,其语义性更低,噪声更多;高层特征具有更强的语义信息,但是分辨率低,对细节的感知能力较差。因此,针对神经网络不同层提取到的特征进行特征融合,得到的第一特征就具有多层次特征。
本申请实施例第二方面提供了一种车道线检测方法,该方法可以应用于智能驾驶场景。例如:自适应巡航、车道偏离预警、车道保持辅助等包含车道线检测的场景。该方法可以由检测设备(例如车辆或车辆中的设备)执行,也可以由检测设备的部件(例如处理器、芯片、或芯片系统等)执行。该方法包括:获取待检测图像;对待检测图像进行处理,得到多个点集,多个点集中的每个点集表示待检测图像中的一条车道线;其中,处理基于transformer结构的第一神经网络与检测框信息预测图像中车道线的点集,检测框信息包括待检测图像中 至少一个对象的检测框在待检测图像中的位置。
本申请实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升目标神经网络对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
可选地,在第二方面的一种可能的实现方式中,上述的检测框信息还包括:检测框的类别与置信度。
该种可能的实现方式中,通过引入检测框的类别与置信度,可以使得后续预测的车道线参考的检测框信息增多,使得后续基于点集确定的车道线更加准确。
可选地,在第二方面的一种可能的实现方式中,上述步骤还包括:显示车道线。
该种可能的实现方式中,通过显示车道线,可以使得用户关注当前道路的车道线情况,尤其在车道线有遮挡等场景,帮助用户准确的确定车道线,减少由于车道线模糊带来的风险。
可选地,在第二方面的一种可能的实现方式中,上述步骤还包括:对至少一个对象进行建模得到虚拟对象;基于位置对多个点集与虚拟对象进行融合处理,得到目标图像;显示目标图像。
该种可能的实现方式中,通过建模虚拟对象,并基于位置将虚拟对象与多个点集进行融合,得到目标图像。用户可以通过目标图像了解周围的对象以及车道线,帮助用户准确的确定周边对象以及车道线,减少由于车道线模糊带来的风险。
本申请实施例第三方面提供了一种图像处理方法,该方法可以应用于智能驾驶场景。例如:自适应巡航、车道偏离预警、车道保持辅助等包含车道线检测的场景。该方法可以由图像处理设备(例如终端设备或服务器)执行,也可以由图像处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该方法包括:获取训练图像;将训练图像输入目标神经网络,得到训练图像的第一点集,第一点集表示训练图像中的预测车道线;目标神经网络用于:对训练图像进行特征提取,得到第一特征;对训练图像的检测框信息进行处理,得到第二特征,检测框信息包括训练图像中对象的检测框在训练图像中的位置;基于第一特征和第二特征获取第一点集,目标神经网络用于基于transformer结构预测图像中车道线的点集;根据第一点集与训练图像中实际车道线的真实点集,对目标神经网络进行训练,得到训练好的目标神经网络。
本申请实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升目标神经网络对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
本申请实施例第四方面提供了一种图像处理设备,该图像处理设备可以应用于智能驾驶场景。例如:自适应巡航、车道偏离预警、车道保持辅助等包含车道线检测的场景。图像处理设备包括:提取单元,用于对待检测图像进行特征提取,得到第一特征;处理单元,用于对待检测图像的检测框信息进行处理,得到第二特征,检测框信息包括待检测图像中至少一个对象的检测框在待检测图像中的位置;确定单元,用于将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于对至少一个第三特征与检测框信息进行处理,得到第二特征,至少一个第三特征为获取第一特征的过程中所得到的中间特征。
可选地,在第四方面的一种可能的实现方式中,上述的第二特征包括待检测图像中对象对应检测框的位置特征与语义特征,检测框信息还包括:检测框的类别与置信度;处理单元,具体用于基于至少一个第三特征、位置以及置信度获取语义特征;处理单元,具体用于基于位置与类别获取位置特征。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于基于位置从至少一个第三特征中提取出感兴趣区域ROI特征;处理单元,具体用于对ROI特征与置信度进行乘法处理,并将得到的特征输入全连接层,得到语义特征;处理单元,具体用于获取类别的向量,并与位置对应的向量进行拼接,将得到的特征输入全连接层,得到位置特征。
可选地,在第四方面的一种可能的实现方式中,上述的基于transformer结构的第一神经网络包括编码器、解码器与前馈神经网络;确定单元,具体用于将第一特征与第二特征输入编码器,得到第四特征;确定单元,具体用于将第四特征、第二特征以及查询特征输入解码器,得到第五特征;确定单元,具体用于将第五特征输入前馈神经网络,得到多个点集,多个点集中的每个点集表示待检测图像中的一条车道线。
可选地,在第四方面的一种可能的实现方式中,上述的图像处理设备还包括:获取单元,用于基于第一特征获取第一行特征与第一列特征,第一行特征为由第一特征对应的矩阵沿着行的方向进行拉平(flatten)得到,第一列特征为由矩阵沿着列的方向进行拉平(flatten)得到;确定单元,具体用于将第一特征、第二特征、第一行特征以及第一列特征输入解码器,得到第四特征。
可选地,在第四方面的一种可能的实现方式中,上述的确定单元,具体用于对第一特征进行自注意力计算,得到第一输出;确定单元,具体用于对第一特征与第二特征进行交叉注意力计算,得到第二输出;确定单元,具体用于对第一行特征与第一列特征进行自注意力计算与拼接处理,得到行列输出;确定单元,具体用于基于第一输出、第二输出以及行列输出获取第四特征。
可选地,在第四方面的一种可能的实现方式中,上述的确定单元,具体用于对第一输出与第二输出进行相加处理,得到第五输出;确定单元,具体用于对第五输出与行列输出进行拼接处理,得到第四特征。
可选地,在第四方面的一种可能的实现方式中,上述的确定单元,具体用于对第一特征进行自注意力计算,得到第一输出;确定单元,具体用于对第一特征与第二特征进行交叉注意力计算,得到第二输出;确定单元,具体用于对第一输出与第二输出进行相加处理,得到第四特征。
可选地,在第四方面的一种可能的实现方式中,上述的确定单元,具体用于对查询特征与第四特征进行交叉注意力计算,得到第三输出;确定单元,具体用于对查询特征与第二特征进行处理,得到第四输出;确定单元,具体用于对第三输出与第四输出进行相加处理,得到第五特征。
可选地,在第四方面的一种可能的实现方式中,上述的提取单元,具体用于对主干网络 中不同层输出的特征进行特征融合与降维处理,得到第一特征,主干网络的输入为待检测图像。
本申请实施例第五方面提供了一种检测设备,该检测设备可以应用于智能驾驶场景。例如:自适应巡航、车道偏离预警、车道保持辅助等包含车道线检测的场景。该检测设备应用于车辆,检测设备包括:获取单元,用于获取待检测图像;处理单元,用于对待检测图像进行处理,得到多个点集,多个点集中的每个点集表示待检测图像中的一条车道线;其中,处理基于transformer结构的第一神经网络与检测框信息预测图像中车道线的点集,检测框信息包括待检测图像中至少一个对象的检测框在待检测图像中的位置。
可选地,在第五方面的一种可能的实现方式中,上述的检测框信息还包括:检测框的类别与置信度。
可选地,在第五方面的一种可能的实现方式中,上述的检测设备还包括:显示单元,用于显示车道线。
可选地,在第五方面的一种可能的实现方式中,上述的处理单元,还用于对至少一个对象进行建模得到虚拟对象;处理单元,还用于基于位置对多个点集与虚拟对象进行融合处理,得到目标图像;显示单元,还用于显示目标图像。
本申请实施例第六方面提供了一种图像处理设备,该图像处理设备可以应用于智能驾驶场景。例如:自适应巡航、车道偏离预警、车道保持辅助等包含车道线检测的场景。图像处理设备包括:获取单元,用于获取训练图像;处理单元,用于将训练图像输入目标神经网络,得到训练图像的第一点集,第一点集表示训练图像中的预测车道线;目标神经网络用于:对训练图像进行特征提取,得到第一特征;对训练图像的检测框信息进行处理,得到第二特征,检测框信息包括训练图像中对象的检测框在训练图像中的位置;基于第一特征和第二特征获取第一点集,目标神经网络用于基于transformer结构预测图像中车道线的点集;训练单元,用于根据第一点集与训练图像中实际车道线的真实点集,对目标神经网络进行训练,得到训练好的目标神经网络。
本申请第七方面提供了一种图像处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该图像处理设备实现前述第一方面或第一方面的任意可能的实现方式中的方法,或者实现前述第三方面或第三方面的任意可能的实现方式中的方法。
本申请第八方面提供了一种检测设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该检测设备实现上述第二方面或第二方面的任意可能的实现方式中的方法。
本申请第九方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法,或者使得计算机执行前述第三方面或第三方面的任意可能的实现方式中的方法。
本申请第十方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法,或者使得计算机执行前述第三 方面或第三方面的任意可能的实现方式中的方法。
其中,第四、第七、第八、第九、第十方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
其中,第五、第七、第八、第九、第十方面或者其中任一种可能实现方式所带来的技术效果可参见第二方面或第二方面不同可能实现方式所带来的技术效果,此处不再赘述。
其中,第六、第七、第八、第九、第十方面或者其中任一种可能实现方式所带来的技术效果可参见第三方面或第三方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
图1为本申请实施例提供的系统架构的结构示意图;
图2为本申请实施例提供的一种芯片硬件结构示意图;
图3a为本申请实施例提供的图像处理系统的一个结构示意图;
图3b为本申请实施例提供的图像处理系统的另一结构示意图;
图4为本申请实施例提供的车辆的一种结构示意图;
图5为本申请实施例提供的图像处理方法的一个流程示意图;
图6为本申请实施例中获取第二特征的一个流程示意图;
图7为本申请实施例提供的第一神经网络的一个结构示意图;
图8为本申请实施例提供的transformer结构的一个结构示意图;
图9为本申请实施例中获取第四特征的一个流程示意图;
图10为本申请实施例中获取第四输出的一个流程示意图;
图11为本申请实施例提供的第一神经网络的另一个结构示意图;
图12为本申请实施例提供的transformer结构的另一个结构示意图;
图13为本申请实施例提供的行列注意力模块的一个结构示意图;
图14a为本申请实施例提供的包括确定多个点集过程的示例图;
图14b为本申请实施例提供的多个点集的一个示例图;
图14c为本申请实施例提供的包括多个点集的待检测图像的一个示例图;
图14d为本申请实施例提供的车道线检测对应的示例图;
图15为本申请实施例提供的图像处理方法的另一个流程示意图;
图16为本申请实施例提供的目标神经网络的一个结构示意图;
图17为本申请实施例提供的目标神经网络的另一个结构示意图;
图18为本申请实施例提供的车道线检测方法的一个流程示意图;
图19为本申请实施例提供的目标图像的示例图;
图20为本申请实施例提供的模型训练方法的一个流程示意图;
图21为本申请实施例提供的图像处理设备的一个结构示意图;
图22为本申请实施例提供的检测设备的一个结构示意图;
图23为本申请实施例提供的图像处理设备的另一个结构示意图;
图24为本申请实施例提供的图像处理设备的另一个结构示意图;
图25为本申请实施例提供的检测设备的另一个结构示意图。
本申请实施例提供了一种图像处理方法、一种车道线检测方法及相关设备。可以提升检测图像中车道线的准确性。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”并他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
智能驾驶的第一步是环境信息的采集与处理,而车道线作为路面最主要的指示信息之一,它可以有效地引导智能车辆在约束的道路区域内行驶。因此,如何实时、准确检测出路面的车道线是智能车辆相关系统设计中的重要环节,可有利于协助路径规划、进行道路偏移预警等功能,并且可为精确导航提供参照。车道线检测技术的目的是通过分析车载摄像头在行驶过程中采集的图片,准确地识别出路面的车道线,以辅助汽车行驶在正确的车道上。
随着深度学习技术的发展,基于图像分割的车道线检测和基于检测的车道线检测开始出现。基于图像分割的车道线检测模型首先预测出整张图的分割结果,然后通过聚类后输出车道线检测结果。而基于检测的车道线检测通过生成多个锚点并预测车道线相对于锚点的偏移量来预测出大量候选车道线,然后通过非极大值抑制进行后处理来得到最终的车道线检测结果。
基于深度学习技术的车道线检测方法大多是基于卷积神经网络,例如:空间卷积神经网络(spatial convolutional neuron network,SCNN)等。SCNN是基于图像分割来进行车道线检测的一种技术方案。该方案使用卷积神经网络对需检测的图片进行图像分割,给图片中的每个像素点预测一个类别。该方案将传统的深度卷积结构推广成逐片卷积结构,并且按照不同的方向来进行卷积,以使得图片中行和列之间的信息可以得到传递。具体来说,传统的卷积是对一个维度为HxWxC的特征进行卷积操作,而该方案首先把HxWxC按照纵向分成H个WxC的片,然后从下到上和从上到下分别对这些片进行卷积,然后把HxWxC按照水平方向分成W个HxC的片,然后从左到右和从右到左分别对这些片进行卷积,最后,把按照这四个方向得到卷积结果拼接起来,通过全连接层输出图像的分割图。从而实现车道线的检测。
但是,由于卷积神经网络会受到感受野的限制,无法很好地感知图片的全局信息。一方面,不利于车道线这种具有长尾关系(也可以理解为是形状细长)的对象的预测。另一方面,尤其在存在很多车辆遮挡的场景下,无法准确地预测出车道线的位置,模型容易出现误测的情况。
为了解决上述技术问题,本申请实施例提供一种图像处理方法、一种车道线检测方法及相关设备,一方面,通过将transformer结构应用于车道线检测任务上,可以有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框位置信息,可以提升对场景的感知能力。减少由于车道线被车辆遮挡场景下的误判。下面将结合附图对本申请实施例的图像处理方法及相关设备进行详细的介绍。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以X
s和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,W
s为X
s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是Relu函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式y=a(Wx+b)来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由Wx完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
2、卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平 面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
3、transformer
transformer结构是一种包含编码器与解码器的特征提取网络(类别于卷积神经网络)。
编码器:通过自注意力的方式在全局感受野下进行特征学习,例如像素点的特征。
解码器:通过自注意力与交叉注意力来学习所需模块的特征,例如输出框的特征。
下面对注意力(也可以称为注意力机制)进行描述:
注意力机制可以快速提取稀疏数据的重要特征。注意力机制是发生在编码器和解码器之间,也可以说是发生在输入句子和生成句子之间。而自注意力模型中的自注意力机制则发生在输入序列内部,或者输出序列内部,可以抽取到同一个句子内间隔较远的单词之间的联系,比如句法特征(短语结构)。自注意力机制通过QKV提供了一种有效的捕捉全局上下文信息的建模方式。假定输入为Q(query),以键值对(K,V)形式存储上下文。那么注意力机制其实是query到一系列键值对(key,value)上的映射函数。attention函数的本质可以被描述为一个查询(query)到一系列(键key-值value)对的映射。attention本质上是为序列中每个元素都分配一个权重系数,这也可以理解为软寻址。如果序列中每一个元素都以(K,V)形式存储,那么attention则通过计算Q和K的相似度来完成寻址。Q和K计算出来的相似度反映了取出来的V值的重要程度,即权重,然后加权求和就得到最后的特征值。
注意力的计算主要分为三步,第一步是将query和每个key进行相似度计算得到权重,常用的相似度函数有点积,拼接,感知机等;然后第二步一般是使用一个softmax函数(一方面可以进行归一化,得到所有权重系数之和为1的概率分布。另一方面可以用softmax函数的特性突出重要元素的权重)对这些权重进行归一化;最后将权重和相应的键值value进行加权求和得到最后的特征值。具体计算公式可以如下:
其中,d为QK矩阵的维度。
另外,注意力包括自注意力与交叉注意力,自注意可以理解为是特殊的注意力,即QKV的输入一致。而交叉注意力中的QKV的输入不一致。注意力是利用特征之间的相似程度(例如内积)作为权重来集成被查询特征作为当前特征的更新值。自注意力是基于特征图本身的关注而提取的注意力。
对于卷积而言,卷积核的设置限制了感受野的大小,导致网络往往需要多层的堆叠才能 关注到整个特征图。而自注意的优势就是它的关注是全局的,它能通过简单的查询与赋值就能获取到特征图的全局空间信息。自注意力在查询、键、值(query key value,QKV)模型中的特殊点在于QKV对应的输入是一致的。后续会对QKV模型进行描述。
4、前馈神经网络
前馈神经网络(feedforward neural network,FNN)是最早发明的简单人工神经网络。在前馈神经网络中,各神经元分别属于不同的层。每一层的神经元可以接收前一层神经元的信号,并产生信号输出到下一层。第0层称为输入层,最后一层称为输出层,其它中间层称为隐藏层。整个网络中无反馈,信号从输入层向输出层单向传播。
5、多层感知器(multilayer perceptron,MLP)
多层感知器,也可以称为多层感知机,是一种前馈人工神经网络模型,其将输入映射到单一的输出的上。
6、损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
7、特征融合
将神经网络提取的不同特征通过某种方法生成新的特征,从而使新特征对分类、识别或检测等更有效,特征融合一般具有两种方式:concat和add。其中,concat是系列特征融合方式,即直接将两个特征进行连接,两个输入特征x和y的维数若为p和q,输出特征z的维数为p+q;add则是一种并行融合策略,是将两个特征向量进行组合,对于输入特征x和y,得到通道数不变的新的特征z。换句话说,add是描述图像的特征下的信息量增多了,但是描述图像的维度本身并没有增加,只是每一维下的信息量在增加;而concat是通道数的合并,也就是说描述图像本身的特征增加了,而每一特征下的信息是没有增加。
8、降维处理
降维处理是将高维度数据化为低维度数据的操作。本实施例中,降维处理主要是针对特征矩阵。具体的,可以通过一个线性变换层对特征矩阵进行降维。对于特征矩阵的降维处理也可以理解为是降低该特征矩阵对应的向量空间的维数。
9、感兴趣区域。
感兴趣区域(region of interest,ROI):机器视觉、图像处理中,从被处理的图像以方框、圆、椭圆、不规则多边形等方式勾勒出需要处理的区域。
下面介绍本申请实施例提供的系统架构。
参见附图1,本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采 集设备160用于采集训练数据,本申请实施例中训练数据包括:训练图像。可选地,训练数据还可以包括训练图像的第一特征、训练图像中对象对应的检测框信息。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的图像处理方法。本申请实施例中的目标模型/规则101具体可以为目标神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)设备/虚拟现实(virtual reality,VR)设备,车载终端等。当然,执行设备110还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:待检测图像。另外该输入数据可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限定。
预处理模块113用于根据I/O接口112接收到的输入数据进行预处理,在本申请实施例中,预处理模块113可以用于,获取待检测图像的特征。可选地,预处理模块113还可以用于,获取待检测图像中对象对应的检测框信息。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的点集或者包括点集的图像返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,本申请实施例中的目标模型/规则101具体可以为目标神经网络。
下面介绍本申请实施例提供的一种芯片硬件结构。
图2为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器20。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。
神经网络处理器20可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器20作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路203,控制器204控制运算电路203提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路203内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路203是二维脉动阵列。运算电路203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路203从权重存储器202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器208中。
向量计算单元207可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元207可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元能207将经处理的输出的向量存储到统一缓存器206。例如,向量计算单元207可以将非线性函数应用到运算电路203的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元207生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路203的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器206用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器205(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器201和/或统一存储器206、将外部存储器中的 权重数据存入权重存储器202,以及将统一存储器206中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)210,用于通过总线实现主CPU、DMAC和取指存储器209之间进行交互。
与控制器204连接的取指存储器(instruction fetch buffer)209,用于存储控制器204使用的指令。
控制器204,用于调用指存储器209中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器206,输入存储器201,权重存储器202以及取指存储器209均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
接下来介绍几种本申请的应用场景。
图3a为本申请实施例提供的图像处理系统的一个结构示意图,该图像处理系统包括用户设备(图3a中以车辆为例)以及图像处理设备。可以理解的是,用户设备除了可以是车辆之外,还可以是手机、车载终端、飞机终端、VR/AR设备、智能机器人等智能终端。用户设备为图像处理的发起端,作为图像处理请求的发起方,通常由用户通过用户设备发起请求。
上述图像处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有图像处理功能的设备或服务器。图像处理设备通过交互接口接收来自智能终端的图像处理请求,再通过存储数据的存储器以及图像处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的图像处理。图像处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在图像处理设备上,也可以在其它网络服务器上。
在图3a所示的图像处理系统中,用户设备可以接收用户的指令,例如用户设备可以获取用户输入/选择的一张图像(或者用户设备通过摄像头采集的图像),然后向图像处理设备发起请求,使得图像处理设备针对用户设备得到的该图像执行图像处理应用(例如,图像中的车道线检测等等),从而得到针对该图像的对应的处理结果。示例性的,用户设备可以获取用户输入的一张图像,然后向图像处理设备发起图像检测请求,使得图像处理设备对该图像进行检测,从而得到图像的检测结果(即车道线的点集),并显示图像的检测结果,以供用户观看和使用。
在图3a中,图像处理设备可以执行本申请实施例的图像处理方法。
图3b为本申请实施例提供的图像处理系统的另一结构示意图,在图3b中,用户设备(图3b中以车辆为例)直接作为图像处理设备,该用户设备能够直接获取图像,并直接由用户设备本身的硬件进行处理,具体过程与图3a相似,可参考上面的描述,在此不再赘述。
可选地,在图3b所示的图像处理系统中,用户设备可以接收用户的指令,例如用户设备可以获取用户在用户设备中所选择的一张图像,然后再由用户设备自身针对该图像执行图像处理应用(例如,图像中的车道线检测等等),从而得到针对该图像的对应的处理结果,并显示处理结果,以供用户观看和使用。
可选地,在图3b所示的图像处理系统中,用户设备可以实时或周期性的采集用户设备所在道路的图像,然后再由用户设备自身针对该图像执行图像处理应用(例如,图像中的车道线 检测等等),从而得到针对该图像的对应的处理结果,并根据处理结果实现智能驾驶功能,例如:自适应巡航、车道偏离预警(lane departure warning,LDW)、车道保持辅助(lane keeping assist,LKA)等。
在图3b中,用户设备自身就可以执行本申请实施例的图像处理方法。
上述图3a和图3b中的用户设备具体可以是图1中的客户设备140或执行设备110,图3a中的图像处理设备具体可以是图1中的执行设备110,其中,数据存储系统250可以存储执行设备210的待处理数据,数据存储系统250可以集成在执行设备210上,也可以设置在云上或其它网络服务器上。
图3a和图3b中的处理器可以通过神经网络模型或者其它模型(例如,基于支持向量机的模型)进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到的模型针对图像执行图像处理应用,从而得到相应的处理结果。
下面对上述场景中的车辆架构进行描述。请先参阅图4,图4为本申请实施例提供的车辆的一种结构示意图。
车辆可包括各种子系统,例如行进系统402、传感器系统404、控制系统406、一个或多个外围设备408以及电源410和用户接口416。可选地,车辆可包括更多或更少的子系统,并且每个子系统可包括多个部件。另外,车辆的每个子系统和部件可以通过有线或者无线(例如,蓝牙)互连。
行进系统402可包括为车辆提供动力运动的组件。在一个实施例中,行进系统402可包括引擎418、能量源419、传动装置420和车轮421。
其中,引擎418可以是内燃引擎、电动机、空气压缩引擎或其他类型的引擎组合,例如,汽油发动机和电动机组成的混动引擎,内燃引擎和空气压缩引擎组成的混动引擎。引擎418将能量源419转换成机械能量。能量源419的示例包括汽油、柴油、其他基于石油的燃料、丙烷、其他基于压缩气体的燃料、乙醇、太阳能电池板、电池和其他电力来源。能量源419也可以为车辆的其他系统提供能量。传动装置420可以将来自引擎418的机械动力传送到车轮421。传动装置420可包括变速箱、差速器和驱动轴。在一个实施例中,传动装置420还可以包括其他器件,比如离合器。其中,驱动轴可包括一个或多个可耦合到车轮421的轴。
传感器系统404可包括感测关于车辆位置信息的若干个传感器。例如,传感器系统404可包括定位系统422(例如:全球定位系统、北斗系统或者其他定位系统)、惯性测量单元(inertial measurement unit,IMU)424、雷达426、激光测距仪428以及相机430。传感器系统404还可包括被监视车辆的内部系统的传感器(例如,车内空气质量监测器、燃油量表、机油温度表等)。来自这些传感器中的一个或多个的传感数据可用于检测对象及其相应特性(例如,位置、形状、方向、速率等)。这种检测和识别是自主车辆的安全操作的关键功能。
其中,定位系统422可用于估计车辆的地理位置,比如车辆所处位置的经纬度信息。IMU 424用于基于惯性加速率来感知车辆的位置和朝向变化。在一个实施例中,IMU 424可以是加速率计和陀螺仪的组合。雷达426可利用无线电信号来感知车辆的周边环境内的物体,具体可以表现为毫米波雷达或激光雷达。在一些实施例中,除了感知物体以外,雷达426还可用于感知物体的速率和/或前进方向。激光测距仪428可利用激光来感知车辆所位于的环境中的 物体。在一些实施例中,激光测距仪428可包括一个或多个激光源、激光扫描器以及一个或多个检测器,以及其他系统组件。相机430可用于捕捉车辆的周边环境的多个图像。相机430可以是静态相机或视频相机。
控制系统406为控制车辆及其组件的操作。控制系统406可包括各种部件,其中包括转向系统432、油门434、制动单元436、电子控制单元438(electronic control unit,ECU)以及整车控制器440(body control module,BCM)。
其中,转向系统432可操作来调整车辆的前进方向。例如在一个实施例中可以为方向盘系统。油门434用于控制引擎418的操作速率并进而控制车辆的速率。制动单元436用于控制车辆减速。制动单元436可使用摩擦力来减慢车轮421。在其他实施例中,制动单元436可将车轮421的动能转换为电流。制动单元436也可采取其他形式来减慢车轮421转速从而控制车辆的速率。车辆电子控制单元438可以被实现为车辆上的单个ECU或多个ECU,所述单个ECU或多个ECU被配置为与外围设备408、传感器系统404进行通信。车辆ECU438可包括至少一个处理器4381,存储器4382(read-only memory,ROM)。可以利用一个或多个通用处理器、内容可寻址存储器、数字信号处理器、专用集成电路、现场可编程门阵列、任何适当的可编程逻辑器件、离散门或晶体管逻辑、离散硬件部件或者被设计用于执行这里描述的功能的任何组合实现或执行至少一个处理器。特别地,至少一个处理器可以被实现为一个或多个微处理器、控制器、微控制器(microcontroller unit,MCU)或状态机。此外,至少一个处理器可以被实现为计算设备的组合,例如数字信号处理器或微处理器、多个微处理器、与数字信号处理器核心结合的一个或多个微处理器,或者任何其他这种配置的组合。ROM可以提供数据的存储,包含本申请中地址、路线、行驶方向的存储。
BCM140可以给ECU438提供车辆发动机状态,速率,档位,方向盘角度等信息。
车辆通过外围设备408与外部传感器、其他车辆、其他计算机系统或用户之间进行交互。外围设备408可包括无线通信系统446、导航系统448、麦克风450和/或扬声器452。在一些实施例中,外围设备408为车辆的用户提供与用户接口416交互的手段。例如,导航系统448可以被实现为车载娱乐系统的一部分、车载显示系统、车载仪器集群等。在一个实际实施例中,导航系统448被实现为包括或与传感器系统404协作,该传感器系统404实时或基本上实时推导出车辆的当前地理位置。导航系统448被配置为向车辆的驾驶员提供导航数据。导航数据可包括车辆的位置数据、建议路线规划行驶指示,以及给车辆操作者的可见地图信息。导航系统448可通过显示元件或其他呈现设备将该位置数据呈现给车辆的驾驶员。车辆的当前位置可以通过以下信息中的一种或者几种来描述:三角测量的位置、纬度/经度位置、x和y坐标,或者指示车辆的地理位置的任何其他符号或任何测量方式。
用户接口416还可操作导航系统448来接收用户的输入。导航系统448可以通过触摸屏进行操作。导航系统448在用户输入起点和终点的地理位置值时,提供路线规划的能力和导航的能力。在其他情况中,外围设备408可提供用于车辆与位于车内的其它设备通信的手段。例如,麦克风450可从车辆的用户接收音频(例如,语音命令或其他音频输入)。类似地,扬声器452可向车辆的用户输出音频。无线通信系统446可以直接地或者经由通信网络来与一个或多个设备无线通信。例如,无线通信系统446可使用3G蜂窝通信,例如如码分多址(code division multiple access,CDMA)、EVD0、全球移动通信系统(global system for mobile communications,GSM)/是通用分组无线服务技术(general packet radio service,GPRS),或者4G蜂窝通信,例如长期演进(long term evolution,LTE),或者5G蜂窝通信。无线通信系统446可利用WiFi与无线局域网(wireless local area network,WLAN)通信。在一些实施例中,无线通信系统446可利用红外链路、蓝牙或ZigBee与设备直接通信。其他无线协议,例如各种车辆通信系统,例如,无线通信系统446可包括一个或多个专用短程通信(dedicated short range communications,DSRC)设备,这些设备可包括车辆和/或路边台站之间的公共和/或私有数据通信。
电源410可向车辆的各种组件提供电力。在一个实施例中,电源410可以为可再充电锂离子或铅酸电池。这种电池的一个或多个电池组可被配置为电源为车辆的各种组件提供电力。在一些实施例中,电源410和能量源419可一起实现,例如一些全电动车中那样。
可选地,上述这些组件中的一个或多个可与车辆分开安装或关联。例如,存储器4382可以部分或完全地与车辆分开存在。上述组件可以按有线和/或无线方式来通信地耦合在一起。
可选地,上述组件只是一个示例,实际应用中,上述各个模块中的组件有可能根据实际需要增添或者删除,图4不应理解为对本申请实施例的限制。
上述车辆可以为轿车、卡车、摩托车、公共汽车、船、割草机、娱乐车、游乐场车辆、施工设备、电车、高尔夫球车、和手推车等,本申请实施例不做特别的限定。
下面对本申请实施例提供的图像处理方法进行描述。该方法可以由图像处理设备执行,也可以由图像处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该图像处理设备可以是云端设备(如前述图3a所示),也可以是车辆(例如图4所示的车辆)或终端设备(例如车载终端、飞机终端等等)等(如前述图3b所示)。当然,该方法也可以是由云端设备和车辆构成的系统执行(如前述图3a所示)。可选地,该方法可以由图像处理设备中的CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
该方法的应用场景(或者理解为是第一神经网络或目标神经网络的应用场景)可以用于智能驾驶场景。例如:自适应巡航、车道偏离预警(lane departure warning,LDW)、车道保持辅助(lane keeping assist,LKA)等包含车道线检测的场景。在智能驾驶场景,本申请实施例提供的图像处理方法可以通过车辆上的传感器(例如摄像头)获取待检测图像,并获取该待检测图像中的车道线,进而实现上述自适应巡航、LDW或LKA等。
本申请实施例中,根据图像处理设备为云端设备还是用户设备,本申请实施例提供的图像处理方法可以包括两种情况,下面分别描述。
第一种情况,图像处理设备为用户设备,这里仅以用户设备是车辆为例(如前述图3b的场景)。可以理解的是,用户设备除了可以是车辆之外,还可以是手机、车载终端、飞机终端、VR/AR设备、智能机器人等智能终端,具体此处不做限定。
请参阅图5,本申请实施例提供的图像处理方法的一个流程示意图,该方法通过目标神经网络实现,该方法可以包括步骤501至步骤504。下面对步骤501至步骤504进行详细说明。
步骤501,获取待检测图像。
本申请实施例中图像处理设备获取待检测图像的方式有多种方式,可以是通过图像处理设备采集待检测图像的方式,也可以是通过接收其他设备发送的待检测图像的方式,还可以是从数据库中选取训练数据的方式等,具体此处不做限定。
可选地,该待检测图像包括车、人、物体、树木、标识等中的至少一种对象。
示例性的,在智能驾驶领域,该图像处理设备可以是指车辆。车辆上的传感器(例如:摄像头或相机)采集图像。可以理解的是,车辆上的传感器可以实时采集图像,也可以是周期性的采集图像,例如:每隔0.5秒采集一次图像,具体此处不做限定。
步骤502,对待检测图像进行特征提取,得到第一特征。
图像处理设备获取待检测图像之后,可以获取待检测图像的第一特征。具体的,对待检测图像进行特征提取,得到第一特征。可以理解的是,本申请实施例所提的特征可以用矩阵或向量等方式进行表达。
可选地,图像处理设备可以通过主干网络对待检测图像进行特征提取,得到第一特征。该主干网络可以是卷积神经网络、图卷积网络(graph convolutional networks,GCN)、循环神经网络等具有提取图像特征功能的网络,具体此处不做限定。
进一步的,为了获取待检测图像的多层次特征,图像处理设备可以对主干网络中不同层输出的特征进行特征融合与降维处理,得到第一特征。其中,不同层输出的特征也可以理解为是计算第一特征过程中的中间特征(也可以称为至少一个第三特征),第三特征的数量与主干网络的层数相关,例如:第三特征的数量与主干网络的层数相同,或者第三特征的数量为主干网络的层数减1。
该种方式下,由于神经网络不同层提取到特征性能不同,低层特征分辨率更高,包含更多位置、细节信息,但是由于经过的卷积更少,其语义性更低,噪声更多。高层特征具有更强的语义信息,但是分辨率低,对细节的感知能力较差。因此,针对主干网络不同层提取到的特征进行特征融合,得到融合后的特征(记为H
f),融合后的特征就具有多层次特征。进一步的,对融合后的特征进行降维处理得到第一特征(记为H′
f)。因此,第一特征同样具有多层次特征。其中,上述的H
f∈R
h×w×d,h为H
f的行数,w为H
f的列数,d为H
f的维度。例如:通过一个线性变换层将H
f的维度d降成d′,即H′
f∈R
h×w×d′。
示例性的,上述的主干网络为采用50层的残差卷积神经网络(residual neural network-50,ResNet50)。
步骤503,对待检测图像的检测框信息进行处理,得到第二特征。
图像处理设备在获取待检测图像之后,可以先基于人-车检测模型得到待检测图像的检测框信息。具体的,将待检测图像输入人-车检测模型中,得到检测框信息,该检测框信息包括待检测图像中至少一个对象的检测框在待检测图像中的位置。其中,该人-车检测模型可以是区域卷积神经网络(region convolutional neuron network,R-CNN)、快速区域卷积神经网 络(fast R-CNN)或更快速区域卷积神经网络(faster R-CNN)等,具体此处不做限定。上述所提的对象可以包括待检测图像中的车、人、物体、树木、标识等中的至少一项,具体此处不做限定。可以理解的是,该检测框的位置可以是经过归一化处理后的位置。
可以理解的是,若获取待检测图像中越多对象的检测框信息,则获取的第二特征的表达能力越强。
可选地,检测框信息还可以包括检测框的类别与置信度。
图像处理设备获取检测框信息之后,可以对检测框信息进行处理,得到第二特征,该第二特征也可以理解为是待检测图像的检测框特征,该第二特征包括待检测图像中对象对应检测框的位置特征与语义特征。其中,位置特征可以记为,语义特征可以记为。
可选地,对至少一个第三特征与检测框信息进行处理,得到第二特征。该至少一个第三特征为获取第一特征的过程中所得到的中间特征(如前述步骤502中的中间特征)。具体的,将检测框信息与中间特征输入到预处理模块,得到位置特征与语义特征。
可选地,若主干网络采用了特征金字塔网络(feature pyramid networks,FPN)结构,则可以基于对至少一个第三特征与检测框信息进行处理,得到第二特征。若主干网络未采用FPN结构,则可以使用未降维前的第一特征以及检测框信息获取第二特征。
本申请实施例中,基于检测框信息的不同,具体获取第二特征的过程(也可以理解为预处理模块的功能)有所不同,下面分别描述:
1、检测框信息只包括检测框的位置。
上述获取语义特征的过程可以包括:根据检测框的位置与主干网络中不同层之间的采样率对检测框进行缩放。使用缩放后的检测框从中间特征对应采样率的特征层中提取出ROI特征。将ROI特征进行通过处理(例如:输入全连接层的处理,或者输入单层感知机与激活层的处理)得到检测框的语义特征:Z
r∈R
M×d′,其中,M是待检测图像中检测框的个数。
上述获取位置特征的过程可以包括:将检测框位置对应的向量进行处理(例如:输入全连接层的处理,或者输入单层感知机与激活层的处理)得到检测框的位置特征:Z
b∈R
M×d′。
示例性的,假设主干网络为5层结构的神经网络,第三层下采样率为8,我们会将原检测框缩小8倍。一般情况下,检测框面积越大,去越小的特征层(越后面的层)去提取ROI特征。
2、检测框信息包括检测框的位置与置信度。
上述获取语义特征的过程可以包括:根据检测框的位置与主干网络中不同层之间的采样率对检测框进行缩放。使用缩放后的检测框从中间特征对应采样率的特征层中提取出ROI特征。将检测框的置信度当做系数,与提取出的ROI特征进行相乘处理,并将相乘后的特征通过处理(例如:输入全连接层的处理,或者输入单层感知机与激活层的处理)得到检测框的语义特征:Z
r∈R
M×d′,其中,M是待检测图像中检测框的个数。
上述获取位置特征的过程可以包括:将检测框位置对应的向量进行处理(例如:输入全连接层的处理,或者输入单层感知机与激活层的处理)得到检测框的位置特征:Z
b∈R
M×d′。其中,可以采用独热码(one-hot)等编码方式对检测框的类别进行编码处理,得到类别向量。
3、检测框信息包括检测框的位置、置信度与类别。
上述获取语义特征的过程可以包括:根据检测框的位置与主干网络中不同层之间的采样率对检测框进行缩放。使用缩放后的检测框从第一特征中对应采样率的特征层中提取出ROI特征。将检测框的置信度当做系数,与提取出的ROI特征进行相乘处理,并将相乘后的特征通过处理(例如:输入全连接层的处理,或者输入单层感知机与激活层的处理)得到检测框的语义特征:Z
r∈R
M×d′,其中,M是待检测图像中检测框的个数。
上述获取位置特征的过程可以包括:将检测框的类别变换为类别向量。并将该类别向量与上述检测框位置对应的向量进行拼接,并经过处理(例如:输入全连接层的处理,或者输入单层感知机与激活层的处理)得到检测框的位置特征:Z
b∈R
M×d′。其中,可以采用独热码(one-hot)等编码方式对检测框的类别进行编码处理,得到类别向量。
可以理解的是,上述检测框信息的几种情况以及几种获取第二特征的具体过程只是举例,在实际应用中,检测框还可以有其他情况(例如:检测框信息包括检测框的位置与类别),获取第二特征还可以有其他方式,具体此处不做限定。
示例性的,获取第二特征的过程可以如图6所示。其中,检测预处理模块执行的步骤参考上述获取第二特征的过程描述,此处不再赘述。
步骤504,将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线。
图像处理设备获取第一特征与第二特征之后,可以将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线。具体的,可以先获取多个点集,再基于多个点集确定车道线。该多个点集中的每个点集表示待检测图像中的一条车道线。
可选地,基于transformer结构的第一神经网络包括编码器、解码器与前馈神经网络。上述获取多个点集可以包括如下步骤:将第一特征与第二特征输入编码器,得到第四特征;将第四特征、第二特征以及查询特征输入解码器,得到第五特征;将第五特征输入前馈神经网络,得到多个点集。后续会结合附图并分情况进行描述,此处再展开。
可选地,可以将第一特征与第二特征输入训练好的第一神经网络,得到多个点集。其中,该训练好的第一神经网络是以训练数据作为第一神经网络的输入,以第一损失函数的值小于第一阈值为目标对第一神经网络进行训练得到,训练数据包括训练图像的第一特征、训练图像中对象对应检测框的位置特征与语义特征,第一损失函数用于表示训练过程中第一神经网络输出的点集与第一点集之间的差异,第一点集为训练图像中实际车道线的真实点集。
进一步的,第一神经网络包括transformer结构与前馈神经网络。可以先通过transformer结构对第一特征与第二特征进行处理,得到第五特征。再将第五特征输入前馈神经网络,得到多个点集。可以理解的是,这里的前馈神经网络也可以由全连接层、卷积神经网络等结构代替,具体此处不做限定。
本申请实施例中,基于第一神经网络输入的不同,transformer结构有所不同。也可以理解为是,获取第五特征的步骤不同,下面分别进行描述。
第1种,第一神经网络如图7所示,transformer结构如图8所示。
在一种可能实现的方式中,为了更直观的看出基于第一特征与第二特征获取第五特征的过程,可以参考图7。该第一神经网络包括transformer结构与前馈神经网络。将第一特征与第二特征输入transformer结构的编码器,得到第四特征。将查询特征、第二特征以及第四特征输入transformer结构的解码器,得到第五特征。
该情况下的transformer结构可以如图8所示,该transformer结构的编码器包括第一自注意力模块与第一注意力模块,该transformer结构的解码器包括第二注意力模块与第三注意力模块。
可选地,解码器还可以包括第二自注意力模块(图8未示出),用于计算查询特征。具体的,对查询向量进行自注意力计算,得到查询特征。该查询向量初始化为随机值,在训练过程中训练得到固定值。并在推理过程中使用该固定值,即查询向量是随机值在训练过程中通过训练得到的固定值。
该结构下,通过第一自注意力模块对第一特征(H′
f)进行自注意力计算,得到第一输出(O
f)。通过第一注意力模块对第一特征(H′
f)与第二特征(Z
r与Z
b)进行交叉注意力计算,得到第二输出(O
p2b)。基于第一输出(O
f)与第二输出(O
p2b)获取第四特征。通过第二注意力模块对查询特征(Q
q)与第四特征进行交叉注意力计算,得到第三输出。对查询特征(Q
q)与第二特征(Z
r与Z
b)进行处理,得到第四输出。对第三输出与第四输出进行相加处理,得到第五特征。其中,查询特征是对查询向量进行自注意力计算得到。
可选地,上述通过第一自注意力模块对第一特征(H′
f)进行自注意力计算,得到第一输出(O
f)的步骤具体可以是:由于是自注意力计算,QKV的输入一致(即都为H′
f)。即通过第一特征(H′
f)经过三种线性处理得到QKV,并基于QKV计算得到O
f。关于自注意力的描述可以参考前述对自注意力机制的描述,此处不再赘述。另外,可以理解的是,在计算自注意力过程中,可以引入第一特征的位置矩阵,后续公式一中有描述,此处不再展开。
可选地,上述基于O
f与O
p2b获取第四特征的具体步骤可以是:对第一输出与第二输出进行相加处理,得到第四特征。
进一步的,如图9所示,上述基于第一输出(O
f)与第二输出(O
p2b)获取第四特征的步骤具体可以是:对第一输出与第二输出进行相加处理,相加处理得到的结果与第一特征进行相加与归一化处理得到输出。一方面,将该输出输入前馈神经网络,得到前馈神经网络的输出结果。并将上述相加归一化得到的输出与前馈神经网络的输出结果进行相加与归一化处理,从而得到第四特征。
可选地,上述通过第一注意力模块对H′
f、Z
r以及Z
b进行交叉注意力计算的步骤具体可以是:将H′
f作为Q,将Z
b作为K,Z
r与作为V进行交叉注意力计算,得到第二输出(O
p2b)。
可选地,上述通过第二注意力模块对Q
q与第四特征进行交叉注意力计算的步骤具体可以是:将Q
q作为Q,将第四特征作为K与V进行交叉注意力计算,得到第三输出。
进一步的,如图10所示,上述对查询特征与第二特征进行处理,得到第四输出的步骤具体可以是:通过第三注意力模块对Q
q、Z
r以及Z
b进行交叉注意力计算,得到第六输出。具体可以是将Q
q作为Q,将Z
b作为K,Z
r与作为V进行交叉注意力计算,得到第六输出。对查询特征与第六输出进行相加处理,相加处理得到的结果与查询向量进行相加与归一化处理得到输出。一方面,将该输出输入前馈神经网络,得到前馈神经网络的输出结果。并将上述相加归一化 得到的输出与前馈神经网络的输出结果进行相加与归一化处理,从而得到第四输出。
需要说明的是,本实施例中对于注意力计算过程中用于当做Q的特征可以引入该特征的位置矩阵(Q
q)。位置矩阵也可以是通过静态位置编码或动态位置编码等方式获取,例如该位置矩阵可以是根据第一特征对应特征图的绝对位置计算得到,具体此处不做限定。
示例性的,上述的第一输出(O
f)、第二输出(O
p2b)的计算公式如下:
其中,以公式一与公式二中为例,E
f为第一特征(H′
f)的位置矩阵,下面通过公式三与公式四举例说明通过正弦余弦的方式计算位置矩阵:
其中,双数的计算用公式三,单数的计算用公式四。i是元素所在位置矩阵中行的位置,2j/2j+1是该元素所在位置矩阵中列的位置,d表示位置矩阵的维度。为了更直白了解上述公式三与公式四的运用,假设若某个元素在第2行第3列,则该元素的位置向量为的计算过程可以通过公式四进行计算,其中i=2,j=1,d=3。
可以理解的是,上述公式一、公式二、公式三以及公式四只是举例,在实际应用中,还可以有其他形式的公式,具体此处不做限定。
第2种,第一神经网络如图11所示,transformer结构如图12所示。
在另一种可能实现的方式中,请参考图11,其中,图11与图7不同之处在于,图7中编码器的输入包括第一特征与第二特征,图11中编码器的输入包括第一特征、第一行特征、第一列特征以及第二特征。即图11中编码器的输入比图7中编码器的输入多了第一行特征与第一列特征。
该情况下的transformer结构如图12所示,该transformer结构的编码器除了图8所示的结构之外,还包括行列注意力模块。即图12所示的transformer结构的编码器包括行列注意力模块、第一自注意力模块以及第一注意力模块,解码器包括第二自注意力模块、第二注意力模块以及第三注意力模块。其中,行列注意力模块包括行注意力模块与列注意力模块。
该结构下,通过行注意力模块对第一行特征(H′
r)进行自注意力计算,得到行输出。通过列注意力模块对第一列特征(H′
c)进行自注意力计算,得到列输出。基于行输出与列输出获取行列输出。通过第一自注意力模块对第一特征(H′
f)进行自注意力计算,得到第一输出 (O
f)。通过第一注意力模块对第一特征(H′
f)与第二特征(Z
r与Z
b)进行交叉注意力计算,得到第二输出(O
p2b)。基于行列输出、第一输出(O
f)以及第二输出(O
p2b)获取第四特征。通过第二自注意力模块对查询向量进行自注意力计算,得到查询特征(Q
q)。通过第二注意力模块对查询特征(Q
q)与第四特征进行交叉注意力计算,得到第三输出。通过第三注意力模块对查询特征(Q
q)与第二特征(Z
r与Z
b)进行处理,得到第四输出。对第三输出与第四输出进行相加处理,得到第五特征。
可以理解的是,上述部分步骤与相关结构可以参考前述图8所示实施例类似的描述,此处不再赘述。
可选地,如图13所示,为行列注意力模块的具体结构。上述基于行输出与列输出获取行列输出的步骤具体可以是:对行输出与第一行特征进行相加与归一化处理(简称相加&归一)得到输出。一方面,将该输出输入前馈神经网络(简称前馈网络),得到前馈网络的输出结果。并将上述相加归一化得到的输出与前馈网络的输出结果进行相加与归一化处理,得到行的输出。同理,对列输出与第一列特征进行相加与归一化处理得到输出。一方面,将该输出输入前馈网络,得到前馈网络的输出结果。并将上述相加归一化得到的输出与前馈网络的输出结果进行相加与归一化处理,得到列的输出。再对行的输出与列的输出进行拼接处理,从而得到行列输出。
可选地,对上述的第一行特征、第一列特征、行输出以及列输出进行描述。获取第一特征之后,可以将第一特征进行行维度的拉平,得到H
r∈R
h×1×wd,并经过处理(例如:输入全连接层的处理与降维处理,或者输入单层感知机与激活层的处理与降维处理)得到第一行特征:H′
r∈R
h×1×d′。上述行维度的拉平也可以理解为是对第一特征对应的矩阵沿着行的方向进行拉平或压缩,得到H
r。同理,将第一特征进行列维度的拉平,得到H
r∈R
h×1×wd,并经过处理(例如:输入全连接层的处理与降维处理,或者输入单层感知机与激活层的处理与降维处理)得到第一列特征:H′
c∈R
1×w×d′。
可选地,上述基于行列输出、第一输出(O
f)以及第二输出(O
p2b)获取第四特征的步骤具体可以是:对第一输出与第二输出进行相加处理,得到第五输出;对第五输出与行列输出进行拼接处理,得到第四特征。
示例性的,上述的行输出(O
row)、列输出(O
column)的计算公式如下:
其中,E
r为第一行特征(H′
r)的位置矩阵,E
c为第一列特征(H′
c)的位置矩阵。该位置矩阵可以是通过静态位置编码或动态位置编码的方式获取,具体此处不做限定。
可以理解的是,上述公式五与公式六只是举例,在实际应用中,还可以有其他形式的公式,具体此处不做限定。
需要说明的是,上述transformer结构的几种情况,或者理解为是获取第五特征的方式只是举例,在实际应用,transformer结构还可以是其他情况,或者获取第五特征还可以有其他方式,具体此处不做限定。
图像处理设备通过上述任一种方式获取第五特征之后,可以将第五特征输入前馈神经网络,得到多个点集。并基于多个点集确定待检测图像中的车道线。可以理解的是,上述的前馈神经网络也可以由全连接层、卷积神经网络等结构代替,具体此处不做限定。
为了更直观了解点集的获取过程,以图14a为例,如图14a中所示的车道线l=(X,s,e),其中,X为等间距Y方向直线(例如72条)与车道线的交点对应X坐标的集合,起始点Y坐标s,结束点Y坐标e。可以理解的是,图14a中车道线的数量、Y方向直线的数量只是举例,具体此处不做限定。
一种可能实现的方式中,多个点集可以通过数组的方式呈现。另一种可能实现的方式中,多个点集还可以通过图像的方式呈现。例如:图14b所示的多个点集。对多个点集与待检测图像进行重叠融合,得到带有多个点集的待检测图像,例如图14c所示。本实施例对多个点集的呈现方式不做限定。
为了更直观的看出第一行特征与第一列特征对检测车道线做的贡献,请参阅图14d,可以看出,通过引入能够顺应车道线形状挖掘上下文信息的第一行特征与第一列特征,可以提升网络对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。
本申请实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的网络中增加图像中对象的检测框位置信息作为输入,可以提升网络的场景感知能力。减少由于车道线被车辆遮挡场景下模型的误判。另一方面,通过在transformer的编码器中引入能够顺应车道线形状挖掘上下文信息的行列自注意力模块,可以提升网络对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。另一方面,现有自动驾驶系统中各个模块之间往往是相互独立的,例如车道线检测模型与人车模型是相互独立,单独预测的。而本实施例提供的图像处理方法中的目标神经网络通过将基于人车检测模型获取的检测框信息利用到第一神经网络中来预测车道线,可以提升车道线检测的准确性。
第二种情况,图像处理设备为云服务器(如前述图3a的场景)。可以理解的是,该种情况下,图像处理设备还可以是网络服务器、应用服务器以及管理服务器等具有图像处理功能的设备或服务器,以用户设备是车辆为例进行示例性描述,具体此处不做限定。
请参阅图15,本申请实施例提供的图像处理方法的一个流程示意图,该方法可以包括步骤1501至步骤1505。下面对步骤1501至步骤1505进行详细说明。
步骤1501,车辆获取待检测图像。
可选地,车辆可以基于车辆上的传感器(例如摄像头或相机)采集待检测图像。当然车辆上的传感器也可以周期性的采集图像。
可以理解的是,车辆也可以是通过接收其他设备发送待检测图像的方式获取,具体此处不做限定。
步骤1502,车辆向服务器发送待检测图像。相应的,服务器接收车辆发送的待检测图像。
车辆获取待检测图像之后,向服务器发送待检测图像。相应的,服务器接收车辆发送的待检测图像。
步骤1503,服务器将待检测图像输入训练好的目标神经网络,得到多个点集。
服务器接收车辆发送的待检测图像之后,可以将待检测图像输入训练好的目标神经网络,得到多个点集。
其中,该训练好的目标神经网络是以训练图像作为目标神经网络的输入,以目标损失函数的值小于目标阈值为目标对目标神经网络进行训练得到。该目标函数用于表示训练过程中目标神经网络输出的点集与目标点集之间的差异,该目标点集为训练图像中实际车道线的点集。目标损失函数与目标阈值可以根据实际需要设置,具体此处不做限定。
本实施例中的目标神经网络可以包括前述图5所示实施例中的主干网络、预处理模块以及第一神经网络。由于图5所示实施例中的第一神经网络的结构有两种情况,因此,本实施例中的目标神经网络也有两种情况,下面分别描述。
在一种可能实现的方式中,目标神经网络的结构可以如图16所示。该种情况下的目标神经网络相当于包括前述图6所示的主干网络、图6所示的预处理模块、图7至图10对应的第一神经网络。神经网络的具体描述与相关流程可以参考前述图6至图10对应的描述,此处不再赘述。
在另一种可能实现的方式中,目标神经网络的结构可以如图17所示。该种情况下的目标神经网络相当于包括前述图6所示的主干网络、图6所示的预处理模块、图11至图13对应的第一神经网络。神经网络的具体描述与相关流程可以参考前述图6、图11至图13对应的描述,此处不再赘述。
步骤1504,服务器向车辆发送多个点集。相应的,车辆接收服务器发送的多个点集。
服务器获取多个点集之后,服务器向车辆发送多个点集。
步骤1505,基于多个点集实现智能驾驶功能。
车辆获取多个点集之后,由于该多个点集中的每个点集表示待检测图像中的一条车道线。车辆可以确定待检测图像中的车道线,并根据该车道线实现智能驾驶功能,例如:自适应巡航、车道偏离预警、车道保持辅助等。
另外,通过多个点集确定待预测图像中车道线的描述可以参考前述图5所示实施例步骤504中的描述类似,此处不在赘述。
可以理解的是,本实施例的步骤可以周期性的执行,即可以根据车载摄像头在行驶过程中采集的待检测图像,准确地识别出路面的车道线,进而实现智能驾驶中与车道线相关的功 能,例如:自适应巡航、车道偏离预警、车道保持辅助等。
本实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的网络中增加图像中对象的检测框位置信息作为输入,可以提升网络的场景感知能力。减少由于车道线被车辆遮挡场景下模型的误判。另一方面,通过在transformer的编码器中引入能够顺应车道线形状挖掘上下文信息的行列自注意力模块,可以提升网络对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。另一方面,通过在云端部署目标神经网络并预测车道线的点集,可以节省车辆的算力开销。另一方面,现有自动驾驶系统中各个模块之间往往是相互独立的,例如车道线检测模型与人车模型是相互独立,单独预测的。而本实施例提供的图像处理方法中的目标神经网络通过将基于人车检测模型获取的检测框信息利用到第一神经网络中来预测车道线,可以提升车道线检测的准确性。
为了更直观的看出本申请实施例提供的目标神经网络的表现,下面将目标神经网络(以下简称Laneformer)与现有其他网络在CULane和TuSimple数据集上进行性能测试。现有其他网络包括:空间卷积神经网络(spatial convolutional neuron network,SCNN)、ENeTSAD、PointLane、有效剩余因式分解(efficient residual factorized network,ERFNet)、CurveLaneS、CurveLaneM、CurveLaneL、LaneATT。
其中,CULane是一个通过车载摄像头,在中国北京收集的大规模车道线检测数据集,采集图片的大小为1640×590。该数据集采集地点多样,包含了很多城市复杂场景的样本。CULane数据集包含了88880张训练图片,9675张验证图片以及34680张测试图片。其中,测试集还分了九个类别,一个类别是常规图片,另外八个类别是具有挑战性的特殊类别(包括阴影场景、高亮场景、黑夜场景、曲线场景、无车道线场景等)。TuSimple则是由图森公司采集的自动驾驶数据集。该数据集专注于高速公路场景,因此所有图片均在高速公路上采集,采集图片的大小为1280×720。TuSimple数据集包含了3626张图片用于训练,2782张图片用于测试。
对于LaneATT中的主干网络采用三种残差结构(ResNet18、ResNet34、ResNet122)。分别记为:LaneATT(ResNet18)、LaneATT(ResNet34)、LaneATT(ResNet122)。本申请实施例提供的Laneformer的主干网络采用三种残差结构(ResNet18、ResNet34、ResNet50),分别记为:Laneformer(ResNet18)、Laneformer(ResNet34)、Laneformer(ResNet50)。并将ResNet50结构下Laneformer中未加入检测注意力模块(即第一注意力模块与第三注意力模块)的网络记为:Laneformer(ResNet50)*。
不同车道线检测方法在CULane上的检测精度如表1所示:
表1
模型 | 常规图片 | 十字路口场景 | 全场景的平均值 | 乘加累积操作数 |
SCNN | 90.6 | 1990 | 71.6 | / |
ENetSAD | 90.1 | 1998 | 70.8 | / |
PointLane | 88 | 1640 | 70.2 | / |
ERFNetHESA | 92 | 2028 | 74.2 | / |
CurveLaneS | 88.3 | 2817 | 71.4 | 9 |
CurveLaneM | 90.2 | 2359 | 73.5 | 33.7 |
CurveLaneL | 90.7 | 1746 | 74.8 | 86.5 |
LaneATT(ResNet18) | 91.17 | 1020 | 75.13 | 9.3 |
LaneATT(ResNet34) | 92.14 | 1330 | 76.68 | 18 |
LaneATT(ResNet122) | 91.74 | 1264 | 77.02 | 70.5 |
Laneformer(ResNet50)* | 91.55 | 1104 | 76.04 | 26.2 |
Laneformer(ResNet18) | 88.6 | 25 | 71.71 | 13.8 |
Laneformer(ResNet34) | 90.74 | 26 | 74.7 | 23 |
Laneformer(ResNet50) | 91.77 | 19 | 77.06 | 26.2 |
由表1可以看出:Laneformer模型在使用ResNet50作为主干网络的设置下,在CULane的测试集全集上达到了当前最优的结果,为77.06%分数。除了在测试集全集上达到最优之外,Laneformer还在几个具有挑战性的场景类别如夜晚场景(Night)、强光场景(Dazzle)和十字路口场景(Cross)上均达到了最优结果(表1中只示出了部分)。其中,Laneformer模型在十字路口场景类别上的表现尤其突出,误测图片数比其他模型少了两个量级。由于十字路口场景在数据集中是没有标注车道线的,因此十字路口场景的衡量采用FP作为指标。其余模型在十字路口场景的FP值都是上千,而本工作所提出的Laneformer模型则达到了19的FP值。由表1可以推测出该提升来自于检测注意力模块的加入,在未加入检测注意力模块的Laneformer(ResNet50)*模型中,十字路口场景的FP虽低,但仍然为上千的数值,而加入检测注意力模块后,该指标锐减到几十,可见在人车情况比较复杂的十字路口场景上,检测注意力模块通过对周边场景和物体的感知,可以大大降低模型的误预测率。
不同车道线检测方法在TuSimple上的检测精度如表2所示:
表2
模型 | 准确率(%) | 假正例率(%) | 假负例率(%) |
SCNN | 96.53 | 6.17 | 1.8 |
LSTR | 96.18 | 2.91 | 3.38 |
EnetSAD | 96.64 | 6.02 | 2.05 |
LineCNN | 96.87 | 4.41 | 3.36 |
PolyLaneNet | 93.36 | 9.42 | 9.33 |
PointLaneNet | 96.34 | 4.67 | 5.18 |
LaneATT(ResNet18) | 95.57 | 3.56 | 3.01 |
LaneATT(ResNet34) | 95.63 | 3.53 | 2.92 |
LaneATT(ResNet122) | 96.1 | 5.64 | 2.17 |
Laneformer(ResNet50)* | 96.72 | 3.46 | 2.52 |
Laneformer(ResNet18) | 96.54 | 4.35 | 2.36 |
Laneformer(ResNet34) | 96.56 | 5.39 | 3.37 |
Laneformer(ResNet50) | 96.8 | 5.6 | 1.99 |
由表2可以看出:Laneformer模型在使用ResNet50作为主干网络的情况下,在 TuSimple数据集取得了96.8%的准确率、5.6%的假正例率和1.99%的假负例率。在最重要的指标准确率上,Laneformer仅比第一的LineCNN低0.07%,并且比同样使用自注意力变换网络的工作LSTR高0.6%。同时,可以观察到,和CULane数据集上表现不同,在TuSimple数据集上,使用更小的主干网络如ResNet18、ResNet34也能得到非常具有竞争力的结果,不同的主干网络导致的模型表现差异几乎可以忽略不计。除此以外,在TuSimple数据集上,仅使用了行列注意力模块的模型{即Laneformer(ResNet50)*}也能达到非常好的效果。
另外,为了更直观看出目标神经网络中各个模块的单独作用,下面将目标神经网络的不同情况分别在CULane数据集上进行性能测试。其中,目标神经网络包括仅使用行列注意力模块的效果,以及逐级使用检测注意力中的不同子模块,包括是否使用人车检测框的位置信息(bounding box),置信度(score)以及类别(category)作为检测预处理模块的输入对整个结果的影响。
测试结果如表3所示:
表3
模型 | F1(%) | 精确率(%) | 召回率(%) | 每秒帧率 | 参数量(百万) |
Baseline(ResNet50) | 75.45 | 81.65 | 70.11 | 61 | 31.02 |
+行列注意力 | 76.04 | 82.92 | 70.22 | 58 | 43.02 |
+检测框的位置信息 | 76.08 | 85.3 | 68.66 | 57 | 45.38 |
+检测框的置信度 | 76.25 | 83.56 | 70.12 | 54 | 45.38 |
+检测框的类别 | 77.06 | 84.05 | 71.14 | 53 | 45.38 |
其中,第一个模型(即Baseline)可以理解为是图16所示的目标神经网络去掉第一注意力模块与第三注意力模块后的网络。第二个模型(+行列注意力)可以理解为是在第一个模型的基础上+行列注意力模块,第三个模型(+检测框的位置信息)可以理解为是在第二个模型的基础上+检测框的位置信息,第四个模型(+检测框的置信度)可以理解为是在第三个模型的基础上+检测框的置信度,第五个模型(+检测框的类别)可以理解为是在第四个模型的基础上+检测框的类别。第五个模型可以看做是前述图17所示的目标神经网络。
本文提出的Laneformer模型在使用Transformer的基础上,加入了行列注意力模块、检测注意力模块(包括第一注意力模块以及第三注意力模块),而检测注意力模块又分为单纯加入检测框信息,附加检测框置信度和附加预测类别这三种情况。因此,本小节对每个模块对模型的影响进行实验探究。由表3可以看出,在没有加行列注意力模块和检测注意力模块的单纯的Transformer模型中,基准的F1分数能够达到75.45%。加入行列注意力模块之后,模型的效果就能提升到76.04%的F1分数。同时可以看到,单纯地加入人车检测模块出来的检测框信息,就能让模型的效果有提升。更进一步地,在检测信息中加入检测框的置信度,能够让模型达到76.25%的F1分数,而把检测框的类别信息也加进去之后,就得到了表3中的最优模型,达到了77.06%的F1分数,由此可以证明,行列注意力模块、检测注意力模块都是可以提高模型性能的。另外可以观察到,检测注意力模块的加入能够显著提高模型的准确率,而对召回率的影响则比较微弱。
上面对本申请实施例提供的图像处理方法进行了描述,下面对本申请实施例提供的车道线检测方法进行描述。该方法可以由检测设备执行,也可以由检测设备的部件(例如处理器、芯片、或芯片系统等)执行。该检测设备可以是终端设备(例如车载终端、飞机终端等等)等(如前述图3b所示)。可选地,该方法可以由检测设备中的CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
该方法的应用场景(或者理解为是第一神经网络的应用场景)可以用于智能驾驶场景。例如:自适应巡航、车道偏离预警(lane departure warning,LDW)、车道保持辅助(lane keeping assist,LKA)等包含车道线检测的场景。在智能驾驶场景,本申请实施例提供的车道线检测方法可以通过车辆上的传感器(例如摄像头)获取待检测图像,并获取该待检测图像中的车道线,进而实现上述自适应巡航、LDW或LKA等。
请参阅图18,本申请实施例提供的车道线检测方法的一个流程示意图,该方法应用于车辆,该方法可以包括步骤1801至步骤1806。下面对步骤1801至步骤1806进行详细说明。
步骤1801,获取待检测图像。
本步骤与前述图5所示实施例中的步骤501类似,此处不再赘述。
示例性的,延续上述举例,待检测图像如图6中的待检测图像一致。
步骤1802,对待检测图像进行处理,得到多个点集。
检测设备获取待检测图像之后,可以对待检测图像进行处理,得到多个点集。多个点集中的每个点集表示待检测图像中的一条车道线;其中,处理基于transformer结构的第一神经网络与检测框信息预测图像中车道线的点集,检测框信息包括待检测图像中至少一个对象的检测框在待检测图像中的位置。
可以理解的是,对于基于transformer结构的神经网络与检测框信息预测图像中车道线的点集的步骤可以参考前述图5至图17所示实施例中描述的类似,此处不再赘述。
步骤1803,显示车道线,本步骤是可选地。
可选地,检测设备确定多个点集之后,可以显示多个点集表示的车道线。
示例性的,延续上述举例,车道线如图14b所示。
步骤1804,对至少一个对象进行建模得到虚拟对象,本步骤是可选地。
可选地,可以对至少一个对象进行建模得到虚拟对象。该虚拟对象可以是二维的,也可以是多维的,具体此处不做限定。
步骤1805,基于位置对多个点集与虚拟对象进行融合处理,得到目标图像,本步骤是可选地。
可选地,获取多个点集与虚拟对象之后,可以基于多个点集在带预测图像中的位置对多个点集与虚拟对象进行融合处理,得到目标图像。
示例性的,目标图像如图19所示,可以理解的是,图19中的虚拟图像只是二维的举例,并不对虚拟对象进行限定。
步骤1806,显示目标图像,本步骤是可选地。
可选地,检测设备获取目标图像之后,可以向用户显示目标图像,以使得驾驶车辆的用户可以明确周围的车辆与车道线,提升车辆的驾驶安全。
可以理解的是,上述步骤1801至步骤1806可以周期性的执行,即可以向用户实时显示目标图像,使得用户可以实时确定周边对象以及车道线,提升用户驾驶体验。
一种可能实现的方式中,本申请实施例提供的车道线检测方法包括步骤1801与步骤1802。另一种可能实现的方式中,本申请实施例提供的车道线检测方法包括步骤1801至步骤1803。另一种可能实现的方式中,本申请实施例提供的车道线检测方法包括步骤1801至步骤1805。
本申请实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升目标神经网络对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
上面对本申请实施例提供的图像处理方法以及车道线检测方法进行了描述,下面对本申请实施例提供的目标神经网络的训练过程进行描述。目标神经网络的训练方法可以由目标神经网络的训练装置来执行,该目标神经网络的训练装置可以是图像处理设备(例如云服务设备或用户设备等运算能力足以用来执行目标神经网络的训练方法的装置),也可以是由云服务设备和用户设备构成的系统。示例性地,训练方法可以由图1中的训练设备120、图2中的神经网络处理器20执行。
可选地,训练方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
请参阅图20,本申请实施例提供的目标神经网络的一种模型训练方法。该模型训练方法包括步骤2001至步骤2004。
步骤2001,获取训练图像。
训练装置可以通过传感器(例如摄像头、雷达等)采集训练图像,也可以从数据库中获取训练图像,还可以接收其他设备发送的训练图像,对于获取训练图像的方式此处不做限定。
在需要对目标神经网络进行训练时,训练装置可以获取一批训练样本,即用于训练的训练图像。其中,训练图像中车道线的真实点集是已知的。
步骤2002,将训练图像输入目标神经网络,得到第一点集。
得到训练图像后,可以将训练图像输入目标神经网络,以通过目标神经网络实现以下步骤:获取训练图像的第一特征;基于第一特征获取第二特征,第二特征包括训练图像中对象对应检测框的位置特征与语义特征;基于第一特征与第二特征获取第一点集,第一点集用于表示训练图像中的车道线。
可选地,上述基于第一特征与第二特征获取第一点集具体包括如下步骤:对第一特征进行自注意力计算,得到第一输出;对第一特征与第二特征进行交叉注意力计算,得到第二输出;基于第一输出与第二输出获取第四特征;对查询特征与第四特征行交叉注意力计算,得到第三输出,查询特征由查询向量基于自注意力机制计算得到;对查询特征与第二特征进行处理,得到第四输出;对第三输出与第四输出进行相加处理,得到第五特征;基于第五特征获取第一点集。
关于获取第一特征、第二特征、第四特征、第五特征以及点集的过程可以参考前述图5 所示实施例中步骤502至步骤504的描述,此处不再赘述。
步骤2003,基于第一点集与训练图像中实际车道线的真实点集,获取目标损失,目标损失用于指示第一点集与真实点集之间的差异。
得到第一点集之后,可以通过预置的目标损失函数对第一点集与真实点集进行计算,以得到目标损失,目标损失用于指示第一点集与真实点集之间的差异。
需要说明的是,若第一点集对应的车道线数目大于真实点集对应的车道线数目,则可以对真实点集进行扩展,并将扩展的点集的车道线的类别设置为非车道线类别。该种情况下的目标损失则用于指示扩展后的真实点集与真第一点集之间的差异。
步骤2004,基于目标损失对目标神经网络的参数进行更新,直至满足训练条件,得到训练好的目标神经网络。
得到目标损失后,可基于目标损失对目标神经网络的参数进行更新,并利用下一批训练样本对更新参数后的目标神经网络进行训练(即重新执行步骤2002至步骤2004),直至满足模型训练条件(例如,目标损失达到收敛等等),可得到训练好的目标神经网络。
另外,训练过程中涉及的查询向量是随机的,在不断更新目标神经网络参数的过程中也对查询向量进行训练,进而得到目标查询向量,该目标查询向量可以理解为是推理过程中所使用的查询向量,即该目标查询向量是图5所示实施例中的查询向量。
本实施例训练得到的目标神经网络,具备利用图像预测车道线的能力。在检测过程中,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的网络中增加图像中对象的检测框位置信息作为输入,可以提升目标神经网络的场景感知能力。减少由于车道线被车辆遮挡场景下模型的误判。另一方面,通过在transformer的编码器中引入能够顺应车道线形状挖掘上下文信息的行列自注意力模块,可以提升网络对长条形车道线特征的构建能力,从而达到更好的车道线检测效果。另一方面,现有自动驾驶系统中各个模块之间往往是相互独立的,例如车道线检测模型与人车模型是相互独立,单独预测的。而本实施例中目标神经网络的训练是通过将基于人车检测模型获取的检测框信息利用到第一神经网络中得到的,可以提升目标神经网络对于车道线检测的准确性。
上面对本申请实施例中的图像处理方法进行了描述,下面对本申请实施例中的图像处理设备进行描述,请参阅图21,本申请实施例中图像处理设备的一个实施例包括:
提取单元2101,用于对待检测图像进行特征提取,得到第一特征;
处理单元2102,用于对待检测图像的检测框信息进行处理,得到第二特征,检测框信息包括待检测图像中对象的检测框在待检测图像中的位置;
确定单元2103,用于将第一特征与第二特征输入基于transformer结构的第一神经网络,得到待检测图像中的车道线。
可选地,本实施例中的图像处理设备还可以包括:获取单元2104,用于基于第一特征获取第一行特征与第一列特征,第一行特征为由第一特征对应的矩阵沿着行的方向进行拉平(flatten)得到,第一列特征为由矩阵沿着列的方向进行拉平(flatten)得到。
本实施例中,图像处理设备中各单元所执行的操作与前述图5至图17所示实施例中描述 的类似,此处不再赘述。
本实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
请参阅图22,本申请实施例中检测设备的一个实施例包括:
获取单元2201,用于获取待检测图像;
处理单元2202,用于对待检测图像进行处理,得到多个点集,多个点集中的每个点集表示待检测图像中的一条车道线;其中,处理基于transformer结构的第一神经网络与检测框信息预测图像中车道线的点集,检测框信息包括待检测图像中至少一个对象的检测框在待检测图像中的位置。
可选地,本实施例中的检测设备还可以包括:显示单元2203,用于显示车道线。
本实施例中,检测设备中各单元所执行的操作与前述图18所示实施例中描述的类似,此处不再赘述。
本实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升目标神经网络对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
请参阅图23,本申请实施例中图像处理设备的另一个实施例包括:
获取单元2301,用于获取训练图像;
处理单元2302,用于将训练图像输入目标神经网络,得到训练图像的第一点集,第一点集表示训练图像中的预测车道线;目标神经网络用于:对训练图像进行特征提取,得到第一特征;对训练图像的检测框信息进行处理,得到第二特征,检测框信息包括训练图像中对象的检测框在训练图像中的位置;基于第一特征和第二特征获取第一点集,目标神经网络用于基于transformer结构预测图像中车道线的点集;
训练单元2303,用于根据第一点集与训练图像中实际车道线的真实点集,对目标神经网络进行训练,得到训练好的目标神经网络。
本实施例中,图像处理设备中各单元所执行的操作与前述图20所示实施例中描述的类似,此处不再赘述。
本实施例中,一方面,通过将transformer结构应用于车道线检测任务上,可以获取待检测图像的全局信息,进而有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框信息,可以提升目标神经网络对图像场景的感知能力,减少由于车道线被车辆遮挡场景下的误判。
参阅图24,本申请提供的另一种图像处理设备的结构示意图。该图像处理设备可以包括处理器2401、存储器2402和通信接口2403。该处理器2401、存储器2402和通信接口2403 通过线路互联。其中,存储器2402中存储有程序指令和数据。
存储器2402中存储了前述图5至图17、图20所示对应的实施方式中,由设备执行的步骤对应的程序指令以及数据。
处理器2401,用于执行前述图5至图17、图20所示实施例中任一实施例所示的由设备执行的步骤。
通信接口2403可以用于进行数据的接收和发送,用于执行前述图5至图17、图20所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,图像处理设备可以包括相对于图24更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
参阅图25,本申请提供的另一种检测设备的结构示意图。该检测设备可以包括处理器2501、存储器2502和通信接口2503。该处理器2501、存储器2502和通信接口2503通过线路互联。其中,存储器2502中存储有程序指令和数据。
存储器2502中存储了前述图18所示对应的实施方式中,由检测设备执行的步骤对应的程序指令以及数据。
处理器2501,用于执行前述图18所示实施例中任一实施例所示的由检测设备执行的步骤。
通信接口2503可以用于进行数据的接收和发送,用于执行前述图18所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,检测设备可以包括相对于图25更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。
当使用软件实现所述集成的单元时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如, 所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
Claims (34)
- 一种图像处理方法,其特征在于,所述方法包括:对待检测图像进行特征提取,得到第一特征;对所述待检测图像的检测框信息进行处理,得到第二特征,所述检测框信息包括所述待检测图像中至少一个对象的检测框在所述待检测图像中的位置;将所述第一特征与所述第二特征输入基于transformer结构的第一神经网络,得到所述待检测图像中的车道线。
- 根据权利要求1所述的方法,其特征在于,所述对所述待检测图像的检测框信息进行处理,得到第二特征包括:对至少一个第三特征与所述检测框信息进行处理,得到所述第二特征,所述至少一个第三特征为获取所述第一特征的过程中所得到的中间特征。
- 根据权利要求2所述的方法,其特征在于,所述第二特征包括所述检测框的位置特征与语义特征,所述检测框信息还包括:所述检测框的类别与置信度;所述对至少一个第三特征与所述检测框信息进行处理,得到所述第二特征包括:基于所述至少一个第三特征、所述位置以及所述置信度获取所述语义特征;基于所述位置与所述类别获取所述位置特征。
- 根据权利要求3所述的方法,其特征在于,所述基于所述至少一个第三特征、所述位置以及所述置信度获取所述语义特征,包括:基于所述位置从所述至少一个第三特征中提取出感兴趣区域ROI特征;对所述ROI特征与所述置信度进行乘法处理,并将得到的特征输入全连接层,得到所述语义特征;所述基于所述位置与所述类别获取所述位置特征,包括:获取所述类别的向量,并与所述位置对应的向量进行拼接,将拼接得到的特征输入全连接层,得到所述位置特征。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述基于transformer结构的第一神经网络包括编码器、解码器以及前馈神经网络;将所述第一特征与所述第二特征输入基于transformer结构的第一神经网络,得到所述待检测图像中的车道线,包括:将所述第一特征与所述第二特征输入所述编码器,得到第四特征;将所述第四特征、所述第二特征以及查询特征输入所述解码器,得到第五特征;将所述第五特征输入所述前馈神经网络,得到多个点集,所述多个点集中的每个点集表示所述待检测图像中的一条车道线。
- 根据权利要求5所述的方法,其特征在于,所述方法还包括:基于所述第一特征获取第一行特征与第一列特征,所述第一行特征为由所述第一特征对应的矩阵沿着行的方向进行拉平(flatten)得到,所述第一列特征为由所述矩阵沿着列的方向进行拉平(flatten)得到;所述将所述第一特征与所述第二特征输入所述编码器,得到第四特征,包括:将所述第一特征、所述第二特征、所述第一行特征以及所述第一列特征输入所述编码器, 得到所述第四特征。
- 根据权利要求6所述的方法,其特征在于,所述将所述第一特征、所述第二特征、所述第一行特征以及所述第一列特征输入所述编码器,得到所述第四特征,包括:对所述第一特征进行自注意力计算,得到第一输出;对所述第一特征与所述第二特征进行交叉注意力计算,得到第二输出;对所述第一行特征与所述第一列特征进行自注意力计算与拼接处理,得到行列输出;基于所述第一输出、所述第二输出以及所述行列输出获取所述第四特征。
- 根据权利要求7所述的方法,其特征在于,所述基于所述第一输出、所述第二输出以及所述行列输出获取所述第四特征,包括:对所述第一输出与所述第二输出进行相加处理,得到第五输出;对所述第五输出与所述行列输出进行拼接处理,得到所述第四特征。
- 根据权利要求5所述的方法,其特征在于,所述将所述第一特征与所述第二特征输入所述编码器,得到第四特征,包括:对所述第一特征进行自注意力计算,得到第一输出;对所述第一特征与所述第二特征进行交叉注意力计算,得到第二输出;对所述第一输出与所述第二输出进行相加处理,得到所述第四特征。
- 根据权利要求5至9中任一项所述的方法,其特征在于,所述将所述第四特征、所述第二特征以及查询特征输入所述解码器,得到第五特征,包括:对所述查询特征与所述第四特征进行交叉注意力计算,得到第三输出;对所述查询特征与所述第二特征进行处理,得到第四输出;对所述第三输出与所述第四输出进行相加处理,得到所述第五特征。
- 根据权利要求1至10中任一项所述的方法,其特征在于,所述对待检测图像进行特征提取,得到第一特征包括:对主干网络中不同层输出的特征进行特征融合与降维处理,得到所述第一特征,所述主干网络的输入为所述待检测图像。
- 一种车道线检测方法,其特征在于,所述方法应用于车辆,所述方法包括:获取待检测图像;对所述待检测图像进行处理,得到多个点集,所述多个点集中的每个点集表示所述待检测图像中的一条车道线;其中,所述处理基于transformer结构的第一神经网络与检测框信息预测图像中车道线的点集,所述检测框信息包括所述待检测图像中至少一个对象的检测框在所述待检测图像中的位置。
- 根据权利要求12所述的方法,其特征在于,所述检测框信息还包括:所述检测框的类别与置信度。
- 根据权利要求12或13所述的方法,其特征在于,所述方法还包括:显示所述车道线。
- 根据权利要求12至14中任一项所述的方法,其特征在于,所述方法还包括:对所述至少一个对象进行建模得到虚拟对象;基于所述位置对所述多个点集与所述虚拟对象进行融合处理,得到目标图像;显示所述目标图像。
- 一种图像处理设备,其特征在于,所述图像处理设备包括:提取单元,用于对待检测图像进行特征提取,得到第一特征;处理单元,用于对所述待检测图像的检测框信息进行处理,得到第二特征,所述检测框信息包括所述待检测图像中至少一个对象的检测框在所述待检测图像中的位置;确定单元,用于将所述第一特征与所述第二特征输入基于transformer结构的第一神经网络,得到所述待检测图像中的车道线。
- 根据权利要求16所述的图像处理设备,其特征在于,所述处理单元,具体用于对至少一个第三特征与所述检测框信息进行处理,得到所述第二特征,所述至少一个第三特征为获取所述第一特征的过程中所得到的中间特征。
- 根据权利要求17所述的图像处理设备,其特征在于,所述第二特征包括所述检测框的位置特征与语义特征,所述检测框信息还包括:所述检测框的类别与置信度;所述处理单元,具体用于基于所述至少一个第三特征、所述位置以及所述置信度获取所述语义特征;所述处理单元,具体用于基于所述位置与所述类别获取所述位置特征。
- 根据权利要求18所述的图像处理设备,其特征在于,所述处理单元,具体用于基于所述位置从所述至少一个第三特征中提取出感兴趣区域ROI特征;所述处理单元,具体用于对所述ROI特征与所述置信度进行乘法处理,并将得到的特征输入全连接层,得到所述语义特征;所述处理单元,具体用于获取所述类别的向量,并与所述位置对应的向量进行拼接,将拼接得到的特征输入全连接层,得到所述位置特征。
- 根据权利要求16至19中任一项所述的图像处理设备,其特征在于,所述基于transformer结构的第一神经网络包括编码器、解码器以及前馈神经网路;所述确定单元,具体用于将所述第一特征与所述第二特征输入所述编码器获取第四特征;所述确定单元,具体用于将所述第四特征、所述第二特征以及查询特征输入所述解码器,得到所述第五特征;所述确定单元,具体用于将所述第五特征输入所述前馈神经网络,得到多个点集,所述多个点集中的每个点集表示所述待检测图像中的一条车道线。
- 根据权利要求20所述的图像处理设备,其特征在于,所述图像处理设备还包括:获取单元,用于基于所述第一特征获取第一行特征与第一列特征,所述第一行特征为由所述第一特征对应的矩阵沿着行的方向进行拉平(flatten)得到,所述第一列特征为由所述矩阵沿着列的方向进行拉平(flatten)得到;所述确定单元,具体用于将所述第一特征、所述第二特征、所述第一行特征以及所述第一列特征输入所述编码器,得到所述第四特征。
- 根据权利要求21所述的图像处理设备,其特征在于,所述确定单元,具体用于对所述第一特征进行自注意力计算,得到第一输出;所述确定单元,具体用于对所述第一特征与所述第二特征进行交叉注意力计算,得到第二输出;所述确定单元,具体用于对所述第一行特征与所述第一列特征进行自注意力计算与拼接处理,得到行列输出;所述确定单元,具体用于基于所述第一输出、所述第二输出以及所述行列输出获取所述第四特征。
- 根据权利要求22所述的图像处理设备,其特征在于,所述确定单元,具体用于对所述第一输出与所述第二输出进行相加处理,得到第五输出;所述确定单元,具体用于对所述第五输出与所述行列输出进行拼接处理,得到所述第四特征。
- 根据权利要求20所述的图像处理设备,其特征在于,所述确定单元,具体用于对所述第一特征进行自注意力计算,得到第一输出;所述确定单元,具体用于对所述第一特征与所述第二特征进行交叉注意力计算,得到第二输出;所述确定单元,具体用于对所述第一输出与所述第二输出进行相加处理,得到所述第四特征。
- 根据权利要求20至24中任一项所述的图像处理设备,其特征在于,所述确定单元,具体用于对所述查询特征与所述第四特征进行交叉注意力计算,得到第三输出;所述确定单元,具体用于对所述查询特征与所述第二特征进行处理,得到第四输出;所述确定单元,具体用于对所述第三输出与所述第四输出进行相加处理,得到所述第五特征。
- 根据权利要求16至25中任一项所述的图像处理设备,其特征在于,所述提取单元,具体用于对主干网络中不同层输出的特征进行特征融合与降维处理,得到所述第一特征,所述主干网络的输入为所述待检测图像。
- 一种检测设备,其特征在于,所述检测设备应用于车辆,所述检测设备包括:获取单元,用于获取待检测图像;处理单元,用于对所述待检测图像进行处理,得到多个点集,所述多个点集中的每个点集表示所述待检测图像中的一条车道线;其中,所述处理基于transformer结构的第一神经网络与检测框信息预测图像中车道线的点集,所述检测框信息包括所述待检测图像中至少一个对象的检测框在所述待检测图像中的位置。
- 根据权利要求27所述的检测设备,其特征在于,所述检测框信息还包括:所述检测框的类别与置信度。
- 根据权利要求27或28所述的检测设备,其特征在于,所述检测设备还包括:显示单元,用于显示所述车道线。
- 根据权利要求27至29中任一项所述的检测设备,其特征在于,所述处理单元,还用于对所述至少一个对象进行建模得到虚拟对象;所述处理单元,还用于基于所述位置对所述多个点集与所述虚拟对象进行融合处理,得到目标图像;所述显示单元,还用于显示所述目标图像。
- 一种图像处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存 储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述图像处理设备执行如权利要求1至11中任一项所述的方法。
- 一种检测设备,其特征在于,所述检测设备应用于车辆,所述检测设备包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述图像处理设备执行如权利要求12至15中任一项所述的方法。
- 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在电子设备上运行时,使得所述电子设备执行如权利要求1至11中任一项所述的方法,或者使得所述电子设备执行如权利要求12至15中任一项所述的方法。
- 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至11中任一项所述的方法,或者使得所述计算机执行如权利要求12至15中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210018538.X | 2022-01-07 | ||
CN202210018538.XA CN114494158A (zh) | 2022-01-07 | 2022-01-07 | 一种图像处理方法、一种车道线检测方法及相关设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023131065A1 true WO2023131065A1 (zh) | 2023-07-13 |
Family
ID=81509244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/143779 WO2023131065A1 (zh) | 2022-01-07 | 2022-12-30 | 一种图像处理方法、一种车道线检测方法及相关设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114494158A (zh) |
WO (1) | WO2023131065A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116872961A (zh) * | 2023-09-07 | 2023-10-13 | 北京捷升通达信息技术有限公司 | 用于智能驾驶车辆的控制系统 |
CN117036788A (zh) * | 2023-07-21 | 2023-11-10 | 阿里巴巴达摩院(杭州)科技有限公司 | 图像分类方法、训练图像分类模型的方法及装置 |
CN117876669A (zh) * | 2024-01-22 | 2024-04-12 | 珠海市欧冶半导体有限公司 | 目标检测方法、装置、计算机设备和存储介质 |
CN118609081A (zh) * | 2024-08-07 | 2024-09-06 | 泉州职业技术大学 | 一种基于行嵌入聚类与特征交叉融合的车道线检测方法 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114494158A (zh) * | 2022-01-07 | 2022-05-13 | 华为技术有限公司 | 一种图像处理方法、一种车道线检测方法及相关设备 |
CN114896898B (zh) * | 2022-07-14 | 2022-09-27 | 深圳市森辉智能自控技术有限公司 | 一种空压机集群系统能耗优化方法及系统 |
CN115588177B (zh) * | 2022-11-23 | 2023-05-12 | 荣耀终端有限公司 | 训练车道线检测网络的方法、电子设备、程序产品及介质 |
CN116385789B (zh) * | 2023-04-07 | 2024-01-23 | 北京百度网讯科技有限公司 | 图像处理方法、训练方法、装置、电子设备及存储介质 |
CN117245672B (zh) * | 2023-11-20 | 2024-02-02 | 南昌工控机器人有限公司 | 摄像头支架模块化装配的智能运动控制系统及其方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200117916A1 (en) * | 2018-10-11 | 2020-04-16 | Baidu Usa Llc | Deep learning continuous lane lines detection system for autonomous vehicles |
CN111191487A (zh) * | 2018-11-14 | 2020-05-22 | 北京市商汤科技开发有限公司 | 车道线的检测及驾驶控制方法、装置和电子设备 |
CN111860155A (zh) * | 2020-06-12 | 2020-10-30 | 华为技术有限公司 | 一种车道线的检测方法及相关设备 |
CN114494158A (zh) * | 2022-01-07 | 2022-05-13 | 华为技术有限公司 | 一种图像处理方法、一种车道线检测方法及相关设备 |
-
2022
- 2022-01-07 CN CN202210018538.XA patent/CN114494158A/zh active Pending
- 2022-12-30 WO PCT/CN2022/143779 patent/WO2023131065A1/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200117916A1 (en) * | 2018-10-11 | 2020-04-16 | Baidu Usa Llc | Deep learning continuous lane lines detection system for autonomous vehicles |
CN111191487A (zh) * | 2018-11-14 | 2020-05-22 | 北京市商汤科技开发有限公司 | 车道线的检测及驾驶控制方法、装置和电子设备 |
CN111860155A (zh) * | 2020-06-12 | 2020-10-30 | 华为技术有限公司 | 一种车道线的检测方法及相关设备 |
CN114494158A (zh) * | 2022-01-07 | 2022-05-13 | 华为技术有限公司 | 一种图像处理方法、一种车道线检测方法及相关设备 |
Non-Patent Citations (1)
Title |
---|
LIU RUIJIN, YUAN ZEJIAN, LIU TIE, XIONG ZHILIANG: "End-to-end Lane Shape Prediction with Transformers", ARXIV.ORG, ITHACA, 28 November 2020 (2020-11-28), XP093077236 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117036788A (zh) * | 2023-07-21 | 2023-11-10 | 阿里巴巴达摩院(杭州)科技有限公司 | 图像分类方法、训练图像分类模型的方法及装置 |
CN117036788B (zh) * | 2023-07-21 | 2024-04-02 | 阿里巴巴达摩院(杭州)科技有限公司 | 图像分类方法、训练图像分类模型的方法及装置 |
CN116872961A (zh) * | 2023-09-07 | 2023-10-13 | 北京捷升通达信息技术有限公司 | 用于智能驾驶车辆的控制系统 |
CN116872961B (zh) * | 2023-09-07 | 2023-11-21 | 北京捷升通达信息技术有限公司 | 用于智能驾驶车辆的控制系统 |
CN117876669A (zh) * | 2024-01-22 | 2024-04-12 | 珠海市欧冶半导体有限公司 | 目标检测方法、装置、计算机设备和存储介质 |
CN118609081A (zh) * | 2024-08-07 | 2024-09-06 | 泉州职业技术大学 | 一种基于行嵌入聚类与特征交叉融合的车道线检测方法 |
Also Published As
Publication number | Publication date |
---|---|
CN114494158A (zh) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023131065A1 (zh) | 一种图像处理方法、一种车道线检测方法及相关设备 | |
US11132780B2 (en) | Target detection method, training method, electronic device, and computer-readable medium | |
US12072442B2 (en) | Object detection and detection confidence suitable for autonomous driving | |
US11966838B2 (en) | Behavior-guided path planning in autonomous machine applications | |
US12039436B2 (en) | Stereo depth estimation using deep neural networks | |
US20240127062A1 (en) | Behavior-guided path planning in autonomous machine applications | |
CN113994390A (zh) | 针对自主驾驶应用的使用曲线拟合的地标检测 | |
WO2022104774A1 (zh) | 目标检测方法和装置 | |
JP2021515724A (ja) | 自動運転車において3dcnnネットワークを用いてソリューション推断を行うlidar測位 | |
JP2021515178A (ja) | 自動運転車両においてrnnとlstmを用いて時間平滑化を行うlidar測位 | |
JP2021514885A (ja) | 自動運転車のlidar測位に用いられるディープラーニングに基づく特徴抽出方法 | |
US11919545B2 (en) | Scenario identification for validation and training of machine learning based models for autonomous vehicles | |
CN113591872A (zh) | 一种数据处理系统、物体检测方法及其装置 | |
CN111368972A (zh) | 一种卷积层量化方法及其装置 | |
WO2022178858A1 (zh) | 一种车辆行驶意图预测方法、装置、终端及存储介质 | |
US20230047094A1 (en) | Image processing method, network training method, and related device | |
CN115273002A (zh) | 一种图像处理方法、装置、存储介质及计算机程序产品 | |
CN114802261B (zh) | 泊车控制方法、障碍物识别模型训练方法、装置 | |
CN115214708A (zh) | 一种车辆意图预测方法及其相关装置 | |
US11308324B2 (en) | Object detecting system for detecting object by using hierarchical pyramid and object detecting method thereof | |
CN115546781A (zh) | 一种点云数据的聚类方法以及装置 | |
WO2023207531A1 (zh) | 一种图像处理方法及相关设备 | |
CN116701586A (zh) | 一种数据处理方法及其相关装置 | |
CN113066124A (zh) | 一种神经网络的训练方法以及相关设备 | |
WO2024093321A1 (zh) | 车辆的位置获取方法、模型的训练方法以及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22918506 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |