CN116052132A

CN116052132A - Method and device for identifying pavement marker and electronic equipment

Info

Publication number: CN116052132A
Application number: CN202310108308.7A
Authority: CN
Inventors: 李宁; 朱磊; 贾双成; 郭杏荣
Original assignee: Zhidao Network Technology Beijing Co Ltd
Current assignee: Zhidao Network Technology Beijing Co Ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-05-02

Abstract

The application relates to a method and device for identifying pavement markers and electronic equipment. The method for identifying the pavement marker comprises the following steps: obtaining an image to be identified; processing the image to be identified by using the trained pavement identification recognition model to obtain pavement identification; the road surface identification recognition model comprises an encoding module, a fusion module and a decoding module, wherein the feature map of at least two dimensions output by the encoding block in the encoding module is fused by the fusion module and then is respectively input into the decoding blocks corresponding to the at least two dimensions in the decoding module. According to the method and the device, the recognition effect of the fuzzy pavement marker in the image can be improved.

Description

Method and device for identifying pavement marker and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device and electronic equipment for identifying pavement markers.

Background

With the rapid development of computer technology and artificial intelligence technology, artificial intelligence technology is applied to more and more scenes such as intelligent traffic, image recognition, etc.

In order to realize automatic driving, accurate detection of the state of the road surface mark has an important role. For example, an autonomous vehicle may operate by merging, turning, speed limiting, etc. according to road markings, as specified by traffic regulations. The related technology can perform image recognition on the acquired image to obtain the pavement marker.

The applicant found that the identification effect of the related art on the pavement marker is to be improved.

In the related art, the road surface mark limited by the shooting of the moving vehicle is fuzzy, so that the recognition effect of the road surface mark is poor.

Disclosure of Invention

In order to solve or partially solve the problems existing in the related art, the application provides a method, a device and electronic equipment for identifying pavement markers, which can effectively improve the identification effect of fuzzy pavement markers.

A first aspect of the present application provides a method of identifying a pavement marking, comprising: obtaining an image to be identified; processing the image to be identified by using the trained pavement identification recognition model to obtain pavement identification; the road surface identification recognition model comprises an encoding module, a fusion module and a decoding module, wherein the feature map of at least two dimensions output by the encoding block in the encoding module is fused by the fusion module and then is respectively input into the decoding blocks corresponding to the at least two dimensions in the decoding module.

According to some embodiments of the present application, a plurality of encoding blocks in an encoding module each output a feature map of different dimensions; the fusion module fuses at least two dimension feature graphs in the feature graphs with different dimensions to obtain a fused feature graph; and the target decoding blocks in the decoding module respectively receive and process the fusion feature map so as to output the pavement identifier, wherein the target decoding blocks are decoding blocks corresponding to at least two dimensions in the decoding module.

According to some embodiments of the present application, the encoding block includes: at least one convolution layer for extracting feature images from input data by convolution operation; and the pooling layer is connected with the convolution layer and is used for downsampling the feature map.

According to certain embodiments of the present application, the encoding module comprises: and the pooling layer of the current-stage coding module is connected with the first-layer convolution layer of the next-stage coding module.

According to some embodiments of the present application, the feature map of at least two dimensions comprises: characteristic diagrams output by a convolution layer of a last layer of a first-stage coding block and a third-stage coding block in the four cascade coding blocks respectively; or, the final layer convolution layers of the second-stage coding block and the fourth-stage coding block in the four cascade coding blocks respectively output characteristic diagrams.

According to some embodiments of the present application, the pooling layer maximizes or averages pixels within the coverage area of the pooling window.

According to certain embodiments of the present application, the pavement marking identification model further comprises: and the transmission layer is respectively connected with the encoding module and the decoding module and comprises a convolution layer, and is used for adjusting the size of the feature map from the encoding module and transmitting the feature map with the adjusted size to the decoding module.

According to certain embodiments of the present application, the structure of the encoding module and the structure of the decoding module are mirror symmetric with respect to the fusion module.

A second aspect of the present application provides an apparatus for identifying a pavement marker, comprising: an image acquisition module and an image recognition module. The image acquisition module is used for acquiring an image to be identified; the image recognition module is used for processing the image to be recognized by utilizing the trained pavement marking recognition model to obtain a pavement marking; the road surface identification recognition model comprises an encoding module, a fusion module and a decoding module, wherein the feature map of at least two dimensions output by the encoding block in the encoding module is fused by the fusion module and then is respectively input into the decoding blocks corresponding to the at least two dimensions in the decoding module.

A third aspect of the present application provides an electronic device, comprising: a processor; and a memory having executable code stored thereon that, when executed by the processor, causes the processor to perform the method.

A fourth aspect of the present application also provides a computer readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the above-described method.

A fifth aspect of the present application also provides a computer program product comprising executable code which when executed by a processor implements the above method.

According to the method, the device and the electronic equipment for identifying the pavement marker, the fusion module of the pavement marker identification model fuses the feature graphs of at least two dimensions from the coding module to obtain the feature graph with more dimension features. And processing the fused feature map by the decoding blocks corresponding to at least two dimensions in the decoding module so as to improve the recognition accuracy of the pavement marker in the image to be recognized.

In some embodiments, the coding block extracts the multidimensional feature map through a plurality of convolution layers and pooling layers, and can improve the recognition accuracy of pavement markers with different sizes on the basis of reducing the operation amount and improving the response speed.

In some embodiments, the adoption of the maximum pooling layer is helpful for extracting important features in the image to be identified, so as to improve the identification accuracy of the blurred pavement marker.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 illustrates an exemplary system architecture that may be applied to methods, apparatus, and electronic devices for identifying pavement markers according to embodiments of the present application;

FIG. 2 schematically illustrates an application scenario diagram for identifying pavement markers according to an embodiment of the present application;

FIG. 3 schematically illustrates a flow chart of a method of identifying a pavement marking according to an embodiment of the present application;

FIG. 4 schematically illustrates a topology of a pavement marking identification model according to an embodiment of the present application;

FIG. 5 schematically illustrates a topology of an encoding block according to an embodiment of the present application;

FIG. 6 schematically illustrates a topology of an encoding module according to an embodiment of the present application;

FIGS. 7A and 7B schematically illustrate a schematic of the pooling process results according to embodiments of the present application;

FIG. 8 schematically illustrates another topology of a pavement marking identification model according to an embodiment of the present application;

fig. 9 schematically shows a schematic diagram of an image processing procedure according to an embodiment of the present application;

FIG. 10 schematically illustrates an example diagram of an identified pavement marker according to an embodiment of the present application;

FIG. 11 schematically illustrates a block diagram of an apparatus for identifying pavement markings according to an embodiment of the present application;

fig. 12 schematically shows a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The related technology has better recognition effect on clear pavement marking images in the photographed images. However, the effect of identifying a blurred road surface identification image in a photographed image is to be improved. For example, the included angle between the optical axis of the camera and the ground is approximately parallel with the increase of the distance, so that the pavement marker in the photographed image is seriously deformed, and the pavement marker at a distance is not clear. The related art cannot quickly and accurately identify the pavement marker in the image shot by the mobile platform. And the road surface mark states of the related road surface marks, such as road surface marks with far distance, are obtained as much as possible, so that the driving experience is improved better.

In order to improve the recognition effect, the related art detection and segmentation network can adopt a convolutional neural network, and the characteristics of the target are extracted in a layer-by-layer abstract mode. Feature fusion of different dimensions is then achieved by e.g. Skip Connect. This approach helps to preserve the original features in the image. However, this approach still has its limitations. For example, according to experiments, the quality of the reconstructed image is obviously improved, but only the feature transfer and fusion of the same dimension can be well realized, the fusion degree between features of different dimensions is insufficient, and the recognition effect of the fuzzy object is still required to be further improved. In addition, the fused computing process between the features of each dimension occupies excessive video memory, and a plurality of pipelines are required to be parallel to improve the processing speed.

According to the embodiment of the application, the image to be identified is processed by using the trained pavement identification model, the fusion module in the pavement identification model carries out different dimension feature fusion on the feature images of at least two dimensions from the coding module, and then the feature images are respectively input into the decoding blocks corresponding to the at least two dimensions in the decoding module, so that the decoding blocks can obtain the features of the same latitude and the adjacent dimensions, and the identification effect on the fuzzy object is improved.

A method, apparatus and electronic device for identifying a pavement marker according to embodiments of the present application will be described in detail below with reference to fig. 1 to 12.

FIG. 1 illustrates an exemplary system architecture that may be applied to methods, apparatus, and electronic devices for identifying pavement markers according to embodiments of the present application. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be used in other devices, systems, environments, or scenarios.

Referring to fig. 1, a system architecture 100 according to this embodiment may include

mobile platforms

101, 102, 103, a network 104, and a cloud 105. The network 104 is the medium used to provide communication links between the

mobile platforms

101, 102, 103 and the cloud 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. Mobile terminals, such as cameras, etc., may be mounted on the

mobile platforms

101, 102, 103 to perform functions such as capturing video and identifying pavement markers.

The user may interact with other mobile platforms and cloud 105 over network 104 using

mobile platforms

101, 102, 103 to receive or send information, etc., such as sending model training requests, model parameter download requests, and receiving trained model parameters, etc. The

mobile platforms

101, 102, 103 may be installed with various communication client applications, such as, for example, driving assistance applications, autopilot applications, vehicle applications, web browser applications, database class applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like.

Mobile platforms

101, 102, 103 include, but are not limited to, automobiles, robots, tablet computers, laptop portable computers, and the like, which may support internet surfing, video capturing, human interaction, and the like.

The cloud 105 may receive a model training request, a model parameter downloading request, etc., adjust model parameters to perform model training, issue a model topology, issue trained model parameters, etc., and may also send road weather information, real-time traffic information, etc. to the

mobile platforms

101, 102, 103. For example, the cloud 105 may be a background management server, a server cluster, a car networking, or the like.

It should be noted that the number of servers in the mobile platform, network and cloud are merely illustrative. There may be any number of mobile platforms, networks, and clouds, as desired for implementation.

Fig. 2 schematically illustrates an application scenario for identifying a pavement marker according to an embodiment of the present application.

Referring to fig. 2, a portion of a picture in an image captured by a camera disposed on a mobile platform is shown. In the related art, clear pavement markers in the image can be better identified. However, if a shielding object such as a vehicle mark exists on the pavement marker, or the pavement marker is at a certain distance from the vehicle, the accuracy of the recognition result cannot meet the application requirement.

When the related art adopts the recognition model of the U-net topological structure to recognize the road surface identification, the related art comprises an encoding (encoder) network, a decoding (decoder) network and a jump connection part. Some of the results of the encoder are not changed, because some attributes are not changed, and the reconstruction of the area is facilitated by adding the low latitude feature skip to the decoding network. However, if the region is to be changed, the feature of the region of the decoder is directly skip, and the decoder tends to utilize the feature of the region, which results in poor modification effect. According to the application, at least partial channels with different dimensions are connected in a jumping mode according to the use effect and experience, and the problems are at least partially solved. In addition, the characteristics of at least part of dimensions in the coding network are fused and then sent to the decoding network, so that decoding of different decoding blocks in the decoding network according to the fused characteristics is facilitated, and the recognition effect on the fuzzy object is improved.

Fig. 3 schematically shows a flow chart of a method of identifying a pavement marking according to an embodiment of the present application.

Referring to fig. 3, the embodiment provides a method of recognizing a pavement marker, which includes operations S310 to S320, as follows.

In operation S310, an image to be identified is obtained, the image to be identified including at least one road surface identification image.

In this embodiment, the image to be recognized may be one frame of image in a video captured by a capturing device provided on the mobile platform. Among them, mobile platforms include, but are not limited to: any one of a vehicle, robot, vessel, or aircraft. For example, the image to be recognized may be obtained by photographing by a photographing device provided on the vehicle. For example, the photographing device may be a vehicle recorder or the like.

The camera may be a monocular camera. In addition, a binocular shooting device and the like can be used, and after video fusion is carried out on two images to be identified shot by the binocular shooting device, pavement identification and identification can be carried out on the fused images to be identified.

The image to be identified can comprise road surface identification images, such as turning identification and turning identification. The image to be identified may include at least a portion of the image of the mobile platform itself or may not include the image of the mobile platform itself. In addition, various man-made objects, non-man-made objects, etc., such as buildings, vehicles, pedestrians, trees, etc., may be included.

In operation S320, the image to be recognized is processed using the trained pavement marking recognition model to obtain a pavement marking.

In this embodiment, the road surface identification model may be a model trained in advance, and it can be determined whether the inputted image to be identified includes the road surface identification image or the road surface identification image is segmented from the image to be identified. The pavement marking recognition model may be a plurality of types of Neural Networks (Neural Networks), or the like.

In some embodiments, the pavement identifier recognition model may include an encoding module, a fusion module and a decoding module, where feature maps of at least two dimensions output by the encoding blocks in the encoding module are fused by the fusion module and then are respectively input into decoding blocks corresponding to the at least two dimensions in the decoding module. For example, the coding module may comprise a plurality of coding blocks in cascade, starting with the coding blocks of the image to be identified, the dimensions of the features extracted by each coding block being from lower to higher layers.

The higher-layer coding block (also called a network) has larger receptive field and strong semantic information characterization capability, but the resolution of the feature map is low, and the characterization capability of the geometric information is weak (the detail of the space geometric feature is lack). The receptive field of the low-layer coding block is smaller, the geometric detail information characterization capability is strong, and the semantic information characterization capability is weak although the resolution ratio is high. The high-level semantic information can help accurately detect or segment out the target. Therefore, at least part of features with different dimensions are fused together in deep learning, and detection and segmentation effects can be effectively improved.

Fig. 4 schematically illustrates a topology diagram of a pavement marking identification model according to an embodiment of the present application.

Referring to fig. 4, the pavement marker identification model includes an encoding module, a fusion module, and a decoding module. The encoding module transmits the output characteristic diagrams to the decoding module and the fusion module respectively. The fusion module is also connected with the decoding module to transmit the fused feature map.

An exemplary description of the pavement marking identification model follows.

In some embodiments, a plurality of encoding blocks in an encoding module each output a feature map of a different dimension. For example, at least one convolution layer may be included in the encoding module to extract a feature map of at least one dimension from the image to be identified.

The fusion module fuses the feature graphs of at least two dimensions in the feature graphs of different dimensions to obtain a fused feature graph. The feature fusion can be realized by adopting the modes of weighted summation, splicing and the like. The coding module may extract feature graphs of multiple dimensions, and may fuse feature graphs of all or part of the dimensions when performing feature fusion, which is not limited herein.

And the target decoding blocks in the decoding module respectively receive and process the fusion feature map so as to output the pavement identifier, wherein the target decoding blocks are decoding blocks corresponding to at least two dimensions in the decoding module. Specifically, the high-dimensional feature map may be restored step by step to the same size as the image to be identified by means of upsampling, where the segmented pavement marker may be included.

Fig. 5 schematically shows a topology of a coding block according to an embodiment of the present application.

Referring to fig. 5, the above-described coding block includes a convolution layer and a pooling layer. Wherein at least one convolution layer is used for extracting the characteristic graph from the input data through convolution operation. And the pooling layer is connected with the convolution layer and is used for downsampling the feature map.

For example, a convolution layer is used to obtain feature maps from video frames in an image to be identified. Wherein each encoding block may include: a plurality of convolution pairs, and a pooling layer disposed between portions of adjacent code blocks. For example, each convolution pair includes adjacently disposed convolution layers and an activation layer, the convolution kernel size (convolutional kernel size) of the convolution layers may be 3, or the like. The feature map fill width (padding) may be 1 and the step size (stride) may be 1. In addition, a normalization layer can be further included between the convolution layer and the activation layer to normalize the extracted features. The convolution kernel size of the pooling layer may be 2, the feature map fill width may be 0, and the step size may be 2.

Furthermore, the encoded blocks may also include more or fewer pooling layers. The coding block can perform feature extraction on the video frame through the convolution pairs and the pooling layer 3, and output a feature map.

In some embodiments, the encoding module comprises: and the pooling layer of the current-stage coding module is connected with the first-layer convolution layer of the next-stage coding module.

Fig. 6 schematically shows a topology of an encoding module according to an embodiment of the present application.

Referring to fig. 6, the at least two-dimensional feature map includes: and the final convolution layers of the first-stage coding block and the third-stage coding block in the four cascade coding blocks respectively output characteristic diagrams.

Furthermore, the at least two-dimensional feature map may further include: and the final convolution layers of the second-stage coding block and the fourth-stage coding block in the four cascade coding blocks respectively output characteristic diagrams.

In particular, video frames may be scaled to a fixed size x×y and then the scaled images are input to an encoding module. The coding module may contain 4 convolutional layers, 4 activation function layers (e.g., relu functions), and 4 pooling layers. In addition, a normalization layer can be included and is arranged between the convolution layer and the activation function layer. The parameters of the convolutional layer 1 and the pooling layer are shown below. Convolution layer parameters: kernel size=3, padding=1, stride=1. Pooling layer parameters: kernel size=2, padding=0, stride=2.

Where padding=1 makes the resolution of the video frame (x+2) × (y+2), and after convolution with a convolution kernel of 3×3 size, the resolution of the output matrix is x×y. The above-described convolutional layer parameter setting makes the input image and the output matrix of the convolutional layer the same size.

For example, the input image of the encoder is 480 x 800, and each layer performs a downsampling operation to double the channel number, and the image length and width are reduced to 1/2.

The feature receptive field with small downsampling multiple (generally shallow layer) is small, the method is suitable for processing small targets, and the small-scale map (deep layer) resolution information is insufficient and is not suitable for identifying the small targets. For example, a feature map (deep layer) of 1/32 size has a large receptive field because of a high downsampling factor, and is suitable for detecting an object of a large target, and a feature map (shallow layer) of 1/8 has a smaller receptive field because of a small receptive field. For small targets, the small-scale feature map cannot provide the necessary resolution information, so that a large-scale feature map needs to be combined, that is, if multiple downsampling operations are performed in the segmentation and detection network, the small targets are easily lost.

Pooling is the use of the overall statistical characteristics of the neighboring outputs of a location instead of the network's outputs at that location, which has the advantage that most of the outputs after the pooling function remain unchanged when the input data makes small shifts. Pooling may enable compression of pictures. The larger the picture is, the greater the processing speed and the recognition difficulty are. The size of the picture can be reduced by a pooling process. Pooling preserves most of the important information while reducing the dimensions of the individual feature maps,

For example, when identifying whether an image includes a turn indicator, it may be useful to pool pixels of a region to obtain an overall statistical feature if it is detected that the image to be identified includes a polyline and a triangle at one end of the polyline without knowing the exact location of the turn indicator. As the feature map becomes smaller after pooling, if the back connection is a full connection layer, the number of neurons can be effectively reduced, the storage space is saved, and the calculation efficiency is improved.

At present, the methods of maximum pooling, average pooling, addition pooling and the like mainly exist. The method of leaving one maximum pixel in the four pixel grids and discarding the other three pixels is selected by a max pooling algorithm. Pooling, i.e., spatial pooling, achieves relatively lower dimensions by performing aggregate statistical processing on different features while avoiding overfitting. Wherein, the average pooling is to calculate the average value of the image area and take the average value as the value after the area is pooled. Maximum pooling is to select the maximum value of the image area and take the maximum value as the value after the area is pooled. A spatial neighborhood is defined, and the largest element is taken from the corrected feature map, or an average value is taken.

Fig. 7A and 7B schematically show a schematic diagram of the result of the pooling process according to an embodiment of the present application. There are generally two methods to pool a 2 x 2 region into a pixel. Average pooling and maximum pooling.

Referring to fig. 7A, the pooling layer averages pixels within the coverage area of the pooling window. Fig. 7A shows 16 pixels, and the values of the pixels are 1 to 16, respectively. The pooling window size was 2 x 2, the stride per movement was 2, and the results obtained after the pooling were averaged, as shown in the right graph of fig. 7A.

Referring to fig. 7B, the pooling layer maximizes the pixels within the coverage area of the pooling window. Fig. 7B shows 16 pixels, and the values of the pixels are 1 to 16, respectively. The pooling window size was 2 x 2, the stride per move was 2, and the results obtained after maximum pooling were shown in the right panel of fig. 7B.

The spatial scale of the input representation can be gradually reduced through the pooling operation, the feature dimension is reduced, and the parameters and the number of calculation in the network are reduced more controllably. Making the network invariant to smaller variations, redundancies, and transformations in the input image. Helping to obtain the dimensional invariance of the image to the greatest extent.

Fig. 8 schematically illustrates another topology of a pavement marking identification model according to an embodiment of the present application.

Referring to fig. 8, the pavement marking identification model may also include a transmission layer. Such as the one at the lowest layer in fig. 8. The transmission layer is respectively connected with the encoding module and the decoding module and comprises a convolution layer which is used for adjusting the size of the characteristic diagram from the encoding module and transmitting the characteristic diagram with the adjusted size to the decoding module.

In some embodiments, the structure of the encoding module and the structure of the decoding module are mirror symmetric with respect to the fusion module. Wherein the downsampling operation is employed in the encoding module and the upsampling operation is employed in the decoding module.

After the original image is convolved and deconvolved, the final image is inconsistent with the original image. Thus, the matrix prior to convolution cannot be restored by deconvolution, but only in size, because the nature of deconvolution is also convolution. Convolution is in fact the process of deconvolution. By now it should be appreciated that deconvolution is also in fact a special way of convolution that can enlarge the artwork by full convolution, increasing the resolution of the artwork, so deconvoluting the image is also referred to as "upsampling" the image. Therefore, it can be directly understood that the convolution and deconvolution of the image are not a simple transformation and restoration process, that is, the picture is first convolved and then deconvolved with the same convolution kernel, so that the picture cannot be restored to the original picture, because the picture is simply expanded after deconvolution and cannot be restored to the original image.

Referring to fig. 4 and fig. 6 together, the encoding module implements feature extraction, which is a shrink network, and reduces the size of the picture by four downsampling processes, and features are extracted from shallow information in the downsampling process. The specific procedure is that the input picture then passes through two convolution kernels (3 x3 followed by an activation function Relu). For example, the size of the input image is 572×572, and after two convolution kernels (3×3), the size of the input image is 572×572 to 570×570, 570×570 to 568×568, and then a Maxpool (2×2) picture size is 284, which is a complete downsampling. The same is true for the next three code blocks. During the downsampling process, the number of channels doubles, for example from 64 to 128 in the figure.

Unlike copy and crop splicing employed in the related art. In the related art, there are four splicing operations (also called Skip connection) in u-net, so that the purpose is to fuse the feature information, so that the deep layer information and the shallow layer information are fused, and when in splicing, attention is paid to the fact that the picture size is consistent, and the dimension (channels) of the features is the same, so that the splicing can be performed. In this embodiment, feature maps of multiple dimensions output by the encoding module need to be fused first and then transmitted to the decoding module. Specifically, feature maps of at least two dimensions output by the coding blocks in the coding module are fused by the fusion module and then are respectively input into decoding blocks corresponding to the at least two dimensions in the decoding module. Therefore, the decoding blocks in the decoding module in the embodiment can obtain the characteristics of adjacent dimensions, and the recognition capability of the fuzzy object can be further improved.

The up-sampling part (up-conv) in the decoding module is also called an expansion network, the picture size is enlarged, deep information is extracted, four up-samplings are used, and the channel number of the picture is halved in the up-sampling process, and is opposite to the change of the characteristic extraction channel number of the coding module of the left part. The up-sampling process fuses the left shallow information, i.e., splices the left fused features.

Fig. 9 schematically shows a schematic diagram of an image processing procedure according to an embodiment of the present application.

Referring to fig. 9, right arrow

Representing convolution and activation functions; grey dotted arrow->

Representing transmission characteristics (characteristic diagrams or fusion characteristic diagrams); down arrow->

Representing downsampling (which may be achieved in particular by a pooling operation); upward arrow->

Representative upsampling and then convolving; thin solid arrow→conv 1X1 represents the convolution operation with a kernel of 1X 1.

The Unet algorithm utilizes a splicing (Concat) layer to fuse the feature images at the corresponding positions in the two processes, so that a decoder can acquire more high-resolution information when up-sampling is performed, further, the detail information in the original image is recovered more perfectly, and the segmentation precision is improved. For this task, spatial domain information is very important. The coding (encoder) part of the network has reduced the resolution of the feature map to be very small through each pooling layer, which is disadvantageous for accurate segmentation mask (mask) generation, and features of a shallower multi-dimensional convolution layer can be introduced through feature fusion and transmission, and the features have higher resolution and comprise multi-dimensional shallow features, which contain relatively rich low-level information and are more advantageous for generating the segmentation mask. The method introduces feature reconstruction of adjacent layers and feature information transfer between a cross-layer feature fusion enhancement layer and layers, and further utilizes abundant detail information in a high-layer convolution feature layer, so that the utilization rate of the feature information in each layer of the network is improved to the greatest extent. In the process of network propagation, as the network is deeper, the receptive field of the corresponding feature map is larger, but the reserved detail information is smaller, and for semantic segmentation tasks, the abundant detail information reserved by the high-level convolution is very valuable, and based on the symmetrical structure of the encoder-decoder, the fusion feature map extracted by downsampling in the encoder process and the new feature map obtained by upsampling in the decoder process are subjected to channel dimension splicing by using the fusion module, so that some important feature information in the high-level convolution can be reserved to a greater extent, and finer segmentation effects can be realized.

For example, the image is input into a neural network (e.g., fully connected network FCN) that convolves the image before pulling down, decreasing the image size while increasing the receptive field, but since the image segmentation prediction is a pixel-wise (pixel-wise) output, the pulling down operation allows each pixel (pixel) prediction to see larger receptive field information by upsampling (upsampling) the smaller image size to the original image size.

It should be noted that, the technical scheme of the application can be well applied to the scene of the video shot by the shooting device in the moving process of the vehicle, the vehicle may move at a high speed in the scene, the position change of the pavement marker in the shot video frame is faster, and the pavement marker needs to be determined from the video frame rapidly.

In embodiments of the present application, the photographing device may be a monocular camera, a binocular camera, a trinocular camera, or more cameras. For example, a multi-view camera includes a plurality of cameras of different shooting ranges, and illustratively, a tri-view camera may include a first camera, a second camera, and a third camera.

When model training is carried out, a corresponding pavement marking recognition model can be trained for each camera of the three-eye cameras. For example, a first road surface identification recognition model for recognizing an image acquired by the first camera is trained for the first camera, respectively. A second pavement marking recognition model for recognizing images acquired by the second camera is trained for the second camera. And training a third path identifier identification model for the third camera to identify the image acquired by the third camera. The weighted results of the three pavement marking identification models are taken as the final results. For another example, a general pavement marker recognition model may be trained for all cameras of the three-eye camera, and the general pavement marker recognition model may recognize images acquired by each of the three-eye cameras.

In the embodiment of the application, the corresponding sample image can be acquired according to the pavement marker recognition model which needs to be trained. For example, when the pavement marking recognition model to be trained is the first pavement marking recognition model corresponding to the first camera, the sample image is an image collected by the first camera. When the pavement marking recognition model to be trained is a second pavement marking recognition model corresponding to the second camera, the sample image is an image acquired by the second camera. When the pavement marker recognition model to be trained is a third pavement marker recognition model corresponding to the third camera, the sample image is an image acquired by the third camera. For another example, when the pavement marker recognition model to be trained is a pavement marker recognition model common to all cameras of the three-camera, the sample image is an image collected by the three cameras.

In some embodiments, the acquired sample images may be time-sequential synchronized, i.e., the acquired images are ordered in the order of acquisition time. When the sample image is an image acquired by three cameras, the images acquired by the three cameras acquired at the same time can be used as the same group of images, and then the plurality of groups of images are ordered according to the sequence of the acquisition time.

In some embodiments, in order to perform model training by using the acquired sample image, pavement markers of the sample image need to be labeled in advance, such as the position and the range of the pavement markers, color information of the pavement markers, and the like, and the information can be specifically selected according to the purpose of model training.

The marked sample images can be divided into a training sample set and a test sample set according to a preset proportion. The training sample set can be used for model training, and the test sample set is used for testing the trained model so as to ensure that the trained model can achieve the expected recognition effect.

In some embodiments, the pavement marking identification model may be trained by a back propagation algorithm. Reference may be made in particular to neural network training methods. In the course of training the basic model, the position information of the road surface mark and the like can be input from the outside.

It should be noted that, the model training process of the pavement marker recognition model may be offline training or online training, and the model training may be performed at the cloud. The computing device of the mobile platform may download the model topology and model parameters of the trained pavement identification recognition model from the cloud to enable pavement identification recognition locally at the mobile platform. The mobile platform can also send the video stream to the cloud so that the cloud processes the image to be identified by utilizing the trained pavement identification model to obtain pavement identification information in the image to be identified, and the cloud sends (or broadcasts) the pavement identification position information to the mobile platform.

In a specific model training process, the weight may be determined by the following process: 1. initializing, namely gradually adjusting according to a training result, stopping adjusting after the training precision reaches a target, and determining a weight. The determination of the values in the convolution kernels of the pavement marker identification model is similar in this application. For example, the values in each convolution kernel (i.e., weights) can be determined by first initializing with random numbers (subject to gaussian distribution), then gradually adjusting the values according to the aforementioned loss function, and stopping when the training accuracy is satisfactory. The process of adjusting the value of the convolution kernel is actually the training process of the pavement marking recognition model, and when the convolution kernel finishes training to determine the value, the pavement marking recognition model is trained.

The architecture of the pavement marker identification model employed in fig. 9 is a repeating structure, each repetition having 2 convolution layers and a pooling layer, the convolution layers having a convolution kernel size of 3×3, the activation function using ReLU, two convolution layers being followed by a 2×2 step size of 2 max pooling layer. The number of characteristic channels is doubled after each downsampling. Each step in the shrink path uses first a deconvolution (up-volume), which halves the number of feature channels each time, doubling the feature map size. After deconvolution, the deconvolution result is spliced with the fusion feature map. The feature map in the connecting path is slightly larger in size, and is spliced after being trimmed. The spliced map is subjected to 2 times of 3×3 convolutions. The convolution kernel size of the last layer is 1×1, converting the feature map of 64 channels into a result of a specific depth.

In the embodiment, multiple layers of semantic features are fused, the fusion mode is to splice multiple channels, and the fusion feature map is beneficial to further improving the recognition effect on the fuzzy object.

Fig. 10 schematically illustrates an example diagram of an identified pavement marker according to an embodiment of the present application. Referring to fig. 10, an image taken by a vehicle in a high-speed driving state is somewhat blurred, and the present embodiment can quickly recognize and accurately segment a road surface identifier from the blurred image.

Another aspect of the present application also provides an apparatus for identifying a pavement marker.

Fig. 11 schematically shows a block diagram of an apparatus for identifying pavement markings according to an embodiment of the present application.

Referring to fig. 1100, the apparatus 1100 for identifying a pavement marker may include: an image acquisition module 1110 and an image recognition module 1120.

The image obtaining module 1010 is configured to obtain an image to be identified.

The image recognition module 1020 is used for processing the image to be recognized by using the trained pavement marking recognition model to obtain pavement marking; the road surface identification recognition model comprises an encoding module, a fusion module and a decoding module, wherein the feature map of at least two dimensions output by the encoding block in the encoding module is fused by the fusion module and then is respectively input into the decoding blocks corresponding to the at least two dimensions in the decoding module.

In some embodiments, a plurality of encoding blocks in an encoding module each output a feature map of a different dimension; the fusion module fuses at least two dimension feature graphs in the feature graphs with different dimensions to obtain a fused feature graph; and the target decoding blocks in the decoding module respectively receive and process the fusion feature map so as to output the pavement identifier, wherein the target decoding blocks are decoding blocks corresponding to at least two dimensions in the decoding module.

In some embodiments, the encoding block comprises: at least one convolution layer for extracting feature images from input data by convolution operation; and the pooling layer is connected with the convolution layer and is used for downsampling the feature map.

In some embodiments, the at least two-dimensional feature map comprises: characteristic diagrams output by a convolution layer of a last layer of a first-stage coding block and a third-stage coding block in the four cascade coding blocks respectively; and/or, the final layer convolution layers of the second-stage coding block and the fourth-stage coding block in the four cascaded coding blocks respectively output characteristic diagrams.

In some embodiments, the pooling layer maximizes or averages pixels within the coverage area of the pooling window.

In certain embodiments, the pavement marking identification model further comprises: and the transmission layer is respectively connected with the encoding module and the decoding module and comprises a convolution layer, and is used for adjusting the size of the feature map from the encoding module and transmitting the feature map with the adjusted size to the decoding module.

In some embodiments, the structure of the encoding module and the structure of the decoding module are mirror symmetric with respect to the fusion module.

The specific manner in which the respective modules and units perform the operations in the apparatus 1100 in the above embodiment has been described in detail in the embodiment related to the method, and will not be described in detail here.

Another aspect of the present application also provides an electronic device.

Referring to fig. 12, an electronic device 1200 includes a memory 1210 and a processor 1220.

Processor 1220 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 1210 may include various types of storage units such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1220 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1210 may include any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1210 may include readable and/or writeable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

Memory 1210 has stored thereon executable code that, when processed by processor 1220, causes processor 1220 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of an electronic device (or a server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of identifying a pavement marking, comprising:

obtaining an image to be identified;

processing the image to be identified by using the trained pavement identification recognition model to obtain pavement identification;

the road surface identification recognition model comprises a coding module, a fusion module and a decoding module, wherein the characteristic diagrams of at least two dimensions output by the coding blocks in the coding module are respectively input into the decoding blocks corresponding to the at least two dimensions in the decoding module after being fused by the fusion module.

2. The method according to claim 1, characterized in that:

a plurality of coding blocks in the coding module respectively output feature graphs with different dimensions;

the fusion module fuses at least two dimension feature graphs in the feature graphs with different dimensions to obtain a fusion feature graph;

and respectively receiving and processing the fusion feature map by a target decoding block in the decoding module to output the pavement identifier, wherein the target decoding block is a decoding block corresponding to the at least two dimensions in the decoding module.

3. The method of claim 1, wherein the encoding block comprises:

at least one convolution layer for extracting feature images from input data by convolution operation;

And the pooling layer is connected with the convolution layer and is used for downsampling the characteristic map.

4. A method according to claim 3, wherein the encoding module comprises:

and the pooling layer of the current-stage coding module is connected with the first-layer convolution layer of the next-stage coding module.

5. The method of claim 4, wherein the at least two-dimensional feature map comprises:

the characteristic diagrams output by the convolution layers of the last layer of the first-stage coding block and the third-stage coding block in the four cascade coding blocks are respectively generated; and/or

And the final layer convolution layers of the second-stage coding block and the fourth-stage coding block in the four cascaded coding blocks respectively output characteristic diagrams.

6. A method according to claim 3, wherein the pooling layer maximizes or averages pixels within the area covered by the pooling window.

7. The method of any one of claims 1 to 6, wherein the pavement marking identification model further comprises:

and the transmission layer is respectively connected with the encoding module and the decoding module and comprises a convolution layer, and is used for adjusting the size of the characteristic diagram from the encoding module and transmitting the characteristic diagram with the adjusted size to the decoding module.

8. The method according to any one of claims 1 to 6, wherein the structure of the encoding module and the structure of the decoding module are mirror-symmetrical with respect to the fusion module.

9. An apparatus for identifying a pavement marker, comprising:

the image acquisition module is used for acquiring an image to be identified;

the image recognition module is used for processing the image to be recognized by utilizing the trained pavement marking recognition model to obtain a pavement marking; the road surface identification recognition model comprises a coding module, a fusion module and a decoding module, wherein the characteristic diagrams of at least two dimensions output by the coding blocks in the coding module are respectively input into decoding blocks corresponding to the at least two dimensions in the decoding module after being fused by the fusion module.

10. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method according to any of claims 1-8.