CN113850111A

CN113850111A - Road condition identification method and system based on semantic segmentation and neural network technology

Info

Publication number: CN113850111A
Application number: CN202110436451.XA
Authority: CN
Inventors: 张继东; 曹靖城; 史国杰; 周帅
Original assignee: Tianyi Smart Family Technology Co Ltd
Current assignee: Tianyi Digital Life Technology Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-12-28

Abstract

The invention provides a road condition identification method and system based on semantic segmentation and neural network technology. The method comprises the following steps: acquiring a video of road condition identification to be carried out; extracting frames of the acquired video to obtain a frame sequence; semantically segmenting the frame sequence to obtain a mask frame sequence which distinguishes a road condition object from a background object, wherein the road condition object comprises one or more of a road, a pedestrian and a vehicle; masking the frame sequence using the sequence of mask frames to remove background objects; performing feature extraction on the masked frame sequence; and identifying the road condition based on the extracted features.

Description

Road condition identification method and system based on semantic segmentation and neural network technology

Technical Field

The invention relates to the field of artificial intelligence image processing, in particular to a method and a system for reducing image noise based on semantic segmentation and neural network technology to improve road condition identification precision of a vehicle data recorder.

Background

The traffic condition state (i.e. whether the road is smooth or congested) affects the judgment and selection of the user on the travel time, route, and even travel mode. Map software Gaode maps and Baidu maps which are widely used in China provide massive positioning and route navigation planning services for users every day, the provided road condition state information is mainly acquired by the road condition, wherein the main model is a floating car model, namely, the speed and the direction of the floating car are recorded by a GPS (global positioning system), and then the road condition is calculated according to road matching. Currently, the main floating cars are provided by taxi companies. In addition, some road condition acquisition methods with higher cost include the arrangement of a ground induction coil and a speed measurement camera, and the data are generally in the national department.

The traditional road condition information detection method is that a ground induction coil, a speed measuring radar and a video monitoring tool are installed on a main road of a city, and the traditional road condition information such as the occupancy rate, the traffic flow and the speed of a road are mainly detected by the devices. Or uploading the elements such as track information, public information and the like of the driving user by the GPS of the vehicle, and providing the elements to the user after algorithm processing. The GPS-based road condition information collection and monitoring mainly relies on the GPS of the traveling vehicle.

However, on roads with few users and abnormal driving behaviors, roads without collection, or township lanes with few trips, the identification of the map software on road conditions depends heavily on whether the current road covers the ground sensing coil and whether the floating car collects the road information, which cannot guarantee the accuracy of the map data obtained by the users in the driving process, and will seriously affect the experience of the users in the trip process.

The vehicle-mounted video image contains more information, and gives another problem solving view angle. Through the video or the picture, the real state of the road surface can be observed, including the number of motor vehicles, the width and the spaciousness of the road, and the like. More accurate road condition states can be obtained based on the vehicle-mounted video images, and higher-quality service is provided for users to go out.

Throughout the research on road condition identification in recent years, the research on driving road conditions focuses on identification of moving vehicles in road conditions, and at the moment, identification of the moving vehicles is converted into identification of moving targets. The current state of research on this problem is as follows:

(1) and (3) an interframe difference method: the method is to use a frame difference method to identify a moving target, but the algorithm can not extract the complete area of the moving object and only extract the outline; the algorithm effect depends heavily on the selected inter-frame time interval and the segmentation threshold. Secondly, the algorithm cannot be applied to a moving camera, and a static target or a target with a slow moving speed cannot be identified.

(2) Background subtraction method: the initial background difference method is to establish a fixed background model and to identify the moving object by using the current frame and the background frame. With the progress of research, adaptive background frame updating is adopted at present, which achieves real-time updating of background frames to further identify moving targets, and solves the actual problem to a certain extent. However, the method cannot be applied to a moving camera, and meanwhile, real-time updating of background frames has certain difficulty.

(3) An optical flow method: what can be effectively identified and tracked initially is an optical flow method based on computing displacement vectors, which uses the change of pixels in an image sequence in a time domain and the correlation between adjacent frames to find the correspondence existing between the previous frame and the current frame, thereby computing the motion information of an object between the adjacent frames. However, the moving object tracking method based on the optical flow method has two basic assumptions: 1) the brightness is constant. 2) The temporal continuation or movement is a small movement, i.e. the target position does not change much over time. This is difficult to satisfy in vehicle event data recorder-based road condition recognition.

The Chinese patent application (CN105681763A) entitled real-time road condition live broadcast method and system based on video, which is characterized in that an intelligent network automobile data recorder is arranged on a vehicle and can automatically establish communication connection with a platform server, the vehicle state information of the vehicle is collected and sent to the platform server at regular time; the platform server receives vehicle state information from the vehicle, sets a sharable identifier at a corresponding position of a map of the vehicle, and shares the real-time video stream to a current request user so as to display road condition information. This way of identification is limited by the network environment and the processing speed of the server and is therefore not applicable in real scenarios.

Chinese patent application (CN110889372A) entitled "automatic tracking driving method based on video tracking and target radar information". The method provides a method for identifying road conditions by following a front vehicle in an automatic driving process. The method mainly adopts video tracking and radar detection to obtain space tracking coordinate information of a front target vehicle so as to judge whether the route planning information of the front vehicle is consistent with the current vehicle planning information, and indirectly identifies the road condition by tracking the driving condition of the vehicle with consistent driving planning information. The method is seriously dependent on the driving behavior of the front vehicle, so that the real situation of the front road condition cannot be accurately reflected.

Chinese patent application (CN107368890A) entitled "road condition analysis method and system based on deep learning with vision as center" proposes visual input by receiving real-time traffic environment from a camera; identifying at least one initial region of interest from the visual input of the real-time traffic environment using a cyclic YOLO engine using a CNN training method; verifying, in the at least one initial region of interest, whether a detection object within the at least one initial region of interest is a tracked candidate object; LSTMs are used to track the test subject based on real-time traffic environment visual input and predict the future state of the test subject by using CNN training methods.

The above-mentioned technology and 3 patents all have certain limitations in a live video mode, a driving route matching mode and a vehicle tracking method. Limited by 1) network environment. 2) Complex driving environment and driver's special driving behavior. 3) The video data input by the algorithm, the camera is fixed (namely, the background is not changed) and the change amplitude is smaller. Therefore, certain limitations exist in the actual driving process, which greatly reduces the identification accuracy. In fact, when the vehicle runs on urban roads, town roads and expressway, the background of the front vehicle is constantly changing, the background contains great noise, and if the noise influence can be effectively reduced, the accuracy of road condition identification can be effectively improved by adopting the video frame after noise reduction.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

In order to solve the problems that the background noise is large in the existing road condition identification technology and the road condition cannot be identified by effectively utilizing the video content of a driving recorder and utilizing the image processing technology, the invention provides a method for carrying out image noise reduction based on semantic segmentation and a deep neural network so as to improve the accuracy of road condition identification. The actual road condition video is collected by using a vehicle-mounted camera or a vehicle data recorder, a certain number of images are firstly extracted from the collected video in a frame extraction mode, and then people in the driving road, roads, vehicles and other objects which influence the driving of the vehicle in the images are identified in a semantic segmentation mode. And comparing the semantically segmented image with the original image, and reserving the object identified by the original image in the semantic segmentation stage and ignoring other background information. Then inputting the processed image sequence into a convolutional neural network with a pyramid structure to extract the characteristics of the image, considering the continuity of the image sequence and the relation between frames, inputting the characteristics of the image into an LSTM to extract the characteristic information of the sequence, and finally classifying road conditions by using a full connection layer.

According to an aspect of the present invention, a traffic status recognition method is provided, including:

acquiring a video of road condition identification to be carried out;

extracting frames of the acquired video to obtain a frame sequence;

semantically segmenting the frame sequence to obtain a mask frame sequence which distinguishes a road condition object from a background object, wherein the road condition object comprises one or more of a road, a pedestrian and a vehicle;

masking the frame sequence using the sequence of mask frames to remove background objects;

performing feature extraction on the masked frame sequence; and

the road condition is identified based on the extracted features.

According to another aspect of the present invention, there is provided a traffic identification system, including:

a video capture module configured to capture real-time traffic status video;

a traffic identification module configured to:

extracting frames of the real-time road condition video captured by the video capturing module to obtain a frame sequence;

performing feature extraction on the masked frame sequence; and

identifying a road condition based on the extracted features; and

and the communication module is used for sending the identification result of the road condition identification module to a road condition monitoring platform.

According to a further embodiment of the present invention, semantically segmenting the frame sequence to obtain a mask frame sequence that distinguishes the road condition object from the background object further comprises:

the road condition object is distinguished from the background object using a pre-trained road condition semantic segmentation model, which is a learning network using a residual pyramid structure.

According to a further embodiment of the present invention, masking the sequence of frames using the sequence of mask frames to remove background objects further comprises:

performing binarization operation on the mask frame sequence;

adjusting the binarized mask frame sequence to the size of an original image; and

the adjusted sequence of mask frames is used to mask the artwork to replace background objects in the artwork with black.

According to a further embodiment of the invention, the feature extraction of the masked frame sequence further comprises:

extracting image characteristics of the masked frame sequence through a convolutional neural network to obtain an image characteristic matrix;

performing sequence semantic feature extraction on the image features extracted by the convolutional neural network through a cyclic neural network to obtain a sequence semantic feature matrix; and

and fusing the obtained image feature matrix and the sequence semantic feature matrix to obtain a fused sequence feature matrix.

According to a further embodiment of the present invention, identifying the road condition based on the extracted features further comprises:

and inputting the extracted features into a pre-trained road condition classifier model to obtain a road condition identification result.

Compared with the scheme in the prior art, the road condition identification method and the road condition identification system provided by the invention at least have the following advantages:

(1) in the process of extracting the image sequence features by using the convolutional neural network, the local semantic information of the image is more concerned, and the sequence semantics among the image sequences are ignored. The method uses a deep circulating neural network to extract sequence semantic information contained in an image sequence from a global perspective;

(2) the recognition accuracy is high, the background information seriously influences the recognition of the road condition under the conditions that the vehicle-mounted camera moves and the background information is continuously changed in the current road condition recognition process, and the video frame can be effectively semantically segmented by using the RPNet model, so that the noise content in the image is removed, and the model accuracy is effectively improved;

(3) the initiation can be performed under strict memory and computation conditions, using fewer parameters than the VGGNet, using only the 1/36 parameters of the VGGNet parameters. This makes the acceptance model applicable to a variety of big data scenarios;

(4) compared with the prior art, the method and the device can greatly save the bandwidth of the network, are not limited by the environment of the network any more, and can identify the road condition in real time only by deploying a model aiming at the scene in a local automobile data recorder and providing certain computing power in an embedded mode.

These and other features and advantages will become apparent upon reading the following detailed description and upon reference to the accompanying drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only some typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

Fig. 1 is a flowchart of a traffic condition identification method according to an embodiment of the present invention.

FIG. 2 is a model framework of a deep learning network that may be used to perform frame sequence feature extraction according to one embodiment of the invention.

Fig. 3 is a schematic structural diagram of a road condition recognition system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the attached drawings, and the features of the present invention will be further apparent from the following detailed description.

Fig. 1 is a flowchart of a traffic status identification method 100 according to an embodiment of the present invention. The method begins at step 102 by obtaining a video to be identified. As one example, the video to be recognized may be real-time traffic video captured by a tachograph or a camera or other image capture device installed on a car. Typically, the traffic video usually captures an image of an environment in a certain viewing angle range right in front of the vehicle, and the environment generally includes traffic objects such as roads, pedestrians, vehicles, and background objects such as buildings and trees on both sides of the roads.

In step 104, the acquired video is decimated. Video is made up of a number of consecutive frames, with common video format frame rates typically being 24 frames/30 frames/60 frames per second. For a traffic scenario, consecutive frames may not globally represent the current driving traffic. Thus, the frame rate of extraction may be specified, such as extracting a certain number of frames from the video every few seconds (e.g., 3 seconds, 5 seconds) or a certain number of frames (e.g., 24fps video may be every 72 frames or 120 frames). In one example, the duration of the video may be defined, for example, 30 seconds or 60 seconds per video segment. If this duration is exceeded, the video may be segmented first. When the duration of the video is fixed, the number of frames extracted from the video will also be fixed. In another example, the duration of the video may not be fixed, but the number of frames extracted may be limited, such as extracting 10 frames or other number of frames.

In step 106, the extracted frame sequence is semantically segmented to obtain a mask frame sequence for distinguishing the road condition object from the background object. As mentioned previously, road condition objects may include, but are not limited to, roads, pedestrians, and vehicles, and background objects may include, but are not limited to, buildings, trees, and any other objects not belonging to road condition objects. As one example, semantic segmentation may be performed on each frame of the extracted sequence of frames one by one using a trained road conditions semantic segmentation model. After inputting the frames into the model, the model can output a two-class segmentation map that distinguishes road condition objects from background objects, where the road condition objects can be labeled in one color and the background can be labeled in another color. This frame is subsequently used as a mask, and is therefore referred to as a mask frame, and each frame of the extracted frame sequence is processed to obtain a mask frame sequence. As a preferred example, to improve the efficiency of segmentation, the present invention can build a learning network using a residual pyramid structure based on, for example, RPNet. Similar to the SSD model in object detection, the model can approximate different levels of residuals by using a single backhaul network on different layers of backhaul networks to realize single-shot segmentation, thereby improving the segmentation efficiency. In addition, the RPNet model can be trained by vehicle-mounted images of different scenes, so that the model can perfectly adapt to the scenes processed by the method. Those skilled in the art will appreciate that RPNet is only a preferred example, and the road condition semantic segmentation network of the present invention can also be built based on other suitable segmentation network architectures (e.g., U-Net or SegNet).

Subsequently, at step 108, the frame sequence is masked using the sequence of mask frames to remove background objects. As one example, the mask processing may further include performing binarization operations (e.g., processing to 0 and 1) on the sequence of mask frames, and then performing a mask operation on the original image using the adjusted binarization map to replace the background object in the original image with black, while the road condition object in the original image remains in the original color. It is understood that the background is replaced by black, which is only an example, and any other color that can facilitate distinguishing the background object from the road condition object may be used. Through the step, only road condition objects such as roads, people and vehicles are reserved in the processed frame sequence, and the rest background information (namely noise) is effectively filtered through mask processing.

In step 110, feature extraction is performed on the masked frame sequence. As one example, various feature extraction algorithms and techniques may be applied to the masked frame sequence, e.g., may be input to any existing or suitable feature extraction model. As one example, the feature extraction model may be a convolutional or cyclic neural network model. On one hand, image feature extraction is carried out on each frame in the frame sequence processed by the mask through the convolutional neural network CNN, on the other hand, in order to extract relation semantic information between the frame sequences, the extracted features can be input into a Long Short-Term Memory network (LSTM), sequence semantic information is extracted, and finally the image features extracted by the CNN and the sequence semantic information extracted by the LSTM are fused.

At step 112, the road conditions are identified based on the extracted features. For example, the fused features are input into a fully connected neural network for road condition classification. In one example, the road conditions can be classified into clear, general, and congested. Those skilled in the art will appreciate that a greater or lesser number of classifications may be provided as desired. The final output may be the predicted probability of each frame traffic classification.

FIG. 2 is a model framework of a deep learning network that may be used to perform frame sequence feature extraction according to one embodiment of the invention. As shown in fig. 2, the feature extraction network is a convolutional and cyclic neural network. First, the data input at the input end is an image sequence with background noise removed after semantic segmentation on the original image, for example, a frame sequence in which the background portion has been replaced by black and only the road condition portion is reserved. The input image is first input into an input layer composed of a plurality of convolutional layers and pooling layers for image feature extraction.

As one example, feature extraction may be implemented by a plurality of block modules. In consideration of the fact that semantic features and spatial information contained in the features extracted by different convolutional layers are different, different convolutional layers are adopted in each block module to extract features of different levels, and therefore the spatial features and the semantic features of the images are reserved. As one example, a block module such as initiation v3, ResNet50, or EffectintNet may be employed here. Wherein the initiation can be executed under strict memory and computation conditions, and uses fewer parameters than the VGGNet, and only uses 1/36 parameters of the VGGNet parameters. This makes the acceptance model applicable to a variety of large data scenarios. The output of the input layer is an image feature matrix of the sequence of input image frames.

The road condition cannot be judged only by extracting semantic features and spatial features of the images in the road condition scene, and the real state of the road condition can be reflected only by taking sequence semantic information among the image sequences into consideration. Therefore, the image features extracted by the input layer are provided to the sequence feature extraction layer for sequence semantic information extraction. To efficiently extract the sequence semantic features of a sequence, here the sequence feature extraction layer may use a more expressive diagonal recurrent neural network DRNN, where each layer uses the RNN variant LSTM. The output of the DRNN is a sequence semantic feature matrix.

In order to effectively reserve and utilize the image semantic features extracted by the convolutional neural network and the sequence semantic information extracted by the cyclic neural network, fusion can be further carried out in a matrix splicing mode, and finally the feature matrix of the current image sequence is output. For example, the image feature matrix of the CNN extracted input image frame sequence and the DRNN extracted sequence semantic feature matrix may be stitched by a concatenate operation. The final feature matrix obtained after fusion can be input into a fully-connected neural network for classification.

Fig. 3 is a schematic structural diagram of a road condition recognition system 300 according to an embodiment of the present invention. As shown in fig. 3, the system 300 may include a video capture module 302, a road condition identification module 304, and a communication module 306. As one example, the video capture module 302 may be a tachograph or a camera or other image capture device mounted on a car for capturing real-time traffic status video. The traffic status recognition module 304 may be configured as hardware or software integrated in the driving recorder or the vehicle for performing traffic status recognition according to the real-time traffic status video captured by the video capture module 302, such as recognizing the current traffic status by the method described above with reference to fig. 1 and 2, and providing the predicted probability of the corresponding traffic status. The communication module 306 may also be integrated in a driving recorder or an automobile, and is configured to report the recognition result and/or other related data output by the traffic recognition module 304 to a cloud, for example, the recognition result and/or other related data may be reported to a traffic monitoring platform or other intelligent management platforms (e.g., intelligent city platform, intelligent city brain, etc.) subordinate to a traffic management department.

What has been described above includes examples of aspects of the claimed subject matter. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but one of ordinary skill in the art may recognize that many further combinations and permutations of the claimed subject matter are possible. Accordingly, the disclosed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims.

Claims

1. A road condition identification method comprises the following steps:

acquiring a video of road condition identification to be carried out;

extracting frames of the acquired video to obtain a frame sequence;

performing feature extraction on the masked frame sequence; and

the road condition is identified based on the extracted features.

2. The method of claim 1, wherein semantically segmenting the sequence of frames to obtain a sequence of mask frames that distinguishes road condition objects from background objects further comprises:

3. The method of claim 1, wherein masking a sequence of frames using the sequence of mask frames to remove background objects further comprises:

performing binarization operation on the mask frame sequence;

4. The method of claim 1, wherein feature extracting the masked frame sequence further comprises:

5. The method of claim 1, wherein identifying road conditions based on the extracted features further comprises:

6. A road condition identification system comprising:

a video capture module configured to capture real-time traffic status video;

a traffic identification module configured to:

performing feature extraction on the masked frame sequence; and

identifying a road condition based on the extracted features; and

7. The system of claim 6, wherein semantically segmenting the sequence of frames to obtain a sequence of mask frames that distinguishes road condition objects from background objects further comprises:

8. The system of claim 6, wherein masking a sequence of frames using the sequence of mask frames to remove background objects further comprises:

performing binarization operation on the mask frame sequence;

9. The system of claim 6, wherein feature extracting the masked frame sequence further comprises:

10. The system of claim 6, wherein identifying the road condition based on the extracted features further comprises: