CN115171034A

CN115171034A - Road foreign matter detection method, and method and device for detecting foreign matters in scene

Info

Publication number: CN115171034A
Application number: CN202210631102.8A
Authority: CN
Inventors: 毛泉涌; 危春波; 周橹楠; 杨吉锐; 吴婷; 邓兵; 梁桥
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-10-11

Abstract

The embodiment of the application provides a method and a device for detecting foreign matters on a road, comprising the following steps: acquiring a road video shot aiming at a road scene and a long-term modeling background image of the road video; the long-time modeling background image is used for representing a road scene without a dynamic target; the dynamic target is a target with a position shifted within a period of time; acquiring a candidate area where a dynamic target in a video frame is located; and inputting the candidate area and a comparison area corresponding to the candidate area in the long-term modeling background image into a comparison detection model, obtaining a judgment result of whether the dynamic target in the video frame is a foreign object, and obtaining the position of the foreign object in the road scene. The image detection method provided by the application can be used for identifying the foreign matters in the road scene based on the idea of contrast learning, and because the application does not carry out target detection based on the video frame full image, the calculated amount is smaller compared with a full image target detection scheme, in addition, the identification accuracy of a contrast detection model does not depend on the training of a large number of labeled samples, and the training cost of the model is lower.

Description

Road foreign matter detection method, and method and device for detecting foreign matters in scene

Technical Field

The present disclosure relates to the field of image recognition technologies, and in particular, to a method and an apparatus for detecting a foreign object on a road, an electronic device, and a machine-readable medium.

Background

Roads carry important vehicle traffic capacity, but spilled foreign objects (e.g., discarded vehicle debris, spilled cargo, dumped or deformed vehicles, discarded tires, etc.) are randomly generated on the roads and seriously affect driving safety, so that foreign objects on the lanes need to be identified for early warning and disposal.

In the related art, a camera may be used to photograph the road surface of a road, and through an image recognition model based on deep learning, the entire image of the photographed video may be subjected to target detection, so as to determine the position and type of each object in the image of the photographed video, and further find out an object belonging to the type of foreign object, which is used as a foreign object on an expressway.

However, the inventors have studied and found that, in the conventional scheme, foreign objects are recognized by an image recognition model based on deep learning, model performance of the image recognition model is excessively depended on in order to obtain an accurate object type, so that model training cost is high, and in addition, the calculation amount required by the whole image target detection is huge, so that calculation cost is increased.

Disclosure of Invention

The embodiment of the application provides a road foreign matter detection method and a scene foreign matter detection method, and aims to solve the problem that model training cost and calculation cost are high in the related art.

Correspondingly, the embodiment of the application also provides a road foreign matter detection device, a scene foreign matter detection device, electronic equipment and a storage medium, so as to ensure the realization and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a method for detecting a foreign object on a road, including:

acquiring a road video shot aiming at a road scene and a long-term modeling background image of the road video; the long-time modeling background image is used for representing a road scene without a dynamic target; the dynamic target is a target with a position shifted within a period of time;

acquiring a candidate area where the dynamic target is located in a video frame of the road video;

inputting the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model to obtain a judgment result of whether the dynamic target in the video frame is a foreign object, wherein the comparison detection model is a machine learning model;

and acquiring the position of the foreign matter in the road scene according to the judgment result of the dynamic target.

Optionally, the training an initial model with the training data set to obtain the comparison detection model includes:

performing a data enhancement operation on the training images in the training data set to increase the number of training images and obtain a target training data set, wherein the data enhancement operation comprises: at least one of cropping, resizing, recoloring, color distortion;

pre-training the initial model through the target training data set to obtain a first model;

performing fine tuning training on the first model through a target training image in the target training data set to obtain a second model;

and performing network distillation training according to the training image which is not marked with the class label in the target training data set, the first model and the second model to obtain the comparison detection model.

The embodiment of the application discloses a method for detecting foreign matters in a scene, which comprises the following steps:

acquiring a target video shot aiming at a target scene and a long-term modeling background image of the target video; the long-time modeling background image is used for representing a target scene without a dynamic target; the dynamic target is a target with a position shifted within a period of time;

acquiring a candidate area where the dynamic target is located in a video frame of the target video;

and acquiring the position of the foreign matter in the target scene according to the judgment result of the dynamic target.

The embodiment of the application discloses road foreign matter detection device, the device includes:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a road video shot aiming at a road scene and a long-term modeling background image of the road video; the long-term modeling background image is used for representing a road scene without a dynamic target; the dynamic target is a target with a position offset within a period of time;

the second acquisition module is used for acquiring a candidate area where the dynamic target is located in a video frame of the road video;

the first comparison module is used for inputting the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model to obtain a judgment result of whether a dynamic target in the video frame is a foreign object, wherein the comparison detection model is a machine learning model;

and the first identification module is used for acquiring the position of the foreign matter in the road scene according to the judgment result of the dynamic target.

The embodiment of the application discloses foreign matter detection device in scene, the device includes:

the third acquisition module is used for acquiring a target video shot aiming at a target scene and a long-term modeling background image of the target video; the long-time modeling background image is used for representing a target scene without a dynamic target; the dynamic target is a target with a position shifted within a period of time;

a fourth obtaining module, configured to obtain a candidate region where the dynamic target is located in a video frame of the target video;

the second comparison module is used for inputting the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model to obtain a judgment result of whether a dynamic target in the video frame is a foreign object, wherein the comparison detection model is a machine learning model;

and the second identification module is used for acquiring the position of the foreign matter in the target scene according to the judgment result of the dynamic target.

The embodiment of the application also discloses an electronic device, which comprises: a processor; and a memory having executable code stored thereon, which when executed, causes the processor to perform a method as described in one or more of the embodiments of the application.

Embodiments of the present application also disclose one or more machine-readable media having executable code stored thereon that, when executed, cause a processor to perform a method as described in one or more of the embodiments of the present application.

Compared with the related art, the embodiment of the application has the following advantages:

in the embodiment of the application, a long-term modeling background image obtained by road video modeling is obtained, a dynamic target in a road scene is extracted, and a difference between a candidate region of the dynamic target and a corresponding contrast region of the candidate region in the long-term modeling background image is compared through a contrast detection model based on the thought of contrast learning to identify foreign matters in the road scene.

Drawings

FIG. 1 is a system architecture diagram of an embodiment of the present application;

FIG. 2 is a schematic diagram of a road scene according to an embodiment of the present application;

FIG. 3 is a long-term modeling background diagram according to an embodiment of the present application;

FIG. 4 is a short-time and long-time modeling background diagram according to an embodiment of the present application;

fig. 5 is a schematic diagram of an implementation of foreign object detection in an indoor scene according to an embodiment of the present application;

fig. 6 is a schematic diagram of implementation of foreign object detection in a parking lot scene according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating steps of a method for detecting a foreign object on a road according to an embodiment of the present disclosure;

FIG. 8 is a flowchart illustrating specific steps of a method for detecting a foreign object on a road according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating steps of a method for detecting a foreign object in a scene according to an embodiment of the present application;

fig. 10 is a block diagram of a road foreign matter detection apparatus according to an embodiment of the present application;

FIG. 11 is a block diagram of a foreign object detection apparatus in a scene according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

To enable those skilled in the art to better understand the present application, the following description is provided to illustrate the concepts related to the present application:

road foreign matter: also referred to as road sprinklers, are objects that affect driving safety that are thrown onto the road, such as, for example, discarded garbage from vehicles on the road, thrown cargo, dropped stones, dumped or deformed vehicles, discarded tires, and the like.

Background: the set of objects that do not change or hardly change in a scene, such as a road scene, includes objects included in the background, such as the road itself, a tree beside the road, and traffic signs on both sides of the road.

And (3) prospect: objects outside the background in the scene can be used as interested objects for subsequent analysis and pressure application, and specifically can be moving objects, special objects and the like in the scene.

Modeling a background graph: the method for detecting the moving target in the video image has the basic idea that a background scene represented in the video is modeled and can be realized in a mixed Gaussian modeling mode.

Dynamic targeting: moving objects in a scene appear in a video as objects that appear in different consecutive frames and are offset in position, such as vehicles, people moving on a road, etc.

Comparative Learning (contrast Learning): the method focuses on learning common features among similar examples and distinguishing different techniques among non-similar examples, and the comparison learning does not pay attention to how to determine specific class characteristics of the features, so that the method can be particularly applied to unsupervised model training scenes with few samples and achieves a high-precision model effect through few sample labeling and unsupervised training modes.

A lane area: polygonal outlines of lanes on the road, and each polygon may have a corresponding lane number, either added manually or through machine learning.

And (3) mechanical and non-human detection: and performing target detection of motor vehicles, non-motor vehicles and pedestrians in the images by using a deep learning method.

In the embodiment of the application, the road foreign matter detection method can be applied to a detection scene of foreign matters in a road, specifically, in the road scene, a fixed road section is shot through a shooting device to obtain a road video, background modeling is performed based on the road video to obtain a long-term modeling background image used for representing the road scene without a dynamic target, then a foreground object can be extracted from the long-term modeling background image, non-foreign matter objects (such as motor vehicles, non-motor vehicles and people) in the foreground object are screened out according to a filtering mode related to the road scene, and finally candidate areas where the remaining foreground objects are located are extracted from a video frame; in the case that the candidate area and the contrast area of the foreground object are more different, the probability that the foreground object is a foreign object is higher, which indicates that the foreground object is less likely to be a background object in the scene.

Therefore, the foreign matters in the road scene are identified by comparing the difference between the foreground object and the background object based on the idea of contrast learning, and because the target detection based on the video frame full image is not performed, the calculation amount is smaller compared with the full image target detection scheme, in addition, the identification accuracy of the contrast learning does not depend on the training of a large number of labeled samples, and the training cost of the model is lower.

Referring to fig. 1, a system architecture diagram provided by an embodiment of the present application is shown, including: detect server and customer end, detect the server and include: the device comprises a background modeling module, a non-foreign object target filtering block, a contrast detection module and a training module.

Further referring to fig. 2, which shows a schematic diagram of a road scene provided in an embodiment of the present application, including: the photographing apparatus 10 disposed near the road, the street lamps 21 on both sides of the road, the traffic signboards 22, the vehicles 31, the vehicles 32, the goods 40 on which the vehicles 31 fall, and the like. The photographing apparatus 10 may photograph the road section to obtain the road video.

Specifically, the background modeling module may respectively establish a long-time modeling background image (obtained by modeling with a large learning rate) and a short-time modeling background image (obtained by modeling with a small learning rate) based on the road video through a preset background modeling algorithm (e.g., a gaussian mixture modeling algorithm) and different learning rates. The learning rate reflects the weight updating speed during modeling, the lower the learning rate is, the slower the weight updating speed is, the slower the change speed of the loss function during modeling is, the long-term modeling background image is modeled by a larger learning rate, and can capture background objects in a scene and ignore moving objects in the scene, so that the long-term modeling background image is used for representing a road scene without dynamic objects, and based on the scene shown in fig. 2, the long-term modeling background image can be in the form of fig. 3, so that the long-term modeling background image in fig. 3 contains background objects such as roads, street lamps 21 and traffic signs 22, but does not contain moving objects such as vehicles 31, vehicles 32 and dropped goods 40; the short-time modeling background image is modeled by a small learning rate, and meanwhile, a background object and a moving object in a scene can be captured, based on the scene shown in fig. 2, the short-time modeling background image can be in the form of fig. 4, wherein the long-time modeling background image and the short-time modeling background image can be generated by updating at preset time intervals (for example, 10 seconds), so that the timeliness of the two images is guaranteed.

It can be seen that, since the long-term modeling background map is used to represent a road scene not including a dynamic target, and the short-term modeling background map is used to represent a road scene including a dynamic target, a dynamic target (also referred to as a foreground target, which refers to a target whose position is shifted within a period of time) in the scene can be extracted by comparing the long-term modeling background map and the short-term modeling background map for difference processing, so as to determine a candidate region where the dynamic target is located in a video frame of a road video.

Non-foreign matter target filter module can carry out the prefilter to the dynamic target that obtains of drawing, screens out the non-foreign matter target that does not exert an influence to driving safety, and the filtration mode can have the multiple: in one mode, since the motor vehicles, non-motor vehicles and people on the road are normal road targets and do not belong to the foreign object classification, the motor vehicle targets, the non-motor vehicle targets and the people targets in the dynamic targets can be identified and screened out. In another way, referring to fig. 2, since the embodiment of the present application focuses on identifying a foreign object in the lane area 50, a dynamic object outside the lane area 50 among the dynamic objects may be filtered. In another mode, the foreign matter which causes harm to driving safety is an object which exists on the road surface for a long time, so that dynamic targets which exist for a short time can be filtered; in another implementation, a standard size range of the foreign objects may be determined based on sizes of a large number of objects determined to be foreign objects, and dynamic objects having sizes outside the standard size range may be filtered. The embodiment of the application can filter the non-foreign object targets in the quick dynamic targets through less calculated amount by the preliminary filtering of the non-foreign object target filtering module, the residual dynamic targets can be input into the comparison detection model to judge the foreign objects, the calculated amount of the comparison detection model can be reduced by the filtering operation, and the recognition efficiency is improved.

The contrast detection model can be obtained by training a training module by utilizing part of labeled training data, the contrast learning is a technology for learning common characteristics between similar examples and distinguishing differences between non-similar examples, because the contrast learning does not pay attention to how to determine specific class characteristics of the characteristics, a small number of samples can be labeled in training data used for training (for example, 10% of the samples are labeled, namely, class labels are added to 10% of the samples), the labeled samples can assist the model in training in a fine adjustment link, so that the model is matched with the recognition scene of the road foreign matter, the trained contrast detection model can input two images and output the similarity of the two images. The comparison detection model can be a Machine Learning model, the Machine Learning model is a mathematical model based on Machine Learning (ML), the Machine Learning is a multi-field cross subject and relates to multiple subjects, and the purpose is to enable the mathematical model to research how a computer simulates or realizes human Learning behaviors so as to acquire new knowledge or skills and reorganize an existing knowledge structure to continuously improve the performance of the computer.

After the contrast detection model is obtained through training, the candidate region where the dynamic target is located in the video frame and the corresponding contrast region of the candidate region in the long-term modeling background image can be input into the contrast detection model, the contrast detection model can calculate the similarity between the candidate region and the contrast region, and under the condition that the similarity is smaller than or equal to the first similarity threshold value, the judgment result that the dynamic target in the video frame is a foreign object is obtained. And when the similarity is greater than or equal to the second similarity threshold, judging that the dynamic target in the video frame is not the foreign object.

Specifically, since the contrast area reflects the background object corresponding to the dynamic object, the more similar the candidate area and the contrast area, the more likely the dynamic object is to be a background object that is not a foreign object, and the more different the candidate area and the contrast area is, the more likely the dynamic object is not a background object but a foreign object. For example, in fig. 2, the moving vehicle 31 is filtered and screened out before the comparison and judgment, and even if the street lamp 21 and the traffic sign 22 with fixed positions are mistakenly identified as a dynamic target due to light influence, shooting jitter and other reasons, the street lamp and the traffic sign are judged to be too similar to the corresponding comparison area in the long-term modeling background image in the comparison and judgment process, and are finally judged to be a non-foreign object, so that the identification accuracy is improved, and the misjudgment probability is reduced.

Based on the determination result of whether the dynamic target in the road video is the foreign object or not, the dynamic target reported as the foreign object can be tracked in real time subsequently, and the dynamic target determined as the foreign object is sent to the client side so that road maintenance personnel using the client side can go forward to investigate and process the foreign object.

It should be noted that the embodiment of the present application further provides a method for detecting a foreign object in a scene, where several other scenes that can be implemented by the method are as follows:

in an implementation manner, referring to fig. 5, an implementation schematic diagram of foreign object detection in an indoor scene provided by an embodiment of the present application is shown, including: client and detection server, the indoor scenario 50 includes: indoor photographing apparatus 11, table 51 (background object), soccer ball 52 (foreign object). The detection server may be configured as shown in fig. 1, a contrast detection model is built in the detection server, a dynamic target in a scene is determined based on an indoor video shot by an indoor shooting device 11, a long-term modeling background image obtained according to the indoor video, and a contrast learning idea, the detection server detects a table 51 with a stable and unchangeable position as a background target from a detection area 53 of an indoor scene 50 by comparing characteristics of the dynamic target in a video frame with a corresponding contrast area of the dynamic target in the long-term modeling background image, and detects a football 52 rolling into the detection area 53 as a foreign object target, so as to detect the foreign object in the indoor scene 50, and based on a determination result of whether the dynamic target in the indoor video is the foreign object, the dynamic target reported as the foreign object can be tracked in real time subsequently, and the dynamic target determined as the foreign object is sent to a client for subsequent foreign object processing.

In another implementation manner, referring to fig. 6, which shows a schematic implementation diagram of foreign object detection in a scene of a parking lot provided in an embodiment of the present application, including: client and detection server, the scene 60 of the parking lot includes: the parking lot photographing device 12, the vehicle 61 (background object or non-foreign object), the trash 62 (foreign object). The detection server side can be configured as shown in fig. 1, a contrast detection model is built in the detection server side, a dynamic target in a scene is determined based on a parking lot video shot by a parking lot shooting device 12, a long-term modeling background picture obtained from the parking lot video and a contrast learning idea, the dynamic target in a video frame is compared with a corresponding contrast area of the dynamic target in the long-term modeling background picture, a parked vehicle 61 is detected as a background target from a parking lot area 63 of a scene 60 of the parking lot, the parked vehicle 61 is a non-foreign object, and garbage 62 thrown to the parking lot area 63 is a foreign object, so that detection of foreign objects in the scene 60 of the parking lot is achieved, the dynamic target reported as the foreign object can be tracked in real time subsequently based on a judgment result of whether the dynamic target in the parking lot video is the foreign object, the dynamic target judged as the foreign object is sent to a client side for subsequent processing of the foreign object in the parking lot, cleanness of the parking lot is guaranteed, and tidiness of the parking lot are improved, and parking experience degree is improved.

It should be noted that, in the embodiment of the present application, the process of acquiring the road video, the long-term modeling background map, and other information, signals, or data used in the process is performed under the premise of complying with the data protection regulation policy corresponding to the country of the location, and obtaining the authorization given by the owner of the corresponding device.

Referring to fig. 7, which shows a flowchart of steps of a road foreign object detection method provided in an embodiment of the present application, including:

step 101, acquiring a road video shot aiming at a road scene and a long-term modeling background image of the road video.

The long-term modeling background image is used for representing a road scene which does not contain a dynamic target; the dynamic target is a target with a position shifted in a period of time.

In the embodiment of the application, foreign matter detection is performed on a road scene, firstly, a shooting device with a fixed shooting view angle needs to be erected in the road scene, so that the shooting device can shoot a road video of the road scene, in order to facilitate subsequent extraction of a dynamic target (a foreground target) in the scene, the road video can be modeled based on a mixed gaussian modeling algorithm, and a long-term modeling background map (as shown in fig. 3) is obtained.

And 102, acquiring a candidate area where the dynamic target is located in a video frame of the road video.

In the embodiment of the application, a short-time modeling background image (as shown in fig. 4) can be obtained based on a smaller learning rate while obtaining a long-time modeling background image, a background target and a moving target in a scene can be captured simultaneously by the short-time modeling background image, a difference image containing the dynamic target is obtained by comparing the long-time modeling background image with the short-time modeling background image and performing difference processing, the dynamic target (also called a foreground target, which refers to a target with a position shifted within a period of time) in the scene can be extracted from the difference image, and further, because the sizes of the modeling background image, the difference image and a video frame of a shot video are consistent, a candidate region where the dynamic target is located can be determined in the video frame of a road video based on the mapping from the region of the dynamic target in the difference image to the video frame. It should be noted that, in the embodiment of the present application, a dynamic object in a scene may also be identified by other identification manners, which is not limited in the embodiment of the present application.

Step 103, inputting the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model, and obtaining a determination result whether the dynamic target in the video frame is a foreign object.

Wherein, the comparison detection model is a machine learning model.

After a contrast detection model is obtained through training of a training data set containing a small number of marked samples and a large number of unmarked samples, a candidate region where a dynamic target is located in a video frame and a contrast region corresponding to the candidate region in a long-term modeling background image can be input into the contrast detection model, the contrast detection model can calculate the similarity between the candidate region and the contrast region, and a judgment result that the dynamic target in the video frame is a foreign object is obtained under the condition that the similarity is less than or equal to a first similarity threshold value. And when the similarity is greater than or equal to the second similarity threshold, judging that the dynamic target in the video frame is not the foreign object.

Specifically, since the contrast area reflects the background object corresponding to the dynamic object, the more similar the candidate area and the contrast area are, the more likely the dynamic object is to be a background object that is not a foreign object, and the larger the difference between the candidate area and the contrast area is, the more likely the dynamic object is not a background object but a foreign object.

Further, the comparison detection model may take two regions to be compared as input, extract respective coding features of the two regions to be compared, calculate cosine distances of the two coding features, and take the cosine distances as similarity between the candidate region and the comparison region.

And 104, acquiring the position of the foreign matter in the road scene according to the judgment result of the dynamic target.

In the embodiment of the application, as for the determination result of whether the dynamic target in the road video is the foreign object, the dynamic target reported as the foreign object can be tracked in real time subsequently, and the dynamic target determined as the foreign object is sent to the client, so that road maintenance personnel using the client can go forward to investigate and process the foreign object according to the position of the foreign object, and in addition, the dynamic target determined as the foreign object can also be sent to the management server to perform foreign object notification, and the management server can assign personnel to go forward to process.

The recognition processing of foreign matter on the road can effectively reduce the traffic accident's that leads to because of the foreign matter emergence probability on the road, on the basis of keeping driving safety, keeps the clean unblocked on road surface, and this application embodiment can also assist road management personnel to realize efficient road management, has reduced road patrol staff's work load. In addition, the foreign matter identification is realized through the idea of contrast learning, and the classification identification of the image features is not carried out, so that whether the target with unobvious features in the scene or the target which is difficult to identify and influenced by shooting jitter and light can be judged through contrast, compared with the classification identification of the image features in the related technology, the foreign matter is judged, higher accuracy can be provided through the judgment mode of the contrast learning, and the probability of false detection is reduced.

In summary, in the embodiment of the present application, a long-term modeling background image obtained by road video modeling is obtained, and a dynamic target in a road scene is extracted, so that a difference between a candidate region of the dynamic target and a corresponding contrast region of the candidate region in the long-term modeling background image is compared by a contrast detection model based on a thought of contrast learning, so as to identify a foreign object in the road scene.

Referring to fig. 8, which shows a flowchart of specific steps of a road foreign object detection method provided in an embodiment of the present application, including:

and step 201, acquiring the long-term modeling background image through a preset background modeling algorithm at preset time intervals based on the first learning rate and the road video.

In this embodiment of the application, the duration of the preset time interval may be set according to an actual requirement, for example, 10 seconds, the larger the preset time interval is, the higher the timeliness of the modeling background map is, the higher the detection accuracy is, the smaller the preset time interval is, the lower the timeliness of the modeling background map is, and the lower the detection accuracy is.

Since the learning rate reflects the weight updating speed during modeling, the lower the learning rate, the slower the weight updating speed, the slower the change speed of the loss function during modeling, and the long-term modeling background map is modeled by a larger first learning rate, which can capture the background object in the scene and ignore the moving object in the scene, therefore, the long-term modeling background map is used to represent a road scene without dynamic objects, and based on the scene shown in fig. 2, the long-term modeling background map can be as shown in fig. 3, which shows that the long-term modeling background map in fig. 3 includes background objects such as roads, streetlights 21, traffic signs 22, and the like, but does not include moving objects such as vehicles 31, vehicles 32, dropped goods 40, and the like.

Step 202, acquiring a short-time modeling background image through the background modeling algorithm based on a second learning rate and the road video every other preset time interval, wherein the short-time modeling background image is used for representing a road scene containing the dynamic target; the first learning rate is greater than the second learning rate.

In this step, the short-time modeling background map is modeled by the second learning rate, which is smaller, and the background object and the moving object in the scene can be captured, and the short-time modeling background map may be in the form of fig. 4 based on the scene shown in fig. 2.

Step 203, obtaining a pixel difference map according to a difference between the pixel value of the long-time modeling background map and the pixel value of the short-time modeling background map.

In the embodiment of the application, since the long-time modeling background image is used for representing a road scene not containing a dynamic target, and the short-time modeling background image is used for representing a road scene containing a dynamic target, the long-time modeling background image and the short-time modeling background image are subjected to difference processing, so that background targets existing in the long-time modeling background image and the short-time modeling background image can be eliminated through subtraction by using the difference of pixel values of pixels, and a pixel difference image with the residual dynamic target (foreground target) of the background target eliminated is obtained.

Step 204, obtaining a first region where the pixel difference map belongs to the dynamic target.

In the embodiment of the application, the background target is eliminated from the pixel difference image, so that the first region of the pixel difference image, where the dynamic target is located, can be obtained through the pixels of the pixel points in the pixel difference image or through image feature identification.

Optionally, in an implementation manner, step 204 may specifically include:

substep 2041, performing binarization processing on the pixel difference image to obtain a binarization image; and the pixel value in the binary image is a first numerical value or a second numerical value, the first numerical value corresponds to the dynamic target, and the second numerical value corresponds to the scene background.

Substep 2042, obtaining a first region in the pixel difference map according to the pixel values of the binarized map.

In this embodiment of the present application, in order to identify a dynamic target in a pixel difference map more accurately, a binarization process may be performed on the pixel difference map first to obtain a binarization map, where the binarization process is to change a pixel value of a pixel point of the pixel difference map to 255 or 0, and the principle is that a pixel value of a region where the dynamic target is located in the pixel difference map is approximately a value close to 255, and a pixel value of a region other than the dynamic target is subtracted and cancelled, and a pixel value of the region other than the dynamic target is close to 0.

In one implementation, a median value (e.g., 128) at a median position in the range of [0,255] may be set, the pixel values in the pixel difference map smaller than the median value may be changed to 0, and the pixel values in the pixel difference map larger than the median value may be changed to 255, thereby obtaining the binarized map.

Optionally, in another implementation manner, step 204 may specifically include:

substep 2043, performing binarization processing on the pixel difference image to obtain a binarization image; and the pixel value in the binary image is a first numerical value or a second numerical value, the first numerical value corresponds to the dynamic target, and the second numerical value corresponds to the scene background.

And a substep 2044 of performing morphological processing on the binarized image, and completely filling the incomplete area in the area of the dynamic target in the binarized image to obtain a filled image.

Substep 2045, performing region communication processing on the filling map, communicating different regions belonging to the same dynamic target in the filling map, and separating connected regions containing different dynamic targets in the filling map to obtain a to-be-identified map.

Substep 2046, obtaining a first region in the pixel difference map according to the pixel values of the image to be identified.

In the embodiment of the present application, in an actual situation, some defects exist in the obtained binarized map, for example, a defect exists in an area where a dynamic target is located in the binarized map, different fragment areas which are originally in the same dynamic target in the binarized map are isolated from each other, and a fragment area where different dynamic targets are located is included in one connected fragment area, and these defects affect the integrity of the dynamic target in the binarized map, and if these defects are eliminated, the identification accuracy of the dynamic target can be greatly improved.

Specifically, in the embodiment of the present application, firstly, morphological processing may be performed on the binarized map, incomplete patch areas in the area of the dynamic target in the binarized map are completely filled, a filled map is obtained, then, area connectivity processing is performed on the filled map, different patch areas belonging to the same dynamic target in the filled map are connected, and connected patch areas containing different dynamic targets in the filled map are separated, so as to obtain a to-be-identified map.

Step 205, mapping the first region in the pixel difference map to the video frame to obtain a candidate region in the video frame.

In the embodiment of the present application, since the pixel difference map is consistent with the size of the video frame, based on the mapping of the first region of the dynamic object in the pixel difference map to the video frame, the candidate region of the dynamic object in the video frame may be obtained.

It should be noted that, after determining the candidate regions of the dynamic objects in all the video frames, target tracking based on an overlap degree (IOU) may be performed, so as to associate the candidate regions of the same dynamic object in a plurality of video frames included in the road video, which aims to identify the dynamic object in the road video dimension.

Specifically, the overlapping degree between candidate regions of dynamic objects in different video frames may be calculated, where the larger the overlapping degree is, the more likely the dynamic objects in different video frames are the same dynamic object, and the larger the overlapping degree is, the more likely the dynamic objects in different video frames are different dynamic objects.

Step 206, inputting the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model, and obtaining a determination result whether the dynamic target in the video frame is a foreign object.

This step may specifically refer to step 103, which is not described herein.

Optionally, step 206 may specifically include:

substep 2061, inputting the candidate region and the comparison region into a comparison detection model, and calculating the similarity between the candidate region and the comparison region.

Sub-step 2062, obtaining the judgment result that the dynamic object in the video frame output by the contrast detection model is a foreign object when the similarity is less than or equal to the first similarity threshold.

Sub-step 2063, obtaining the judgment result that the dynamic target in the video frame output by the contrast detection model is not the foreign object when the similarity is greater than or equal to the second similarity threshold.

Wherein, the comparison detection model is a machine learning model.

In the embodiment of the present application, since the contrast area reflects the background object corresponding to the dynamic object, the more similar the candidate area and the contrast area are, the more likely the dynamic object is to be a background object that is not a foreign object, and the more different the candidate area and the contrast area is, the more likely the dynamic object is not a background object but a foreign object.

Specifically, the comparison detection model may calculate a similarity between the candidate region and the comparison region, and obtain a determination result that the dynamic object in the video frame is a foreign object when the similarity is less than or equal to a first similarity threshold. And when the similarity is greater than or equal to the second similarity threshold, judging that the dynamic target in the video frame is not the foreign object.

And step 207, acquiring the position of the foreign matter in the road scene according to the judgment result of the dynamic target.

This step may specifically refer to step 104, which is not described herein.

Optionally, a lane area to which a lane belongs is marked in the video frame of the road video, and the method may further include:

and 208, screening a first candidate region where a first dynamic target except the lane region is located from the candidate regions contained in the video frame marked with the lane region.

In the embodiment of the application, the obtained dynamic target can be subjected to preliminary filtering, and non-foreign object targets which do not influence driving safety are screened out. Since the lane foreign matter identification scene needs to pay attention to the foreign matter falling from the lane, the lane area 50 can be determined in the lane scene shown in fig. 2, and for the road video obtained by shooting by the shooting device, the lane area can be marked in the video frame of the road video, and the marking can be manually completed, or can be automatically identified and marked by using a deep learning technology.

Aiming at the video frame of the road video marked with the lane area, the embodiment of the application can screen out the first candidate area where the first dynamic target is located from the candidate areas contained in the video frame, so that the dynamic target in the non-attention area is screened out in advance through preliminary filtering, and the calculation amount of subsequent comparison judgment is reduced.

Optionally, after step 208, the method may further include:

step 209, obtaining a second candidate region of a second dynamic target in the lane region and a generation time of the second dynamic target, wherein the generation time is determined by the number of video frames containing the second dynamic target in the road video.

And 210, screening out a third candidate region of a third dynamic target with the generation time being greater than or equal to a preset time threshold from the second candidate region of the second dynamic target in the lane region.

In the embodiment of the present application, in relation to steps 209 to 210, the foreign object that endangers driving safety is an object that exists on the road surface for a long time, so that a dynamic object with a short generation time can be filtered from the dynamic objects, specifically, the generation time of the target object is determined by counting the number of video frames containing the dynamic object in the road video, and the larger the number of video frames containing the dynamic object in the road video is, the longer the generation time of the dynamic object is. And screening out a third candidate region of a third dynamic target with shorter generation time, thereby screening out a dynamic target appearing in a short time in advance through preliminary filtering, and reducing the calculation amount of subsequent comparison judgment.

Optionally, the method may further include:

step 211, acquiring a fourth dynamic target belonging to a preset target category in the video frame through a deep learning model; the object classes include: at least one of a pedestrian category, a motor vehicle category, a non-motor vehicle category.

Step 212, a fourth candidate area where the fourth dynamic target is located is screened out from the candidate areas contained in the video frame.

In the embodiment of the application, for steps 211-212, since the motor vehicles, non-motor vehicles and people on the road are normal road targets and do not belong to the foreign matter classification, the motor vehicle targets, non-motor vehicle targets and people targets in the dynamic targets can be identified and screened out.

Specifically, the fourth dynamic target belonging to the preset target category is identified by using a deep learning technology, that is, the fourth dynamic target belonging to the preset target category in the video frame is identified by using image category identification.

Optionally, the method may further include:

and step 213, acquiring the corresponding relation between the image depth and the foreign matter size range in the video frame.

And step 214, obtaining the image depth of the candidate region in the video frame, and obtaining the size range of the target foreign object corresponding to the candidate region in the video frame from the corresponding relationship according to the image depth of the candidate region in the video frame.

Step 215, removing a fifth candidate region where a fifth dynamic target with a size not within the target foreign object size range is located from the candidate regions included in the video frame.

In this embodiment, in steps 213 to 215, based on the analysis of the sizes of a large number of alien materials on the road, it may be determined that the alien materials have a certain size range, and in the captured image with a fixed capturing angle, due to the display characteristics of far, small, near, and large, the corresponding size ranges of the alien materials are different for different image depths, specifically, the farther the image depth is, the smaller the corresponding size range of the alien material is, the closer the image depth is, the larger the corresponding size range of the alien material is, and based on such characteristics, the present embodiment may establish the corresponding relationship between the image depth and the size range of the alien material in the video frame.

For a dynamic target identified in a current video frame, a target foreign matter size range corresponding to an image depth where the dynamic target is located can be determined, and if the size of the dynamic target (the size of a corresponding candidate region) is not within the target foreign matter size range, it can be determined that the dynamic target is not a foreign matter, and then the dynamic target can be screened.

Optionally, the video frame is divided into a plurality of different areas; after step 206, the method may further comprise:

step 216, acquiring a history judgment result of whether the dynamic target in the history video frame is a foreign matter; the historical video frame is a video frame for which a foreign object has been identified prior to the video frame.

And step 217, if it is determined that the foreign object exists in the segment of the video frame according to the determination result of the video frame and it is also determined that the foreign object exists in the segment according to the history determination result of the history video frame, it is determined that the foreign object does not exist in the segment of the video frame.

And step 218, under the condition that the foreign matters exist in the areas of the video frames according to the judgment results of the video frames and the foreign matters do not exist in the areas according to the history judgment results of the history video frames, determining that the foreign matters exist in the areas of the video frames.

As to steps 216 to 218, since the embodiment of the present application performs foreign object detection on video frames in a road video stream according to a playing time sequence, for a detected foreign object in a certain video frame, the detected foreign object can be subsequently determined to be a foreign object in subsequent video frames through object tracking, and when performing foreign object detection processing on subsequent video frames, a determination reporting phenomenon that a dynamic object is determined to be a foreign object repeatedly occurs, so that the foreign object is reported repeatedly. In addition, if the historical video frame judges that the dynamic target at one position is a foreign object, but the dynamic target judged as the foreign object disappears in the subsequent video frame, based on the comparison judgment logic of the application, the position has a misjudgment phenomenon that the foreign object is judged to exist.

In order to solve the above problem, in the embodiment of the present application, the video frame may be divided into different segments, for example, the video frame is divided into 3 × 3 matrix distribution segments, specifically, a historical determination result of whether a dynamic object in a historical video frame is a foreign object may be obtained first, after a determination result of whether a foreign object exists in a current video frame is obtained, a determination result of the current video frame for a segment and a historical determination result of the historical video frame for the same segment are compared, and whether a situation that a foreign object is repeatedly determined and reported and a false determination phenomenon of whether a foreign object exists are determined according to the comparison result.

The specific logic is as follows: determining that foreign matters exist in the areas of the video frames according to the determination result of the video frames and determining that foreign matters exist in the areas according to the historical determination result of the historical video frames; and under the condition that the foreign matters exist in the areas of the video frames according to the judgment results of the video frames and the foreign matters do not exist in the areas according to the historical judgment results of the historical video frames, the foreign matters exist in the areas of the video frames, so that the repeated judgment reporting probability of the foreign matters and the misjudgment probability of the foreign matters are reduced.

Optionally, the method may further include:

step 219, a training data set is obtained, where the training data set includes a plurality of training images, and a part of the target training images in the plurality of training images are labeled with corresponding category labels.

And step 220, training an initial model by using the training data set to obtain the comparison detection model.

In the embodiment of the present application, for steps 219 to 220, contrast learning is a technique for learning common features between similar examples and distinguishing differences between non-similar examples, because the contrast learning does not pay attention to how to determine specific class representations of the features, a small number of samples may be labeled in training data used for training (for example, 10% of the samples are labeled, i.e., class labels are added to 10% of the samples), the labeled samples may assist in training the model in a fine-tuning link, so that the model is matched with a recognition scene of a road foreign object, a contrast detection model after training may input two images and output similarity of the two images, and this idea makes recognition accuracy of the contrast learning not depend on training of a large number of labeled samples, and training cost of the model is low.

Optionally, step 220 may specifically include:

substep 2201, performing a data enhancement operation on the training images in the training data set to increase the number of the training images and obtain a target training data set, wherein the data enhancement operation comprises: at least one of cropping, resizing, recoloring, color distortion.

In the training link, more training images are needed for higher training effect, and how to further increase the number of the training images under the condition that the number of the acquired training images is limited.

Specifically, one training image may be subjected to one or any combination of two processing methods, such as cropping, resizing, recoloring, and color distortion, and the processed image may be used as a new training image, for example, a gray scale image obtained by performing gray scale processing on one training image may be used as a new training image.

And a substep 2202 of pre-training the initial model through the target training data set to obtain a first model.

In the embodiment of the application, the initial model can be pre-trained through a full amount of samples in a target training data set to obtain a first model, the pre-training is to enable the model to rapidly learn common characteristics among similar examples through a large amount of samples, the difference among non-similar examples is distinguished, and the pre-training process can be performed through about 800 iterative training rounds.

And a substep 2203, performing fine tuning training on the first model through the target training image in the target training data set to obtain a second model.

After the first model is obtained through pre-training, fine tuning training (fine tuning) can be performed based on the characteristics of the labeled training image in the target training data set and the road foreign matter identification scene, so that the second model obtained after training can better meet the identification requirement in the road foreign matter identification scene.

Specifically, in the fine tuning training process, 10% of marked training images can be taken firstly to perform fine tuning training in the first stage, and the iterative training is performed for about 60 rounds; and then, performing fine tuning training of a second stage by using all marked training images, and performing about 30 times of iterative training.

And a substep 2204, performing network distillation training according to the training image which is not marked with the class label in the target training dataset, the first model and the second model to obtain the comparison detection model.

In the embodiment of the application, the second model can be used as a teacher model obtained through fine tuning training, the first model obtained through pre-training can be used as a student model, and the student model can be further trained through the teacher model through network distillation training subsequently, so that the first model has a structure similar or identical to that of the teacher model, the purpose of knowledge migration among different model networks is achieved, and a comparison detection model with better performance is obtained.

Specifically, the network distillation training may utilize the second model to identify training images not labeled with class labels in the target training data set, mark pseudo labels on the training images not labeled with class labels, and train the first model with the training images labeled with the pseudo labels based on a learning rate mechanism, weight attenuation, and batch (batch) that are the same as those of the pre-training, and the network distillation training process may be iterated for about 400 rounds.

In summary, in the embodiment of the present application, by obtaining the long-term modeling background map obtained by road video modeling and extracting the dynamic target in the road scene, the difference between the candidate region of the dynamic target and the corresponding comparison region of the candidate region in the long-term modeling background map can be compared by comparing the detection model based on the thought of comparison learning, so as to identify the foreign object in the road scene.

Referring to fig. 9, which shows a flowchart of steps of a method for detecting a foreign object in a scene according to an embodiment of the present application, including:

step 301, acquiring a target video shot for a target scene and a long-term modeling background image of the target video.

The long-term modeling background image is used for representing a target scene which does not contain a dynamic target; the dynamic target is a target with a position shifted in a period of time.

Step 302, obtaining a candidate region where the dynamic target is located in a video frame of the target video.

Step 303, inputting the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model, and obtaining a determination result of whether the dynamic target in the video frame is a foreign object.

Wherein, the comparison detection model is a machine learning model.

And 304, acquiring the position of the foreign matter in the target scene according to the judgment result of the dynamic target.

For this embodiment, reference may be made to the description of fig. 1 to 6, which is not described herein again.

In summary, in the embodiment of the present application, a long-term modeling background map obtained by road video modeling is obtained, and a dynamic target in a road scene is extracted, so that a difference between a candidate region of the dynamic target and a corresponding comparison region of the candidate region in the long-term modeling background map can be compared by comparing a detection model based on a comparison learning idea, so as to identify a foreign object in the road scene.

Referring to fig. 10, which shows a block diagram of a road foreign object detection apparatus provided in an embodiment of the present application, including:

a first obtaining module 401, configured to obtain a road video shot for a road scene and a long-term modeling background map of the road video; the long-time modeling background image is used for representing a road scene without a dynamic target; the dynamic target is a target with a position offset within a period of time;

a second obtaining module 402, configured to obtain a candidate region where the dynamic target is located in a video frame of the road video;

a first comparison module 403, configured to input the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model, and obtain a determination result of whether a dynamic target in the video frame is a foreign object; wherein, the comparison detection model is a machine learning model.

And a first identification module 404, configured to obtain a position of a foreign object in the road scene according to a determination result of the dynamic target.

Optionally, the first comparing module 403 includes:

the similarity submodule is used for inputting the candidate region and the comparison region into a comparison detection model and calculating the similarity of the candidate region and the comparison region;

and the first judgment submodule is used for obtaining a judgment result that the dynamic target in the video frame output by the contrast detection model is a foreign object under the condition that the similarity is less than or equal to a first similarity threshold value.

And the second judgment submodule is used for obtaining a judgment result that the dynamic target in the video frame output by the contrast detection model is not a foreign object under the condition that the similarity is greater than or equal to a second similarity threshold.

Optionally, the apparatus further comprises:

the first generation submodule is used for acquiring the long-time modeling background image at intervals of preset time through a preset background modeling algorithm based on a first learning rate and the road video;

the second generation submodule is used for acquiring a short-time modeling background image through the background modeling algorithm based on a second learning rate and the road video every other preset time interval, and the short-time modeling background image is used for representing a road scene containing the dynamic target; the first learning rate is greater than the second learning rate.

Optionally, the second obtaining module 402 includes:

the difference submodule is used for acquiring a pixel difference image according to the difference between the pixel value of the long-time modeling background image and the pixel value of the short-time modeling background image;

the first identification submodule is used for acquiring a first area where the pixel difference image belongs to the dynamic target;

and the second identification submodule is used for mapping the first region in the pixel difference value image to the video frame to obtain a candidate region in the video frame.

Optionally, the first identification submodule includes:

the binarization submodule is used for carrying out binarization processing on the pixel difference image to obtain a binarization image; the pixel value in the binary image is a first numerical value or a second numerical value, the first numerical value corresponds to the dynamic target, and the second numerical value corresponds to the scene background;

and the third identification submodule is used for obtaining a first area in the pixel difference image according to the pixel value of the binarization image.

Optionally, the first identification sub-module further includes:

the morphological unit is used for performing morphological processing on the binary image, completely filling the incomplete area in the area of the dynamic target in the binary image and obtaining a filled image;

the area communication unit is used for carrying out area communication processing on the filling map, communicating different areas belonging to the same dynamic target in the filling map, and separating connected areas containing different dynamic targets in the filling map to obtain a to-be-identified map;

the first identification submodule includes:

and the identification unit is used for obtaining a first area in the pixel difference image according to the pixel value of the image to be identified.

Optionally, a lane area to which a lane belongs is marked in the video frame of the road video; the device further comprises:

and the first filtering module is used for screening out a first candidate region where a first dynamic target outside the lane region is located from candidate regions contained in the video frame marked with the lane region.

Optionally, the apparatus further comprises:

a third obtaining module, configured to obtain a second candidate region of a second dynamic target in the lane region and a generation time of the second dynamic target, where the generation time is determined by a number of video frames containing the second dynamic target in the road video;

and the second filtering module is used for screening out a third candidate region of a third dynamic target with the generation time being greater than or equal to a preset time threshold from the second candidate region of the second dynamic target in the lane region.

Optionally, the apparatus further comprises:

the deep learning module is used for acquiring a fourth dynamic target belonging to a preset target category in the video frame through a deep learning model; the object classes include: at least one of a pedestrian category, a motor vehicle category, a non-motor vehicle category;

and the third filtering module is used for screening out a fourth candidate area where the fourth dynamic target is located from the candidate areas contained in the video frame.

the fourth acquisition module is used for acquiring the corresponding relation between the image depth and the foreign matter size range in the video frame;

a fifth obtaining module, configured to obtain an image depth of the candidate region in the video frame, and obtain, according to the image depth of the candidate region in the video frame, a size range of the target foreign object corresponding to the candidate region in the video frame from the correspondence;

and the fourth filtering module is used for screening out a fifth candidate area where a fifth dynamic target with the size not in the target foreign matter size range is located from the candidate areas contained in the video frame.

Optionally, the video frame is divided into a plurality of different regions;

the device also comprises

A sixth obtaining module, configured to obtain a history determination result of whether a dynamic target in the history video frame is a foreign object; the historical video frame is a video frame in which foreign objects are identified before the video frame;

the first judging module is used for determining that foreign matters exist in the areas of the video frames according to the judging result of the video frames and determining that foreign matters do not exist in the areas of the video frames under the condition that the foreign matters exist in the areas according to the historical judging result of the historical video frames;

and the second judging module is used for determining that foreign matters exist in the areas of the video frames according to the judging result of the video frames and determining that foreign matters do not exist in the areas according to the historical judging result of the historical video frames.

Optionally, the apparatus further comprises:

a seventh obtaining module, configured to obtain a training data set, where the training data set includes a plurality of training images, and some target training images in the plurality of training images are labeled with corresponding category labels;

and the training module is used for training the initial model by utilizing the training data set to obtain the comparison detection model.

Optionally, the training module includes:

a data enhancement sub-module, configured to perform a data enhancement operation on the training images in the training data set to increase the number of training images and obtain a target training data set, where the data enhancement operation includes: at least one of cropping, resizing, recoloring, color distortion;

the first training submodule is used for pre-training the initial model through the target training data set to obtain a first model;

the second training submodule is used for performing fine tuning training on the first model through a target training image in the target training data set to obtain a second model;

and the third training submodule is used for carrying out network distillation training according to the training image which is not marked with the class label in the target training data set, the first model and the second model to obtain the comparison detection model.

Referring to fig. 11, which shows a block diagram of a device for detecting a foreign object in a scene according to an embodiment of the present application, the device includes:

a third obtaining module 501, configured to obtain a target video shot for a target scene and a long-term modeling background map of the target video; the long-time modeling background image is used for representing a target scene without a dynamic target; the dynamic target is a target with a position shifted within a period of time;

a fourth obtaining module 502, configured to obtain a candidate region where the dynamic target is located in a video frame of the target video;

a second comparison module 503, configured to input the candidate region and a comparison region corresponding to the candidate region in the long-term modeling background image into a comparison detection model, so as to obtain a determination result of whether a dynamic target in the video frame is a foreign object; wherein, the comparison detection model is a machine learning model.

A second identifying module 504, configured to obtain a position of a foreign object in the target scene according to a determination result of the dynamic target.

The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform a method as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).

Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as a terminal device, a server (cluster), and the like, using any suitable hardware, firmware, software, or any combination thereof, for a desired configuration. Fig. 12 schematically illustrates an example apparatus 1000 that may be used to implement various embodiments described in embodiments of the present application.

For one embodiment, fig. 12 illustrates an example apparatus 1000 having one or more processors 1002, a control module (chipset) 1004 coupled to at least one of the processor(s) 1002, memory 1006 coupled to the control module 1004, non-volatile memory (NVM)/storage 1008 coupled to the control module 1004, one or more input/output devices 1010 coupled to the control module 1004, and a network interface 1012 coupled to the control module 1004.

The processor 1002 may include one or more single-core or multi-core processors, and the processor 1002 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1000 can be used as a terminal device, a server (cluster), and other devices described in this embodiment.

In some embodiments, the apparatus 1000 may include one or more computer-readable media (e.g., the memory 1006 or the NVM/storage 1008) having instructions 1014 and one or more processors 1002 that, in conjunction with the one or more computer-readable media, are configured to execute the instructions 1014 to implement modules to perform the actions described in this disclosure.

For one embodiment, control module 1004 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1002 and/or any suitable device or component in communication with control module 1004.

The control module 1004 may include a memory controller module to provide an interface to the memory 1006. The memory controller module may be a hardware module, a software module, and/or a firmware module.

Memory 1006 may be used, for example, to load and store data and/or instructions 1014 for device 1000. For one embodiment, memory 1006 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 1006 may comprise a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, the control module 1004 may include one or more input/output controllers to provide an interface to the NVM/storage 1008 and input/output device(s) 1010.

For example, NVM/storage 1008 may be used to store data and/or instructions 1014. NVM/storage 1008 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more hard disk drive(s) (HDD (s)), one or more Compact Disc (CD) drive(s), and/or one or more Digital Versatile Disc (DVD) drive (s)).

The NVM/storage 1008 may include storage resources that are physically part of the device on which the apparatus 1000 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 1008 may be accessed over a network via input/output device(s) 1010.

Input/output device(s) 1010 may provide an interface for apparatus 1000 to communicate with any other suitable device, input/output devices 1010 may include communication components, audio components, sensor components, and so forth. Network interface 1012 may provide an interface for device 1000 to communicate over one or more networks, and device 1000 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1002 may be packaged together with logic for one or more controllers of control module 1004 (e.g., memory controller module). For one embodiment, at least one of the processor(s) 1002 may be packaged together with logic for one or more controller(s) of control module 1004 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1002 may be integrated on the same die with logic for one or more controller(s) of control module 1004. For one embodiment, at least one of the processor(s) 1002 may be integrated on the same die with logic for one or more controller(s) of control module 1004 to form a system on chip (SoC).

In various embodiments, the apparatus 1000 may be, but is not limited to: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 1000 may have more or fewer components and/or different architectures. For example, in some embodiments, device 1000 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that relational terms such as first and second, and the like, may be used herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or terminal apparatus that comprises the element.

The method for detecting the road foreign matter, the method for detecting the foreign matter in the scene, the device, the electronic equipment and the storage medium are introduced in detail, specific examples are applied in the method for detecting the road foreign matter, the principle and the implementation mode of the method are explained, and the explanation of the embodiments is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for detecting a foreign object on a road, comprising:

acquiring a road video shot aiming at a road scene and a long-term modeling background image of the road video; the long-time modeling background image is used for representing a road scene without a dynamic target; the dynamic target is a target with a position offset within a period of time;

2. The method according to claim 1, wherein the inputting the candidate region and a corresponding contrast region of the candidate region in the long-term modeling background map into a contrast detection model to obtain a determination result of whether a dynamic object in the video frame is a foreign object comprises:

inputting the candidate region and the comparison region into a comparison detection model, and calculating the similarity of the candidate region and the comparison region;

under the condition that the similarity is smaller than or equal to a first similarity threshold value, obtaining a judgment result that a dynamic target in the video frame output by the contrast detection model is a foreign object;

and under the condition that the similarity is greater than or equal to a second similarity threshold, obtaining a judgment result that a dynamic target in the video frame output by the contrast detection model is not a foreign object.

3. The method of claim 1, further comprising:

acquiring the long-term modeling background picture through a preset background modeling algorithm at intervals of a preset time interval on the basis of a first learning rate and the road video;

acquiring a short-time modeling background image through the background modeling algorithm based on a second learning rate and the road video every other preset time interval, wherein the short-time modeling background image is used for representing a road scene containing the dynamic target; the first learning rate is greater than the second learning rate.

4. The method of claim 3, wherein the obtaining the candidate region of the video frame of the road video where the dynamic object is located comprises:

obtaining a pixel difference image according to the difference value between the pixel value of the long-time modeling background image and the pixel value of the short-time modeling background image;

acquiring a first area where the pixel difference image belongs to the dynamic target;

and mapping the first region in the pixel difference image to the video frame to obtain a candidate region in the video frame.

5. The method according to claim 4, wherein the obtaining a first region of the pixel difference map where the dynamic target is located comprises:

carrying out binarization processing on the pixel difference image to obtain a binarization image; the pixel value in the binary image is a first numerical value or a second numerical value, the first numerical value corresponds to the dynamic target, and the second numerical value corresponds to the scene background;

and obtaining a first region in the pixel difference image according to the pixel value of the binarization image.

6. The method of claim 5, further comprising:

performing morphological processing on the binary image, and completely filling the incomplete area in the area of the dynamic target in the binary image to obtain a filling image;

performing area communication processing on the filling map, communicating different areas belonging to the same dynamic target in the filling map, and separating connected areas containing different dynamic targets in the filling map to obtain a to-be-identified map;

the obtaining a first region in the pixel difference map according to the pixel values of the binarized map includes:

and obtaining a first region in the pixel difference image according to the pixel value of the image to be identified.

7. The method according to claim 1, characterized in that a lane area to which a lane belongs is marked in a video frame of the road video; the method further comprises the following steps:

and screening a first candidate region where a first dynamic target outside the lane region is located from the candidate regions contained in the video frame marked with the lane region.

8. The method of claim 7, wherein after screening the first candidate region in which the first dynamic target is located except for the lane region, the method further comprises:

acquiring a second candidate region of a second dynamic target in the lane region and the generation time of the second dynamic target, wherein the generation time is determined by the number of video frames containing the second dynamic target in the road video;

and screening out a third candidate region of a third dynamic target with the generation time larger than or equal to a preset time threshold from the second candidate region of the second dynamic target in the lane region.

9. The method of claim 1, further comprising:

acquiring a fourth dynamic target belonging to a preset target category in the video frame through a deep learning model; the object classes include: at least one of a pedestrian category, a motor vehicle category, a non-motor vehicle category;

and screening out a fourth candidate area in which the fourth dynamic target is positioned from the candidate areas contained in the video frame.

10. The method according to claim 1, characterized in that a lane area to which a lane belongs is marked in a video frame of the road video; the method further comprises the following steps:

acquiring the corresponding relation between the image depth and the foreign matter size range in the video frame;

acquiring the image depth of the candidate region in the video frame, and acquiring the size range of the target foreign matter corresponding to the candidate region in the video frame from the corresponding relation according to the image depth of the candidate region in the video frame;

and screening out a fifth candidate area where a fifth dynamic target with the size not in the target foreign matter size range is located from the candidate areas contained in the video frame.

11. The method of claim 1, wherein the video frame is partitioned into a plurality of different tiles;

after obtaining the determination result whether the dynamic object in the video frame is a foreign object, the method further comprises

Acquiring a history judgment result of whether a dynamic target in a history video frame is a foreign matter; the historical video frame is a video frame in which foreign objects are identified before the video frame;

determining that foreign matters exist in the areas of the video frames according to the determination result of the video frames and determining that foreign matters exist in the areas according to the historical determination result of the historical video frames;

and under the condition that the foreign matter exists in the area of the video frame according to the judgment result of the video frame and the foreign matter does not exist in the area according to the historical judgment result of the historical video frame, the foreign matter exists in the area of the video frame.

12. The method of claim 1, further comprising:

acquiring a training data set, wherein the training data set comprises a plurality of training images, and part of target training images in the plurality of training images are marked with corresponding class labels;

and training an initial model by using the training data set to obtain the comparison detection model.

13. A method for detecting a foreign object in a scene, comprising:

14. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1 to 13.