CN113515968A

CN113515968A - Method, device, equipment and medium for detecting street abnormal event

Info

Publication number: CN113515968A
Application number: CN202010273415.1A
Authority: CN
Inventors: 谢奕; 胡鹏; 陆瑞智; 喻晓源; 陈普
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2021-10-19

Abstract

The application provides a street abnormal event detection method, which comprises the following steps: the method comprises the steps of obtaining a target image and a reference image, inputting the target image and the reference image to a semantic difference extraction network to obtain a semantic difference area of the target image relative to the reference image, and obtaining a detection result according to the semantic difference area, wherein the detection result is used for representing whether the street scene recorded by the target image includes an abnormal event or not. The method improves the detection accuracy, reduces the false alarm rate and further improves the user experience.

Description

Method, device, equipment and medium for detecting street abnormal event

Technical Field

The present disclosure relates to the field of Artificial Intelligence (AI), and in particular, to a method, an apparatus, a device, and a computer readable storage medium for detecting abnormal events in a street.

Background

With the continuous development of artificial intelligence technology, especially the progress of deep learning in image processing tasks, more and more scenes adopt a mode of capturing images or shooting videos and then detecting specific events in the images or videos, so that real-time automatic monitoring is realized. For example, in a city management scene, abnormal events such as street occupancy in a monitoring video of a street are detected through an artificial intelligence technology, so that intelligent street patrol is realized.

Currently, background modeling technology is mainly used in the industry to detect street abnormal events in images or videos. Specifically, modeling is performed on each pixel point in the image or video stream, probability distribution of each pixel point in a stable background state is obtained through learning, and a moving foreground region and a stable background region in a current image frame of the image or video stream are judged according to the fitting condition of the pixel probability distribution. When a large area in a manually drawn street is identified as a movement foreground area, the street is considered to have abnormal events such as street occupation and the like. However, the accuracy of the detection method is not high, and a large number of false alarms are often generated, which affects the user experience. Based on this, it is highly desirable to provide a method for detecting street abnormal events with high accuracy.

Disclosure of Invention

The application provides a street abnormal event detection method, which solves the problems that in the related technology, the detection accuracy is not high, a large number of false reports are generated, and the user experience is influenced. Corresponding apparatus, devices, computer-readable storage media, and computer program products are also provided.

In a first aspect, the present application provides a street exceptional event detection method. According to the method, the semantic difference area is extracted from the monitoring image of the street, and the abnormal event detection of the street is carried out based on the semantic difference area, so that the noise difference of light, weather and the like is prevented from interfering the abnormal event detection, the detection accuracy is improved, the false alarm rate is reduced, and the user experience is improved.

Specifically, a target image and a reference image are obtained, wherein the target image is an image to be detected, the reference image is an image referred to when the target image is detected, abnormal events are not included in street scenes recorded by the reference image, the target image and the reference image are input into a semantic difference extraction network, the semantic difference extraction network can extract respective semantic features of the target image and the reference image, a semantic difference area of the target image relative to the reference image can be determined based on the semantic features, abnormal event detection of streets is performed according to the semantic difference area, and a detection result is obtained. The detection result is used for representing whether the abnormal event is included in the street scene of the target image record.

In some possible implementations, an exceptional event refers to an event that is not compliant with normalcy in the street or with relevant regulations. As one example, the exception event may include any one or more of an violation event, a security incident, and/or a security risk event.

The illegal events can comprise abnormal events such as illegal parking and the like which violate road traffic safety regulations, and also can comprise abnormal events such as occupied road operation, messy materials, dealer pedlars or garbage throws which violate city appearance and environmental sanitation management regulations. The safety accident can be road surface collapse or an abnormal event such as a traffic accident. The potential safety hazard events can be abnormal events such as lack of safety warning signs, unobvious safety warning signs, faults of traffic indicating lamps and the like.

By setting the abnormal events, intelligent street patrol can be realized in real time. On the one hand, the inspection efficiency is improved, and on the other hand, the inspection cost is reduced. In addition, the method can carry out comprehensive inspection on multiple aspects of safety, sanitation and the like, and has high usability.

In some possible implementation manners, when detecting an abnormal event according to the semantic difference region, the detection result may be determined by using the similarity between the image of the semantic difference region and the image of the known abnormal event type. Specifically, at least one detection image is obtained according to the semantic difference area, wherein each detection image is obtained by segmenting the semantic difference area in the target image, and then the detection result is determined according to the similarity between the at least one detection image and the small sample support image in the small sample support set.

Wherein the small sample support set comprises small sample support images representing different abnormal events. And determining the type of the abnormal event corresponding to the detection image according to the similarity between the detection image and at least one small sample support image in the small sample support set, thereby obtaining a detection result.

The implementation mode can be suitable for any stage of monitoring application (video monitoring application or image monitoring application), and particularly has high accuracy when the number of samples is small in the initial application stage, and abnormal event detection of streets is carried out according to the similarity of the detection images and the small sample support images in the small sample support set.

In some possible implementation manners, when detecting abnormal events according to the semantic difference area, the detection may also be implemented by an event classification network. Specifically, at least one detection image is input to an event classification network, wherein each detection image is obtained by segmenting a semantic difference region in a target image, the event classification network is obtained by training a plurality of images and type labels in a knowledge base, the type labels are used for identifying the types of events corresponding to the images, the event classification network is obtained by training in a supervised learning mode by using the images and the type labels as sample data, and the detection result can be obtained by detecting the detection images by using the event classification network.

The images in the knowledge base may be detection images, images obtained by other means, or a combination of the two. When the image in the knowledge base is a detection image, the type label of the image can be determined according to the similarity between the detection image and at least one small sample supporting image in the small sample supporting set.

When the number of samples is large, in the later stage of application, the abnormal event detection of the street is carried out by utilizing the event classification network obtained by the supervised learning training, so that the detection accuracy can be further improved, the false alarm rate is reduced, and the user experience is improved.

In some possible implementations, the detection result may also be provided to the user, and then feedback of the detection result by the user is obtained, the feedback including correction of the corresponding event type of the detection image, so that the detection accuracy is further improved by combining with the user feedback.

The small sample support set may also be updated according to the above feedback, taking into account the accuracy of the subsequent detection process. Specifically, updating a small sample support set according to feedback can be classified into the following cases:

the first case is the addition of a first small sample support image to the small sample support set based on feedback, the first small sample support image recording a street scene including a first specified type of anomaly.

In a second case, a second small sample support image is deleted from the small sample support set based on feedback, the second small sample support image recording a street scene including a second specified type of anomaly.

A third case is where a third small sample support image is modified from the small sample support set based on feedback, the third small sample support image recording a street scene including a third specified type of exception.

In this implementation, when the knowledge base is constructed based on the detection image, in order to improve the sample accuracy, the type tag of the detection image may be determined according to the feedback of the user on the detection result.

In some possible implementation manners, the detection result may be displayed in a visual manner, so that the user can quickly know the detection result. Specifically, a visualization result map is generated according to the target image and the detection result. The visualization result image may show an abnormal event included in the street view of the target image record.

The generation of the visualization result graph comprises a plurality of implementation modes:

one way to implement this is to add a detection box, which may be a rectangular box or a box with other shapes, on the target image to identify the semantic difference region in the target image. And adding the type of the abnormal event at the corresponding position of the target image so as to obtain a visual result graph.

Another implementation manner is that when the detection result is obtained, a detection result mark is added to a corresponding position (a position corresponding to the semantic difference region) of the target image to form a visual result map. The detection result mark can be a flag or a star mark, and is used for identifying that an abnormal event exists at the position in the street.

The visualization result graph can also show the types of the abnormal events existing in the positions. The visualization result graph can directly show the types of the abnormal events existing at each position near the detection result mark, and can also show the detailed information such as the types of the abnormal events existing at the position when the user triggers the operation of showing the detailed information by clicking the detection result mark and the like.

In some possible implementation manners, the reference image is an image frame in a video stream, and the reference image may also be updated based on a moving foreground pixel ratio of the image frame, so that each target image may perform street abnormal event detection based on a high-quality reference image, and the detection accuracy is improved.

Specifically, when the moving foreground pixel proportion of the current image frame in the video stream is acquired and is smaller than the moving foreground pixel proportion of the historical image frame in the video stream, the reference image is updated by using the current image frame.

In some possible implementation manners, the pedestrian walking normally or the vehicle running normally may also cause semantic difference, and in order to avoid interference of moving objects such as the pedestrian walking normally or the vehicle running normally, a moving foreground region may also be eliminated from the semantic difference region, where the moving foreground region is the region where the moving object in the target image is located, and then the detection result is obtained according to the semantic difference region where the moving foreground region is eliminated. Thus, the detection accuracy can be further improved.

In some possible implementations, the semantic difference extraction network is a trained neural network model, and the semantic difference extraction network includes a feature extraction layer, a semantic difference fusion layer, and a semantic difference segmentation layer. Based on this, the target image and the reference image may be input to the feature extraction layer, a basic feature map of the target image and a basic feature map of the reference image are obtained, then the basic feature map of the target image and the basic feature map of the reference image are input to the semantic difference fusion layer, a fusion feature map is obtained, the fusion feature map includes the basic feature map of the target image, the basic feature map of the reference image and the difference feature map of the target image and the reference image, and then the fusion feature map is input to the semantic difference segmentation layer, so as to obtain a semantic difference region of the target image relative to the reference image. Because the information of the original basic feature map is reserved in the fusion feature map, the semantic difference region can be more accurately segmented by performing semantic difference segmentation on the basis of the fusion feature map.

In a second aspect, the present application provides an apparatus for detecting street abnormal events. The device includes:

the communication module is used for acquiring a target image and a reference image, and the street scene recorded by the reference image does not include the abnormal event;

the semantic difference extraction module is used for inputting the target image and the reference image to a semantic difference extraction network to obtain a semantic difference area of the target image relative to the reference image;

and the detection module is used for obtaining a detection result according to the semantic difference region, and the detection result is used for representing whether the street scene recorded by the target image comprises the abnormal event or not.

In some possible implementations, the street exception event includes: violation events, security incidents, and/or potential safety hazard events.

In some possible implementations, the detection module is specifically configured to:

obtaining at least one detection image according to the semantic difference area, wherein each detection image is obtained by segmenting the semantic difference area in the target image;

determining the detection result according to the similarity of the at least one detection image and small sample support images in a small sample support set, wherein the small sample support set comprises small sample support images representing different abnormal events.

inputting at least one detection image to an event classification network to obtain a detection result, wherein each detection image is obtained by segmenting a semantic difference region in the target image, the event classification network is obtained by training a plurality of images and type labels in a knowledge base, and the type labels are used for identifying the types of events corresponding to the images.

In some possible implementations, the communication module is further configured to:

providing the detection result to a user;

and obtaining feedback of the user on the detection result, wherein the feedback comprises correction on the type of the event corresponding to the detected image.

In some possible implementations, the apparatus further includes:

a first updating module for updating the small sample support set according to the feedback.

In some possible implementations, the detection module is further configured to:

and generating a visual result graph according to the target image and the detection result.

In some possible implementations, the reference image is an image frame in a video stream, and the communication module is further configured to:

obtaining the moving foreground pixel proportion of the current image frame in the video stream;

the device further comprises:

and the second updating module is used for updating the reference image by using the current image frame when the moving foreground pixel proportion of the current image frame is smaller than that of the historical image frame in the video stream.

In some possible implementations, the apparatus further includes:

the elimination module is used for eliminating a motion foreground area from the semantic difference area, wherein the motion foreground area is an area where a motion object in the target image is located;

the detection module is specifically configured to:

and obtaining a detection result according to the semantic difference region for eliminating the motion foreground region.

In some possible implementations, the semantic difference extraction network is a trained neural network model, and the semantic difference extraction network includes a feature extraction layer, a semantic difference fusion layer, and a semantic difference segmentation layer;

the semantic difference extraction module is specifically configured to:

inputting the target image and the reference image to the feature extraction layer to obtain a basic feature map of the target image and a basic feature map of the reference image;

inputting a basic feature map of the target image and a basic feature map of the reference image to the semantic difference fusion layer to obtain a fusion feature map, wherein the fusion feature map comprises the basic feature map of the target image, the basic feature map of the reference image and a difference feature map of the target image and the reference image;

and inputting the fusion feature map to the semantic difference segmentation layer to obtain a semantic difference area of the target image relative to the reference image.

In a third aspect, the present application provides an apparatus comprising a processor and a memory. The processor and the memory are in communication with each other. The processor is configured to execute the instructions stored in the memory to cause the apparatus to perform the method according to the first aspect or any implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having instructions stored therein for instructing a device to perform the method according to the first aspect or any implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a device, cause the device to perform the method of the first aspect or any of the implementations of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is a system architecture diagram of a street abnormal event detection method according to an embodiment of the present application;

FIG. 2 is a flowchart of a street abnormal event detection method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a semantic difference extraction network according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating elimination of a moving foreground region from a semantic difference region according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a visualization result graph provided in an embodiment of the present application;

fig. 6 is a schematic diagram illustrating abnormal event detection based on a small sample support set according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a deep nearest neighbor neural network according to an embodiment of the present application;

fig. 8 is a flowchart of a method for performing abnormal event detection based on supervised learning according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The scheme in the embodiments provided in the present application will be described below with reference to the drawings in the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished.

Events related to objects in a scene that are of interest in video surveillance applications or image surveillance applications include primarily presence-type events and motion-type events. The presence event is that the set type object enters (appears) and leaves (disappears) the set area in the geographical area identified by the image, or the set type object enters/leaves the set area along the set direction, or the presence time of the set type object in the set area meets the set condition. The motion event refers to a set motion mode, such as a fighting mode, a motion mode, or the like, of a set type object (such as a person, a vehicle, an animal, or the like) in a physical area identified by the image.

For the street patrol task of city management, videos or images can be shot through the camera, and then the videos or the images are analyzed through the video monitoring application or the image monitoring application, so that intelligent street patrol is achieved. One key to intelligent street patrolling is to detect street anomalies from videos or images.

The street abnormal event refers to an event which is not in accordance with the normal state in the street or in accordance with the relevant regulations. As one example, the exception event may include any one or more of an violation event, a security incident, and/or a security risk event. The violation event may include an abnormal event such as illegal parking violating the road traffic safety regulations, or an abnormal event such as occupied road management violating the city appearance and environmental sanitation regulations, scrabble materials, pedestrain or garbage throws. The safety accident can be road surface collapse or an abnormal event such as a traffic accident. The potential safety hazard events can be abnormal events such as lack of safety warning signs, unobvious safety warning signs, faults of traffic indicating lamps and the like.

The type of street exceptions detected from the video or image may be set according to business needs. For example, unusual events such as a track, a scram, or a dealer may be detected from a video or image. In some possible implementations, abnormal events such as parking violations, garbage throws, and the like can also be detected.

Currently, a background modeling technique is proposed in the industry for detecting abnormal events in streets. Specifically, background modeling is carried out on each pixel point in the street monitoring video stream, probability distribution of each pixel point in a stable background state is obtained through learning, and a motion foreground area and a stable background area in a current picture are judged according to the fitting condition of the pixel probability distribution. Based on the above, the moving foreground region refers to a region where an interested moving object is located, and the stable background region refers to a background region which is outside the region where the interested moving object is located and has a relatively stable probability distribution. When a large-area movement foreground area appears in the street area, abnormal events such as road encroachment and the like are determined to occur, and thus an alarm can be triggered.

However, whether the algorithm is a gaussian mixture background modeling algorithm, an environment (vibe) algorithm, or a multi-layer background difference (multi-layer BGS) algorithm is affected by factors such as illumination and weather, a shadow area caused by illumination or weather is easily identified as a moving foreground area, and it is determined that an event such as road encroachment occurs in the moving foreground area. That is, the method for detecting the street abnormal event based on background modeling cannot effectively distinguish the noise difference area caused by the factors such as light, weather and the like and the semantic difference area caused by the appearance, disappearance and change of the object, so that the detection accuracy is not high, a large amount of false reports are generated, and the user experience is influenced.

In view of this, the present application provides a street abnormal event detection method. According to the method, a semantic difference extraction network is adopted to directly extract a semantic difference area, and the semantic difference area is adopted to replace a motion foreground area extracted by a background modeling algorithm to detect abnormal events of a street, so that the problem that the noise difference area influences the detection accuracy caused by factors such as illumination and weather is solved. Specifically, a target image and a reference image are acquired, wherein the street scene recorded by the reference image does not include an abnormal event, and therefore, the reference image can be used as a comparison template of the target image. The target image and the reference image are input into a semantic difference extraction network to be compared, and a semantic difference area of the target image relative to the reference image can be obtained. The semantic difference area marks the difference area caused by the appearance, disappearance or change of the object, abnormal event detection of the street is carried out according to the semantic difference area, the interference of noise difference areas caused by factors such as illumination, weather and the like can be avoided, the detection accuracy is improved, the false alarm is reduced, and the user experience is improved. In some embodiments, the semantic difference extraction network is a trained neural network model.

The street abnormal event detection method provided by the embodiment of the application can be applied to application scenarios including but not limited to the scenario shown in fig. 1.

As shown in fig. 1, the scenario includes a device 102 and a device 104. The device 102 is an image capturing device, and the image capturing device 102 may specifically be a camera. The device 104 is a Processing device having a Central Processing Unit (CPU) and/or a Graphics Processing Unit (GPU) for Processing an image acquired by the image acquisition device, thereby implementing street exception detection. It should be noted that the device 104 may be a physical device or a physical device cluster, such as a terminal, a server, or a server cluster. Of course, the device 104 may also be a virtualized cloud device, such as at least one cloud computing device in a cloud computing cluster.

In particular implementations, image capture device 102 captures an image of a geographic area, such as a street, to obtain an image of the target. Image capture device 102 then sends the target image to device 104. A street abnormal event detection device 1040 is disposed in the device 104, and the street abnormal event detection device 1040 includes a communication module 1042, a semantic difference extraction module 1044, and a detection module 1046. The communication module 1042 acquires the target image and the reference image, and the streets recorded by the reference image do not include the abnormal event. For example, in a street patrol scenario, the reference image does not include unusual events such as road occupation, messy materials, and dealer pedestrials.

The semantic difference extraction module 1044 includes a semantic difference extraction network, and after the target image and the reference image are input into the semantic difference extraction network, the semantic difference extraction network can perform semantic identification on the target image and the reference image, and then extract a semantic difference region of the target image relative to the reference image. Then, the detection module 1046 may perform classification based on the semantic difference area to obtain a detection result. The detection result is particularly used for characterizing whether the abnormal event is included in the street scene of the target image record.

In order to make the technical solution of the present application clearer and easier to understand, the street abnormal event detection method provided in the embodiment of the present application will be described below from the perspective of the street abnormal event detection device 1040.

S202: the street anomaly detection device 1040 acquires a target image and a reference image.

A geographical area, such as a street area, to be monitored or patrolled is typically deployed with cameras, which include at least a camera head. The camera shoots the geographical area through the camera to obtain a target image. The street anomaly detection device 1040 may capture the target image from a camera. It should be noted that the camera may take images or obtain a video stream for the geographic area. When the video stream is captured by the camera, the street abnormal event detecting device 1040 may decode the video stream to obtain a plurality of frames of images after the video stream is obtained, and then obtain the target image from the plurality of frames of images.

The reference image is an image to be referred to when the target image is detected. The reference image records do not include an exception in the street view. The type of the abnormal event can be set according to the actual business requirement. For example, in a street patrol scenario, the types of exceptions may include types of road occupation, scratchy material, and vendor. Of course, in some implementations, the types of exception events may also include types of litter and parking violations. It should be noted that the street of the reference image record of the target image is the same as the street of the target image record. When the target image is shot for different streets, the reference images corresponding to the target image are different.

In a specific implementation, the street abnormal event detecting device 1040 may screen, from a plurality of images obtained by monitoring a street, an image that does not include an abnormal event as a reference image by using a technique such as image recognition. Of course, in some possible implementations, the street anomaly detection device 1040 may also receive, as a reference image, an image of a street scene that is provided and recorded by the user and that does not include an anomaly.

S204: the street abnormal event detecting device 1040 inputs the target image and the reference image to a semantic difference extraction network, and obtains a semantic difference area of the target image relative to the reference image.

The semantic difference extraction network can effectively distinguish noise difference caused by disturbance of light, visual angle, shadow, leaves and the like and semantic difference caused by appearance, disappearance and change of objects by utilizing the expression capability of the neural network on high-level semantic information, and extract a semantic difference area of the target image relative to the reference image.

In particular implementations, the semantic difference extraction network may be a full volume twinning metric network (FCSMN), also known as cosimet. Given a pair of images (i.e., a target image and a reference image), the semantic difference extraction network aims to identify semantic changes (i.e., semantic differences) at different times. CosimNet uses contrast loss (coherent loss) to reduce the distance between unmodified feature pairs and expand the distance between modified feature pairs, and proposes a strategy to penalize noise variation (i.e. noise variance), thereby distinguishing noise variance from semantic variance, and extracting semantic variance regions therefrom.

The above CosimNetExtracting a basic feature map Feat of a target image_IAnd a basic feature map Feat of the reference image_BThen, the distance between the two basic feature maps is generally directly calculated, for example, the euclidean distance (i.e., L2 norm, L2 norm) is calculated, and the difference feature map Feat of the target image and the reference image is obtained_subThen based on the difference feature map Feat_subAnd performing pixel-by-pixel semantic difference evaluation.

In order to improve the semantic difference recognition accuracy, in some possible implementations, the difference feature map Feat may be further used_subWith the target image I_tBase feature map of (Feat)_IReference image B_tBase feature map of (Feat)_BAnd fusion is carried out, and when semantic difference evaluation is carried out based on the fused difference characteristic diagram, the difference between two basic characteristic diagrams can be effectively combined, and meanwhile, the information of the original basic characteristic diagram is not lost.

Based on difference characteristic diagram Feat_subOr fuse Feat_I、Feat_BWhen the semantic difference evaluation is carried out on the difference characteristic diagram, the semantic difference evaluation can be realized through a semantic difference segmentation layer. The semantic difference segmentation layer may specifically comprise a convolutional layer. By convolution layer, difference characteristic diagram or fusion Feat can be obtained_I、Feat_BThe difference feature map is subjected to channel reduction operation to obtain two-channel feature maps, then the values of the two channels are normalized point by point softmax to obtain the probability value that the pixel points are semantic difference or non-semantic difference, and a semantic difference comparison result map S can be obtained based on the probability value_tThe semantic difference comparison result graph S_tWhere the semantic difference region is identified. When the channel reduction operation is carried out, a double-layer convolution layer can be used, more nonlinear operations are introduced to enhance the expression capability of semantic difference, and therefore the accuracy of the semantic difference region is improved.

For ease of understanding, the present application also provides an example of a semantic difference extraction network. As shown in fig. 3, the semantic difference extraction network includes a feature extraction layer, a semantic difference fusion layer, and a semantic difference segmentation layer. The feature extraction layer is a twin basic network sharing weight, and the twin basic network can adopt various image areasThe network structure of the domain feature segmentation expression may be implemented by, for example, a full volume network (FCN), a U-type network (U Net), a release 2 deep laboratory (deeplab v2), or a deeplab v 3. The feature extraction layer carries out image matching on the target image I through the twin basic network_tAnd reference picture B_tRespectively extracting characteristic graphs to obtain a basic characteristic graph Feat_IAnd Feat_B。

The semantic difference fusion layer firstly aims at two basic feature maps, namely Feat_IAnd Feat_BPerforming point-by-point difference calculation to obtain a difference characteristic diagram Feat_sub. Wherein, Feat_sub＝Feat_I-Feat_B. Then can be according to [ Feat_B,Feat_sub,Feat_I]The form of the method is subjected to channel cascade to obtain a fused difference characteristic diagram. Note that, when performing feature fusion, the semantic difference fusion layer may be formed in accordance with [ Feat_I,Feat_sub,Feat_B]The channel cascade is performed in a form, and the cascade order is not limited in the embodiment of the application.

The semantic difference segmentation layer comprises two convolution layers. And the first layer of convolution layer carries out nonlinear processing on the fused difference characteristic diagram. The convolution kernel size of the second convolution layer is 1 x 1, and the second convolution layer is used for performing channel reduction operation under the condition that the resolution of the feature map is guaranteed. After passing through the second convolution layer, the output size is 2 × H_feat*W_featThe characteristic diagram of (1). Wherein H_featAnd W_featThe height and width of the feature map are respectively represented, and 2 is the number of feature map channels (one channel corresponds to semantic difference probability and one channel corresponds to non-semantic difference probability). The values of the 2 channels can be normalized to probability values for semantic and non-semantic differences by a point-by-point multi-channel Softmax operation. Based on the probability value of each pixel point in the feature map, the pixel point can be determined to belong to a semantic difference region or a non-semantic difference region, and the size H can be obtained_feat*W_featSemantic difference comparison result graph S_t. In some implementations, the size may also be H_feat*W_featSemantic difference comparison result graph S_tZooming to the size of the target imageAt the same time, i.e. scaling to H_input*W_inputSo as to obtain the final semantic difference comparison result picture S_t。

S206: the street abnormal event detecting device 1040 obtains a detection result according to the semantic difference region.

The appearance, disappearance, or change of objects in a street causes semantic changes (i.e., semantic differences). For example, a business stores goods on a street outside a door (i.e., a street operation), a building unit, etc., deposits materials such as steel on the street (i.e., a messy material), and sells goods (a dealer) via a motor vehicle or a non-motor vehicle, which causes semantic changes. Therefore, the street abnormal event detecting device 1040 may classify the semantic difference region based on the correspondence between the semantics and the events, thereby detecting the abnormal event in the semantic difference region and obtaining the detection result. The detection result is specifically used for representing whether the street scene recorded by the target image comprises an abnormal event or not. Such as whether it includes an exception event, such as a road occupation, a messy material, or a dealer.

Considering that moving objects such as pedestrians walking normally and/or vehicles running normally may also appear in geographic areas such as streets, in order to avoid the moving objects interfering with the detection of abnormal events in the streets, if the pedestrians or vehicles are mistakenly identified as dealers, the abnormal event detecting apparatus 1040 may further eliminate the moving foreground area from the semantic difference area, and then perform the detection of abnormal events in the streets according to the semantic difference area without the moving foreground area, so as to obtain the detection result.

And the moving foreground area is the area where the moving object in the target image is located. The moving foreground region can be realized by background modeling and moving foreground extraction. Following with the video stream { I₁,I₂,...,I_t-1,I_tThe target image I in (1)_tAn example explanation is made. In this example, the street anomaly detection device 1040 may employ a video sequence { I } before time t₁,I₂,...,I_t-1Carry on the background modeling, then with a certain pixel point at time t

Characteristic value of (2) and background model distribution or characteristic value of the pixel point

And comparing to determine whether the pixel point is a moving foreground or a stable background. Then, the abnormal event detection device 1040 of the street may remove the fine-grained noise area in the motion foreground by using the binary image expansion corrosion operation, and fill the hole in the communication area to obtain the motion foreground image M at the current time t_t. The moving foreground image M_tWherein the moving foreground region is identified.

The semantic difference comparison result graph S_tAnd a moving foreground image M_tTo a binarized image. When the binarized image has the same value, for example, the value "0" is used to identify the semantic difference region and the motion foreground region, the street abnormal event detecting device 1040 may perform an inversion process on one of the images, for example, the motion foreground image, and then perform an and operation on the inverted image and the other image, for example, perform an and operation on the inverted motion foreground image and the semantic difference comparison result map S_tAnd operation is carried out, so that the motion foreground area is eliminated from the semantic difference area. Considering that there may be noise interference in the edge portion of the semantic difference region, in some implementations, the street abnormal event detecting device 1040 may also perform an and operation after performing the and operation.

When the target image is a frame image in the video stream, such as a frame image I at time t_tIn this case, the abnormal event detecting device 1040 may also compare the result map { S } based on the semantic difference of the previous t-1 frame image₁,S₂,...,S_t-1Determining semantic difference comparison result graph S at current moment_tTo improve the accuracy of the semantic difference region. Specifically, as shown in FIG. 4, the street abnormal event detecting device 1040 may compare the semantic difference of the previous t-1 frame image with the result map { S }₁,S₂,...,S_t-1Performing AND operation in sequence, and then performing semantic difference between the result of the AND operation and the current timeComparison result graph S_tPerforming AND operation, and taking the result of the AND operation as the semantic difference comparison result graph S of the current time_t. Then, the abnormal event detecting device 1040 for street may detect the moving foreground image M_tPerforming negation, and comparing the negated image with the final semantic difference comparison result image S_tPerforming AND operation and then performing open operation to obtain a semantic difference comparison result graph for eliminating the motion foreground region

. Wherein, eliminating semantic difference comparison result chart of motion foreground area

Also referred to as a static region mask, in which semantic difference regions are identified from which moving foreground regions are eliminated.

It should be noted that the detection result obtained by the street abnormal event detecting device 1040 may also be presented through a visualization result graph. Specifically, as shown in fig. 5, the street abnormal event detecting apparatus 1040 may generate a visualization result map based on the target image and the detection result when obtaining the detection result.

Specifically, when the street abnormal event detecting device 1040 identifies the semantic difference region of the target image relative to the reference image, the detecting box 502 is added to the target image, and the detecting box 502 may be a rectangular box or a box with other shapes for identifying the semantic difference region in the target image. When the street abnormal event detecting device 1040 determines the type of the abnormal event corresponding to each semantic difference area, the type 504 of the abnormal event is added to the corresponding position of the target image, so as to obtain the visualization result graph. The visualization result chart can show the abnormal events in the street scene and the types of the abnormal events, so that a user can conveniently and quickly obtain corresponding information.

In some possible implementations, the street abnormal event detecting device 1040 may also add a detection result flag to a corresponding position (corresponding to the semantic difference region) of the target image when obtaining the detection result, so as to form a visual result map. The detection result mark can be a flag or a star mark, and is used for identifying that an abnormal event exists at the position in the street. The visualization result graph can also show the types of the abnormal events existing in the positions. It should be noted that the visualization result graph may directly show the types of the abnormal events existing at the respective positions near the detection result flag, or may show the types of the abnormal events existing at the positions when the user triggers an operation of showing the detailed information by clicking the detection result flag or the like.

Based on the above description, the method for detecting the street abnormal event provided by the embodiment of the application directly extracts the semantic difference region by using the semantic difference extraction network, and the semantic difference region is adopted to replace the movement foreground region to detect the street abnormal event, so that the noise difference region interference caused by factors such as illumination and weather can be avoided, the detection accuracy is improved, the misinformation is reduced, and the user experience is improved.

Considering that the visual features of similar abnormal events in scenes such as street patrol have larger intra-class difference (namely stronger diversity) and the number of samples is relatively small, the embodiment of the application also provides a method for detecting events based on a small sample support set (support set).

Referring to fig. 6, a flow chart of a method for abnormal event detection based on a small sample support set is shown, the method comprising:

s2062: the street anomaly detection device 1040 obtains at least one detection image according to the semantic difference region.

Specifically, the abnormal event detection device 1040 for street obtains the semantic difference comparison result map S identifying the semantic difference area_tThen, the semantic difference comparison result graph S can be determined_tExcluding the outline with the pixel area of the circumscribed area smaller than the preset area (for example, 1000), the remaining outline is the semantic difference area list corresponding to the time t

Abnormal event detection device 1040 for street according to semantic difference area list

And dividing n images of the semantic difference areas from the target image to obtain n detection images, wherein the n detection images can form a detection set. Wherein n is a positive integer.

In particular implementation, the abnormal event detecting device 1040 for street may compare the result map S with semantic differences by using findContours function in Open source Computer Vision library (OpenCV)_tAnd extracting the outline, and determining the minimum circumscribed rectangle R of the outline by using a boundingRec function in OpenCV. The circumscribed area may have other shapes, such as a circumscribed circle, which is not limited in the embodiments of the present application.

It should be noted that, when the target image has a moving foreground region, the abnormal event detection apparatus 1040 may also compare the result map S with the semantic difference_tThe moving foreground area is eliminated in the semantic difference area to obtain a static area mask. Then, the street abnormal event detecting device 1040 may extract the outline from the static region mask, determine the circumscribed region of the outline, determine n semantic difference regions based on the area of the circumscribed region, and segment n images from the target image according to the n semantic difference regions to form a detection set.

S2064: the street abnormal event detecting device 1040 determines the detection result according to the similarity between the at least one detection image and the small sample support image in the small sample support set.

Considering that the amount of sample data is small in the initial stage of application, event classification can be performed by using small sample learning (few-shot learning). Here, the small sample means that the sample capacity is small, that is, the number of samples is small, and therefore, the small sample learning is also called the small sample learning. The street abnormal event detecting device 1040 may use a classification model based on a small sample to determine the type of the event corresponding to the detected image, and obtain the detection result. The classification model based on the small samples is provided with a small sample support set, the small sample support set comprises small sample support images representing different abnormal events, and the classification model based on the small samples determines the type of the event corresponding to the detection image by calculating the similarity between the detection image and at least one type of small sample support images.

The classification model based on the small samples comprises various models such as a nearest neighbor neural network, a K-means (K-means) neural network and the like. For ease of understanding, the present application is illustrated with a deep nearest neighbor neural network. In this example, the small sample support set includes a small sample support image whose type label is a preset type, and the small sample support set may further include a small sample support image whose type label is a non-preset type in consideration of sample equalization.

For a street occupancy patrol scenario, the type tag being a non-preset type may include two cases, one case being that the type tag is an irrelevant type, and the other case being that the type tag is that no street occupancy exists. Based on this, small sample supporting set

Wherein s is₀Small sample support set images for the unoccupied class, { s }₁,...,s_CThe small sample support images of class 1 to class C occupancy categories,

from class 1 to K_otherThe small samples of the independent classes of the classes support images (such as people, cars and the like), and the number of the small sample support images of each class is at least five.

As shown in fig. 7, the deep nearest neighbor neural network includes a feature extractor that employs a Conv-64F network structure. And inputting the detection images in the detection set and the detection images in the small sample support set into a feature extractor, and performing feature extraction on each input image by the feature extractor to output m d-dimensional feature vectors. Abnormal event detection apparatus 1040 for street calculates detection image I_qIs characterized byf_qSimilarity with each class in the support set S, wherein the class with the highest similarity is the detection image I_qThe category of the corresponding event.

I_qIs characterized in that_qThe similarity to class c in the small sample support set is:

wherein the content of the first and second substances,

is f_qThe method of (a) for the first feature,

set pictures s for class c support_cIn, and

the kth feature with the highest similarity.

It is to be noted that if I_qThe corresponding events are of the type C +1 to C + K_otherClass (i.e., irrelevant class), the street exception detection 1040 may detect this I_qAnd discarding the corresponding semantic difference region without performing region tracking and result output. If I_qIf the corresponding events are classified into categories 1 to C, the street abnormal event detecting device 1040 may also detect the abnormal event for the corresponding event_qAnd carrying out region tracking and result output on the corresponding semantic difference region.

Further, the street abnormal event detecting device 1040 may further construct a knowledge base X using a classification model based on small samples to detect the detection result of the image_event＝{x_i,y_i1., N }, where x is_iFor the ith sample image (obtained from the test image), y_iIs x_iA corresponding type tag. The street abnormal event detecting device 1040 trains an event classification network using supervised learning based on the knowledge base, and predicts the type of the abnormal event using the event classification network.

Specifically, referring to the flowchart of the method for detecting abnormal events of streets based on supervised learning shown in fig. 8, on the basis of the embodiment shown in fig. 6, the method includes:

s2066: the street abnormal event detecting device 1040 constructs a knowledge base according to the detected image and the type of the event corresponding to the detected image. When the number of sample data in the knowledge base, the type tag of which is a preset type, reaches a first preset number, S2068 is executed.

The knowledge base comprises sample data, the sample data comprises the detected image and a type label thereof, and the type label is used for identifying the type of an event corresponding to the detected image. The type label can be obtained by classifying the detection image based on a classification model of the small sample.

To improve the accuracy, the street abnormal event detecting device 1040 may further provide the detection result to the user, so that the user may confirm whether the detection result is correct and correct the incorrect detection result. For example, when the type of the abnormal event in the detection result is different from the actual type, the user can correct the type of the abnormal event to form feedback of the user on the detection result. The abnormal event detecting device 1040 obtains the feedback of the user to the above detection result, and generates sample data by using the feedback, so as to improve the accuracy of the sample data.

When the number of sample data with the type tag being a preset type (i.e., the type of the abnormal event to be detected) in the knowledge base reaches the first preset number, the supervised learning using the sample data in the knowledge base may obtain higher accuracy, and the abnormal event detecting device 1040 for street may perform S2068. Wherein the first preset number may be set according to an empirical value. As one example, the first preset number may be 5000.

The embodiment shown in FIG. 8 is but one implementation of building a knowledge base. The images in the knowledge base may also be images obtained by other means, that is, the abnormal event detecting device 1040 for street may construct the knowledge base according to the images obtained by other means and the corresponding type labels. In some possible implementations, the images in the knowledge base may also be a combination of the detected images and images obtained by other means. Namely, the abnormal event detecting device 1040 of the street may acquire the detected image, acquire the image in other ways, and acquire the type tag corresponding to the image, thereby constructing the knowledge base.

S2068: the street anomaly detection device 1040 trains an event classification network using sample data in the knowledge base.

The event classification network may include neural networks of various structures. For ease of understanding, the embodiment of the present application is illustrated with a Residual Network (Resactual Network-18, ResNet-18) of 18 layers. Specifically, the 17 convolutional layers are connected by several additional convolutional layers (e.g. 4 convolutional layers) to increase the depth of the network, and a fully-connected layer is connected at the end of the network for mapping the high-dimensional features to the low-dimensional features for output.

When the network structure of the event classification network is built, the abnormal event detection device 1040 trains the event classification network using sample data in the knowledge base. Specifically, the abnormal event detecting device 1040 inputs sample data into the event classification network, and updates parameters of the event classification network based on the prediction result of the event classification network and the type tag in the sample data, thereby implementing training of the event classification network. And when the event classification network meets the training end condition, if the loss function of the event classification network is converged or is smaller than a preset value, stopping training. The street anomaly detection device 1040 may verify the accuracy of the event classification network, and when the required accuracy is satisfied, the event classification network may be used to detect the detection image.

S2069: the street abnormal event detecting device 1040 inputs the detected image obtained by segmenting the semantic difference region from the target image to the event classification network, and obtains a detection result.

Specifically, the street abnormal event detecting device 1040 inputs at least one detection image into a trained event classification network, and the event classification network based on supervised learning can output C + K_other+ 1-dimensional feature vector, this C + K_otherThe + 1-dimensional feature vector represents that the detected images belong to C + K_otherProbability of +1 categories. C + K_otherThe dimension of the maximum value in the + 1-dimensional feature vector is the prediction type of the detected image.

Similar to the classification model based on small samples, when the detected image corresponds to events of which the classes are C +1 to C + K_otherWhen the detected image is classified (i.e. irrelevant), the street abnormal event detecting device 1040 may discard the semantic difference region corresponding to the detected image, and does not participate in region tracking and result output.

In order to optimize the classification result and improve the classification accuracy, the street abnormal event detecting device 1040 may further obtain the feedback of the user on the event detection result, where the feedback includes the correction of the type of the event corresponding to the detected image, and the street abnormal event detecting device 1040 may update the small sample support set according to the feedback. The street abnormal event detecting device 1040 may be divided into three cases according to the feedback update small sample support set, which will be described in detail below.

In the first case, the street anomaly detection device 1040 adds a first small sample support image to the small sample support set according to the feedback, where the first small sample support image records a street scene containing a first specified type of anomaly.

Specifically, the feedback characterization user corrects the type of the event corresponding to the detected image into a first specified type, and the first specified type is a newly added preset type. The street abnormal event detecting device 1040 adds the small sample support image of which the type label is the first designated type, i.e., the first small sample support image, to the small sample support set based on the feedback.

For example, in the pre-application period, the preset type includes C types, and when the user corrects the corresponding event type from one of the C types to a new type for a plurality of detected images, the street abnormal event detecting apparatus 1040 may select a first number (e.g., 5) of images from the plurality of detected images as the first small sample supporting image. The type label of the first small sample support image is a first specified type. The number of the preset types in the updated small sample support set is C + 1.

Second, the street anomaly detection device 1040 removes a second small sample support image from the small sample support set based on the feedback, the second small sample support image recording a street scene that includes a second specified type of anomaly.

Specifically, when the feedback characterization user corrects the type of the event corresponding to the detected image from the second specified type in the preset types to the non-occupied type, the abnormal event detecting device 1040 of the street may delete the small sample support image whose type is labeled as the second specified type, that is, the second small sample support image, from the small sample support set. Wherein the second specified type is specifically one of the preset C types.

Third, the street anomaly detection device 1040 modifies a third small sample support image from the small sample support set based on the feedback, the third small sample support image recording a street scene that includes a third specified type of anomaly. Wherein the third specified type is specifically one of preset C types.

Specifically, the street abnormal event detecting device 1040 may extract a preset ratio, for example, 10%, from the detection image of the classification error to form a check set according to the feedback. And updating the small sample support set by using the check set. Given a sample { x in the knowledge base_i,y_iThe update strategy of the small sample support set is as follows:

(1) to ensure reasoning efficiency, the number of sample data of each category in the small sample support set does not exceed a second number, which is specifically set according to an empirical value, and may be set to 20, for example. If the support is concentrated y_iThe number of samples in a class is less than 20, then x will be_i,y_iDirectly adding small sample support sets;

(2) if small sample supports concentrate y_iThe number of samples of the category has been equal to 20, step (3) is performed;

(3) circularly traversing small sample support concentration y_iEach sample of a category. For the jth sampleTry to cull it out of the small sample support set while rejecting x_iAdding the small sample support set into the small sample support set, and calculating the classification accuracy P of the replaced small sample support set on the updated check set_j；

(4) Calculating the classification accuracy B of the original small sample support set on the updated check set;

(5) if B is greater than any P_jThen the small sample support set does not need to be updated, otherwise P will be_jFinding out the combination with the highest medium accuracy, eliminating the corresponding jth sample out of the small sample supporting set, and finding out the x_iAnd adding the small sample support set into the small sample support set to complete the updating of the small sample support set.

In order to accurately identify the predetermined type of abnormal event in the above embodiment, it is necessary to provide a high-quality reference image. The reference image records include fewer or even no exceptions in the street view. In addition, the reference image records include fewer moving objects in the street view, or even no moving objects. Therefore, the method for automatically acquiring the reference image and updating the reference image is provided, the situation that a user manually selects the reference image from the images shot by the camera is avoided, the workload of the user is reduced, and the user experience is improved.

Specifically, the street abnormal event detecting device 1040 may detect a frame of image in the video stream, such as the first frame of image I₁As a reference image, the reference image is then updated according to the moving foreground pixel fraction of the image frames in the video stream. In some possible implementations, the street abnormal event detecting device 1040 may obtain a moving foreground pixel ratio of a current image frame in the video stream, and when the moving foreground pixel ratio of the current image frame is smaller than a moving foreground pixel ratio of a historical image frame in the video stream, update the reference image with the current image frame, that is, determine the current image frame as the reference image.

It should be noted that the situation that the moving foreground pixel ratio of the current image frame is smaller than that of the historical image frame in the video stream may include two cases. One is that the moving foreground pixel proportion of the current image frame is smaller than that of the previous image frame in the video stream, and the other is that the moving foreground pixel proportion of the current image frame is smaller than that of all image frames before the current image frame. In some possible implementations, the street abnormal event detecting device 1040 may use the second comparison mode in the early stage, that is, compare the moving foreground pixel proportion of the current image frame with the moving foreground pixel proportion of all image frames before the current image frame, and after the application runs for a period of time, use the first mode, that is, compare the moving foreground pixel proportion of the current image frame with the moving foreground pixel proportion of the previous image frame.

For ease of understanding, the following description is made in conjunction with a specific example. The specific updating process is as follows:

(1) for the first ts frame video stream sequence I₁,I₂,…,I_tsExtracting the motion foreground information { M }₁,M₁,…,M_tsCalculating the ratio of the moving foreground pixels { P }₁,P₁,…,P_tsIf it is at I_iMoving foreground pixel ratio P of frame image_iIs less than I₀To I_i-1Selecting the video frame I at the current moment when the lowest value of the frame motion foreground pixel ratio is obtained_iAs a new reference picture B_i；

(2) For real-time video stream sequence after ts frame I_ts+1…, calculating the ratio of the moving foreground pixels { P }_ts+1… } if I_iFrame motion foreground pixel fraction P_iRatio I_i-1If the frame motion foreground pixel ratio is low, then I_iUpdated to template image B_i。

Further, when I_iFrame motion foreground pixel fraction P_iRatio I_i-1Frame motion foreground pixel occupancy ratio, or equal to I_i-1When the frame motion foreground pixel is in proportion, the street abnormal event detecting device 1040 may further perform the following steps:

(3) if I_iFrame motion foreground pixel fraction P_iAbove a set threshold T_backIf the current movement foreground is disordered, skipping I_iFrame, not updating the template image. Otherwise generate I_iFrame and reference picture B_iSemantic difference segmentation result graph S_iObtaining a semantic difference area list

And I_iAnd B_iThe difference area image corresponding to

Classifying and judging the images of the difference areas, for I_iOr reference image, calculating a weighted score for the i category:

wherein n is the number of difference region images, alpha_ijConfidence, A, for the difference region image j determined to occupy the event class i_jIs the total number of pixels in the difference region image j;

if and only if I_iWeighted score of (I)_scoreAnd a weighted score B of the reference image_scoreWhen the following formula is satisfied, recording as an effective update, and updating the latest I when the effective update times exceed a set threshold value_iThe frame being a reference picture B_i：

H is a street occupancy event positive category set comprising store-out operation, dealer pedlars, occupied road stack objects and the like, and I is a street occupancy event negative category set comprising people, vehicles, roads and the like.

The method for detecting an abnormal event of a street provided by the present application is described in detail above with reference to fig. 1 to 8, and the apparatus 1040 and the device 104 for detecting an abnormal event of a street provided by the present application are described below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an abnormal event detecting apparatus 1040 for street in the system architecture diagram is shown, as shown in fig. 1, the apparatus 1040 includes:

the communication module 1042 is used for acquiring a target image and a reference image, wherein the street scene recorded by the reference image does not include the abnormal event;

a semantic difference extraction module 1044, configured to input the target image and the reference image to a semantic difference extraction network, and obtain a semantic difference region of the target image relative to the reference image;

a detecting module 1046, configured to obtain a detection result according to the semantic difference region, where the detection result is used to characterize whether the street scene recorded by the target image includes the abnormal event.

For a specific implementation of the communication module 1042, reference may be made to the description of the relevant content of S202 in the embodiment shown in fig. 2, for a specific implementation of the semantic difference extraction module 1044, reference may be made to the description of the relevant content of S204 in the embodiment shown in fig. 2, and for a specific implementation of the detection module 1046, reference may be made to the description of the relevant content of S206 in the embodiment shown in fig. 2, which is not described herein again.

In some possible implementations, the detecting module 1046 is specifically configured to:

For specific implementation of the detection module 1046, reference may be made to related content description in the embodiment shown in fig. 6, which is not described herein again.

The specific implementation of the detection module 1046 may refer to the description of the related content in the embodiment shown in fig. 8, and is not described herein again.

In some possible implementations, the communication module 1042 is further configured to:

providing the detection result to a user;

obtaining feedback of a user on the detection result, wherein the feedback comprises correction on the type of an event corresponding to the detected image;

in some possible implementations, the apparatus 1040 further includes:

In some possible implementations, the reference image is an image frame in a video stream, and the communication module 1042 is further configured to:

the apparatus 1040 further comprises:

In some possible implementations, the detecting module 1046 is further configured to:

The specific implementation of the detection module 1046 may refer to the description of the relevant content of S206 in the embodiment shown in fig. 2, and is not described herein again.

In some possible implementations, the apparatus 1040 further includes:

the detection module 1046 is specifically configured to:

the semantic difference extraction module 1044 is specifically configured to:

inputting the feature map of the target image and the feature map of the reference image to the semantic difference fusion layer to obtain a fusion feature map, wherein the fusion feature map comprises a basic feature map of the target image, a basic feature map of the reference image and a difference feature map of the target image and the reference image;

The specific implementation of the semantic difference extraction module 1044 can refer to the description of the relevant content of S204 in the embodiment shown in fig. 2, which is not described herein again.

The street abnormal event detecting device 1040 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the street abnormal event detecting device 1040 are respectively to implement corresponding flows of each method in fig. 2, fig. 6, and fig. 8, and are not described herein again for brevity.

It should be noted that the above-described embodiments are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

The embodiment of the present application further provides a device 104, which is used to implement the function of the abnormal event detecting apparatus 1040 for street in the system architecture diagram shown in fig. 1. The device 104 may be a physical device or a physical device cluster, or may be a virtualized cloud device, such as at least one cloud computing device in a cloud computing cluster. For ease of understanding, the present application illustrates the structure of the device 104 as a separate physical device from the device 104.

Fig. 9 provides a schematic diagram of the structure of a device 104, and as shown in fig. 9, the device 104 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, memory 604, and communication interface 603 communicate over a bus 601. The bus 601 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus. The communication interface 603 is used for communication with the outside. For example, a target image and a reference image, etc. are acquired.

The processor 602 may be a Central Processing Unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, an HDD, or an SSD.

The memory 604 stores executable code that the processor 602 executes to perform the street anomaly detection method described above.

Specifically, in the case of implementing the embodiment shown in fig. 1, and in the case that the modules of the street abnormal event detection apparatus 1040 described in the embodiment of fig. 1 are implemented by software, software or program codes required for executing the functions of the semantic difference extraction module 1044 and the detection module 1046 in fig. 1 are stored in the memory 604. The processor 602 executes program codes corresponding to the modules stored in the memory 604, such as program codes corresponding to the semantic difference extraction module 1044 and the detection module 1046, to extract a semantic difference region of the target image relative to the reference image, and obtain a detection result according to the semantic difference region. So, through detecting the unusual incident of street to realize intelligent street patrol.

Of course, the code needed to perform the functions of the first update module, the second update module, and/or the elimination module may also be stored in the memory 604. The communication module 603 may further provide the detection result to the user, obtain feedback of the detection result from the user, and the processor 602 may further execute the program code corresponding to the first updating module to update the small sample support set according to the feedback. The communication module 603 may further obtain a moving foreground pixel proportion of a current image frame in the video stream, and the processor 602 may further execute a program code corresponding to the second updating module to update the reference image with the current image frame when the moving foreground pixel proportion of the current image frame is smaller than the moving foreground pixel proportion of the historical image frame. The processor 602 may further execute a program code corresponding to the elimination module to eliminate the moving foreground region from the semantic difference region, and then perform the step of obtaining the detection result according to the semantic difference region.

The embodiment of the present application further provides a computer-readable storage medium, which includes instructions for instructing the device 104 to execute the above-mentioned street abnormal event detection method applied to the street abnormal event detection apparatus 1040.

The embodiment of the application also provides a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the above street abnormal event detection methods. The computer program product may be a software installation package which may be downloaded and executed on a computer in the event that any of the aforementioned street anomaly detection methods needs to be used.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A method for detecting street abnormal events, the method comprising:

acquiring a target image and a reference image, wherein the street scene recorded by the reference image does not comprise the abnormal event;

inputting the target image and the reference image to a semantic difference extraction network to obtain a semantic difference area of the target image relative to the reference image;

and obtaining a detection result according to the semantic difference region, wherein the detection result is used for representing whether the street scene recorded by the target image comprises the abnormal event or not.

2. The method of claim 1, wherein the street exception event comprises: violation events, security incidents, and/or potential safety hazard events.

3. The method according to claim 1 or 2, wherein the obtaining a detection result according to the semantic difference region comprises:

4. The method according to claim 1 or 2, wherein the obtaining a detection result according to the semantic difference region comprises:

5. The method according to claim 3 or 4, characterized in that the method further comprises:

providing the detection result to a user;

6. The method according to any one of claims 1 to 5, further comprising:

7. The method of any of claims 1 to 6, wherein the reference image is an image frame in a video stream, the method further comprising:

and when the moving foreground pixel proportion of the current image frame is smaller than that of the historical image frame in the video stream, updating the reference image by using the current image frame.

8. The method according to any one of claims 1 to 7, further comprising:

eliminating a moving foreground region from the semantic difference region, wherein the moving foreground region is a region where a moving object in the target image is located;

the obtaining a detection result according to the semantic difference region includes:

9. The method according to any one of claims 1 to 8, wherein the semantic difference extraction network is a trained neural network model, and the semantic difference extraction network comprises a feature extraction layer, a semantic difference fusion layer and a semantic difference segmentation layer;

the inputting the target image and the reference image to a semantic difference extraction network to obtain a semantic difference area of the target image relative to the reference image comprises:

10. An apparatus for detecting an abnormal event in a street, the apparatus comprising:

11. The apparatus of claim 10, wherein the street exception event comprises: violation events, security incidents, and/or potential safety hazard events.

12. The apparatus according to claim 10 or 11, wherein the detection module is specifically configured to:

13. The apparatus according to claim 10 or 11, wherein the detection module is specifically configured to:

14. The apparatus of claim 12 or 13, wherein the communication module is further configured to:

providing the detection result to a user;

15. The apparatus of any one of claims 11 to 14, wherein the detection module is further configured to:

16. The apparatus of any of claims 11 to 15, wherein the reference image is an image frame in a video stream, and wherein the communication module is further configured to:

the device further comprises:

17. The apparatus of any one of claims 11 to 16, further comprising:

the detection module is specifically configured to:

18. The apparatus according to any one of claims 11 to 17, wherein the semantic difference extraction network is a trained neural network model, and the semantic difference extraction network comprises a feature extraction layer, a semantic difference fusion layer and a semantic difference segmentation layer;

the semantic difference extraction module is specifically configured to:

19. An apparatus, comprising a processor and a memory;

the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 9.

20. A computer-readable storage medium comprising instructions that, when executed on a device, cause the device to perform the method of any of claims 1 to 9.