CN112464898A

CN112464898A - Event detection method and device, electronic equipment and storage medium

Info

Publication number: CN112464898A
Application number: CN202011480676.7A
Authority: CN
Inventors: 郭宇; 于志鹏; 梁鼎; 曾星宇
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-09

Abstract

The present disclosure relates to an event detection method and apparatus, an electronic device, and a storage medium, wherein the method includes: carrying out human body detection on a video stream, and determining a first area where a target object in the video stream is located; sampling the video stream to obtain K video frames of the video stream; determining a region image group of the target object according to a first region of the target object in the K video frames; and carrying out event detection on the regional image group to obtain a detection result of the target object. The method and the device for detecting the event can improve the accuracy of event detection.

Description

Event detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to an event detection method and apparatus, an electronic device, and a storage medium.

Background

With the development of economy and the continuous improvement of infrastructure construction, the application of the escalator in markets, office buildings, public transportation and other scenes is more and more extensive. While the escalator brings convenience, people pay more attention to possible accidents caused by abnormal actions of pedestrians on the escalator, such as falling, retrograde motion, squatting and the like. If abnormal actions on the escalator cannot be detected in time and the escalator is alarmed to stop running in time, great loss can be caused to the life and property safety of pedestrians on the escalator.

The detection scheme in the related art realizes the start-stop operation of the elevator based on infrared induction, is easy to generate error identification, cannot timely alarm abnormal behaviors, and has poor detection effect.

Disclosure of Invention

The present disclosure provides an event detection technical solution.

According to an aspect of the present disclosure, there is provided an event detection method including: carrying out human body detection on a video stream, and determining a first area where a target object in the video stream is located; sampling the video stream to obtain K video frames of the video stream, wherein K is an integer greater than 1; determining a region image group of the target object according to a first region of the target object in the K video frames; and carrying out event detection on the regional image group to obtain a detection result of the target object.

In a possible implementation manner, performing event detection on the region image group to obtain a detection result of the target object includes: performing feature extraction on the regional image group to obtain first feature information of the target object; according to the time information of the K video frames, performing characteristic offset on the first characteristic information to obtain offset second characteristic information; and processing the second characteristic information to obtain a detection result of the target object.

In a possible implementation manner, performing feature offset on the first feature information according to the time information of the K video frames to obtain offset second feature information includes: rearranging the first characteristic information according to the time information of the K video frames to obtain a first characteristic matrix, wherein the characteristic matrix comprises a time dimension and a space dimension; performing characteristic offset on the characteristics in the first characteristic matrix in a time dimension to obtain a second characteristic matrix after offset; and rearranging the second feature matrix to obtain the second feature information.

In a possible implementation manner, the characteristic shifting is performed on the characteristics in the first characteristic matrix in a time dimension to obtain a shifted second characteristic matrix, including; and aiming at the first feature matrix, keeping the position of the feature with the first space length unchanged, shifting the feature with the second space length forwards in the time dimension, and shifting the feature with the third space length backwards in the time dimension to obtain the second feature matrix, wherein the space length of the space dimension is the sum of the first space length, the second space length and the third space length.

In a possible implementation manner, the time length of the time dimension is K, the space length of the space dimension is M, and M is an integer greater than 1, and the feature in the first feature matrix is subjected to feature shifting in the time dimension to obtain a shifted second feature matrix, including; keeping the position of the first M/2 characteristic of the space length of the first characteristic matrix unchanged; forward shifting features of M/2 to 3M/4 of the first feature matrix space length by K time lengths in a time dimension, where K is greater than or equal to 1 and less than K; shifting 3M/4 to M features of the first feature matrix spatial length backward by k time lengths in the time dimension; and deleting the features exceeding the time length, and filling the vacant positions with zeros to obtain the second feature matrix.

In a possible implementation manner, the method implements feature offset by using N-level network blocks, where N is an integer greater than 1, and the performing feature offset on the first feature information to obtain offset second feature information includes:

performing space-time feature extraction on the third feature information of the i-1 level through the ith level network block of the N level network blocks to obtain the fourth feature information of the i level, wherein i is more than or equal to 1 and less than or equal to N; and performing space-time characteristic deviation on the ith-level fourth characteristic information through the ith-level network block to obtain ith-level third characteristic information, wherein the 0 th-level third characteristic information is the first characteristic information, and the Nth-level third characteristic information is the second characteristic information.

In a possible implementation manner, the determining a region image group of the target object according to the first region of the target object in the K video frames includes: expanding each first area according to the position of the first area in the K video frames to obtain an expanded second area; and respectively intercepting the images corresponding to the second area from the K video frames to obtain the area image group.

In one possible implementation, the sampling the video stream to obtain K video frames of the video stream includes: determining a video clip with a preset time window length from the video stream according to a preset sliding step length, wherein the sliding step length is smaller than the time window length; and sampling the video clips to obtain the K video frames.

In one possible implementation, the video stream includes one or more video streams, a target object in each video stream includes one or more objects, and the sampling the video stream to obtain K video frames of the video stream includes: respectively sampling each path of video stream to obtain K video frames of each path of video stream; the determining a region image group of the target object according to the first region of the target object in the K video frames includes: respectively determining a region image group of each object according to a first region of each object of the target object in corresponding K video frames; the event detection of the region image group to obtain the detection result of the target object comprises the following steps: and respectively carrying out event detection on the regional image group of each object through a plurality of processing processes to obtain the detection result of each object.

In one possible implementation, the method implements event detection through an event detection network, and the method further includes: training the event detection network according to a preset training set, where the training set includes a plurality of sample video clips and labeling information of the plurality of sample video clips, and the labeling information includes occurrence or non-occurrence of an abnormal event, where training the event detection network according to the preset training set includes:

sampling the sample video clips to obtain K sample video frames; determining a sample region image group of a sample object in the K sample video frames according to the labeled human body region in the K sample video frames; performing event detection on the sample region image group through the event detection network to obtain a sample detection result of the sample object; and training the event detection network according to the sample detection result and the labeling information of the sample video clip.

In one possible implementation, the method further includes: and sending alarm information when the detection result is that an abnormal event occurs.

In one possible implementation, the video stream comprises a video stream of an area in which an escalator is located, and the exception event comprises a fall event.

According to an aspect of the present disclosure, there is provided an event detection apparatus including: the video detection module is used for carrying out human body detection on a video stream and determining a first area where a target object in the video stream is located; the video sampling module is used for sampling the video stream to obtain K video frames of the video stream, wherein K is an integer greater than 1; the image group determining module is used for determining a region image group of the target object according to a first region of the target object in the K video frames; and the event detection module is used for carrying out event detection on the regional image group to obtain a detection result of the target object.

In one possible implementation, the event detection module includes: the feature extraction submodule is used for performing feature extraction on the region image group to obtain first feature information of the target object; the offset submodule is used for carrying out characteristic offset on the first characteristic information according to the time information of the K video frames to obtain second characteristic information after offset; and the classification submodule is used for processing the second characteristic information to obtain a detection result of the target object.

In one possible implementation, the offset submodule is configured to: rearranging the first characteristic information according to the time information of the K video frames to obtain a first characteristic matrix, wherein the characteristic matrix comprises a time dimension and a space dimension; performing characteristic offset on the characteristics in the first characteristic matrix in a time dimension to obtain a second characteristic matrix after offset; and rearranging the second feature matrix to obtain the second feature information.

In one possible implementation, the apparatus implements the feature offset by an N-level network block, where N is an integer greater than 1, and the offset submodule is configured to: performing space-time feature extraction on the third feature information of the i-1 level through the ith level network block of the N level network blocks to obtain the fourth feature information of the i level, wherein i is more than or equal to 1 and less than or equal to N; and performing space-time characteristic deviation on the ith-level fourth characteristic information through the ith-level network block to obtain ith-level third characteristic information, wherein the 0 th-level third characteristic information is the first characteristic information, and the Nth-level third characteristic information is the second characteristic information.

In one possible implementation, the image group determining module is configured to: expanding each first area according to the position of the first area in the K video frames to obtain an expanded second area; and respectively intercepting the images corresponding to the second area from the K video frames to obtain the area image group.

In one possible implementation, the video sampling module is configured to: determining a video clip with a preset time window length from the video stream according to a preset sliding step length, wherein the sliding step length is smaller than the time window length; and sampling the video clips to obtain the K video frames.

In one possible implementation, the video stream includes one or more video streams, a target object in each video stream includes one or more objects, and the video sampling module is configured to: respectively sampling each path of video stream to obtain K video frames of each path of video stream; the image group determination module is configured to: respectively determining a region image group of each object according to a first region of each object of the target object in corresponding K video frames; the event detection module is configured to: and respectively carrying out event detection on the regional image group of each object through a plurality of processing processes to obtain the detection result of each object.

In one possible implementation, the apparatus implements event detection through an event detection network, and the apparatus further includes: a training module, configured to train the event detection network according to a preset training set, where the training set includes a plurality of sample video clips and labeling information of the plurality of sample video clips, and the labeling information includes an abnormal event or an abnormal event that does not occur, and the training module is configured to:

In one possible implementation, the apparatus further includes: and sending alarm information when the detection result is that an abnormal event occurs.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

According to embodiments of the present disclosure, a plurality of video frames can be sampled from a video stream; and carrying out event detection according to the regional image group of the object in the plurality of video frames to obtain a detection result. The event detection is realized through a plurality of images, and the accuracy of the event detection can be improved, so that the probability of accidents is reduced, and the cost of manpower detection is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of an event detection method according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of feature shifts for an event detection method according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of an event detection network according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of an event detection process according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of a training process of an event detection network according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of an event detection device according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

According to the fall warning scheme of the automatic elevator in the related art, the start and stop operation of the elevator is realized mostly based on infrared induction, and some schemes based on computer vision are usually judged based on images. The schemes are easy to generate false recognition, and can only judge the falling state after falling, and cannot give an alarm in time for the falling event.

The event detection method can be applied to scenes such as shopping malls, office buildings, public transportation and the like, and based on a deep learning method, video streams of the area where the escalator is located in the scene are processed and analyzed, abnormal events (such as falling, retrograde motion, squatting and the like) of an object (such as a pedestrian) can be detected in real time, the area where the abnormal events occur on the escalator is located, and an alarm is given in time so as to stop the operation of the escalator, so that the risk of safety accidents is reduced.

Fig. 1 shows a flowchart of an event detection method according to an embodiment of the present disclosure, as shown in fig. 1, the event detection method includes:

in step S11, performing human body detection on a video stream, and determining a first area where a target object is located in the video stream;

in step S12, sampling the video stream to obtain K video frames of the video stream, where K is an integer greater than 1;

in step S13, determining a region image group of the target object according to a first region of the target object in the K video frames;

in step S14, event detection is performed on the region image group to obtain a detection result of the target object.

In a possible implementation manner, the event detection method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a terminal, and the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the method may be performed by a server.

For example, a video stream of an area where an escalator is located can be acquired through an image acquisition device (such as a camera), so that an object (such as a pedestrian) riding on the escalator in a video stream picture can be detected, and the behavior of the object can be identified. The present disclosure does not limit the collection mode of the video stream and the specific area corresponding to the video stream.

In one possible implementation, the captured video stream may be decoded to obtain a decoded video stream (which may also be referred to as a picture stream). In step S11, human body detection may be performed on the decoded video stream, and human body frames in each video frame of the video stream are determined; and tracking the human body frames in the video frames to determine the human body frames of the pedestrians belonging to the same identity, namely determining a first area where the target object in the video stream is located.

The human body detection mode can be, for example, human body key point identification, human body contour detection and the like; the human body tracking mode may be, for example, determining objects belonging to the same identity according to the intersection ratio of human body frames in adjacent video frames. It will be appreciated by those skilled in the art that human detection and tracking can be accomplished in any manner known in the relevant art, and the present disclosure is not limited thereto.

In one possible implementation, the decoded video stream may be sampled in step S12 to obtain K video frames. A video segment of a preset time window length can be determined from the video stream, for example, with a set sliding step; and uniformly sampling the video clip to obtain K video frames. Wherein the sliding step can be set to 2s, for example; the time window length may be set to 3s, for example; k may for example take the value 8. The sliding step length, the time window length and the value of K are not limited in the disclosure.

In one possible implementation manner, after K video frames are obtained by sampling, in step S13, a region image group of the target object may be determined according to a first region of the target object in the K video frames. For example, according to the position of each first region, each first region may be expanded to the same size, and the expanded K region images may be captured, so as to obtain a region image group of the target object.

In one possible implementation manner, in step S14, event detection may be performed on the region image group of the target object, and a detection result is obtained. The detection result may include occurrence or non-occurrence of an abnormal event, and the abnormal event may include, for example, a fall event, a retrograde motion event, a squat event, etc., which is not limited by the present disclosure.

In a possible implementation manner, event detection may be implemented through a convolutional neural network, that is, the region image group is input into the convolutional neural network, and a detection result is output.

In one possible implementation, if the detection result is that an abnormal event occurs, warning information may be generated and sent. The alarm information may include a reminder of the abnormal event and may also include an area where the target object in which the abnormal event occurs is located, so that the relevant person can perform positioning. The present disclosure is not so limited.

In one possible implementation, an alarm message can be sent, for example, to the control device of the escalator, so that the control device stops the operation of the escalator; the warning information can also be sent to the related personnel in charge of the escalator operation, so that the related personnel stop the escalator operation and go to the escalator for rescue and the like. The present disclosure does not limit the content of the warning information.

The following provides an explanation of the event detection method of the embodiments of the present disclosure.

As described above, the video stream of the area where the escalator is located may be collected by the camera, and the collected video stream may be transmitted to the local electronic device such as the front server or the cloud server. The electronic device may decode the video stream to obtain a decoded video stream.

In a possible implementation manner, in step S11, the human body detection and tracking may be performed on the decoded video stream through a detection and tracking network, so as to detect a human body frame in each video frame of the video stream, and track a human body frame of a pedestrian belonging to the same identity, so as to obtain a first area where the target object in the video stream is located. The detection tracking network may be a convolutional neural network, and the network structure of the detection tracking network is not limited by the present disclosure.

In one possible implementation, the video stream may be sampled in step S12. Wherein, the step S12 may include:

determining a video clip with a preset time window length from the video stream according to a preset sliding step length, wherein the sliding step length is smaller than the time window length;

and sampling the video clips to obtain the K video frames.

For example, the time window length and the sliding step size may be preset so as to sample the video stream in the time window, thereby realizing continuous processing of the video stream. Wherein, the sliding step length can be less than the time window length, thereby avoiding missing the possible events. For example, the sliding step is set to 2s, and the time window length is set to 3 s; or for example the sliding step is set to 3s and the time window length is set to 5s, which the present disclosure does not limit.

In a possible implementation manner, for any one time of processing, the time window can be made to slide according to the sliding step length, and the video segment corresponding to the time window is determined; and the video clip is uniformly sampled for K times to obtain K video frames. The value of K can be set by those skilled in the art according to practical situations, for example, the value is 4, 6, 8, 12, 16, etc., which is not limited by the present disclosure.

Since the human body frame in each video frame of the video stream has been detected and tracked in step S11, the first region of the target object in the K video frames can be directly determined, and the region image group of the target object is further determined in step S13.

In one possible implementation, step S13 may include:

expanding each first area according to the position of the first area in the K video frames to obtain an expanded second area;

and respectively intercepting the images corresponding to the second area from the K video frames to obtain the area image group.

For example, the position of the target object in the first region in the K video frames may be different. In this case, the first regions may be expanded according to the positions of the respective first regions such that the expanded second region includes all of the first regions. For example, a union region of each first region is determined according to the position coordinates of each first region, and the union region is expanded into a rectangular frame of a preset size as a second region.

In a possible implementation manner, after the second region is obtained, images corresponding to the second region may be respectively cut out from the K video frames, so as to obtain K region images, which are used as a region image group of the target object, for subsequent analysis processing.

In a possible implementation manner, if the target object does not appear in a partial video frame of the K video frames, that is, the first region does not exist in the partial video frame, the determination may be performed according to an actual situation. When the missing part is less, for example, only 1-2 frames do not have the first region, the expansion can be carried out according to the video frames with the first region, and the second region is determined; then, intercepting K area images from K video frames; and then normal processing is continued. When there are many missing parts, e.g. 3 frames without the first region, the event detection for the target object may be abandoned. The present disclosure does not limit the manner of handling this case.

By the method, the size of the area image to be processed can be unified, the position change of the target object can be reflected, and the convenience and the accuracy of subsequent event detection are improved.

In one possible implementation, event detection may be performed on the region image group in step S14. Wherein, the step S14 may include:

performing feature extraction on the regional image group to obtain first feature information of the target object;

according to the time information of the K video frames, performing characteristic offset on the first characteristic information to obtain offset second characteristic information;

and processing the second characteristic information to obtain a detection result of the target object.

For example, a group of regional images may be event detected by a trained event detection network, which may include a feature extraction sub-network, a feature migration sub-network, and a classification sub-network.

In one possible implementation manner, the region image group may be input into a feature extraction sub-network for feature extraction, and first feature information of the target object may be output. Wherein, the feature extraction sub-network may for example comprise a convolutional layer, a pooling layer, etc., the disclosure does not limit the network structure of the feature extraction sub-network.

In one possible implementation manner, the first feature information may be input into the feature shift sub-network, the feature shift may be performed on the first feature information according to the time information of the K video frames, and the shifted second feature information may be output.

Wherein, according to the time information of the K video frames, performing feature offset on the first feature information to obtain offset second feature information may include:

rearranging the first characteristic information according to the time information of the K video frames to obtain a first characteristic matrix, wherein the characteristic matrix comprises a time dimension and a space dimension;

performing characteristic offset on the characteristics in the first characteristic matrix in a time dimension to obtain a second characteristic matrix after offset;

and rearranging the second feature matrix to obtain the second feature information.

For example, the first feature information may be a four-dimensional tensor of C × H × W × K, C, H, W respectively indicates the number of channels, height, and width of the first feature information, and can indicate spatial information of the first feature information; k is the number of video frames, and can represent the time information of the first feature information.

In one possible implementation manner, the first feature information may be rearranged according to the time information K to obtain a first feature matrix, where the first feature matrix includes a time dimension and a space dimension, a time length in the time dimension is K, and a space length in the space dimension is M, where M ═ C × H × W. That is, the first feature matrix is a K × M matrix.

In one possible implementation, the features in the first feature matrix may be subjected to feature shifting in a time dimension, so as to obtain a shifted second feature matrix.

Fig. 2 shows a schematic diagram of feature shifts for an event detection method according to an embodiment of the present disclosure. As shown in fig. 2, the horizontal axis may represent a time dimension with a length of K; the vertical axis may represent a spatial dimension, with a length M ═ C × H × W. That is, the first feature matrix before the shift is a K × M matrix. In fig. 2, a space dimension of 4 grids is taken as an example for illustration, and the actual length M may be set by a person skilled in the art according to an actual situation.

In one possible implementation, the step of shifting the features in the first feature matrix in the time dimension may include:

keeping the positions of the features with the first space length unchanged aiming at the first feature matrix, shifting the features with the second space length forwards in the time dimension, shifting the features with the third space length backwards in the time dimension to obtain the second feature matrix,

wherein the spatial length of the spatial dimension is a sum of the first spatial length, the second spatial length, and the third spatial length.

That is, the position of the partial features of the first spatial length may be kept constant, the partial features of the second spatial length may be shifted forward in the time dimension, or the partial features of the third spatial length may be shifted backward in the time dimension. As shown in fig. 2, the positions of the first two rows of features of the first feature matrix may be kept unchanged, and the third row of features may be shifted forward by one grid in the time dimension, and the fourth row of features may be shifted backward by one grid in the time dimension, i.e., by the time distance of one sampled video frame.

The sum of the first space length, the second space length, and the third space length is the space length M of the space dimension, and those skilled in the art can set the first space length, the second space length, and the third space length according to the actual situation, which is not limited in this disclosure.

In one possible implementation, features that exceed the length of time may be deleted, such as deleting the first bin of the third row of features and the last bin of the fourth row of features in FIG. 2; the empty locations are then filled with zeros, for example, after the last bin of the third row of features and before the first bin of the fourth row of features in fig. 2. Thus, after feature shifting, a second feature matrix can be obtained, which is still a K × M matrix with unchanged size.

By the method, the characteristic deviation process can be realized, and the accuracy of event detection is improved.

keeping the position of the first M/2 characteristic of the space length of the first characteristic matrix unchanged;

forward shifting features of M/2 to 3M/4 of the first feature matrix space length by K time lengths in a time dimension, where K is greater than or equal to 1 and less than K;

shifting 3M/4 to M features of the first feature matrix spatial length backward by k time lengths in the time dimension;

and deleting the features exceeding the time length, and filling the vacant positions with zeros to obtain the second feature matrix.

In one example, the position of the first M/2 features of the spatial length of the first feature matrix may be kept unchanged when shifting the features in the first feature matrix; forward shifting features of M/2 to 3M/4 of the first feature matrix space length by K time lengths in a time dimension, where K is greater than or equal to 1 and less than K; shifting 3M/4 to M features of the first feature matrix spatial length back by k time lengths in the time dimension, i.e. by the temporal distance of k sampled video frames. When K is 8, K may be, for example, 1, 2, or 3, thereby enabling time offsets of various lengths. The value of k is not limited by this disclosure.

In one example, features that exceed the length of time may be deleted and the vacant locations filled with zeros, resulting in a second feature matrix that is unchanged in size.

It should be understood that the above is only a schematic illustration of the manner of feature shifting, and the skilled person can set the method according to the actual situation as long as the requirement of keeping the position of the partial feature unchanged and shifting the partial feature forward or backward in the time dimension is satisfied, for example, keeping the position of the feature of the rear M/2 of the space length unchanged, and shifting the front M/4 and the features of the front M/4 to M/2 forward and backward respectively. The present disclosure is not limited to the particular manner in which the features are offset.

In one possible implementation, the second feature matrix may be rearranged to obtain the second feature information. That is, the second eigen matrix is reduced to a four-dimensional tensor of C × H × W × K, thereby implementing the entire process of eigen shift.

In a possible implementation manner, after the second feature information is obtained, the second feature information may be input into the classification subnetwork for processing, and the detection result may be output. That is, the probability of an abnormal event (such as a fall) of the target object is determined by the classification sub-network; when the probability exceeds a preset threshold value, outputting a detection result of the abnormal event; otherwise, when the probability is lower than the preset threshold value, outputting the detection result that no abnormal event occurs.

Wherein, the classification sub-network may include a full connection layer, an activation layer, etc., and the disclosure does not limit the network structure of the classification sub-network; the preset threshold of the probability of the occurrence of the abnormal event may be set to 0.75, for example, and the value of the preset threshold is not limited by the present disclosure.

By the method, the deviation of time and space characteristics can be realized, so that the network can more effectively capture the characteristic difference of abnormal actions and normal actions of the object in the time dimension, and the accuracy of network prediction is improved.

In one possible implementation, the event detection method of the embodiment of the present disclosure may implement the feature offset through N-level network blocks, where N is an integer greater than 1. That is, the feature shifting sub-network includes N-level network blocks.

Fig. 3 shows a schematic diagram of an event detection network according to an embodiment of the present disclosure. As shown in fig. 3, the event detection network may include a feature extraction subnetwork 31, a feature migration subnetwork 32, and a classification subnetwork 33. The feature extraction sub-network 31 may include a convolution layer 311 and a maximum value pooling layer 312; feature shifting sub-network 32 comprises N-level network blocks; the classification subnetwork 33 includes a fully connected layer 331.

During processing, a group of region images (not shown) may be input into the feature extraction subnetwork 32, outputting first feature information (not shown); the first feature information is input into the feature shift subnetwork 32 and the second feature information (not shown) is output; the second feature information is input into the classification sub-network 33 and the detection result (not shown) is output, thereby implementing the entire event detection process.

In a possible implementation manner, the step of performing feature offset on the first feature information to obtain offset second feature information may include:

performing space-time feature extraction on the third feature information of the i-1 level through the ith level network block of the N level network blocks to obtain the fourth feature information of the i level, wherein i is more than or equal to 1 and less than or equal to N;

performing space-time characteristic deviation on the ith-level fourth characteristic information through the ith-level network block to obtain ith-level third characteristic information,

wherein the third feature information of the 0 th level is the first feature information, and the third feature information of the nth level is the second feature information.

That is, the feature migration may be performed multiple times by N-level network blocks, each of which includes a spatio-temporal feature extraction layer 321 and a spatio-temporal feature migration layer 322. The network blocks at each level may be, for example, residual blocks, and the present disclosure does not limit the network structure of the network blocks at each level and the value of N.

For the ith-level network block (i is more than or equal to 1 and less than or equal to N) of the N-level network blocks, performing spatio-temporal feature extraction on feature information (referred to as third feature information of the (i-1) th level) output by the previous level through a spatio-temporal feature extraction layer 321 to obtain fourth feature information of the ith level; and then the spatio-temporal feature migration layer 322 performs spatio-temporal feature migration on the ith-level fourth feature information, and outputs the ith-level third feature information.

When i is equal to 1, the third feature information of the 0 th level input by the 1 st level network block is the first feature information; and when the i is equal to N, the third feature information of the nth stage output by the nth stage network block is the second feature information.

The process of spatio-temporal feature migration is similar to the feature migration process described above. That is, rearranging the input feature information to obtain a feature matrix; performing characteristic offset on the characteristics in the characteristic matrix on a time dimension to obtain an offset characteristic matrix; and rearranging the characteristic matrix after the deviation to obtain output characteristic information. The specific processing is not described repeatedly here.

By the method, multiple time-space characteristic deviations can be realized, so that the characteristic difference on the time dimension is further extracted, and the accuracy of network prediction is further improved.

In one possible implementation manner, the event detection method according to the embodiment of the present disclosure may further include: and sending alarm information when the detection result is that an abnormal event occurs.

That is, if the detection result obtained at step S14 is that an abnormal event occurs, warning information may be generated and transmitted at this step. The alarm information may include a reminder of the abnormal event and may also include an area where the target object in which the abnormal event occurs is located, so that the relevant person can perform positioning. The present disclosure is not so limited.

By the method, the real-time performance of alarming can be improved, and the probability of accidents is reduced.

In an actual scene, video streams of areas where a plurality of escalators are located can be collected through a plurality of cameras respectively; a plurality of cameras can also be arranged for some escalators (such as a longer escalator), and the cameras respectively collect video streams of all areas of the escalator, so that a plurality of video streams are obtained.

In the picture of each video stream, one or more objects riding on the escalator may appear, that is, the target object to be detected in each video stream may include one or more objects.

In this case, event detection can be performed on each object in each video stream simultaneously by means of parallel processing.

In one possible implementation, the video stream includes one or more video streams, and the target object in each video stream includes one or more objects. Wherein, step S12 includes: respectively sampling each path of video stream to obtain K video frames of each path of video stream;

step S13 includes: respectively determining a region image group of each object according to a first region of each object of the target object in corresponding K video frames;

step S14 includes: and respectively carrying out event detection on the regional image group of each object through a plurality of processing processes to obtain the detection result of each object.

For example, when the electronic device receives a single-channel or multiple-channel video stream, each channel of video stream may be decoded to obtain a decoded video stream. In step S11, human body detection and tracking may be performed on each decoded video stream through the detection and tracking network, so as to obtain a first area where the target object is located in each video stream.

In one possible implementation, in step S12, each video stream may be sampled separately. Namely, determining a video clip with a preset time window length from each path of video stream according to a set sliding step length; and uniformly sampling the video clips to obtain K video frames of each path of video stream.

In one possible implementation manner, in step S13, the regional image group of each object may be determined according to the first region of each object of the target object in the corresponding K video frames. That is, for any one object, a union region of each first region of the object is determined according to the position of the first region of the object in the corresponding K video frames, and the union region is expanded into a rectangular frame with a preset size as a second region. And respectively cutting out images corresponding to the second areas from the K video frames to obtain K area images which are used as the area image group of the object.

In a possible implementation manner, a plurality of processing processes may be provided in the electronic device, and each processing process may perform event detection on the region image group, so as to implement parallel processing. In step S14, event detection may be performed on the region image group of each object by a plurality of processing procedures, and the detection result of each object may be obtained.

Fig. 4 shows a schematic diagram of an event detection process according to an embodiment of the present disclosure. As shown in fig. 4, the single/multiple video streams are transmitted to an electronic device (not shown) for decoding; the decoded video stream is input into a detection and tracking network (not shown) for human body detection and tracking, and a human body frame where the target object is located in each path of video stream is obtained.

In an example, each video stream may be sampled separately, resulting in K video frames of each video stream. According to the body frames of each object of the target object in the corresponding K video frames, the regional image group of each object can be determined. The target object includes n objects (n is an integer greater than 1), and the region image group of the n objects can be recorded as a human body frame 1, a human body frame 2, … …, and a human body frame n.

In an example, event detection may be performed on the human body frame 1, the human body frame 2, … …, and the human body frame n through a plurality of processing procedures, and detection results of the respective objects, that is, the result 1, the result 2, the result … …, and the result n in fig. 4, are obtained. And outputting the result, namely alarm information in the case of a falling event, wherein the alarm information can comprise reminding information and the position of the human body where the falling event occurs.

By means of parallel processing, abnormal events which may occur in a plurality of video pictures can be detected and alarmed at the same time, and the efficiency of event detection is improved.

The event detection network described above may be trained prior to deployment. In one possible implementation manner, the event detection method according to the embodiment of the present disclosure may further include:

training the event detection network according to a preset training set, wherein the training set comprises a plurality of sample video clips and marking information of the sample video clips, and the marking information comprises abnormal events or abnormal events which do not occur.

For example, a training set may be preset, the training set comprising R sample video segments, which may be represented as { V }₁,V₂,…,V_R}，V_rR is more than or equal to 1 and less than or equal to R, and R is an integer more than 1. { P₁,P₂,…,P_RDenotes annotation information for R sample video clips, P_rFor a sample video segment V_rThe annotation information of (1). The annotation information is an action (fall/normal) label of an object in the sample video segment, and is used for indicating that an abnormal event occurs or no abnormal event occurs.

In one possible implementation, the step of training the event detection network according to the training set may include:

sampling the sample video clips to obtain K sample video frames;

determining a sample region image group of a sample object in the K sample video frames according to the labeled human body region in the K sample video frames;

performing event detection on the sample region image group through the event detection network to obtain a sample detection result of the sample object;

and training the event detection network according to the sample detection result and the labeling information of the sample video clip.

Fig. 5 shows a schematic diagram of a training process of an event detection network according to an embodiment of the present disclosure. As shown in fig. 5, for any sample video segment, the sample video segment may be decoded and sampled to obtain K sample video frames; and labeling the sample video clips or the K sample video frames to obtain the labeled human body areas in the K sample video frames. The present disclosure is not intended to be limited to the particular forms set forth herein.

In one possible implementation manner, according to the labeled human body region in the K sample video frames, the sample region image group of the sample object in the K sample video frames can be determined

Namely, the human body area is expanded into a rectangular frame with a preset size according to the position of the human body area in each sample video frame; and (3) intercepting images corresponding to the rectangular frame from the K sample video frames to obtain K area images which are used as sample area image groups of the sample object for subsequent analysis processing.

In one possible implementation, the set of sample region images can be grouped together

Inputting the event detection network to perform event detection to obtain a sample detection result of the sample object

According to the difference between the sample detection result and the labeling information, the learning process of the event detection network can be supervised by using a classification loss function, and the learning of the event detection network is guided in a back propagation mode. The Loss function Loss of the event detection network can be expressed as:

the loss function in equation (1) may be, for example, a cross-entropy loss function, and the present disclosure does not limit the specific type of loss function.

After multiple times of training, under the condition of meeting preset training conditions (such as network convergence), the trained event detection network can be obtained, and therefore the whole training process is completed.

By the method, the training of the event detection network can be realized, and the high-precision event detection network is obtained.

According to the event detection method disclosed by the embodiment of the disclosure, the video stream of the region where the escalator is located can be detected and tracked, and the human body frame of the same object is determined; expanding the human body frame of the same object in a plurality of sampling video frames, inputting the human body frame into a trained event detection network for behavior analysis, and judging whether a falling event occurs or not; and finally, returning the area judged as the falling event and the corresponding human body frame for alarming, so that the accuracy of event detection can be improved, the real-time performance of alarming can be improved, the risk of safety accidents is reduced, and the cost of manual detection is reduced.

According to the event detection method, the abnormal event detection of the video is realized by utilizing the deep learning algorithm, the feature migration sub-network is arranged in the event detection network, the features of the space dimension are migrated in the time dimension, the space-time feature migration is realized, the falling action of the human body is represented in a more accurate form, the network can effectively capture the feature difference of the abnormal action and the normal action of the object in the time dimension, and the accuracy of the network in judging the falling action is effectively improved.

According to the event detection method disclosed by the embodiment of the disclosure, a multi-path parallel processing mode is adopted, the falling event of multiple persons under one or more video pictures can be judged at the same time, and the event detection efficiency is improved.

The event detection method can be applied to the field of security monitoring, is deployed in escalator self-service monitoring systems in application scenes such as large-scale shopping malls, subway stations and office buildings, achieves automatic alarm and elevator stopping of falling time of elevators, and reduces cost of manpower monitoring.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides an event detection apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any event detection method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the method section are not repeated.

Fig. 6 shows a block diagram of an event detection apparatus according to an embodiment of the present disclosure, as shown in fig. 6, the apparatus including:

the video detection module 61 is configured to perform human body detection on a video stream and determine a first area where a target object in the video stream is located;

a video sampling module 62, configured to sample the video stream to obtain K video frames of the video stream, where K is an integer greater than 1;

an image group determining module 63, configured to determine a region image group of the target object according to a first region of the target object in the K video frames;

and the event detection module 64 is configured to perform event detection on the region image group to obtain a detection result of the target object.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The embodiments of the present disclosure also provide a computer program product, which includes computer readable code, and when the computer readable code runs on a device, a processor in the device executes instructions for implementing the event detection method provided in any one of the above embodiments.

The embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed cause a computer to perform the operations of the event detection method provided in any of the above embodiments.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 7, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 8 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An event detection method, comprising:

carrying out human body detection on a video stream, and determining a first area where a target object in the video stream is located;

sampling the video stream to obtain K video frames of the video stream, wherein K is an integer greater than 1;

determining a region image group of the target object according to a first region of the target object in the K video frames;

and carrying out event detection on the regional image group to obtain a detection result of the target object.

2. The method according to claim 1, wherein performing event detection on the region image group to obtain a detection result of the target object comprises:

3. The method according to claim 2, wherein performing feature offset on the first feature information according to the time information of the K video frames to obtain second feature information after offset, comprises:

4. The method of claim 3, wherein the feature shifting the features in the first feature matrix in the time dimension to obtain a shifted second feature matrix, comprises;

5. The method of claim 3, wherein the temporal dimension has a temporal length of K, the spatial dimension has a spatial length of M, M being an integer greater than 1,

performing characteristic offset on the characteristics in the first characteristic matrix in a time dimension to obtain a second characteristic matrix after offset, wherein the second characteristic matrix comprises the characteristic matrix;

6. The method according to any of claims 2-5, wherein the method implements the feature offset by means of N stages of network blocks, N being an integer greater than 1,

the performing feature offset on the first feature information to obtain second feature information after offset includes:

7. The method according to any one of claims 1 to 6, wherein the determining the region image group of the target object according to the first region of the target object in the K video frames comprises:

8. The method according to any one of claims 1-7, wherein the sampling the video stream to obtain K video frames of the video stream comprises:

and sampling the video clips to obtain the K video frames.

9. The method according to any of claims 1-8, wherein the video stream comprises one or more video streams, wherein the target object in each video stream comprises one or more objects,

the sampling the video stream to obtain K video frames of the video stream includes: respectively sampling each path of video stream to obtain K video frames of each path of video stream;

the determining a region image group of the target object according to the first region of the target object in the K video frames includes: respectively determining a region image group of each object according to a first region of each object of the target object in corresponding K video frames;

the event detection of the region image group to obtain the detection result of the target object comprises the following steps: and respectively carrying out event detection on the regional image group of each object through a plurality of processing processes to obtain the detection result of each object.

10. The method according to any of claims 1-9, wherein the method implements event detection over an event detection network, the method further comprising:

training the event detection network according to a preset training set, wherein the training set comprises a plurality of sample video clips and marking information of the plurality of sample video clips, the marking information comprises the occurrence of abnormal events or the non-occurrence of abnormal events,

wherein the training the event detection network according to a preset training set comprises:

sampling the sample video clips to obtain K sample video frames;

11. The method according to any one of claims 1-10, further comprising:

and sending alarm information when the detection result is that an abnormal event occurs.

12. The method of claim 10 or 11, wherein the video stream comprises a video stream of an area in which an escalator is located, and the exception event comprises a fall event.

13. An event detection device, comprising:

the video detection module is used for carrying out human body detection on a video stream and determining a first area where a target object in the video stream is located;

the video sampling module is used for sampling the video stream to obtain K video frames of the video stream, wherein K is an integer greater than 1;

the image group determining module is used for determining a region image group of the target object according to a first region of the target object in the K video frames;

and the event detection module is used for carrying out event detection on the regional image group to obtain a detection result of the target object.

14. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 12.

15. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 12.