CN113569825B

CN113569825B - Video monitoring method and device, electronic equipment and computer readable medium

Info

Publication number: CN113569825B
Application number: CN202111126083.5A
Authority: CN
Inventors: 杨宇; 陈银伟
Original assignee: Beijing Guodiantong Network Technology Co Ltd
Current assignee: Beijing Guodiantong Network Technology Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-10
Anticipated expiration: 2041-09-26
Also published as: CN113569825A

Abstract

The embodiment of the disclosure discloses a video monitoring method, a video monitoring device, electronic equipment and a computer readable medium. One embodiment of the method comprises: responding to the received video monitoring request, and acquiring a target camera information set; sending control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server; in response to receiving a monitoring video address information set sent by at least one target server, pulling a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set; and displaying the real-time video stream in the real-time video stream set on the target display interface. The embodiment improves the display efficiency of the monitoring video.

Description

Video monitoring method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a video monitoring method, a video monitoring device, electronic equipment and a computer readable medium.

Background

Video monitoring refers to a technology for monitoring a target shooting area in real time through a camera. At present, when video monitoring is performed, the method generally adopted is as follows: and displaying the monitoring picture in real time through a display interface corresponding to the server corresponding to the camera.

However, when the above-described manner is adopted, there are often technical problems as follows:

firstly, when a monitoring video acquired by a camera in a plurality of cameras needs to be checked, a display interface is often required to be switched, so that the display efficiency of the monitoring video is low;

secondly, often need pass through the people, discern the dangerous behavior in the control screen of show interface show, when the show interface contains a plurality of control screens, through the mode of manual identification, the discernment inefficiency of dangerous behavior.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose video monitoring methods, apparatuses, electronic devices and computer readable media to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a video monitoring method, including: responding to the received video monitoring request, and acquiring a target camera information set; sending control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server; in response to receiving a monitoring video address information set sent by the at least one target server, pulling a real-time video stream corresponding to monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set; and displaying the real-time video stream in the real-time video stream set on a target display interface.

In a second aspect, some embodiments of the present disclosure provide a video monitoring apparatus, the apparatus comprising: an acquisition unit configured to acquire a target camera information set in response to receiving a video monitoring request; the sending unit is configured to send control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server; the pulling unit is configured to respond to the received monitoring video address information set sent by the at least one target server, and pull a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set; and the display unit is configured to display the real-time video streams in the real-time video stream set on a target display interface.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: by the video monitoring method of some embodiments of the present disclosure, the display efficiency of the monitored video is improved. Specifically, the reason why the display efficiency of the surveillance video is low is that: due to the limitation of the shooting angle and the shooting distance of a single camera, when video monitoring is performed on a large-area, a plurality of cameras are often required to be arranged. But there are often differences due to the brand, model and kind of cameras. The result is that switching to the display interface corresponding to the camera is often required to view the monitoring video. Based on this, the video monitoring method of some embodiments of the present disclosure first obtains a target camera information set in response to receiving a video monitoring request. By acquiring target camera information corresponding to all monitoring cameras in the monitoring target shooting area, subsequent monitoring videos are conveniently pulled. And secondly, sending control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server. There are often differences in the make, model and variety of cameras. Therefore, the control information is sent to at least one target server, so that the monitoring videos are collected in parallel. And then, in response to receiving a monitoring video address information set sent by the at least one target server, pulling a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set. Videos collected by cameras of different brands, models and varieties are often stored in different servers. Therefore, the real-time videos shot by different cameras are pulled in parallel by acquiring the monitoring video address information set. And finally, displaying the real-time video stream in the real-time video stream set on a target display interface. Therefore, real-time monitoring videos collected by different cameras are displayed on one display interface. The mode does not need to switch the display interface, and the display efficiency of the monitoring video is greatly improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

Fig. 1 is a schematic diagram of an application scenario of a video surveillance method of some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of a video surveillance method according to the present disclosure;

FIG. 3 is a schematic diagram of a target presentation interface;

FIG. 4 is a flow diagram of further embodiments of a video surveillance method according to the present disclosure;

FIG. 5 is another schematic view of a target presentation interface;

FIG. 6 is a schematic block diagram of some embodiments of a video surveillance apparatus according to the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of an application scenario of a video monitoring method according to some embodiments of the present disclosure.

In the application scenario of fig. 1, first, the computing device 101 may obtain a target set of camera information 103 in response to receiving the video surveillance request 102; secondly, the computing device 101 may send control information 104 for controlling a target camera corresponding to the target camera information in the target camera information set 103 to perform real-time video acquisition to at least one target server 105; then, in response to receiving the monitoring video address information set 106 sent by the at least one target server 105, the computing device 101 may pull a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set 106 from the at least one target server 105 according to the monitoring video address information set 106, so as to obtain a real-time video stream set 107; finally, the computing device 101 may present the real-time video streams in the set of real-time video streams 107 at the target presentation interface 108.

The computing device 101 may be hardware or software. When the computing device is hardware, it may be implemented as a distributed cluster composed of multiple servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of computing devices in FIG. 1 is merely illustrative. There may be any number of computing devices, as implementation needs dictate.

With continued reference to fig. 2, a flow 200 of some embodiments of a video surveillance method according to the present disclosure is shown. The video monitoring method comprises the following steps:

step 201, in response to receiving a video monitoring request, acquiring a target camera information set.

In some embodiments, an executing entity (e.g., the computing device 101 shown in fig. 1) of the video monitoring method may acquire the target camera information set in response to receiving the video monitoring request through a wired connection or a wireless connection. The video monitoring request may be a request for acquiring target camera information corresponding to a camera for monitoring a target area. The video monitoring request may include: a request initiating terminal identification, monitoring area information and request initiating terminal address information. The identifier of the request initiating end may be an identifier corresponding to a terminal that initiates the video monitoring request. The monitoring area information may be position information corresponding to the target area. For example, the area corresponding to the monitoring area information may be an area framed by a geo-fence. The address information of the request initiator may represent the address of the request initiator. For example, the request originator address information may be represented by a URL (Uniform Resource Locator). The target camera information in the target camera information set may be camera information corresponding to a camera that monitors and records a picture in the target area in real time. The target camera information in the target camera information set may include: camera position information and camera video pull addresses. The camera position information can represent the position of the camera. The camera video pull address may represent an address of a server storing a video stream recorded by the camera in real time. The execution main body can determine whether the camera is located in the area corresponding to the monitoring area information by inquiring the camera position information corresponding to the camera stored in the target database, and when the camera is located in the area corresponding to the monitoring area information, the execution main body determines the camera information corresponding to the camera as the target camera information. The target database may be a database for storing camera information corresponding to the cameras. The target database may be a distributed database. For example, the target database may be an Hbase database.

Step 202, sending control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server.

In some embodiments, the execution main body may send, to the at least one target server, control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video capture. The target server in the at least one target server may be a server storing a real-time monitoring video acquired by the camera. The control information may be information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition. For example, the control information may include: the camera video draws address, camera address and video acquisition sign. The camera video pull address may represent an address of a target server storing a video stream recorded by the camera in real time. The camera address may represent a network address corresponding to the target camera. The video acquisition identification can represent whether the target camera is controlled to carry out video acquisition. For example, the video capture identification may be "1". The video capture flag may also be "2". When the video acquisition identifier is '1', the control target camera can be represented to perform video acquisition. When the video acquisition identifier is '2', the camera which does not control the target can be represented to carry out video acquisition.

As an example, the control information may be: { [ Camera video Pull Address: 192.168.1.2, camera address: 192.168.2.0, video capture identification: 1], [ camera video pull address: 159.144.1.0, camera address: 153.168.2.0, video capture identification: 1], [ camera video pull address: 159.144.1.0, camera address: 153.168.2.1, video capture identification: 1]}.

Step 203, in response to receiving the monitoring video address information set sent by the at least one target server, pulling a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set, so as to obtain a real-time video stream set.

In some embodiments, the executing body may, in response to receiving a monitoring video address information set sent by the at least one target server, pull a real-time video stream corresponding to monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set, so as to obtain the real-time video stream set. The monitoring video address information in the monitoring video address information set can represent the addresses of videos collected by a target camera stored in a target server in real time. The execution main body may pull a Real-Time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server through an RTSP (Real Time Streaming Protocol), so as to obtain a Real-Time video stream set.

And step 204, displaying the real-time video stream in the real-time video stream set on the target display interface.

In some embodiments, the execution subject may present the real-time video streams in the set of real-time video streams on the target presentation interface. The target display interface can be used for displaying monitoring videos collected by cameras of different brands, models or types pulled from a plurality of target servers in real time.

Optionally, the target display interface may further display a real-time bullet screen.

As an example, the target presentation interface may be as shown in fig. 3. The target presentation interface may include 9 sub-target presentation interfaces 301, so as to simultaneously present videos corresponding to a plurality of real-time video streams in the real-time video stream set. The target interface also includes a video presentation control component 302. The video display control component 302 may include: the screen control system comprises a single-screen control assembly, a four-split-screen control assembly, a nine-split-screen control assembly, a sixteen-split-screen control assembly and a thirty-two-split-screen control assembly. The "single screen" control component may be a component for controlling the target presentation interface to present a video corresponding to a single real-time video stream. The four-split screen control component may be a component for controlling the target display interface to display videos corresponding to the four real-time video streams. The "nine-split screen" control component may be a component for controlling the target display interface to display videos corresponding to nine real-time video streams. The "sixteen split screens" control component may be a component for controlling the target display interface to display videos corresponding to sixteen real-time video streams. The thirty-two screens control component may be a component for controlling the target presentation interface to present videos corresponding to thirty-two real-time video streams.

With further reference to fig. 4, a flow 400 of further embodiments of a video surveillance method is shown. The process 400 of the video monitoring method includes the following steps:

step 401, in response to receiving a video monitoring request, acquiring a target camera information set.

Step 402, sending control information for controlling a target camera corresponding to target camera information in a target camera information set to perform real-time video acquisition to at least one target server.

Step 403, in response to receiving the monitoring video address information set sent by the at least one target server, pulling a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set, so as to obtain a real-time video stream set.

And step 404, displaying the real-time video stream in the real-time video stream set on the target display interface.

In some embodiments, the specific implementation of

steps

401 and 404 and the technical effect thereof can refer to

steps

201 and 204 in the embodiments corresponding to fig. 2, which are not described herein again.

Step 405, performing video picture analysis on each real-time video stream in the real-time video stream set to generate a video picture analysis result, so as to obtain a video picture analysis result set.

In some embodiments, the execution subject may perform video picture analysis on each real-time video stream in the set of real-time video streams to generate a video picture analysis result, so as to obtain the set of video picture analysis results. The video picture analysis results in the video picture analysis result set can be used for representing behavior categories of pedestrians contained in the real-time video stream. The execution subject can determine the behavior category of the pedestrian contained in the real-time video stream through a behavior recognition model. For example, the behavior recognition model may be, but is not limited to, any of the following: ST-NBNN (spatial-Temporal Na meive-Bayes neural Nearest Neighbor) model and ST-GCN (spatial Temporal Graph constraint Networks) model.

As an example, the video picture analysis result may be: { [ frame number: 0023, pedestrian information: { (pedestrian position (12, 23), action type: "smoking"), (pedestrian position (22, 123), action type: "waving") }, [ frame number: 0034, pedestrian information: { (pedestrian position (111, 27), action category: "walk") }.

Optionally, the behavior recognition model may include a feature extraction network, a feature filtering network, a feature fusion network, and a full connection layer. Wherein the feature extraction network comprises: an image feature extraction network and an optical flow feature extraction network. The above feature filter network includes: a first convolution block attention module and a second convolution block attention module.

The image feature extraction network is a ResNet-152 model. The optical flow feature extraction network is a VGG-16 model. The first module for attention of a roll of blocks comprises: a first pooling layer, a multi-layer sensor, and a second pooling layer. The second convolution block attention module includes: a third pooling layer, a multi-layer sensor, and a fourth pooling layer. Wherein the multi-layer perceptron included in the first convolution block attention module and the multi-layer perceptron included in the second convolution block attention module share weights.

The above feature fusion network includes: the device comprises a first coiling layer, a second coiling layer, a fifth pooling layer, a third coiling layer, a fourth coiling layer, a sixth pooling layer, a seventh coiling layer, an eighth coiling layer, a seventh pooling layer, a first full-connection layer, a second full-connection layer, a third full-connection layer, a feature splicing layer, a fourth full-connection layer, a first random inactivation layer, a fifth full-connection layer, a second random inactivation layer and a classification layer. Wherein the input of the first fully connected layer is the output of the fifth pooling layer. The input of the second fully connected layer is the output of the seventh pooling layer. The input of the third fully connected layer is the output of the sixth pooling layer. The inputs to the feature splice layer are the outputs of the first fully-connected layer, the second fully-connected layer, and the third fully-connected layer.

The behavior recognition model is used as an invention point of the disclosure, and solves the technical problem two mentioned in the background art that dangerous behaviors in a monitoring screen displayed on a display interface are often recognized by people, and when the display interface comprises a plurality of monitoring screens, the recognition efficiency of the dangerous behaviors is low in a manual recognition mode. The method comprises the steps of firstly, respectively adopting an image feature extraction network and an optical flow feature extraction network to carry out feature extraction on a real-time video stream so as to increase the richness of features. While taking into account that the feature hierarchies of optical flow features and image features tend to be different. Therefore, different feature extraction networks are respectively adopted for feature extraction. Secondly, a first convolution block attention module and a second convolution block attention module are added to extract the characteristics of a time dimension, a space dimension and a channel dimension in consideration of the characteristic relevance between adjacent frame images in continuous videos. Then, in order to improve the perception effect of the feature fusion network on the global features, the problem of feature loss caused by the fact that the feature fusion network only focuses on local features is avoided. Thus, the present disclosure splices the outputs of the first full link layer, the second full link layer, and the third full link layer by inputting the output of the fifth pooling layer into the first full link layer, the output of the sixth pooling layer into the third full link layer, and the output of the seventh pooling layer into the second full link layer. Therefore, the perception effect of the feature fusion network on the global features is improved. Finally, by adding the first random inactivation layer and the second random inactivation layer, the over-fitting situation is avoided. Compared with a manual identification mode, the pedestrian action in the real-time video stream is accurately identified in real time through the behavior identification model, and the identification efficiency of dangerous behaviors is greatly improved.

In some optional implementations of some embodiments, the performing subject performing video picture analysis on each real-time video stream in the set of real-time video streams to generate a video picture analysis result may include:

firstly, pedestrian detection is carried out on the real-time video stream to generate a target pedestrian information set.

The target pedestrian information in the target pedestrian information set may be information corresponding to a pedestrian included in the real-time video stream. For example, the target pedestrian information in the target pedestrian information set may include position information of the same pedestrian in different frame images in the real-time video stream. The execution subject may determine, through a target detection model, pedestrians included in the real-time video stream to generate the target pedestrian information set. For example, the target detection model may be, but is not limited to, any of the following: AlexNet model and OverFeat (Integrated Recognition Localization and Detection using Convolutional neural Networks, Integrated Recognition, Localization and Detection based on Convolutional neural Networks) model.

As an example, the target pedestrian information may be: { pedestrian number: 0001, [ (frame number: 002, position coordinates: (22, 12)), (frame number: 005, position coordinates: (45, 33)), (frame number: 033, position coordinates: (123, 56)) ] }. The position coordinates of the pedestrian can also be formed by a set of key points corresponding to the behaviors.

And secondly, performing behavior analysis on the target pedestrian information in the target pedestrian information set to generate a video picture analysis result.

The execution subject can input at least one video frame image of a pedestrian corresponding to the target pedestrian information in the target pedestrian information set into the behavior detection model to generate the video picture analysis result. The behavior detection model may be, but is not limited to, any of the following: CNN (Convolutional Neural Networks) model, FCN (full Convolutional Networks) model, SVM (Support Vector Machine) model, and RNN (Recurrent Neural Networks) model.

Optionally, the performing main body performs behavior analysis on the target pedestrian information in the target pedestrian information set to generate a video picture analysis result, and the method may include the following steps:

firstly, performing motion recognition on a pedestrian corresponding to each piece of target pedestrian information in the target pedestrian information set to generate a motion recognition result, and obtaining a motion recognition result set.

The execution body can perform action recognition on the pedestrian corresponding to the target pedestrian information through an action recognition model to generate an action recognition result. The action recognition result can represent the action category of the pedestrian corresponding to the target pedestrian information. For example, the motion recognition model may be a TPN (Temporal Pyramid Network) model.

And secondly, determining behavior information according to the position information of the pedestrian corresponding to each piece of target pedestrian information in the target pedestrian information set and the action recognition result corresponding to the target pedestrian information to obtain a behavior information set.

The behavior information in the behavior information set may include position information of the pedestrian in different video frames and corresponding motion recognition results.

As an example, the behavior information may be { pedestrian number: 001, [ (frame number 00012, pedestrian position (123, 121), action type: walking), (frame number 00345, pedestrian position (13, 221), action type: smoking), (frame number 01022, pedestrian position (223, 12), action type: garbage throwing) ] }.

And thirdly, generating the video picture analysis result according to the behavior information set and the action recognition result set.

The execution body may stitch the behavior information in the behavior information set with the motion recognition result of the pedestrian in the last frame image of the pedestrian including the behavior information to generate the video frame analysis result.

And 406, generating adjustment control information corresponding to each video picture analysis result in the video picture analysis result set to obtain an adjustment control information set.

In some embodiments, the execution subject may generate adjustment control information corresponding to each video picture analysis result in the video picture analysis result set, to obtain the adjustment control information set. The adjustment control information in the adjustment control information set may be information for controlling the camera to adjust the shooting angle.

As an example, when the number of pedestrians included in the video picture analysis result is greater than a preset threshold, the execution main body may adjust a focal length of a camera including the pedestrians included in the video picture analysis result. For example, the preset threshold may be "20".

In some optional implementations of some embodiments, the executing the adjustment control information corresponding to each video picture analysis result in the video picture analysis result set by the subject may include:

the method comprises the first step of responding to target behavior information in the video picture analysis result, and determining the position of a pedestrian corresponding to the target behavior information in an image included in a corresponding real-time video stream to generate candidate position information.

And the target behavior information is behavior information representing dangerous behaviors included in the video picture analysis result. First, the execution subject may determine whether target behavior information exists according to the action type included in the video picture analysis result. For example, the target behavior information may be "non-wearing safety helmet". Then, the execution subject may determine, as the position candidate information, the pedestrian position whose action type included in the target pedestrian information is the same as the target behavior information.

And secondly, determining the state information of the target camera corresponding to the video picture analysis result.

The state information may represent a current working state of the target camera. The target camera may be a camera that records a real-time video stream corresponding to the video picture analysis result.

And thirdly, generating adjustment control information corresponding to the video picture analysis result according to the candidate position information and the state information.

The adjustment control information may be camera control information for adjusting the target camera. For example, the adjustment control information may be information for adjusting the focal length of the target camera. For another example, the adjustment control information may be information for adjusting the target camera pitch angle. The execution body may generate adjustment control information for controlling the target camera according to the position corresponding to the candidate position information and the state information.

Optionally, the status information may include: initial focus information and initial angle information. The initial focal length information may represent a current focal length of the target camera. The initial angle information may include: an initial pitch rotation angle and an initial horizontal rotation angle. The initial pitch rotation angle may represent a current pitch angle of the target camera. The initial horizontal rotation angle may represent a current horizontal angle of the target camera.

Optionally, the executing entity may generate the adjustment control information corresponding to the video picture analysis result according to the candidate position information and the state information, and may include:

first, according to the candidate position information and the angle information included in the state information, angle adjustment information is determined.

The execution body may control the target camera to rotate so that the pedestrian at the position corresponding to the candidate position information is located at the shooting center of the target camera. The execution body may determine a rotation angle of the target camera in a horizontal direction and a rotation angle of the target camera in a vertical direction as the angle adjustment information.

And secondly, determining focal length adjusting information according to the candidate position information and the focal length information included in the state information.

The execution main body can control the target camera to zoom so that the proportion of the pedestrian corresponding to the candidate position information in the picture shot by the target camera meets a preset proportion value. The execution body may determine a variation amount of the focal length as the focal length adjustment information. For example, the preset ratio value may be 25%.

And thirdly, generating adjustment control information corresponding to the video picture analysis result according to the angle adjustment information and the focal length adjustment information.

The execution body may determine the angle adjustment information and the focus adjustment information as adjustment control information corresponding to the video picture analysis result.

Optionally, the execution main body may further control, according to the adjustment control information, a target camera corresponding to the adjustment control information to perform object tracking on a pedestrian corresponding to the target behavior information.

The execution body can control the target camera corresponding to the adjustment control information to rotate and adjust the focal length through a target tracking algorithm, so that the target camera corresponding to the adjustment control information can track the pedestrian corresponding to the target behavior information. The target tracking algorithm may be an R-CNN (Region-based Convolutional Neural Networks) model, an SSD (Single Shot MultiBox Detector) model, and a YOLO-V3 (Young Only Look OnTrace Version 3) model.

Optionally, the target display interface may further include a video stream display interface and a target camera control interface.

As an example, the target presentation interface may be as shown in fig. 5. The target display interface may include: a video stream presentation interface 501 and a target camera control interface. Wherein, the target camera control interface may include: picture control component, focus adjustment function component, snapshot component 506, video component 507, intercom component 508, and target camera control component 509. Wherein, the focal length adjustment functional assembly may include: an afocal assembly 504 and a near-focal assembly 505. The screen control component may include: a screen enlargement component 502 and a screen reduction component 503. The capturing module 506 may control the target camera to capture the image. The video recording component 507 may control the target camera to record video. The talk-back component 508 can control talk-back. The target camera control component 509 may control the shooting angle of the target camera. The far focus module 504 and the near focus module 505 can control the focal length of the target camera. The screen zoom-in component 502 and the screen zoom-out component 503 may control the scaling of the screen of the video stream presented in the video stream presentation interface 501. The positional relationship of the components shown in fig. 5 is only for explanation, and in practical applications, the positions of the components may be arranged and laid out as required, which is not limited herein.

As can be seen from fig. 3, compared with the description of some embodiments corresponding to fig. 2, the present disclosure first performs feature extraction on a real-time video stream by using an image feature extraction network and an optical flow feature extraction network, respectively, to increase the richness of features. While taking into account that the feature hierarchies of optical flow features and image features tend to be different. Therefore, different feature extraction networks are respectively adopted for feature extraction. Secondly, a first convolution block attention module and a second convolution block attention module are added to extract the characteristics of a time dimension, a space dimension and a channel dimension in consideration of the characteristic relevance between adjacent frame images in continuous videos. Then, in order to improve the perception effect of the feature fusion network on the global features, the problem of feature loss caused by the fact that the feature fusion network only focuses on local features is avoided. Thus, the present disclosure splices the outputs of the first full link layer, the second full link layer, and the third full link layer by inputting the output of the fifth pooling layer into the first full link layer, the output of the sixth pooling layer into the third full link layer, and the output of the seventh pooling layer into the second full link layer. Therefore, the perception effect of the feature fusion network on the global features is improved. Finally, by adding the first random inactivation layer and the second random inactivation layer, the over-fitting situation is avoided. Compared with a manual identification mode, the pedestrian action in the real-time video stream is accurately identified in real time through the behavior identification model, and the identification efficiency of dangerous behaviors is greatly improved.

With further reference to fig. 6, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of a video monitoring apparatus, which correspond to those illustrated in fig. 2, and which may be particularly applicable in various electronic devices.

As shown in fig. 6, the video surveillance apparatus 600 of some embodiments includes: an obtaining unit 601 configured to obtain a target camera information set in response to receiving a video monitoring request; a sending unit 602, configured to send, to at least one target server, control information for controlling a target camera corresponding to target camera information in the target camera information set to perform real-time video acquisition; a pulling unit 603 configured to, in response to receiving a monitoring video address information set sent by the at least one target server, pull a real-time video stream corresponding to monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set, so as to obtain a real-time video stream set; and a presentation unit 604 configured to present the real-time video streams in the real-time video stream set on the target presentation interface.

It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.

Referring now to FIG. 7, a block diagram of an electronic device (such as computing device 101 shown in FIG. 1) 700 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network via communications means 709, or may be installed from storage 708, or may be installed from ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to the received video monitoring request, and acquiring a target camera information set; sending control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server; in response to receiving a monitoring video address information set sent by the at least one target server, pulling a real-time video stream corresponding to monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set; and displaying the real-time video stream in the real-time video stream set on a target display interface.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a sending unit, a pulling unit, and a presentation unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquisition unit may also be described as a "unit that acquires a target camera information set in response to receiving a video monitoring request".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A video surveillance method, comprising:

responding to the received video monitoring request, and acquiring a target camera information set;

sending control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server;

in response to receiving a monitoring video address information set sent by the at least one target server, pulling a real-time video stream corresponding to monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set;

displaying the real-time video stream in the real-time video stream set on a target display interface;

performing video picture analysis on each real-time video stream in the real-time video stream set to generate a video picture analysis result and obtain a video picture analysis result set, wherein the video picture analysis result in the video picture analysis result set is used for representing the behavior category of the pedestrian contained in the real-time video stream, the behavior category of the pedestrian contained in the real-time video stream is determined by a behavior recognition model, and the behavior recognition model comprises: the system comprises a feature extraction network, a feature filtering network, a feature fusion network and a full connection layer, wherein the feature extraction network comprises: an image feature extraction network and an optical flow feature extraction network, the feature filtering network comprising: a first convolution block attention module and a second convolution block attention module, the image feature extraction network is a ResNet-152 model, the optical flow feature extraction network is a VGG-16 model, the first convolution block attention module comprises: a first pooling layer, a multi-layer perceptron, and a second pooling layer, the second convolution block attention module comprising: a third pooling layer, a multi-layer perceptron, and a fourth pooling layer, the multi-layer perceptron included by the first convolution block attention module and the multi-layer perceptron weight shared by the second convolution block attention module, the feature fusion network comprising: the device comprises a first rolling layer, a second rolling layer, a fifth pooling layer, a third rolling layer, a fourth rolling layer, a sixth pooling layer, a seventh rolling layer, an eighth rolling layer, a seventh pooling layer, a first full-link layer, a second full-link layer, a third full-link layer, a feature splicing layer, a fourth full-link layer, a first random deactivation layer, a fifth full-link layer, a second random deactivation layer and a classification layer, wherein the input of the first full-link layer is the output of the fifth pooling layer, the input of the second full-link layer is the output of the seventh pooling layer, the input of the third full-link layer is the output of the sixth pooling layer, and the input of the feature splicing layer is the outputs of the first full-link layer, the second full-link layer and the third full-link layer;

and generating adjustment control information corresponding to each video picture analysis result in the video picture analysis result set to obtain an adjustment control information set.

2. The method of claim 1, wherein the performing video picture analysis on each real-time video stream in the set of real-time video streams to generate video picture analysis results comprises:

performing pedestrian detection on the real-time video stream to generate a target pedestrian information set;

and performing behavior analysis on the target pedestrian information in the target pedestrian information set to generate a video picture analysis result.

3. The method of claim 2, wherein the performing behavior analysis on the target pedestrian information in the set of target pedestrian information to generate video frame analysis results comprises:

performing action recognition on the pedestrian corresponding to each piece of target pedestrian information in the target pedestrian information set to generate an action recognition result, and obtaining an action recognition result set;

determining behavior information according to the position information of the pedestrian corresponding to each piece of target pedestrian information in the target pedestrian information set and the action recognition result corresponding to the target pedestrian information to obtain a behavior information set;

and generating the video picture analysis result according to the behavior information set and the action recognition result set.

4. The method of claim 3, wherein the generating adjustment control information corresponding to each video picture analysis result in the set of video picture analysis results comprises:

responding to target behavior information in the video picture analysis result, and determining the position of a pedestrian corresponding to the target behavior information in an image included in a corresponding real-time video stream to generate candidate position information, wherein the target behavior information is behavior information representing dangerous behaviors included in the video picture analysis result;

determining state information of a target camera corresponding to the video picture analysis result;

and generating adjustment control information corresponding to the video picture analysis result according to the candidate position information and the state information.

5. The method of claim 4, wherein the status information comprises: initial focal length information and initial angle information; and

the generating of the adjustment control information corresponding to the video picture analysis result according to the candidate position information and the state information includes:

determining angle adjustment information according to the candidate position information and angle information included in the state information;

determining focal length adjustment information according to the candidate position information and focal length information included in the state information;

and generating adjustment control information corresponding to the video picture analysis result according to the angle adjustment information and the focal length adjustment information.

6. The method of claim 5, wherein the method further comprises:

and controlling a target camera corresponding to the adjustment control information to track the pedestrian corresponding to the target behavior information according to the adjustment control information.

7. A video surveillance apparatus comprising:

an acquisition unit configured to acquire a target camera information set in response to receiving a video monitoring request;

the sending unit is configured to send control information for controlling a target camera corresponding to the target camera information in the target camera information set to perform real-time video acquisition to at least one target server;

the pulling unit is configured to respond to the receiving of a monitoring video address information set sent by the at least one target server, and pull a real-time video stream corresponding to the monitoring video address information in the monitoring video address information set from the at least one target server according to the monitoring video address information set to obtain a real-time video stream set;

a presentation unit configured to present the real-time video streams in the set of real-time video streams on a target presentation interface;

8. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

9. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1 to 6.