CN116152723B

CN116152723B - Intelligent video monitoring method and system based on big data

Info

Publication number: CN116152723B
Application number: CN202310421265.8A
Authority: CN
Inventors: 周付有; 桂锦舒
Original assignee: Shenzhen Guochen Intelligent System Co ltd
Current assignee: Shenzhen Guochen Intelligent System Co ltd
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-06-27
Anticipated expiration: 2043-04-19
Also published as: CN116152723A

Abstract

According to the intelligent video monitoring method and system based on big data, in the process of extracting the behavior tendency descriptors of the first target object, the reference object image is added to perfect the behavior tendency represented in the behavior tendency descriptors of the first target object, and the obtained integrated behavior tendency descriptors can further accurately reflect the behavior tendency of the first target object in the target video frame, so that accuracy and reliability of behavior recognition results obtained through the integrated behavior tendency descriptors are improved.

Description

Intelligent video monitoring method and system based on big data

Technical Field

The application relates to the field of artificial intelligence and image processing, in particular to an intelligent video monitoring method and system based on big data.

Background

With the advent of the big data age and the large-area penetration of the internet of things technology, people's daily life is almost anytime not in the internet of things, and an important component in the internet of things is based on the identification analysis of video monitoring information so as to complete intelligent video monitoring. For example, in the application of intelligent campus or intelligent fire fighting, by performing target recognition analysis on monitoring big data (video information), the identity information of a target object is judged, and the behavior information of the target object is analyzed, so as to complete character searching, abnormal behavior early warning and the like, thereby truly realizing intelligent coverage of the sky, eyes and land, helping management of authorities or other regular demand parties, and maintaining social order. At present, there is still an enhanced room for accuracy of recognition analysis of intelligent monitoring videos.

Disclosure of Invention

The invention aims to provide an intelligent video monitoring method and system based on big data so as to solve the problems.

The implementation manner of the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides an intelligent video monitoring method based on big data, applied to an intelligent video monitoring server, where the method includes:

acquiring a monitoring video and extracting a target video frame of the monitoring video;

acquiring a plurality of reference object images of a first target object in a target video frame, wherein the plurality of reference object images are used for reflecting different behavioral tendencies corresponding to the first target object;

acquiring a behavior tendency descriptor of the first target object through the first target object and other objects except the first target object in the target video frame;

characteristic integration is carried out on the behavior tendency descriptors of the first target object and the behavior tendency descriptors of the plurality of reference object images through tendency commonality coefficients between the behavior tendency descriptors of the first target object and the behavior tendency descriptors of the plurality of reference object images, so that an integrated behavior tendency descriptor of the first target object is obtained;

And acquiring a behavior recognition result corresponding to the target video frame through the integrated behavior tendency descriptor.

As one embodiment, the obtaining the behavior tendency descriptor of the first target object through the first target object and the rest objects except the first target object in the target video frame includes:

loading the first target object and the rest objects into a description array extraction network;

and using the description array extraction network to map focusing information of the first target object and the rest objects to obtain a behavior tendency descriptor of the first target object.

As an implementation manner, the performing focusing information mapping on the first target object and the rest objects to obtain a behavior tendency descriptor of the first target object includes:

acquiring a first search array, a first anchor array and a first result array of the first target object;

acquiring a second anchoring array and a second result array of the rest objects;

performing standardization operation on the product result of the first search array and the first anchor array and the product result of the first search array and the second anchor array to obtain a first focusing eccentric coefficient of the first target object and a second focusing Jiao Pianxin coefficient of the rest objects to the first target object;

And summing the product result of the first focusing eccentric coefficient and the first result array and the product result of the second focusing Jiao Pianxin coefficient and the second result array to obtain the behavior tendency descriptor of the first target object.

As one embodiment, the feature integrating the behavior tendency descriptor of the first target object with the behavior tendency descriptors of the plurality of reference object images by using the tendency commonality coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images, to obtain an integrated behavior tendency descriptor of the first target object includes:

the description array extraction network is adopted to execute the following steps:

determining a plurality of first reference eccentricity coefficients between the behavior tendency descriptors of the first target object and the behavior tendency descriptors of the plurality of reference object images by means of tendency commonality coefficients between the behavior tendency descriptors of the first target object and the behavior tendency descriptors of the plurality of reference object images, the first reference eccentricity coefficients characterizing the association closeness of the corresponding reference object images to the first target object;

And performing feature integration on the behavior tendency descriptors of the target video frame and the behavior tendency descriptors of the plurality of reference object images through a plurality of first involving eccentric coefficients to obtain integrated behavior tendency descriptors of the first target object.

As one embodiment, the feature integrating the behavior tendency descriptor of the target video frame with the behavior tendency descriptors of the plurality of reference object images through a plurality of first involving eccentric coefficients, to obtain an integrated behavior tendency descriptor of the first target object includes:

the behavior tendency descriptors of the target video frames and the behavior tendency descriptors of the reference object images are weighted and summed through a plurality of first involving eccentric coefficients to obtain integrated behavior tendency descriptors of the first target object;

performing joint focusing information mapping on the integrated behavior tendency descriptors to obtain a plurality of focusing information mapping arrays of the first target object;

and obtaining the integrated behavior tendency descriptor of the first target object through the plurality of focusing information mapping arrays.

As an implementation manner, the obtaining, by the plurality of focusing information mapping arrays, the integrated behavior tendency descriptor of the first target object includes:

Fusing the plurality of focusing information mapping arrays to obtain a focusing information mapping matrix;

compressing the focusing information mapping matrix to obtain an integrated behavior tendency descriptor of the first target object;

the description array extraction network is obtained by debugging the following steps:

obtaining a debugging template, wherein the debugging template comprises a monitoring video frame template, a behavior video frame template and a template tendency commonality coefficient between the monitoring video frame template and the behavior video frame template;

loading the monitoring video frame template and the behavior video frame template into the description array extraction network;

extracting an integrated behavior tendency descriptor of a template target object in the monitoring video frame template by adopting the description array extraction network;

optimizing the network parameter quantity of the description array extraction network through the loss between the tendency commonality coefficient between the integrated behavior tendency descriptors of the template target object and the integrated behavior tendency descriptors of the behavior video frame template and the template tendency commonality coefficient.

As one embodiment, the acquiring the plurality of reference object images of the first target object in the target video frame includes:

Searching a selected object image matched with the first target object in a reference object image set, wherein the reference object image set comprises a plurality of objects and a plurality of reference object images corresponding to each object;

determining a plurality of reference object images corresponding to the selected object as a plurality of reference object images of the first target object;

the first target object is obtained through the following steps: performing semantic segmentation on the target video frame to obtain a plurality of candidate recognition objects of the target video frame; and if any one of the plurality of candidate recognition objects accords with a matching condition with any one of a reference object image set, determining the any one candidate recognition object as the first target object, wherein the reference object image set comprises a plurality of objects and a plurality of reference object images corresponding to each object.

As one embodiment, the behavioral trend descriptors of the plurality of reference object images are obtained by:

for any reference object image, loading the any reference object image into a description array extraction network;

and performing focusing information mapping on a plurality of objects in any reference object image by using the description array extraction network to obtain a behavior tendency descriptor of the any reference object image.

As an implementation manner, the mapping the focusing information of the plurality of objects in the arbitrary reference object image to obtain the behavior tendency descriptor of the arbitrary reference object image includes:

acquiring a third search array, a third anchor array and a third result array of any object in the plurality of objects in any reference object image;

acquiring a fourth anchor array and a fourth result array of the rest objects except any object in a plurality of objects in any reference object image;

performing standardization operation on the product result of the third search array and the third anchor array and the product result of the third search array and the fourth anchor array to obtain a third coefficient Jiao Pianxin of any object and a fourth focusing eccentric coefficient of the rest objects on any object;

summing the product result of the third Jiao Pianxin coefficient and the third result array and the product result of the fourth focusing eccentric coefficient and the fourth result array to obtain a behavior tendency descriptor of the any object;

and carrying out feature integration on the behavior tendency descriptors of a plurality of objects in any reference object image to obtain the behavior tendency descriptors of any reference object image.

On the other hand, the embodiment of the application provides an intelligent video monitoring server, which comprises an intelligent video monitoring server and video monitoring equipment which are in communication connection with each other, wherein the intelligent video monitoring server comprises a memory and a processor, the memory stores a computer program, and when the processor runs the computer program, the method is realized.

In the following description, other features will be partially set forth. Upon review of the ensuing disclosure and the accompanying figures, those skilled in the art will in part discover these features or will be able to ascertain them through production or use thereof. The features of the present application may be implemented and obtained by practicing or using the various aspects of the methods, tools, and combinations that are set forth in the detailed examples described below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

The methods, systems, and/or programs in the accompanying drawings will be described further in terms of exemplary embodiments. These exemplary embodiments will be described in detail with reference to the drawings. These exemplary embodiments are non-limiting exemplary embodiments, wherein reference numerals represent similar mechanisms throughout the several views of the drawings.

Fig. 1 is a schematic diagram of the intelligent video monitoring system according to some embodiments of the present application.

Fig. 2 is a schematic diagram of hardware and software components in an intelligent video surveillance server, according to some embodiments of the present application.

Fig. 3 is a flow chart of a big data based intelligent video monitoring method according to some embodiments of the present application.

Fig. 4 is a schematic architecture diagram of an intelligent video monitoring device according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions described above, the following detailed description of the technical solutions of the present application is provided through the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present application are detailed descriptions of the technical solutions of the present application, and not limit the technical solutions of the present application, and the technical features of the embodiments and embodiments of the present application may be combined with each other without conflict.

In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant teachings. However, it will be apparent to one skilled in the art that the present application may be practiced without these details. In other instances, well-known methods, procedures, systems, components, and/or circuits have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present application.

These and other features, together with the functions, acts, and combinations of parts and economies of manufacture of the related elements of structure, all of which form part of this application, may become more apparent upon consideration of the following description with reference to the accompanying drawings. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the application. It should be understood that the drawings are not to scale. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the application. It should be understood that the figures are not to scale.

The flowcharts are used in this application to describe implementations performed by systems according to embodiments of the present application. It should be clearly understood that the execution of the flowcharts may be performed out of order. Rather, these implementations may be performed in reverse order or concurrently. Additionally, at least one other execution may be added to the flowchart. One or more of the executions may be deleted from the flowchart.

Fig. 1 is a schematic diagram of an intelligent video monitoring system 400 according to some embodiments of the present application, the intelligent video monitoring system 400 including an intelligent video monitoring server 100 and a video monitoring device 300 communicatively connected to each other via a network 200.

In some embodiments, please refer to fig. 2, which is a schematic diagram of an architecture of the intelligent video monitoring server 100, the intelligent video monitoring server 100 includes an intelligent video monitoring device 110, a memory 120, a processor 130 and a communication unit 140. The memory 120, the processor 130, and the communication unit 140 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The intelligent video monitoring apparatus 110 includes at least one software function module that may be stored in the memory 120 in the form of software or firmware (firmware) or cured in an Operating System (OS) of the intelligent video monitoring server 100. The processor 130 is configured to execute executable modules stored in the memory 120, such as software functional modules and computer programs included in the intelligent video monitoring apparatus 110.

The Memory 120 may be, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory 120 is used for storing a program, and the processor 130 executes the program after receiving an execution instruction. The communication unit 140 is used for establishing a communication connection between the intelligent video monitoring server 100 and the front-end image pickup apparatus 200 through a network, and for transceiving data through the network.

The processor may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital Signal Processors (DSPs)), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It is to be understood that the architecture shown in fig. 2 is illustrative only and that intelligent video surveillance server 100 may also include more or fewer components than those shown in fig. 2 or have a different configuration than that shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.

The video monitoring device 300 may be any image capturing device, for example, a camera disposed in a public place, which is used for capturing monitoring and generating a monitoring video, and the type of the video monitoring device 300 is not limited in this application. It can be appreciated that in the embodiment of the present application, the video information acquired by the video monitoring device 300 is acquired based on the legal regulations.

Fig. 3 is a flowchart of a big data based communication internet of things management method according to some embodiments of the present application, where the method is applied to the intelligent video monitoring server 100 in fig. 1, and may specifically include the following steps 100 to 500. On the basis of the following steps, alternative embodiments will be described, which should be understood as examples and should not be interpreted as essential features for implementing the present solution.

Step 100: and acquiring a monitoring video, and extracting a target video frame of the monitoring video.

In this embodiment of the present application, the monitoring video may be video data obtained by shooting with any image capturing device, and the monitoring video may include a plurality of video frames, including a target video frame, where the target video frame is a video frame including a target object, for example, the target object is a person, and the behavior of the target object needs to be analyzed and identified. It will be appreciated that the target video frame includes at least one target object, or, in other embodiments, a plurality of target objects. As an embodiment, when a plurality of video frame images each include a target object, a video frame having the best image quality may be selected as the target video frame.

Step 200: and acquiring a plurality of reference object images of a first target object in the target video frame, wherein the plurality of reference object images are used for reflecting different behavioral trends corresponding to the first target object.

The first target object in the target video frame is an object needing behavior detection, the current behavior of the first target object may correspond to different behavior results, for example, the same character action may be identified as motion or fighting, the reference object image is a representative image of behavior identification results corresponding to different scenes, one reference object image may represent one behavior tendency of the first target object, and the multiple reference object images may represent different behavior tendencies of the first target object.

Step 300: and acquiring the behavior tendency descriptors of the first target object through the first target object and the rest objects except the first target object in the target video frame.

Wherein, because the first target object is an object that may correspond to different behavior trends under different behavior scenes, and the behavior scenes are constructed by the rest of the objects (for example, may include characters and objects) in the target video frame and the first target object together, the behavior trend of the first target object in the target video frame can be reflected by the behavior trend descriptor of the first target object obtained by the first target object and the rest of the objects.

Step 400: and carrying out feature integration on the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images through tendency commonality coefficients between the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images to obtain an integrated behavior tendency descriptor of the first target object.

In this embodiment of the present invention, since the plurality of reference object images may represent a plurality of behavioral tendencies of the first target object, feature integration is performed on the behavioral tendency descriptors of the first target object and the behavioral tendency descriptors of the plurality of reference object images by using a tendency commonality coefficient (for example, a commonality measurement result between feature values, and a similarity between the feature values) between the behavioral tendency descriptors of the first target object and the behavioral tendency descriptors of the plurality of reference object images, external information may be added to the behavioral tendency descriptors of the first target object, and the obtained integrated behavioral tendency descriptors of the first target object may further accurately represent the behavioral tendency of the first target object in the target video frame.

Step 500: and acquiring a behavior recognition result corresponding to the target video frame through integrating the behavior tendency descriptors.

The behavior recognition result obtained by integrating the behavior tendency descriptors can be more in line with the actual scene, and the accuracy and reliability of behavior recognition are improved based on the behavior recognition result.

In the embodiment of the present application, because the first target object may be matched with different behavior trends in different scenes, when the method of the embodiment of the present application is used to extract the behavior trend descriptor of the first target object, a reference object image is added, and different reference object images may represent different behavior trends of the first target object. Based on the tendency commonality coefficient between the behavior tendency descriptor of the reference object image and the behavior tendency descriptor of the first target object, the behavior tendency descriptor of the reference object image and the behavior tendency descriptor of the first target object are subjected to feature integration, in other words, the behavior tendency information represented by the behavior tendency descriptor of the first target object is filled, based on the fact, the integrated behavior tendency descriptor can more accurately represent the behavior tendency of the first target object in the target video frame, and based on the fact, accuracy and reliability of a behavior recognition result obtained through the integrated behavior tendency descriptor are improved.

The following describes, by means of specific embodiments, the big data based intelligent video monitoring method provided in the embodiments of the present application, which may include the following steps:

step 10: a first target object is acquired in a target video frame.

As one embodiment, semantic segmentation is performed on a target video frame to obtain a plurality of candidate recognition objects of the target video frame. And if any one of the plurality of candidate recognition objects accords with the matching condition with any one of the reference object image sets, determining the candidate recognition object as a first target object, wherein the reference object image set comprises a plurality of objects and a plurality of reference object images corresponding to each object. The reference object image set stores a plurality of objects and a plurality of reference object images corresponding to each object by a structuring method, in other words, the plurality of objects are taken as anchoring targets (keys), the reference object images corresponding to each object are stored as a result, and the corresponding reference object images can be searched through the objects. Based on the method, after the semantic segmentation of the target video frame is carried out, searching is carried out on a plurality of candidate identification objects in the reference object image set so as to determine a first target object in the target video frame, and the acquisition speed of the first target object is improved.

Step 20: and acquiring a plurality of reference object images of a first target object in the target video frame, wherein the plurality of reference object images are used for reflecting different behavioral trends corresponding to the first target object.

As one embodiment, a set of reference object images is searched for a selected object image that matches the first target object. A plurality of reference object images corresponding to the selected object are determined as a plurality of reference object images of the first target object. Based on the method, a plurality of reference object images of the first target object can be efficiently acquired in the reference object image set, and the integrated behavior tendency descriptor of the first target object can be acquired through the plurality of reference object images, so that the behavior tendency of the first target object in the target video frame is more accurately represented.

Step 30: and acquiring the behavior tendency descriptors of the first target object through the first target object and the rest objects except the first target object in the target video frame.

As one implementation, a first target object and other objects are loaded into a description array extraction network, and focusing information mapping is carried out on the first target object and the other objects by using the description array extraction network, so that a behavior tendency descriptor of the first target object is obtained. Based on the above, since the behavior tendency of the first target object in the target video frame can be reflected only by the joint analysis of the first target object and the rest objects, the first target object and the rest objects can be subjected to focusing information mapping (for example, the focusing information mapping is embedded based on an attention mechanism), and the behavior tendency descriptor of the first target object is obtained through the focusing information mapping, so that the used behavior tendency descriptor of the first target object can comprehensively and accurately represent the behavior tendency of the first target object.

For example, the first target object and the other objects are respectively loaded into a description array extraction network, and the description array extraction network is adopted to obtain a first search array, a first anchor array and a first result array of the first target object. And acquiring a second anchor array and a second result array of the rest objects by adopting the description array extraction network. And normalizing the product result of the first search array and the first anchor array and the product result of the first search array and the second anchor array to obtain a first focusing eccentric coefficient of the first target object (for example, in an attention mechanism, the focusing eccentric coefficient is biased weight information) and a second focusing Jiao Pianxin coefficient of the rest object on the first target object. And summing the product result of the first focusing eccentric coefficient and the first result array with the product result of the second focusing Jiao Pianxin coefficient and the second result array to obtain the behavior tendency descriptor of the first target object. As one embodiment, the first search array and the first anchor array are used to obtain a first focus bias factor of the first target object, the first result array is used to characterize the first target object, and the first focus bias factor and the first result array are used to obtain a behavior tendency descriptor of the first target object. The search array, anchor array, and result array may be one-dimensional, i.e., may be vectors (the dimensions of the vectors are not limited), although in other possible embodiments, they may be two-dimensional matrices.

For example, a first target object S is projected as a first projection array, which is a vector, e.g., [2,4,6], a plurality of objects of the remaining objects are projected as a plurality of second projection arrays, respectively, e.g., [3,5,7], [1,4,3], the first projection array [2,4,6] and the two second projection arrays [3,5,7], [1,4,3] are loaded into the description array extraction network. The description array extraction network is adopted, and three different convolution matrices A, B and C (the constitution of each convolution matrix is not limited), wherein the convolution matrix A is used for acquiring a search array of a first target object, the convolution matrix B is used for acquiring an anchor array of the first target object, and the convolution matrix C is used for acquiring a result description array of the first target object. And respectively integrating the first projection arrays [2,4,6] of the first target object with the convolution matrix A, the convolution matrix B and the convolution matrix C to obtain a first search array R1, a first anchor array K1 and a first result array V1 of the first target object. And acquiring a second anchor array and a second result array of two objects in the rest objects by adopting the description array extraction network, in other words, integrating the second projection arrays [3,5,7] with the convolution matrix B and the convolution matrix C respectively to obtain a second anchor array K2 and a second result array V2. And respectively integrating the second projection arrays [1,4 and 3] with the convolution matrix B and the convolution matrix C to obtain a second anchoring array K2 'and a second result array V2'. And obtaining a product result R1K 1 of the first search array R1 and the first anchor array K1. And obtaining the product result R1K 2 of the first search array R1 and the second anchor array K2. The product result R1K 2 'of the first search array R1 and the second anchor array K2' is obtained. And (3) carrying out standardized operation on the product result R1K 1, the product result R1K 2 and the product result R1K 2' through a normalized exponential function to obtain a first focusing eccentric coefficient x of the first target object and second focusing Jiao Pianxin coefficients y and z of other objects. And integrating the first focusing eccentric coefficient x with a first result array V1 of the first target object to obtain an array x.V1, and similarly obtaining the integrated result of the rest objects and the corresponding focusing eccentric coefficients.

40: and carrying out feature integration on the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images through tendency commonality coefficients between the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images to obtain an integrated behavior tendency descriptor of the first target object.

As one embodiment, a description array extraction network is used to determine a plurality of first eccentricity-of-involvement coefficients between the behavior trend descriptors of the first target object and the behavior trend descriptors of the plurality of reference object images by means of a trend commonality coefficient between the behavior trend descriptors of the first target object and the behavior trend descriptors of the plurality of reference object images, the first eccentricity-of-involvement coefficients characterizing a closeness of association of the corresponding reference object image with the first target object. And performing feature integration on the behavior tendency descriptors of the first target object and the behavior tendency descriptors of the plurality of reference object images through a plurality of first involving eccentric coefficients to obtain integrated behavior tendency descriptors of the first target object. Based on this, a description array extraction network may be adopted, and the first involvement eccentricity coefficient is determined through the tendency commonality coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images, in other words, if the tendency commonality coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of one reference object image is higher, the reference object image is represented to be more consistent with the behavior tendency of the first target object in the target video frame, the first involvement eccentricity coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of the reference object image may be configured to be higher, so that the obtained integrated behavior tendency descriptor may accurately represent the behavior tendency of the first target object in the target video frame.

The first acquisition process involving the eccentricity factor and the acquisition process integrating the behavior tendency descriptors are described below.

For the first involving eccentricity coefficient, as an implementation manner, the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images are loaded into a description array extraction network, and the description array extraction network obtains the tendency commonality coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images through a Tanh function, for example:

S _p =tanh（T ₁ ·n ₁ +T ₂ ·n ₂ ）；

S _p as a coefficient of tendency to commonality, T ₁ Behavior tendency descriptor, T, for first target object ₂ Behavior tendency descriptor for reference object image, n ₁ And n ₂ As a parametric matrix, a plurality of first involving decentration coefficients between the behavioral trend descriptor of the first target object and the behavioral trend descriptors of the plurality of reference object images may be obtained based on the following formula:

W=exp（S _p ）/∑exp（S _q ）；

because the behavior tendency descriptor of the first target object comprises the context information of the target video frame, the contribution of each reference object image in the current behavior scene can be better acquired based on the calculation process.

As an embodiment, the first involving decentration coefficient is positively correlated with the tendency co-coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of the reference object image, in other words, the higher the tendency co-coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of a reference object image, the greater the first involving decentration coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of the reference object image. The higher the coefficient of tendency commonality between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of the other reference object image, the smaller the first involving decentration coefficient between the behavior tendency descriptor of the first target object and the behavior tendency descriptor of the reference object image.

For the integrated behavior tendency descriptors, as an implementation manner, a description array extraction network is adopted, and the integrated behavior tendency descriptors of the first target object are obtained by weighted summation of the behavior tendency descriptors of the first target object and the behavior tendency descriptors of the plurality of reference object images through a plurality of first involving eccentric coefficients. And performing joint focusing information mapping on the integrated behavior tendency descriptors by using a description array extraction network to obtain a plurality of focusing information mapping arrays of the first target object. And adopting a description array extraction network, and obtaining the integrated behavior tendency descriptor of the first target object through a plurality of focusing information mapping arrays. The joint focusing information mapping is a process of embedding and mapping the integrated behavior tendency descriptors based on different convolution matrixes, and the joint focusing information mapping (for example, based on multi-head attention mapping) can extract information of a lower layer of the integrated behavior tendency descriptors, so that the characteristic characterization effect of the integrated behavior tendency descriptors is improved.

As an embodiment, the integrated behavior tendency descriptor of the first target object may be obtained based on the following formula:

F=（T ₁ +∑W·T ₂ ）；

Wherein F is an integrated behavioral trend descriptor of the first target object.

Specifically, a first target object and a plurality of reference object images are loaded to an embedding mapping module of a description array extraction network, and the first target object and the plurality of reference object images are embedded and encoded through the embedding mapping module, so that a behavior tendency descriptor of the first target object and a behavior tendency descriptor of the plurality of reference object images are obtained. The behavioral tendency descriptors of the first target object and the behavioral tendency descriptors of the plurality of reference object images are loaded to a first involving eccentric coefficient determining module of a description array extraction network, a plurality of first involving eccentric coefficients are obtained by the first involving eccentric coefficient determining module through the above-mentioned first involving eccentric coefficient obtaining mode, as one implementation mode, the description array extraction network is adopted to integrate the first involving eccentric coefficients with corresponding behavioral tendency descriptors, the integrated result (such as a vector) is loaded to a classification mapping module (such as a fully connected module) of the description array extraction network, and the integrated behavioral tendency descriptor F of the first target object is obtained through the above-mentioned calculation mode through the classification mapping module.

As an embodiment, the joint focus information map is a three-party joint focus information map, which is focus end X, focus end Y, and focus end Z, respectively. For three focus ends, each corresponding to three convolution kernels Cq, ck, and Cv, as one embodiment, focus end X corresponds to three convolution kernels Cq ₁ 、Ck ₁ And Cv ₁ Three convolution kernels corresponding to the focusing end Y are Cq ₂ 、Ck ₂ And Cv ₂ Three convolution kernels corresponding to the focusing end Z are Cq ₃ 、Ck ₃ And Cv ₃ 。

For a focusing end X, a description array extraction network is adopted, and an integrated behavior tendency descriptor of a first target object is respectively combined with three convolution kernels Cq ₁ 、Ck ₁ And Cv ₁ And integrating the search array, the anchor array and the result array of the behavior tendency descriptors. Using tracingThe array extraction network integrates the search array and the transposed array of the anchor array to obtain a calculation result, and performs standardized operation on the calculation result to obtain a focusing eccentric coefficient of the first target object. And integrating the focusing eccentric coefficient with the result array of the first target object to obtain a focusing information mapping array of the first target object. For a focusing end Y, a description array extraction network is adopted, and an integrated behavior tendency descriptor of a first target object is respectively combined with three convolution kernels Cq ₂ 、Ck ₂ And Cv ₂ And integrating the search array, the anchor array and the result array of the behavior tendency descriptors. And (3) integrating the search array and the transposed array of the anchor array by adopting a description array extraction network to obtain a calculation result, and performing standardized operation on the calculation result to obtain the focusing eccentric coefficient of the first target object. And integrating the focusing eccentric coefficient with the result array of the first target object to obtain a focusing information mapping array of the first target object. For a focusing end Z, a description array extraction network is adopted, and an integrated behavior tendency descriptor of a first target object is respectively combined with three convolution kernels Cq ₃ 、Ck ₃ And Cv ₃ And integrating the search array, the anchor array and the result array of the behavior tendency descriptors. And integrating the search array and the transposed array of the anchor array by adopting the description array extraction network to obtain a calculation result, performing standardization operation on the calculation result to obtain a focusing eccentric coefficient of the first target object, and integrating the focusing eccentric coefficient and a result array of the first target object to obtain a focusing information mapping array of the first target object.

As an implementation manner, a description array extraction network is adopted to fuse a plurality of focusing information mapping arrays to obtain a focusing information mapping matrix, and the description array extraction network is used to compress (reduce the dimension of) the focusing information mapping matrix to obtain an integrated behavior tendency descriptor of the first target object.

And fusing, for example, splicing together three focusing information mapping arrays of the first target object by adopting a description array extraction network to obtain a focusing information mapping matrix, performing full-connection processing on the focusing information mapping matrix by using the description array extraction network, integrating the focusing information mapping matrix with the transposed array of the full-connection eccentric array, and then summing the integrated focusing information mapping matrix with the transposed array of the full-connection eccentric array to obtain the integrated behavior tendency descriptor of the first target object.

In this embodiment of the present application, the process of acquiring the behavioral tendency descriptors of the multiple reference object images may specifically include:

as one embodiment, for any reference object image, any reference object image is loaded into the description array decimation network. And performing focusing information mapping on a plurality of objects in the reference object image by using a description array extraction network to obtain a behavior tendency descriptor of the reference object image.

For example, for any one of a plurality of objects in any reference object image, a description array extraction network is employed to obtain a third search array, a third anchor array, and a third result array for the object. And acquiring a fourth anchor array and a fourth result array of the rest objects except the object in the plurality of objects in the reference object image by using the description array extraction network. And carrying out standardization operation on the product result of the third search array and the third anchor array and the product result of the third search array and the fourth anchor array by using the description array extraction network to obtain a third focusing eccentric coefficient Jiao Pianxin coefficient of the object and a fourth focusing eccentric coefficient of the rest objects on the object. And summing the product result of the third coefficient Jiao Pianxin and the third result array with the product result of the fourth focusing eccentric coefficient and the fourth result array by using a description array extraction network to obtain the behavior tendency descriptor of the object. And adopting a description array extraction network to integrate the behavior tendency descriptors of a plurality of objects in the reference object image to obtain the behavior tendency descriptors of the reference object image.

50: and acquiring a behavior recognition result corresponding to the target video frame through integrating the behavior tendency descriptors.

As one embodiment, if the tendency commonality coefficient between the integrated behavior tendency descriptor of any one of the behavior recognition results and the integrated behavior tendency descriptor of the first target object meets the target tendency commonality coefficient condition, determining the behavior recognition result as a behavior recognition result corresponding to the target video frame. Wherein the tendency co-coefficient meets the target tendency co-coefficient condition, for example, the tendency co-coefficient is not less than the tendency co-coefficient threshold.

Based on the above, the behavior recognition result corresponding to the target video frame can be determined by the tendency commonality coefficient between the integrated behavior tendency descriptor of the first target object and the integrated behavior tendency descriptor of the behavior recognition result, and the recognition speed of the behavior recognition result is high.

The process of obtaining an integrated behavioral trend descriptor of a behavioral recognition result is described as follows:

as one embodiment, the image of the behavior recognition result is semantically segmented to obtain a plurality of candidate recognition objects of the behavior recognition result. And if any one of the plurality of candidate recognition objects of the behavior recognition result accords with the matching condition with any one of the reference object image set, determining any one of the candidate recognition objects as a second target object, wherein the reference object image set comprises a plurality of objects and a plurality of reference object images corresponding to the plurality of objects. And acquiring a plurality of reference object images corresponding to the second target object from the reference object image set. And acquiring the behavior tendency descriptors of the second target object through the second target object and the rest objects except the second target object in the behavior recognition result. And carrying out feature integration on the behavior tendency descriptors of the second target object and the behavior tendency descriptors of the plurality of reference object images corresponding to the second target object through tendency commonality coefficients between the behavior tendency descriptors of the second target object and the behavior tendency descriptors of the plurality of reference object images corresponding to the second target object, so as to obtain an integrated behavior tendency descriptor of the behavior recognition result.

The first target object may correspond to different behavior trends in different scenes, and when the behavior trend descriptors of the first target object are extracted, reference object images are added, and different reference object images can represent different behavior trends of the first target object. By referring to the relevance of the behavior tendency descriptor of the object image and the behavior tendency descriptor of the first target object, the behavior tendency descriptor of the reference object image and the behavior tendency descriptor of the first target object are subjected to feature integration, namely the behavior tendency information represented by the behavior tendency descriptor of the first target object is perfected, so that the integrated behavior tendency descriptor can more accurately represent the behavior tendency of the first target object in the target video frame, and the accuracy and the reliability of a behavior recognition result obtained by the integrated behavior tendency descriptor are improved based on the integrated behavior tendency descriptor.

The debugging process of the description array extraction network provided in the embodiment of the present application is described as follows:

step 21: obtaining a debugging template, wherein the debugging template comprises a monitoring video frame template, a behavior video frame template and a template tendency commonality coefficient between the monitoring video frame template and the behavior video frame template.

As an implementation manner, the template tendency commonality coefficient between the monitoring video frame template and the behavior video frame template can be indicated by 1 and 0, for example, 1 represents that the template tendency commonality coefficient of the monitoring video frame template and the behavior video frame template is high, that is, the monitoring video frame template is matched with the behavior video frame template; and 0 represents that the tendency commonality coefficient of the monitoring video frame template and the behavior video frame template is low, namely the monitoring video frame template is not matched with the behavior video frame template.

Step 22: and loading the monitoring video frame template and the behavior video frame template into a description array extraction network.

Step 23: and extracting an integrated behavior tendency descriptor of the template target object in the monitoring video frame template by adopting the description array extraction network.

As one implementation, a description array extraction network is adopted to obtain a template target object in a monitoring video frame template. And extracting the network by using the description array to obtain a template target object of the behavior video frame template. And acquiring a plurality of reference object images of the template target object and a plurality of reference object images of the template target object by adopting the description array extraction network. And extracting a network by using the description array to obtain an integrated behavior tendency descriptor of the template target object and an integrated behavior tendency descriptor of the behavior video frame template.

For example, a semantic segmentation module of a description array extraction network is adopted to obtain a template target object from a monitoring video frame template, a plurality of reference object images corresponding to the template target object are obtained from a reference object image set, and a behavior tendency descriptor of the template target object and a behavior tendency descriptor of the plurality of reference object images are extracted. The semantic segmentation module of the description array extraction network is adopted to obtain a template target object from the behavior video frame template, a plurality of reference object images corresponding to the template target object are obtained from the reference object image set, and a behavior tendency descriptor of the template target object and a behavior tendency descriptor of the plurality of reference object images are extracted. Loading the behavior tendency descriptors of the template target object and the behavior tendency descriptors of the plurality of reference object images to a focusing information integration module of a description array extraction network, and carrying out feature integration on the behavior tendency descriptors of the template target object and the behavior tendency descriptors of the plurality of reference object images through the focusing information integration module to obtain an integrated behavior tendency descriptor of the template target object. The behavior tendency descriptors of the template target object and the behavior tendency descriptors of the plurality of reference object images are loaded to a focusing information integration module of a description array extraction network, and feature integration is carried out on the behavior tendency descriptors of the target object and the behavior tendency descriptors of the plurality of reference object images through the focusing information integration module, so that the integrated behavior tendency descriptors of the template target object are obtained. Loading the integrated behavior tendency descriptors of the template target object and the integrated behavior tendency descriptors of the template target object into a combined focusing information mapping module of a description array extraction network, and carrying out combined focusing information mapping on the integrated behavior tendency descriptors of the template target object through the combined focusing information mapping module to obtain a plurality of focusing information mapping arrays corresponding to the integrated behavior tendency descriptors of the template target object; and carrying out joint focusing information mapping on the integrated behavior tendency descriptors of the template target object through a joint focusing information mapping module to obtain a plurality of focusing information mapping arrays corresponding to the integrated behavior tendency descriptors of the template target object. And loading a plurality of focusing information mapping arrays corresponding to the integrated behavior tendency descriptors of the template target object to a classification mapping module (such as a full connection unit) of the description array extraction network, and performing full connection operation on the plurality of focusing information mapping arrays corresponding to the integrated behavior tendency descriptors of the template target object based on the classification mapping module to obtain the integrated behavior tendency descriptors of the template target object. Loading a plurality of focusing information mapping arrays corresponding to the integrated behavior tendency descriptors of the template target object to a classification mapping module of a description array extraction network, and performing full connection operation according to the plurality of focusing information mapping arrays corresponding to the integrated behavior tendency descriptors of the template target object by the classification mapping module to obtain the integrated behavior tendency descriptors of the template target object, and obtaining the integrated behavior tendency descriptors of the behavior video frame template through the integrated behavior tendency descriptors of the template target object.

24: and optimizing the network parameter quantity of the description array extraction network through loss between the tendency commonality coefficient between the integrated behavior tendency descriptors of the template target object and the integrated behavior tendency descriptors of the behavior video frame template and the template tendency commonality coefficient.

The loss is obtained, for example, based on a contrast loss function.

Because the first target object may have different behavioral tendencies in different scenes, the present application adds reference object images when debugging the description array extraction network, and different reference object images may characterize the different behavioral tendencies of the first target object. The behavior tendency descriptors of the template target object and the two integrated behavior tendency descriptors obtained by perfecting the behavior tendency of the characterization in the template target object based on the reference object image can further accurately reflect the behavior tendency of the template target object in the monitoring video frame template and the behavior tendency of the template target object in the behavior recognition result, and the extraction performance of the behavior tendency descriptors describing the array extraction network is improved based on the behavior tendency descriptors.

Referring to fig. 4, a schematic diagram of a functional module architecture of an intelligent video monitoring apparatus 110 according to an embodiment of the present invention is provided, where the intelligent video monitoring apparatus 110 may be used to execute a communication internet of things management method based on big data, and the intelligent video monitoring apparatus 110 includes:

The video acquisition module 111 is configured to acquire a surveillance video, and extract a target video frame of the surveillance video;

a reference retrieving module 112, configured to obtain a plurality of reference object images of a first target object in a target video frame, where the plurality of reference object images are used to reflect different behavioral tendencies corresponding to the first target object;

the feature extraction module 113 is configured to obtain a behavior tendency descriptor of the first target object through the first target object and other objects in the target video frame except for the first target object;

the feature integration module 114 is configured to perform feature integration on the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images through tendency commonality coefficients between the behavior tendency descriptor of the first target object and the behavior tendency descriptors of the plurality of reference object images, so as to obtain an integrated behavior tendency descriptor of the first target object;

the behavior determination module 115 is configured to obtain, by using the integrated behavior tendency descriptor, a behavior recognition result corresponding to the target video frame.

Since in the above embodiment, the detailed description has been made of the big data based communication internet of things management method provided in the embodiment of the present invention, and the principle of the intelligent video monitoring apparatus 110 is the same as that of the method, the execution principle of each module of the intelligent video monitoring apparatus 110 will not be described in detail here.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an intelligent video monitoring server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It is to be understood that the terminology which does not make a noun interpretation with respect to the above description is not to be interpreted as a noun interpretation, and that the skilled person can unambiguously ascertain the meaning to which it refers from the above disclosure. The foregoing of the disclosure of the embodiments of the present application will be apparent to and complete with respect to those skilled in the art. It should be appreciated that the process of deriving and analyzing technical terms not explained based on the above disclosure by those skilled in the art is based on what is described in the present application, and thus the above is not an inventive judgment of the overall scheme.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this application, and are therefore within the spirit and scope of the exemplary embodiments of this application.

It should also be appreciated that in the foregoing description of the embodiments of the present application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of at least one of the embodiments of the invention. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the subject application. Indeed, less than all of the features of a single embodiment disclosed above.

Claims

1. An intelligent video monitoring method based on big data is characterized by being applied to an intelligent video monitoring server, and comprises the following steps:

acquiring a behavior recognition result corresponding to the target video frame through the integrated behavior tendency descriptor;

wherein, the obtaining the behavior tendency descriptor of the first target object through the first target object and the rest objects except the first target object in the target video frame includes:

Using the description array extraction network to map focusing information of the first target object and the rest objects to obtain a behavior tendency descriptor of the first target object;

the performing focusing information mapping on the first target object and the rest objects to obtain a behavior tendency descriptor of the first target object includes:

2. The method of claim 1, wherein the feature integrating the behavior trend descriptor of the first target object with the behavior trend descriptors of the plurality of reference object images by the trend commonality coefficient between the behavior trend descriptor of the first target object and the behavior trend descriptors of the plurality of reference object images, the obtaining an integrated behavior trend descriptor of the first target object comprises:

3. The method of claim 2, wherein the feature integrating the behavior trend descriptor of the target video frame with the behavior trend descriptors of the plurality of reference object images by a plurality of first involving eccentricity coefficients, the obtaining an integrated behavior trend descriptor of the first target object comprises:

4. The method of claim 3, wherein the obtaining the integrated behavioral trend descriptor of the first target object from the plurality of focus information mapping arrays comprises:

5. The method of claim 1, wherein the acquiring the plurality of reference object images of the first target object in the target video frame comprises:

6. The method according to any one of claims 1 to 5, wherein the behavioral tendency descriptors of the plurality of reference object images are obtained by:

7. The method of claim 6, wherein the mapping the focusing information for the plurality of objects in the arbitrary reference object image to obtain the behavior trend descriptor of the arbitrary reference object image includes:

8. An intelligent video monitoring system, comprising an intelligent video monitoring server and a video monitoring device in communication connection with each other, wherein the intelligent video monitoring server comprises a memory and a processor, the memory storing a computer program, and when the processor runs the computer program, the method according to any one of claims 1 to 7 is implemented.