CN113469142B

CN113469142B - Classification method, device and terminal for monitoring video time-space information fusion

Info

Publication number: CN113469142B
Application number: CN202110932947.6A
Authority: CN
Inventors: 张煇; 剌昊跃; 柳世豪; 陈宏涛
Original assignee: Shanxi Changhe Technology Co ltd
Current assignee: Changhe Information Co ltd
Priority date: 2021-03-12
Filing date: 2021-09-01
Publication date: 2022-01-14
Anticipated expiration: 2041-09-01
Also published as: CN113469142A

Abstract

The embodiment of the invention discloses a classification method, a device and a terminal for monitoring video space-time information fusion, wherein the method comprises the following steps: acquiring a sample video set; randomly and uniformly selecting videos with various types in a sample video set; inputting the selected video into a preset classifier to carry out deep network weight training to obtain a trained classifier; importing the video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points; calculating the chessboard distance of the preliminary prediction result of each time point according to the behavior quantization value as a distance value to obtain a category label of the time point; and performing time-dimension behavior category fusion operation on the category labels of all time points to obtain a final behavior category prediction result of the video to be identified. According to the scheme, behavior category quantization is utilized to realize unification of behavior category fusion on the space-time dimension, and data pertinence prediction is achieved on the characteristics of the service window video stream.

Description

Classification method, device and terminal for monitoring video time-space information fusion

Technical Field

The invention relates to the technical field of video analysis and deep learning, in particular to a classification method, a device and a terminal for monitoring video temporal-spatial information fusion.

Background

The system can effectively improve the working efficiency of the service window, protect the legal rights and interests of the working personnel and people, and has obvious significance for improving the service level and image of the government and service enterprise departments for people.

At present, the monitoring video station information is analyzed and processed by means of an automatic behavior detection and classification technology based on the monitoring video, so that people can be helped to release from complex and time-consuming work such as behavior retrieval, behavior analysis and the like of manual video examination, and the existing automatic behavior detection and classification technology based on the monitoring video comprises the following steps:

one is that the behavior detection method based on optical flow realizes the behavior detection depending on the optical flow difference generated by the human behavior change. Specifically, according to the trend or the track of the optical flow change, the video image is correspondingly subjected to feature extraction or coding to obtain classification data of different types of behaviors, and the classification data is used as a classification algorithm such as an SVM (support vector machine) for training to realize a behavior classification process. The method has the problems that the feature extraction and selection are based on general image methods such as histogram analysis, gradient analysis, optical flow tracking and the like, the analysis and modeling are not based on behavior difference, the analysis process is easily interfered by other factors such as object motion and the like, and the detection precision is insufficient.

The method utilizes the human skeleton characteristic information to make classification and identification of human behaviors more targeted, forms a specific key point structure by manually defining and automatically detecting key parts such as the head, the shoulder, the hand and the like of the human body, can well complete human detection and identification tasks under the conditions of no shielding and normal visual angle, and provides powerful guarantee for behavior classification. However, the posture data of the personnel acquired by the service window monitoring video cannot always meet the above conditions, and the service window data generally has the problems that the key points are shielded, the human body monitoring is difficult, the condition of serious personnel missing detection is caused, and the automatic posture recognition and classification task cannot be met.

The other method is a behavior classification method of a double video stream deep neural network by referring to the processing and abstraction processes of human visual cells on visual signals, and the method depends on the learning and training of a multilayer complex network; however, the accurate behavior marking and classification are still difficult due to the characteristics of long-time fixed posture, large working activity range and the like of the workers under video monitoring, and the problems of multi-person behavior category information, space behavior category conflict and the like contained in a video segment are still outstanding; meanwhile, the service window is different from single individual behavior classification, and interaction processes often exist among workers and between the workers and the public, so that the space range, the class marking and classification of the behavior classification are more complicated.

Thus, there is a need for a better solution to the problems of the prior art.

Disclosure of Invention

In view of this, the invention provides a classification method, a device and a terminal for monitoring video temporal-spatial information fusion, which are used for solving the problems in the prior art.

Specifically, the present invention proposes the following specific examples:

the embodiment of the invention provides a classification method for monitoring video space-time information fusion, which comprises the following steps:

acquiring a sample video set; each video in the sample video set is marked with a behavior category and an influence value influencing the work efficiency level;

randomly and uniformly selecting videos with various behavior categories in the sample video set;

inputting the selected video into a preset classifier to carry out deep network weight training to obtain a trained classifier;

importing the video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points;

performing chessboard distance calculation on the preliminary prediction result of each time point according to a behavior quantization value as a distance value to obtain a category label of the time point;

and performing time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be identified.

In a specific embodiment, the behavior categories include: one or more of working on duty, not working on duty, playing mobile phone on duty, sleeping on duty and off duty;

the influence values corresponding to different behavior categories are related to the application scene.

In a specific embodiment, the acquiring a sample video set includes:

performing time sampling and space sampling for a certain number of times on the video clips with the segmented duration by using a behavior classification method of the double video streams; each video clip is marked with a behavior category and an influence value;

and taking the video clips subjected to time sampling and space sampling as a sample video set.

In a specific embodiment, the video input to the preset classifier is a plurality of video segments corresponding to different spatial distributions of the same window.

In a specific embodiment, the step of introducing a video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points includes:

importing the video to be identified into a trained classifier;

and carrying out multiple times of spatial sampling and behavior classification on the video clips of the same window in the video to be identified through the trained shunt to obtain a plurality of preliminary prediction results of the same time point.

In a specific embodiment, the category label is predicted based on the following formula:

wherein the content of the first and second substances,

is a category label;

the result is a preliminary prediction result; when the impact value is a scalar quantity,

is a scalar value; when the impact value is a vector of values,

in the form of a vector of values,

by vector 2n norm (n)>0) Carrying out quantitative measurement; t is t_uIs time of dayPoint; c. C_sThe number of spatial samples.

In a specific embodiment, the final behavior category prediction result is obtained by performing a behavior category fusion operation in a time dimension based on the following formula:

wherein L is_s，tPredicting results for the final behavior categories; c. C_tIs the number of time samples taken.

The embodiment of the invention also provides a classification device for monitoring video temporal-spatial information fusion, which comprises:

the acquisition module is used for acquiring a sample video set; each video in the sample video set is marked with a behavior category and an influence value influencing the work efficiency level;

the selection module is used for randomly and uniformly selecting videos with various behavior categories in the sample video set;

the training module is used for inputting the selected video into a preset classifier to carry out deep network weight training to obtain a trained classifier;

the preliminary prediction module is used for importing the video to be recognized into the trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points;

the label module is used for calculating the chessboard distance of the preliminary prediction result of each time point according to the behavior quantization value as a distance value to obtain a category label of the time point;

and the fusion module is used for performing time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be identified.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor runs the computer program to enable the processor to execute the classification method for the monitoring video spatiotemporal information fusion.

The embodiment of the invention also provides a storage medium, wherein a computer program is stored on the storage medium, and when being executed by a processor, the computer program realizes the classification method for the monitoring video time-space information fusion.

Therefore, the embodiment of the invention provides a classification method, a device and a terminal for monitoring video space-time information fusion, wherein the method comprises the following steps: acquiring a sample video set; each video in the sample video set is marked with a behavior category and an influence value influencing the work efficiency level; randomly and uniformly selecting videos with various behavior categories in the sample video set; inputting the selected video into a preset classifier to carry out deep network weight training to obtain a trained classifier; importing the video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points; performing chessboard distance calculation on the preliminary prediction result of each time point according to a behavior quantization value as a distance value to obtain a category label of the time point; and performing time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be identified. In the scheme, the behavior categories are quantized, so that decision fusion of the behavior categories can be obtained in a distance calculation mode; the vector space concept is adopted in the chemical operation process, so that category expansion and category aggregation are possible; behavior category quantification is utilized to realize unification of behavior category fusion on the space-time dimension; the behavior category prediction method for adjusting the characteristics of the video stream of the service window has data pertinence.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.

FIG. 1 is a flow chart illustrating a classification method for temporal-spatial information fusion of surveillance videos according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a classification apparatus for monitoring video temporal-spatial information fusion according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 4 shows a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

Example 1

The embodiment 1 of the invention discloses a classification method for monitoring video spatio-temporal information fusion, which comprises the following steps as shown in figure 1:

s101, acquiring a sample video set; each video in the sample video set is marked with a behavior category and an influence value influencing the work efficiency level;

specifically, the behavior categories include: one or more of working on duty, not working on duty, playing mobile phone on duty, sleeping on duty and off duty;

Thus, the "acquiring a sample video set" in step S101 includes:

Specifically, for example, the behavior or posture of the service window staff mainly includes working on duty, not working on duty (drinking, chatting, inattention, etc.), playing a mobile phone, sleeping, lacking duty, etc., and a certain amount of operation definition classification labels can be performed on the behavior of the window staff according to the level of the influence on the working efficiency, such as:

off duty ═ x₆：x₆∈R₆}；

Sleep ═ x₅：x₅∈R₅}；

Playing mobile phone ═ x₄：x₄∈R₄}；

On Shift not working ═ x₃：x₃∈R₃}；

On Shift working conversation ═ x₂：x₂∈R₂}；

On Shift work ═ x₁：x₁∈R₁}；

Wherein R is₁∪R₂∪R₃∪R₄∪R₅∪R₆＝R^kAnd k is a positive integer. When k is 1, x_iIs a scalar value, then there is x₁＜x₂＜x₃＜x₄＜x₅＜x₆(ii) a When k > 1, x_iAs a vector value, then

When two adjacent behaviors affect the level of working efficiency to be constant, x can be made_i＝x_i+1(ii) a Each behavior category can also be further decomposed into a plurality of subclasses, and the same quantization labeling process is adopted; behavioral impact performance levels may also be ranked differently in different applications.

S102, randomly and uniformly selecting videos with various behavior categories in the sample video set;

s103, inputting the selected video into a preset classifier to carry out deep network weight training to obtain a trained classifier;

specifically, the video input into the preset classifier is a plurality of video segments which correspond to the same window and are distributed in different spaces.

Specifically, C is carried out on the video clip with the well-segmented duration T by utilizing a behavior classification method of the double video streams_tSub time sum C_sSub-space sampling; combining the quantization behavior labels corresponding to the video clips to obtain a plurality of fast and slow video samples of the same window video clip in different spatial distributions; and then, inputting the video sample into a classifier to carry out deep network weight training to obtain the classifier.

Step S104, importing the video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points;

specifically, the step of importing a video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points includes:

importing the video to be identified into a trained classifier;

Specifically, after obtaining the classifier, C for a window video stream segment_sThe C can be obtained after the sub-space sampling is carried out and the behavior classification is carried out_sAnd the classification category label result related to the behavior space at the same time sampling moment is the preliminary prediction result.

Step S105, carrying out chessboard distance calculation on the preliminary prediction result of each time point according to a behavior quantization value as a distance value to obtain a category label of the time point;

in addition, specifically, after the preliminary prediction result is obtained, the chessboard distance calculation is performed on the spatial dimension behavior prediction result (that is, the preliminary prediction result) of the video stream segment according to the behavior quantization value as the distance value to obtain the time sampling point t_uFinal class label of (1):

the class label is predicted based on the following formula:

wherein the content of the first and second substances,

is a category label;

is a scalar value; when the influence value isWhen the vector is used as the vector, the vector is obtained,

in the form of a vector of values,

by vector 2n norm (n)>0) Carrying out quantitative measurement; t is t_uIs a time point; c. C_sThe number of spatial samples.

And S106, performing time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be recognized.

Specifically, all time sampling points t are determined by considering the condition that the time variation amplitude of the station personnel behavior category is not significant_uAfter the behavior category prediction result is obtained, a behavior category fusion operation similar to the time dimension of the space dimension mode can be performed to obtain a final behavior category prediction result of the video segment.

And performing behavior category fusion operation of a time dimension based on the following formula to obtain the final behavior category prediction result:

The invention provides a behavior classification method framework of temporal-spatial information fusion aiming at the problem of human behavior classification of the current service window monitoring video, which is used for realizing the following steps: classifying the behavior of the staff in a multi-scene and multi-service window area; quantifying behavior category information according to a preset level, such as influence on working efficiency and the like; by means of a behavior classification method of double video stream deep learning and chessboard distance and norm concepts, time-space domain decision fusion of quantitative behavior category prediction results is achieved.

Example 2

For further explanation of the present invention, embodiment 2 of the present invention further discloses a classification apparatus for monitoring video temporal-spatial information fusion, which includes:

an obtaining module 201, configured to obtain a sample video set; each video in the sample video set is marked with a behavior category and an influence value influencing the work efficiency level;

a selecting module 202, configured to randomly and uniformly select videos of different behavior categories in the sample video set;

the training module 203 is used for inputting the selected video into a preset classifier to perform deep network weight training to obtain a trained classifier;

the preliminary prediction module 204 is configured to import a video to be identified into a trained classifier for prediction, so as to obtain multiple preliminary prediction results at corresponding time points;

a label module 205, configured to perform chessboard distance calculation on the preliminary prediction result at each time point according to a behavior quantization value as a distance value to obtain a category label of the time point;

and the fusion module 206 is configured to perform a time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be identified.

In a specific embodiment, the behavior classification includes: one or more of working on duty, not working on duty, playing mobile phone on duty, sleeping on duty and off duty;

the influence values corresponding to different behavior classifications are related to the application scene.

In a specific embodiment, the obtaining module 201 is configured to:

In a specific embodiment, the training module 203 is configured to:

importing the video to be identified into a trained classifier;

In a specific embodiment, the preliminary prediction result is predicted by the trained classifier based on the following formula:

wherein the content of the first and second substances,

is a scalar value; when the impact value is a vector of values,

in the form of a vector of values,

by vector 2n norm (n)>0) Carrying out quantitative measurement; t is t_uIs a time point; c. C_sThe number of results is preliminarily predicted for the same time point.

wherein L is_s，tFor final behavior classesMeasuring a result; c. C_tIs the number of time samples taken.

Example 3

Embodiment 3 of the present invention further discloses a terminal, as shown in fig. 3, which includes a memory and a processor, where the memory stores a computer program, and the processor runs the computer program to enable the processor to execute the classification method for monitoring video spatiotemporal information fusion according to embodiment 1.

Example 4

Embodiment 4 of the present invention further discloses a storage medium, as shown in fig. 4, where the storage medium stores a computer program, and the computer program is executed by a processor to implement the classification method for monitoring video temporal-spatial information fusion described in embodiment 1.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A classification method for monitoring video space-time information fusion is characterized by comprising the following steps:

importing the video to be identified into a trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points; the step of importing the video to be identified into the trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to the time points includes: importing the video to be identified into a trained classifier; carrying out multiple times of spatial sampling and behavior classification on video segments of the same window in the video to be recognized through a trained classifier to obtain multiple initial prediction results of the same time point;

taking the behavior quantization value corresponding to the preliminary prediction result of each time point as a distance value to calculate chessboard distance to obtain a category label of the time point;

performing time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be identified;

the class label is predicted based on the following formula:

wherein the content of the first and second substances,

is a category label;

is a scalar value; when the impact value is a vector of values,

in the form of a vector of values,

2. The method of claim 1, wherein the behavior categories include: one or more of working on duty, not working on duty, playing mobile phone on duty, sleeping on duty and off duty;

3. The method of claim 1, wherein said obtaining a sample video set comprises:

4. A method as claimed in claim 1 or 3, wherein the video input to the preset classifier is a plurality of video segments corresponding to different spatial distributions of the same window.

5. The method of claim 1, wherein the final behavior class prediction result is obtained by performing a behavior class fusion operation in a time dimension based on the following formula:

6. A classification device for monitoring video temporal-spatial information fusion is characterized by comprising:

the preliminary prediction module is used for importing the video to be recognized into the trained classifier for prediction to obtain a plurality of preliminary prediction results corresponding to time points; the preliminary prediction module is used for importing the video to be identified into a trained classifier; carrying out multiple times of spatial sampling and behavior classification on video segments of the same window in the video to be recognized through a trained classifier to obtain multiple initial prediction results of the same time point;

the label module is used for calculating the chessboard distance by taking the behavior quantization value corresponding to the preliminary prediction result of each time point as a distance value to obtain a category label of the time point;

the fusion module is used for performing time-dimension behavior category fusion operation on the category labels at all time points to obtain a final behavior category prediction result of the video to be identified;

the class label is predicted based on the following formula:

wherein the content of the first and second substances,

is a category label;

is a scalar value; when the impact value is a vector of values,

in the form of a vector of values,

7. A terminal comprising a memory storing a computer program and a processor executing the computer program to cause the processor to perform the classification method for surveillance video spatiotemporal information fusion according to any one of claims 1-5.

8. A storage medium having stored thereon a computer program which, when executed by a processor, implements a surveillance video spatiotemporal information fusion classification method according to any one of claims 1-5.