CN113743221A

CN113743221A - Multi-view pedestrian behavior identification method and system under edge computing architecture

Info

Publication number: CN113743221A
Application number: CN202110891098.4A
Authority: CN
Inventors: 王雪; 游伟
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-12-03
Anticipated expiration: 2041-08-04
Also published as: CN113743221B

Abstract

The application provides a multi-view human behavior identification method and system under an edge computing architecture, and belongs to the technical field of human behavior identification. The method comprises the following steps: the method comprises the steps that a camera set shoots the same scene from different visual angles, human behavior video data of different visual angles are obtained and transmitted to edge computing nodes connected with the camera set, the human behavior video data to be recognized at different visual angles in the same time period are collected and stored and subjected to data preprocessing, the data are input to a human behavior feature encoder to obtain multi-visual angle human behavior feature vectors, a cloud server receives the multi-visual angle human behavior feature vectors uploaded by the edge computing nodes and inputs the multi-visual angle human behavior feature vectors to a human behavior recognition model to obtain human behavior recognition results. Through the extraction of the human behavior characteristics on the edge computing nodes, the cloud server performs human behavior classification, so that the computing pressure of the cloud server is reduced, and the real-time performance of recognition is improved; the multi-view human behavior information is collected and utilized, the expression capability of the characteristics is improved, and the accuracy of human behavior identification is improved.

Description

Multi-view pedestrian behavior identification method and system under edge computing architecture

Technical Field

The present application relates to the field of pedestrian behavior identification technologies, and in particular, to a method and a system for identifying a multi-view pedestrian behavior under an edge computing architecture.

Background

The behavior and meaning of people can be judged through the image data of the camera set by the pedestrian behavior identification technology, and the pedestrian behavior identification technology has important significance for improving the automation and intelligence level of a security monitoring system and ensuring the stability and order of social production and life. In the existing human behavior identification method, image data acquired by a camera needs to be uploaded to a cloud server, a large amount of video data is stored in the cloud server, and a data label is marked by adopting a mode of manually checking videos.

In the related art, in order to reduce the workload of manual labeling, a technical route of an automatic supervision learning method is adopted. However, on the one hand, when there is a case where an object is occluded or a human body is occluded by itself, the recognition accuracy of the self-supervised learning method is low. On the other hand, the self-supervision learning method runs on the end cloud server, occupies a large amount of computing resources of the cloud server, and causes high delay of human behavior recognition tasks.

Disclosure of Invention

The application discloses a multi-view behavior identification method and system under an edge computing architecture, which are used for solving the problems or at least partially solving the problems.

In a first aspect, an embodiment of the present invention discloses a method for identifying multi-view pedestrian behaviors under an edge computing architecture, where the method includes:

the method comprises the steps that a camera set shoots the same scene from different visual angles to obtain behavior video data of people to be identified from different visual angles, and the behavior video data of the people to be identified from different visual angles are transmitted to edge computing nodes connected with the camera set;

the edge computing node collects and stores the human behavior video data to be identified at different visual angles in the same time period, carries out data preprocessing on the human behavior video data to be identified at different visual angles in the same time period, inputs the preprocessed data into a human behavior feature encoder to obtain multi-visual-angle human behavior feature vectors and transmits the multi-visual-angle human behavior feature vectors to the cloud server;

and the cloud server receives the multi-view behavior feature vectors uploaded by the edge computing nodes, inputs the behavior feature vectors into a behavior recognition model, and obtains behavior recognition results of the behavior video data to be recognized at different views.

Optionally, the method further comprises:

the method comprises the following steps that a camera set shoots the same scene from different visual angles to obtain first same personal behavior video data of different visual angles, and the first same personal behavior video data of different visual angles are transmitted to edge computing nodes connected with the camera set;

the edge computing node collects and stores first same personal behavior video data of different visual angles in the same time period, data preprocessing is carried out on the first same personal behavior video data of different visual angles in the same time period, a preset personal behavior self-supervision characteristic learning model is trained on the basis of the preprocessed first same personal behavior video data of different visual angles, and the personal behavior characteristic encoder is obtained.

Optionally, the method further comprises:

the camera set shoots the same scene from different visual angles to obtain second sample pedestrian behavior video data of different visual angles, and the second sample pedestrian behavior video data of different visual angles are transmitted to the edge computing node connected with the camera set;

the edge computing node uploads a plurality of second sample human behavior video data with different visual angles, collects and stores the second sample human behavior video data with different visual angles in the same time period, performs data preprocessing on the second sample human behavior video data with different visual angles in the same time period, inputs the preprocessed data into a human behavior feature encoder to obtain a multi-visual angle human behavior feature vector, and transmits the multi-visual angle human behavior feature vector to a cloud server;

the cloud server receives the multi-view human behavior characteristic vectors uploaded by the edge computing nodes and the second sample human behavior video data of the preset number and different views, and trains a preset model according to the behavior category labels marked on the second sample human behavior video data of the preset number and the multi-view human behavior characteristic vectors to obtain a human behavior recognition model.

Optionally, the human behavior video data preprocessing of different view angles includes:

determining skeleton data of the human behavior video data at different visual angles according to the human behavior video data at different visual angles;

preprocessing the skeleton data of the pedestrian behavior video data at different visual angles to obtain skeleton sequences of the pedestrian behavior video data at different visual angles;

and fusing the skeleton sequences of the human behavior video data at different visual angles to obtain a fused skeleton fragment sequence.

Optionally, the method further comprises:

reordering the fusion framework fragment sequence obtained after pretreatment according to a plurality of ordering modes, and marking an ordering mode label;

training a preset human behavior self-supervision characteristic learning model based on the preprocessed first same human behavior video data with different visual angles, and the training comprises the following steps:

inputting the reordered fusion framework fragment sequence and the ordering mode label thereof into a human behavior self-supervision characteristic learning model for training.

Optionally, the determining, according to the human behavior video data at different viewing angles, skeleton data of the human behavior video data at different viewing angles includes:

calculating the positions of human body posture key points of each frame of picture in the human behavior video data at different visual angles, wherein the positions of the human body posture key points are skeleton data of the human behavior video data at different visual angles;

the calculation formula of the skeleton data is as follows:

wherein the content of the first and second substances,

the image of the ith frame of the camera is represented, x and y respectively represent the horizontal and vertical coordinates of the human body posture key points in the image, j represents the number of the human body posture key points, and N is the total number of the human body posture key points.

Optionally, the preprocessing the skeleton data of the human behavior video data at different viewing angles to obtain the skeleton sequence of the human behavior video data at different viewing angles includes:

subtracting the mean value of the coordinate positions of all the human posture key points in each frame image from the coordinate position of the human posture key point in each frame image, wherein the calculation formula is as follows:

determining the skeleton characteristics of each frame of image, wherein the calculation formula is as follows:

determining a skeleton sequence of the human behavior video data of different visual angles, wherein the calculation formula is as follows:

performing normalization processing on the skeleton sequences of the human behavior video data at different visual angles, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

as a skeletal feature of the t frame image numbered as the ith camera, SⁱFor people from different perspectives to behave as a sequence of video data skeletons,

the human behavior video data skeleton sequences of different visual angles after normalization.

Optionally, the fusing the skeleton sequences of the human behavior video data at different viewing angles to obtain a fused skeleton segment sequence includes:

dividing the human behavior video data skeleton sequence of each different visual angle equally according to time nodes to obtain a plurality of skeleton segments;

randomly extracting any one of the skeleton segments corresponding to each time node, and fusing the skeleton segments corresponding to the multiple time nodes to obtain skeleton segment sequences of the human behavior video data at different visual angles.

Optionally, the human behavior recognition model outputs the human behavior recognition prediction result according to the following formula:

wherein f is_fusionAs a human behavior classifier, g (X)₁),g(X₂),g(X₃) M is a multi-view human behavior feature vector, and is a human behavior recognition prediction result, (i) represents the ith element of the vector, and K is the number of the types of behaviors to be recognized.

In a second aspect, an embodiment of the present invention discloses a multi-view person behavior recognition system under an edge computing architecture, where the system includes:

the camera set shoots the same scene from different visual angles to obtain the behavior video data of the people to be identified from different visual angles, and transmits the behavior video data of the people to be identified from different visual angles to the edge computing node connected with the behavior video data;

the edge computing node is used for receiving the human behavior video data of different visual angles transmitted by the camera group, carrying out data preprocessing on the human behavior video data of different visual angles, and training a preset human behavior self-supervision characteristic learning model based on the human behavior video data of different visual angles after data preprocessing to obtain the human behavior characteristic encoder; the human behavior video data of different visual angles are transmitted to a cloud server, data preprocessing is carried out on the human behavior video data of different visual angles, the human behavior video data are input to a human behavior feature encoder, multi-visual-angle human behavior feature vectors are obtained, and the human behavior feature vectors are transmitted to the cloud server;

the cloud server is used for uploading the multi-view behavior characteristic vectors and the behavior video data at different views by the edge computing node, determining behavior category labels of manual labeling according to the behavior video data at different views, and training a preset model according to the behavior category labels of the manual labeling and the multi-view behavior characteristic vectors to obtain a behavior recognition model; and receiving the multi-view pedestrian behavior feature vectors uploaded by the edge computing nodes, and inputting the pedestrian behavior feature vectors into a pedestrian behavior identification model to obtain pedestrian behavior identification results of the pedestrian behavior video data at different views.

Compared with the prior art, the method has the following advantages:

the application provides a multi-view person behavior identification method and system under an edge computing architecture, which comprises the following steps: the method comprises the steps that a camera set shoots the same scene from different visual angles to obtain pedestrian behavior video data of different visual angles, the pedestrian behavior video data are transmitted to edge computing nodes connected with the camera set, the pedestrian behavior video data to be identified of different visual angles in the same time period are collected and stored and subjected to data preprocessing, the data are input to a pedestrian behavior feature encoder to obtain multi-visual-angle pedestrian behavior feature vectors, a cloud server receives the multi-visual-angle pedestrian behavior feature vectors uploaded by the edge computing nodes, the pedestrian behavior feature vectors are input to a pedestrian behavior identification model, and pedestrian behavior identification results of the pedestrian behavior video data of different visual angles are obtained.

By performing behavior feature extraction on the edge computing nodes, the cloud server only needs to perform behavior classification, so that the computing pressure of the cloud server is reduced, and the real-time performance of recognition is improved; by collecting the human behavior video data information to be identified at different visual angles, the distinguishing degree of the features is improved, the expression capability of the features is further improved, and the accuracy of human behavior identification is improved; and the multi-view human behavior characteristics can be automatically supervised and learned by utilizing the computing resources of the edge computing nodes under the condition of not manually labeling human behavior data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a multi-view behavior recognition system under an edge computing architecture according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a multi-view behavior recognition method under an edge computing architecture according to an embodiment of the present invention;

FIG. 3 is a flow chart of the training steps of the human behavior feature encoder in an embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps of data preprocessing performed on pedestrian video data from different viewing angles according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating key point numbering of skeleton data according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of multi-view skeleton segment fusion in an embodiment of the invention;

FIG. 7 is a diagram of an autonomic behavior supervised feature learning model in an embodiment of the present invention;

FIG. 8 is a schematic diagram of a human behavior feature encoder according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating the training steps of the human behavior recognition model according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a human behavior classifier in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the related technology, a technical route of an automatic supervision learning method is adopted, the method comprises the steps of firstly training a deep neural network through a preposed task, then taking the deep neural network obtained by the training of the preposed task as a feature encoder, finally extracting human behavior features by using the feature encoder, and training a classifier (a full connection layer, a nearest neighbor classifier, a support vector machine and the like) with a simple structure by using the features of a small number of samples and labels.

However, the current self-supervised learning methods for skeleton are still few, and have the following disadvantages:

1. only aiming at the single-view camera, when the conditions of object shielding or human body self shielding exist, the pedestrian behavior identification accuracy rate can be greatly reduced. At present, a large number of cameras are deployed in a public area, a plurality of cameras are often adopted to shoot the same scene from different visual angles, the self-supervision learning method of the single-visual-angle camera cannot simultaneously utilize the advantages of the multi-visual-angle cameras, the distinguishability of the single-visual-angle features is not high, the difficulty of subsequent classification tasks is increased, and the improvement of the human behavior identification accuracy rate is limited.

2. The self-supervision learning method runs on the cloud server, occupies a large amount of computing resources of the cloud server, and causes high delay of human behavior recognition tasks.

Therefore, the technical idea of the invention is proposed: the method comprises the steps that a camera set shoots the same scene from different visual angles, pedestrian behavior video data of different visual angles are obtained and transmitted to edge computing nodes connected with the camera set, the pedestrian behavior video data to be recognized of different visual angles in the same time period are collected and stored and subjected to data preprocessing, the data are input to a pedestrian behavior feature encoder to obtain multi-visual-angle pedestrian behavior feature vectors, a cloud server receives the multi-visual-angle pedestrian behavior feature vectors uploaded by the edge computing nodes and inputs the pedestrian behavior feature vectors to a pedestrian behavior recognition model, and pedestrian behavior recognition results of the pedestrian behavior video data of different visual angles are obtained. By performing behavior feature extraction on the edge computing nodes, the cloud server only needs to perform behavior classification, so that the computing pressure of the cloud server is reduced, and the real-time performance of recognition is improved; by collecting the human behavior video data information to be identified at different visual angles, the distinguishing degree of the features is improved, the expression capability of the features is further improved, and the accuracy of human behavior identification is improved; and the multi-view human behavior characteristics can be automatically supervised and learned by utilizing the computing resources of the edge computing nodes under the condition of not manually labeling human behavior data.

Referring to fig. 1, the present invention provides a multi-view person behavior recognition system under an edge computing architecture, the system including:

In this embodiment, the camera group is composed of a plurality of cameras, and serves as a sensing end, the cameras adopt a network high-definition CCD with a resolution of 1280 × 720, a frame rate is 25 frames/second, and the cameras should support an RTSP real-time streaming protocol. The edge computing node adopts a high-performance workstation and is provided with an Intel Xeon E5-2640v4@2.4GHz processor, a 64G memory and an NVIDIA RTX3090 GPU. The software platform of the edge computing node adopts Anaconda software to build a Python operating environment, and installs an NVIDIA CUDA operating library and CUDNN acceleration software. The PyTorch deep learning library is installed. The openpos pose estimation toolkit is installed and the Python interface program is compiled. The cloud adopts a rack-mounted cloud server, and is provided with an Intel Xeon Silver 4114@2.2GHz processor, a 128G memory and an NVIDIA TESLA V100 GPU. A software platform of the cloud server adopts Anaconda software to build a Python operating environment, and an NVIDIA CUDA operating library and CUDNN acceleration software are installed. The PyTorch deep learning library is installed. The cameras and the edge computing nodes, the edge computing nodes and the cloud server are connected through network cables, and fixed IP addresses are used for communication.

Based on the same inventive concept, an embodiment of the present invention provides a method for identifying a multi-view behavior under an edge computing architecture, where an implementation environment of the method may be a multi-view behavior identification system under an edge computing architecture shown in fig. 1. Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a multi-view behavior identification method under an edge computing architecture according to an embodiment of the present application, where the method includes the following steps:

step S201: the camera set shoots the same scene from different visual angles to obtain the behavior video data of the people to be identified from different visual angles, and transmits the behavior video data of the people to be identified from different visual angles to the edge computing node connected with the same.

When the same scene is shot, a plurality of cameras are erected around the scene for shooting, the erection angles of the cameras are different or the heights of the cameras are different, video data at different visual angles in the same scene can be collected, namely behavior video data of people to be recognized at different visual angles can be collected, and the behavior video data of people to be recognized at different visual angles are transmitted to edge computing nodes arranged near a camera group through network cables and gateway equipment.

Step S202: the edge computing node collects and stores the human behavior video data to be recognized at different visual angles in the same time period, carries out data preprocessing on the human behavior video data to be recognized at different visual angles in the same time period, inputs the preprocessed data into a human behavior feature encoder, obtains multi-visual-angle human behavior feature vectors, and transmits the multi-visual-angle human behavior feature vectors to the cloud server.

After the edge computing node receives video streams of pedestrian behavior video data to be identified at different visual angles, a synchronous image acquisition task of a multi-visual-angle video is executed on the edge computing node, and the task specifically comprises the following steps: continuously collecting and storing human behavior video clips with different visual angles in the same time period on edge nodes

Where C is the number of cameras,

i

1,2, C is the camera number, T represents the number of image frames per video segment,

representing the t frame image of the ith camera. The collected human behavior video data at different visual angles are extracted through a multi-visual-angle video human behavior framework, framework data are preprocessed, multi-visual-angle framework sequences are fused and then input into a human behavior feature encoder, the human behavior feature encoder outputs multi-visual-angle human behavior feature vectors corresponding to human behavior video data to be identified at different visual angles, all the multi-visual-angle human behavior features are spliced into one feature vector, and the feature vector is uploaded to a cloud server. By adopting the human behavior video data of different visual angles for shooting the same scene from different visual angles, the problem of insufficient distinguishing degree of single visual angle features is solved, the feature expression capability is improved, the difficulty of classification tasks is reduced, and the accuracy of human behavior classification tasks is improved.

Step S203: and the cloud server receives the multi-view behavior feature vectors uploaded by the edge computing nodes, inputs the behavior feature vectors into a behavior recognition model, and obtains behavior recognition results of the behavior video data to be recognized at different views.

And the cloud server receives the pedestrian behavior characteristic vector uploaded by the edge computing node, inputs the characteristic vector into a pedestrian behavior recognition model of the cloud server, and completes a final behavior recognition task. Because the feature extraction task is completed by the edge computing node, only the human behavior recognition task is executed at the cloud end, and the feature extraction task of the human behavior video data to be recognized at different visual angles is not executed. Therefore, the computing pressure of the cloud server is reduced, and the real-time performance of behavior identification is improved.

In the embodiment, through the steps, the human behavior feature encoder obtained by performing self-supervision learning on the multi-view human behavior features on the edge computing nodes is utilized, human behavior feature extraction can be realized on the edge nodes, and the cloud server only needs to run a simple human behavior classification model, so that the computing pressure of the cloud server is reduced, and the real-time performance of human behavior identification is improved; meanwhile, by utilizing the multi-view human behavior information, the expression capability of the characteristics can be improved, and the accuracy of human behavior identification can be improved.

The processing flow of the data to be identified for the multi-view person behavior identification is similar to the processing flow of the sample data, and the differences are only in the target object difference and the subsequent operation difference.

As shown in fig. 3, training the self-supervised feature learning model of the preset human behavior includes the following steps:

step S200-1: the method comprises the following steps that a camera set shoots the same scene from different visual angles to obtain first same personal behavior video data of different visual angles, and the first same personal behavior video data of different visual angles are transmitted to edge computing nodes connected with the camera set;

step S200-2: the edge computing node collects and stores first same personal behavior video data of different visual angles in the same time period, data preprocessing is carried out on the first same personal behavior video data of different visual angles in the same time period, a preset personal behavior self-supervision characteristic learning model is trained on the basis of the preprocessed first same personal behavior video data of different visual angles, and the personal behavior characteristic encoder is obtained.

In the embodiment, after the plurality of cameras finish collecting the first same personal behavior video data of different visual angles, the collected first same personal behavior video data of different visual angles are transmitted to the edge computing node directly connected with the same in a video stream mode. After the edge computing node receives video streams of pedestrian behavior video data to be identified at different visual angles, a synchronous image acquisition task of a multi-visual-angle video is executed on the edge computing node, and the task specifically comprises the following steps: continuously collecting and storing human behavior video clips with different visual angles in the same time period on edge nodes

Where C is the number of cameras,

i

representing the t frame image of the ith camera. The collected human behavior video data of different visual angles are subjected to extraction of human behavior skeletons of the multi-visual-angle video, skeleton data preprocessing, fusion of multi-visual-angle skeleton sequences and training of a human behavior self-supervision feature learning model as source data, and a human behavior feature encoder is obtained. Under the condition of not manually labeling human behavior data, self-supervision learning is carried out on the multi-view human behavior characteristics by utilizing the computing resources of the edge computing nodes, and a human behavior characteristic encoder is obtained.

As shown in fig. 4, the data preprocessing is performed on the pedestrian behavior video data of different viewing angles, and includes the steps of:

step S200-2-1: determining skeleton data of the human behavior video data at different visual angles according to the human behavior video data at different visual angles;

step S200-2-2: preprocessing the skeleton data of the pedestrian behavior video data at different visual angles to obtain skeleton sequences of the pedestrian behavior video data at different visual angles;

step S200-2-3: and fusing the skeleton sequences of the human behavior video data at different visual angles to obtain a fused skeleton fragment sequence.

Further, in step S200-2-1, determining skeleton data of the human behavior video data of different viewing angles according to the human behavior video data of different viewing angles includes:

the calculation formula of the skeleton data is as follows:

wherein the content of the first and second substances,

In this embodiment, the calculation of the human behavior video data to be recognized at different viewing angles or the calculation of the sample human behavior video data at different viewing angles is performed. The skeleton comprises 18 human posture key points in total, as shown in figure 5.

Further, in step S200-2-2, the skeleton data of the human behavior video data at different viewing angles is preprocessed to obtain skeleton sequences of the human behavior video data at different viewing angles:

in the formula (I), the compound is shown in the specification,

skeleton features of the t frame image of the camera numbered i, SⁱFor people from different perspectives to behave as a sequence of video data skeletons,

Further, in the step S200-2-3, the fusing the skeleton sequences of the human behavior video data at different view angles to obtain a fused skeleton segment sequence includes:

In the present embodiment, 3 cameras (C ═ 3) are used as an example, and the procedure is as shown in fig. 6. First, each fragment of the framework sequence

Is divided into 3 skeleton segments according to the time sequence

Then, a set of first skeleton segments containing 3 cameras is constructed

Randomly extracting a fragment X therefrom₁. In the same way respectively from

And

middle random fragment X₂And X₃Obtaining a skeleton fragment sequence [ X ] containing multi-view human behavior information₁,X₂,X₃]。

It should be noted that the to-be-recognized behavior video data, the second sample behavior video data and the first sample behavior video data all need to be subjected to data preprocessing, the data preprocessing processes are completely the same, only the operations executed after the processing are different, the to-be-recognized behavior video data and the second sample behavior video data are used as input and input into a behavior feature encoder to extract multi-view behavior feature vectors, the first sample behavior video data is used as input, and a behavior self-supervision feature learning model is input to perform training.

The fusion framework segment sequence after data preprocessing can be used as input to train a human behavior self-supervision characteristic learning model, and further, the extraction accuracy of a human behavior characteristic encoder is improved.

The fusion framework fragment sequence obtained after pretreatment can be reordered according to a plurality of ordering modes, and an ordering mode label is marked;

and training a preset human behavior self-supervision characteristic learning model based on the preprocessed first same human behavior video data with different visual angles, wherein the training comprises the following steps:

In this embodiment, the fusion framework fragment sequences obtained after the pre-treatment are randomly reordered to obtain reordered framework fragment sequences

The sorting modes are 6 in total, and the sorting modes and labels thereof are shown in table 1. Reordered backbone fragment sequences

And the sequencing mode label is input into a self-supervision characteristic learning model for training.

TABLE 1

Sorting mode	Label (R)
		[X₁,X₂,X₃]	0
[X₁,X₃,X₂]	1
		[X₂,X₁,X₃]	2
[X₂,X₃,X₁]	3
		[X₃,X₁,X₂]	4
[X₃,X₂,X₁]	5

As shown in fig. 7, in the human behavior self-supervision feature learning model, after the reordered skeleton segment sequence and the ordering mode label are input, the ordering mode needs to be determined according to the input skeleton segment sequence. Wherein g represents a human behavior feature encoder, and h represents a sorting mode classifier. The training process of the human behavior self-supervision feature learning model is as follows: and respectively inputting the 3 skeleton segments obtained by fusing the multi-view skeleton segments into a human behavior feature encoder g, respectively encoding each skeleton segment into 128-dimensional features by the encoder, and splicing the 3 128-dimensional features into 384-dimensional human behavior feature vectors. And inputting the vector into a sorting mode classifier h, and outputting a predicted value of the sorting mode. By using

The probability distribution of the predicted value of the sorting mode is represented as follows:

and generating a true value y of the sorting mode by adopting one-hot coding on the sorting mode label. The model loss was calculated using the following loss function:

and training a human behavior self-supervision characteristic learning model by adopting a random gradient descent method to obtain a human behavior characteristic encoder g and a sorting mode classifier h.

In a possible embodiment, the human behavior feature encoder is configured as shown in fig. 8, and has a skeleton segment as an input and outputs a feature vector of a fixed length. Taking the length T of the video sequence as an example, after the skeleton sequence is divided into 3 segments on average, the length of each skeleton sequence segment is 32, each frame in the sequence includes x and y coordinates of 18 key points, and the data dimension input to the feature encoder is (32, 36). Fig. 9 "Conv 1D" shows the one-dimensional convolutional layer [4], and 3 numbers in parentheses after "Conv 1D" respectively show the number of output channels, the size of convolutional kernel, and the convolution step size of the one-dimensional convolutional layer. BN is a batch normalization layer to prevent gradient disappearance or explosion, speed up training, ReLU (modified linear unit) is the activation function employed by the feature encoder. The numbers outside the box are the dimensions of the output feature maps of the layers of the feature encoder. The encoding process of the human behavior characteristic encoder on the skeleton segment is as follows: the skeleton fragment was input into 6 convolutional layers. The convolution kernel size of 6 convolutional layers is 6, the convolution step size of the 3 rd and 6 th convolutional layers is 2, and the convolution step size of the other convolutional layers is 1. The number of output channels per layer of the first 3 convolutional layers is 64, and the number of output channels per layer of the last 3 convolutional layers is 128. And residual error connection is adopted among every 3 convolutional layers, one-dimensional convolution is also adopted for residual error connection, the size of a convolution kernel is 1, and the convolution step length is 2. After the skeleton segment passes through 6 convolutional layers, the dimension of the output feature map is (8,128). The previous level of feature map is input into the max pooling layer, which has a step size of 4, and the output feature map dimension is (2,128). And inputting the feature map of the previous layer into a flattening layer, arranging the feature maps into a one-dimensional vector by the layer, inputting the feature map of the previous layer into a full-connection layer with the output feature map dimension being (256), and inputting the feature map of the previous layer into a full-connection layer with the output feature map dimension being (128), thereby obtaining the 128-dimensional feature vector output by the encoder.

As shown in fig. 9, training the human behavior recognition model includes the following steps:

step S200-3: the camera set shoots the same scene from different visual angles to obtain second sample pedestrian behavior video data of different visual angles, and the second sample pedestrian behavior video data of different visual angles are transmitted to the edge computing node connected with the camera set;

step S200-4: the edge computing node uploads a preset number of second sample pedestrian video data with different visual angles, collects and stores the second sample pedestrian video data with different visual angles in the same time period, performs data preprocessing on the second sample pedestrian video data with different visual angles in the same time period, inputs the preprocessed data into a pedestrian characteristic encoder to obtain a multi-visual-angle pedestrian characteristic vector, and transmits the multi-visual-angle pedestrian characteristic vector to a cloud server;

step S200-5: the cloud server receives the multi-view human behavior characteristic vectors uploaded by the edge computing nodes and the second sample human behavior video data of the preset number and different views, and trains a preset model according to the behavior category labels marked on the second sample human behavior video data of the preset number and the multi-view human behavior characteristic vectors to obtain a human behavior recognition model.

In this embodiment. Firstly, the human behavior feature encoder obtained by training of the self-supervision learning method can extract human behavior features which contain sufficient information and are well distinguishable, so that a small number of human behavior video segments are uploaded to the cloud server by edge nodes, human behavior type labels are labeled for the video segments by adopting a manual labeling method, a human behavior classifier with a simple structure is constructed, after training, a human behavior recognition model with a high recognition rate and a simple structure can be obtained, a large number of labels do not need to be labeled manually, and the workload of manual labeling is saved. Training human behavior classifier f by adopting human behavior feature vectors uploaded by edge nodes and manually labeled behavior class labels_actionAnd the pedestrian behavior classifier f after training_actionThe cloud server only needs to classify the behaviors of people, so that the computing pressure of the cloud server is reduced, and the real-time performance of recognition is improved.

In a possible implementation, the behavior classification model trains behavior classesPersonal behavior classifier f for training personal behavior by using special labels_actionThe structure of (2) is shown in fig. 10. The input of the classifier is a 384-dimensional human behavior feature vector, which comprises 2 fully connected layers. The output of the first fully connected layer fc1 is 256-dimensional, using the ReLU activation function. The output dimension of the second full-connection layer fc2 is the same as the number K of the types of behaviors to be recognized, and a Softmax activation function is adopted. The model was trained using a cross entropy loss function and using a stochastic gradient descent method. Using trained classifiers f_actionAnd when the human behavior recognition task is completed, the human behavior recognition model is as follows:

Based on the same inventive concept, an embodiment of the present application provides a readable storage medium, where the storage medium stores a multi-view behavior recognition program under an edge computing architecture, and the multi-view behavior recognition program under the edge computing architecture is executed by a processor to implement the steps of the multi-view behavior recognition method under the edge computing architecture according to the first aspect of the embodiment of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the appended characteristic claims are intended to be interpreted as including the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of the present invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and the system for identifying the multi-view pedestrian behavior under the edge computing architecture provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A multi-view behavior identification method under an edge computing architecture, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising:

the edge computing node uploads a preset number of second sample human behavior video data at different visual angles, collects and stores the second sample human behavior video data at different visual angles in the same time period, performs data preprocessing on the second sample human behavior video data at different visual angles in the same time period, inputs the preprocessed data into a human behavior feature encoder to obtain multi-visual angle human behavior feature vectors, and transmits the multi-visual angle human behavior feature vectors to a cloud server;

4. The method according to claim 1 or 2, wherein the preprocessing of the human behavior video data of different view angles comprises:

5. The method of claim 4, further comprising:

6. The method of claim 4, wherein determining the skeleton data of the human behavior video data of different perspectives from the human behavior video data of different perspectives comprises:

the calculation formula of the skeleton data is as follows:

wherein the content of the first and second substances,

7. The method of claim 6, wherein preprocessing the skeleton data of the human behavior video data of different view angles to obtain a skeleton sequence of the human behavior video data of different view angles comprises:

wherein the content of the first and second substances,

8. The method of claim 4, wherein fusing the skeleton sequences of the human behavior video data from different perspectives to obtain a fused skeleton segment sequence comprises:

9. The method of claim 3, wherein the human behavior recognition model outputs the human behavior recognition prediction according to the following formula:

wherein f is_fusionAs a human behavior classifier, g (X)₁),g(X₂),g(X₃) For the multi-view human behavior feature vector, m is the human behavior recognition prediction result, (i) the ith element of the vector is represented, and K is the number of the types of behaviors to be recognized。

10. A multi-perspective behavior recognition system under an edge computing architecture, the system comprising: