CN110210378B

CN110210378B - Embedded video image analysis method and device based on edge calculation

Info

Publication number: CN110210378B
Application number: CN201910461504.6A
Authority: CN
Inventors: 张江辉; 马敏; 田西兰; 赵洪立; 蔡红军; 王曙光; 夏勇; 夏鹏; 王斌; 刘丽莎; 吴昭; 吴颖; 李江涛; 孙龙; 吴涛; 姜欢欢; 刘海飞; 常沛; 张玉营
Original assignee: CETC 38 Research Institute
Current assignee: CETC 38 Research Institute
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2023-04-07
Anticipated expiration: 2039-05-30
Also published as: CN110210378A

Abstract

The invention discloses an embedded video image analysis method and device based on edge calculation, which are applied to analyzing camera images in a video monitoring network, wherein the video monitoring network comprises a plurality of cameras connected with a monitoring center, and the method comprises the following steps: recognizing a preset target from a video shot by a camera; aiming at a preset target identified from a video, acquiring attribute features of the preset target and/or scene attribute features of the preset target; the attribute characteristics of the preset target comprise type, position and quantity in the image and the like; the scene attribute characteristics of the preset target include: one or a combination of shooting time, shooting place and shooting angle of the original image; and uploading the acquired attribute characteristics of the preset target and the scene attribute characteristics of the preset target to a monitoring center corresponding to the camera for constructing a video big data analysis application system. By applying the embodiment of the invention, the cost can be saved.

Description

Embedded video image analysis method and device based on edge calculation

Technical Field

The invention relates to an image identification method and device, in particular to an embedded video image analysis method and device based on edge calculation.

Background

The image, especially the video image, has abundant information which is difficult to compare with other information acquisition means, so that the image is the most intuitive and reliable information acquisition means for human, and has been the focus of people to pay attention to and rely on. With the development of social security technology, monitoring networks based on monitoring video images play an important role in the fields of security, traffic and the like. At present, a large number of video monitoring cameras are arranged in numerous places such as urban roads, expressways, markets, stations and the like, and a monitoring video network with wide coverage is formed. A large number of video images which can be generated by the network every day gather large-scale monitoring video data resources. However, since the image itself is unstructured information, it is not possible to directly utilize the big data technology for mining, so that a large amount of monitoring video images with rich information cannot be processed in real time and effectively utilized. Under the general condition, analysis and interpretation of monitoring video images mainly depend on manual work, so that the efficiency is low, and the situation that the timeliness requirement of safety monitoring individuals is high cannot be met, such as: the method comprises the steps of monitoring and controlling the overall situation of road traffic conditions in real time during urban traffic rush hours, quickly tracking vehicles and personnel in a vicious crime case, performing related detection and the like. In the situation, the video monitoring images in a larger range need to be analyzed and processed in real time to form a more complete and clear situation, so that accurate and powerful information support is provided for traffic control and case detection decision.

In order to solve the above problems, the conventional identification method is to construct a "cloud" computing center in the background and analyze the video image by using the strong computing power of the "cloud" center. However, the existing video cameras can generate tens of megabits of video data per second, and when a large amount of data is formed by original video images generated by thousands or tens of thousands of deployed video cameras, not only a huge challenge is created for the data transmission capability of a video monitoring network, but also the computing capability of a cloud center is very easy and difficult to deal with, so that the method needs to greatly transform and upgrade the existing video data transmission network, greatly improve the computing capability of the cloud computing center, and further cause higher cost.

Therefore, the prior art has the technical problem that the traditional monitoring video upgrading cost is high.

Disclosure of Invention

The invention provides an embedded video image analysis method and device based on edge computing, which is an edge computing device and is used for solving the technical problem that the traditional surveillance video in the prior art is high in upgrading cost.

The invention solves the technical problems through the following technical scheme:

the embodiment of the invention provides an embedded video image analysis method based on edge calculation, which is applied to a camera in a video monitoring network, wherein the video monitoring network comprises a plurality of cameras in communication connection with a monitoring center, and the method comprises the following steps:

identifying a preset target from a video shot by a camera, wherein the preset target comprises: one or a combination of a person, a vehicle, a building;

the method comprises the steps of acquiring attribute features of a preset target and/or scene attribute features of the preset target aiming at the preset target identified from a video, wherein the attribute features of the preset target comprise: the method comprises the steps that a preset target is a vehicle, and the type, body color, license plate, vehicle position and the like of the vehicle are identified; the preset target is a person, and the gender, age, clothing, position and the like of the person are identified; the preset target is one of buildings, and the type, the position and the like of the preset target are identified; the scene attribute characteristics of the preset target include: one or a combination of shooting time, shooting place and shooting angle of the original image;

and uploading the acquired attribute characteristics of the preset target and the scene attribute characteristics of the preset target to a monitoring center corresponding to the camera.

Optionally, before the preset target is identified from the video captured by the camera, the method further includes:

the method comprises the steps of acquiring an original image in video stream data shot by a camera, and taking the original image as a video shot by the camera.

Optionally, the acquiring an original image in video stream data captured by a camera includes:

acquiring model data of a camera, and searching a video coding format of the camera from a pre-stored model data-video coding format list according to the model data of the camera;

and decoding the video stream data shot by the camera by using a decoding method corresponding to the video coding format, and restoring an original image shot by the camera.

Optionally, the identifying a preset target from a video shot by a camera includes:

the method comprises the following steps that an ARM is used as a main control unit, an FPGA is used as a core acceleration unit to construct a hardware computing framework for identifying a preset target; based on the hardware architecture, a preset target contained in each original image China is identified by utilizing a pre-constructed convolutional neural network model, wherein the preset target comprises: one or a combination of a person, a vehicle, a building.

Optionally, the process of constructing the pre-constructed target convolutional neural network is as follows:

constructing an initial convolutional neural network with an input layer, a convolutional layer, a pooling layer, a full-link layer and an output layer, and training;

acquiring a conversion matrix aiming at pruning operation according to the number of convolution kernels in a target convolution neural network obtained after preset pruning and the number of convolution kernels in the constructed initial convolution neural network;

acquiring the minimum reconstruction error of each convolution kernel in the initial convolution neural network according to the conversion matrix and the weight of each convolution kernel;

and eliminating the convolution kernels of which the corresponding minimum reconstruction errors exceed a preset numerical range to obtain the constructed target convolution neural network.

Optionally, the obtaining a conversion matrix for the pruning operation according to the number of convolution kernels in the target convolution neural network obtained after the preset pruning and the number of convolution kernels in the constructed initial convolution neural network includes:

according to the number of convolution kernels in the target convolution neural network obtained after the preset pruning and the number of convolution kernels in the constructed initial convolution neural network, using a formula, Y = (N × c × k) _h ×k _w ) ^-1 ·n×c×k _h ×k _w A transformation matrix for the pruning operation is obtained, wherein,

y is a conversion matrix for pruning operation; n is the number of convolution kernels in the initial convolution neural network; c is the number of channels corresponding to the characteristic diagram; k is a radical of _h ×k _w Is the size of the convolution kernel;and n is the number of convolution kernels in the target convolution neural network obtained after pruning.

Optionally, the obtaining a minimum reconstruction error of each convolution kernel in the initial convolutional neural network according to the transformation matrix and the weight of each convolution kernel includes:

based on the transform matrix and the weights of the various convolution kernels, using a formula,

obtaining a minimized reconstruction error of each convolution kernel in the initial convolutional neural network, wherein,

min is a minimum evaluation function; beta is a selection vector coefficient corresponding to a channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; | | non-woven hair _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w is a group of ^T A transpose matrix that is a weight matrix of the convolution kernel; c' is the number of channels retained after pruning; c is the number of channels corresponding to the characteristic diagram; i O ₀ Is a zero norm function.

for each convolution kernel, a transformation matrix is generated based on the transformation matrix and the weights of the respective convolution kernels, using a formula,

and acquiring the reconstruction error of each convolution kernel in the initial convolution neural network, wherein,

beta is a selection vector coefficient corresponding to the channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; i O _F Is a norm function; y is a transformation matrix for pruning operations(ii) a Sigma is a summation function; x _i A slice matrix for the ith channel; w ^T A transpose matrix that is a weight matrix of the convolution kernel; λ is a penalty coefficient; | | non-woven hair ₁ Is a norm function;

is any one of i; c' is the number of channels retained after pruning; c is the number of channels corresponding to the characteristic diagram; i O ₀ Is a zero norm function.

Optionally, the removing the convolution kernel whose corresponding minimized reconstruction error exceeds the preset value range to obtain the constructed target convolution neural network includes:

taking the initial convolutional neural network as a current network model, and removing a convolutional kernel of which the corresponding minimum reconstruction error exceeds a preset numerical range aiming at each convolutional kernel in a current convolutional layer in the current network model;

aiming at each convolution kernel left after the elimination, the weight matrix of the convolution kernel is kept unchanged, and by utilizing a formula,

obtaining a current value of a selection vector coefficient corresponding to a channel of length c, wherein,

selecting a current value of the vector coefficient corresponding to the channel with the length of c; argmin is a function minimum variable evaluation function;

if so, using a formula,

obtaining weights corresponding to the convolution kernel that minimizes the reconstruction error; taking the current value of the selection vector coefficient corresponding to the channel with the length c and the weight of the convolution kernel corresponding to the minimized reconstruction error as the target selection vector of the convolution kernelThe coefficient and the target convolution kernel weight update the current network model according to the target selection vector coefficient and the target convolution kernel weight;

if not, updating the penalty coefficient according to a preset step length, and returning to the step of obtaining the current value of the selection vector coefficient corresponding to the channel with the length of c until | |. Beta | | computation is finished ₀ Converging;

and taking the updated current network model as a current network model, taking the next convolutional layer of the current convolutional layer as a current convolutional layer, returning to execute the step of removing the convolutional cores of which the corresponding minimum reconstruction errors exceed the preset value range aiming at each convolutional core in the current convolutional layer in the current network model until each convolutional layer of the current network model is pruned, and taking the pruned current network model as a target convolutional neural network model.

aiming at each convolution kernel remained after the elimination, by using a formula,

obtaining a current value of a selection vector coefficient corresponding to a channel with length c, wherein,

selecting a current value of the vector coefficient corresponding to the channel with the length c; argmin is a function minimum variable evaluation function;

by means of the formula (I) and (II),

obtaining a current weight of a convolution kernel corresponding to the reconstruction error;

judging whether the reconstruction error corresponding to the current value of the selected vector coefficient and the current weight of the convolution kernel is converged;

if yes, taking the current value of the selection vector coefficient corresponding to the channel with the length of c and the weight of the convolution kernel corresponding to the minimized reconstruction error as a target selection vector coefficient and a target convolution kernel weight of the convolution kernel, and updating the current network model according to the target selection vector coefficient and the target convolution kernel weight;

if not, updating the penalty coefficient according to a preset step length, and returning to the step of acquiring the current value of the selection vector coefficient corresponding to the channel with the length of c until the reconstruction error corresponding to the current value of the selection vector coefficient and the current weight of the convolution kernel converges;

and taking the updated current network model as the current network model, taking the next convolutional layer of the current convolutional layer as the current convolutional layer, returning to execute the step of removing the convolutional cores of which the corresponding minimum reconstruction errors exceed the preset numerical range aiming at each convolutional core in the current convolutional layer in the current network model until each convolutional layer of the current network model is pruned, and taking the pruned current network model as the target convolutional neural network model.

Optionally, the step of using the pruned current network model as a target convolutional neural network model includes:

quantizing the model parameters in the pruned current network model by using a linear quantization algorithm, and converting 32-bit floating point numbers into 8-bit integers;

coding the current network model after the model parameters are quantized by using a Huffman coding algorithm;

and taking the coded current network model as a target convolutional neural network model.

Optionally, when using the pre-trained convolutional neural network model for identification, the n × m convolutional kernel operation is split into n × m multiplication operations and n × m-1 addition operations, and,

when n x m is an odd number, taking n x m-1 times of addition operation as current operation, summing every two operations in the current operation to obtain a summed operation result, taking the summed operation result as current operation, and returning to execute the step of summing every two operations in the current operation to obtain the summed operation result until the summation of the n x m-1 times of addition operation is completed to obtain an operation result of an n x m convolution kernel;

when n x m is an even number, taking n x m-2 times of addition operation as current operation, summing every two operations in the current operation to obtain a summed operation result, taking the summed operation result as current operation, and returning to execute the step of summing every two operations in the current operation to obtain the summed operation result until the summation of the n x m-2 times of addition operation is completed; and summing the sum of the n x m-2 times of addition operation and the addition operation which does not participate in the operation to obtain an operation result of the n x m convolution kernel.

The embodiment of the invention provides an embedded video image analysis device based on edge calculation, which is applied to a camera in a video monitoring network, wherein the video monitoring network comprises a plurality of cameras in communication connection with a monitoring center, and the device comprises:

the identification module is used for identifying a preset target from a video shot by a camera, wherein the preset target comprises: one or a combination of a person, a vehicle, a building;

the first obtaining module is configured to, for a preset target identified from a video, obtain an attribute feature of the preset target and/or a scene attribute feature of the preset target, where the attribute feature of the preset target includes: when the preset target is a vehicle, one or a combination of the type, the body color, the license plate and the position of the vehicle; when the preset target is a person, one or a combination of the sex, the age, the clothing and the position of the person; when the preset target is a building, one or a combination of the position and the type of the building; the scene attribute characteristics of the preset target include: one or a combination of shooting time, shooting place and shooting angle of the original image;

and the uploading module is used for uploading the acquired attribute characteristics of the preset target and the scene attribute characteristics of the preset target to a monitoring center corresponding to the camera.

Optionally, the embodiment of the present invention further includes: and the second acquisition module is used for acquiring an original image in the video stream data shot by the camera and taking the original image as a video shot by the camera.

Optionally, the second obtaining module is configured to:

acquiring the model data of a camera, and searching the video coding format of the camera from a pre-stored model data-video coding format list according to the model data of the camera;

Optionally, the identification module is configured to:

the method comprises the following steps that an ARM is used as a main control unit, an FPGA is used as a core acceleration unit to construct a hardware computing architecture for recognizing a preset target; based on the hardware architecture, a preset target contained in each original image China is identified by utilizing a pre-constructed convolutional neural network model, wherein the preset target comprises: one or a combination of a person, a vehicle, a building.

and eliminating the convolution kernels of which the corresponding minimum reconstruction errors exceed the preset numerical range to obtain the constructed target convolution neural network.

y is a conversion matrix for pruning operation; n is the number of convolution kernels in the initial convolution neural network; c is the number of channels corresponding to the characteristic diagram; k is a radical of _h ×k _w Is the size of the convolution kernel; and n is the number of convolution kernels in the target convolution neural network obtained after pruning.

min is a minimum evaluation function; beta is a selection vector coefficient corresponding to a channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; | | non-woven hair _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w is a group of ^T Transpose moment of weight matrix for convolution kernelArraying; c' is the number of channels reserved after pruning; c is the number of channels corresponding to the characteristic diagram; | | non-woven hair ₀ Is a zero norm function.

Optionally, the obtaining, according to the transformation matrix and the weight of each convolution kernel, a minimized reconstruction error of each convolution kernel in the initial convolutional neural network includes:

beta is a selection vector coefficient corresponding to a channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; i O _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w ^T A transpose matrix that is a weight matrix of the convolution kernel; λ is a penalty coefficient; i O ₁ Is a norm function;

is any one of i; c' is the number of channels reserved after pruning; c is the number of channels corresponding to the characteristic diagram; | | non-woven hair ₀ Is a zero norm function.

if so, using a formula,

obtaining weights corresponding to the convolution kernel that minimizes the reconstruction error; taking the current value of the selection vector coefficient corresponding to the channel with the length of c and the weight of the convolution kernel corresponding to the minimized reconstruction error as the target selection vector coefficient and the target convolution kernel weight of the convolution kernel, and updating the current network model according to the target selection vector coefficient and the target convolution kernel weight;

selecting a current value of the vector coefficient corresponding to the channel with the length of c; argmin is a function minimum variable evaluation function; />

By means of the formula(s),

if yes, taking the current value of the selection vector coefficient corresponding to the channel with the length of c and the weight of the convolution kernel corresponding to the minimized reconstruction error as the target selection vector coefficient and the target convolution kernel weight of the convolution kernel, and updating the current network model according to the target selection vector coefficient and the target convolution kernel weight;

quantizing the model parameters in the pruned current network model by using a linear quantization algorithm;

when n is an odd number, taking n-m-1 times of addition operation as current operation, summing every two operations in the current operation to obtain a summed operation result, taking the summed operation result as current operation, and returning to execute the step of summing every two operations in the current operation to obtain the summed operation result until the summation of n-m-1 times of addition operation is completed to obtain an operation result of an n-m convolution kernel;

when n is an even number, taking n x m-2 times of addition operation as current operation, summing every two operations in the current operation to obtain a summed operation result, taking the summed operation result as current operation, and returning to execute the step of summing every two operations in the current operation to obtain the summed operation result until the summation of the n x m-2 times of addition operation is completed; and summing the sum of the n x m-2 times of addition operation and the addition operation which does not participate in the operation to obtain an operation result of the n x m convolution kernel.

Compared with the prior art, the invention has the following advantages:

by applying the embodiment of the invention, the target attribute characteristics and the scene attribute characteristics in the shot video stream data are extracted at the camera end, so that the problems of receiving, storing, analyzing and calculating thousands of paths of videos at the same time by the cloud computing center are solved, the requirements on upgrading of the transmission bandwidth and the computing capacity of the cloud computing center are reduced, and the cost is further saved.

Drawings

Fig. 1 is a schematic flowchart of an embedded video image parsing method based on edge computation according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a principle of an embedded video image parsing method based on edge computation according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a convolutional neural network compression in an embedded video image parsing method based on edge computation according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a pruning flow of an initial convolutional neural network in an embedded video image analysis method based on edge computation according to an embodiment of the present invention;

fig. 5 is another schematic diagram of an initial convolutional neural network pruning flow in an embedded video image analysis method based on edge computation according to an embodiment of the present invention;

fig. 6 is a schematic diagram of data flow before and after network parameter quantization in an embedded video image analysis method based on edge calculation according to an embodiment of the present invention;

fig. 7 is a schematic view of an FPGA implementation flow in an embedded video image parsing method based on edge computation according to an embodiment of the present invention;

fig. 8 is a schematic diagram illustrating a convolution acceleration operation in an embedded video image parsing method based on edge calculation according to an embodiment of the present invention;

fig. 9 is a schematic diagram illustrating a pooling acceleration operation in an embedded video image parsing method based on edge calculation according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an embedded video image parsing apparatus based on edge calculation according to an embodiment of the present invention.

Detailed Description

The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.

The embodiment of the invention provides an embedded video image analysis method and device based on edge calculation, and firstly introduces the embedded video image analysis method based on edge calculation provided by the embodiment of the invention.

It should be noted that, the embodiment of the present invention is preferably applied to the analysis of the image content of the camera in the existing large video monitoring network for security, transportation, and the like, and generally includes a plurality of cameras in communication connection with the monitoring center.

Example 1

Fig. 1 is a schematic flowchart of an embedded video image parsing method based on edge computation according to an embodiment of the present invention, and fig. 2 is a schematic diagram of a principle of the embedded video image parsing method based on edge computation according to an embodiment of the present invention; as shown in fig. 1 and 2, the method includes:

s101: identifying a preset target from a video shot by a camera, wherein the preset target comprises: one or a combination of a person, a vehicle, a building.

Fig. 3 is a schematic diagram illustrating a compression flow of a target convolutional neural network in an embedded video image parsing method based on edge computation according to an embodiment of the present invention; fig. 4 is a schematic diagram illustrating a pruning flow of an initial convolutional neural network in an embedded video image analysis method based on edge computation according to an embodiment of the present invention; fig. 5 is another schematic diagram of an initial convolutional neural network pruning flow in an embedded video image analysis method based on edge computation according to an embodiment of the present invention; as shown in figures 3-4 of the drawings,

specifically, the step may include the following steps:

a: an initial convolutional neural network with an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer is constructed and trained.

It can be understood that a convolutional neural network with a common structure can be used, and then a training set composed of monitoring images is used for training the constructed convolutional neural network, and the convolutional neural network can automatically adjust the weight parameters of the convolutional neural network according to the condition of the training set; meanwhile, the hyper-parameters of the convolutional neural network are adjusted manually, and then training of the convolutional neural network is completed, and the initial convolutional neural network is obtained.

B: and acquiring a conversion matrix aiming at pruning operation according to the number of convolution kernels in the target convolution neural network obtained after the preset pruning and the number of convolution kernels in the constructed initial convolution neural network.

Specifically, according to the number N of convolution kernels in the target convolution neural network obtained after the preset pruning and the number N of convolution kernels in the constructed initial convolution neural network, by using a formula,

Y＝(N×c×k _h ×k _w ) ^-1 ·n×c×k _h ×k _w a transformation matrix for the pruning operation is obtained, wherein,

C: and acquiring the minimum reconstruction error of each convolution kernel in the initial convolution neural network according to the conversion matrix and the weight of each convolution kernel.

Specifically, the C step may be a C1 step or a C2 step.

C1: based on the transformation matrix Y and the weights of the various convolution kernels, using a formula,

min is a minimum evaluation function; beta is a length ofc, selecting vector coefficients corresponding to the channels; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; | | non-woven hair _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w ^T A transpose matrix that is a weight matrix of the convolution kernel; c' is the number of channels reserved after pruning; c is the number of channels corresponding to the characteristic diagram; | | non-woven hair ₀ Is a zero norm function.

C2: for each convolution kernel, based on the transformation matrix Y and the weights of the respective convolution kernels, using a formula,

beta is a selection vector coefficient corresponding to the channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; | | non-woven hair _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w ^T A transpose matrix that is a weight matrix of the convolution kernel; λ is a penalty coefficient; | | non-woven hair ₁ Is a norm function;

In practical applications, the method of obtaining the minimized reconstruction error is also called the LASSO regression method.

D: and eliminating the convolution kernels of which the corresponding minimum reconstruction errors exceed a preset numerical range to obtain the constructed target convolution neural network.

Specifically, the D step may be a D1 step or a D2 step.

D1: taking the initial convolutional neural network as a current network model, and removing a convolutional kernel of which the corresponding minimum reconstruction error exceeds a preset numerical range aiming at each convolutional kernel in a current convolutional layer in the current network model;

if so, using a formula,

d2: taking the initial convolutional neural network as a current network model, and removing a convolutional kernel of which the corresponding minimum reconstruction error exceeds a preset numerical range aiming at each convolutional kernel in a current convolutional layer in the current network model;

by means of the formula(s),

if not, updating the penalty coefficient according to a preset step length, and returning to the step of acquiring the current value of the selection vector coefficient corresponding to the channel with the length of c until the reconstruction error corresponding to the current value of the selection vector coefficient and the current weight of the convolution kernel is converged;

and taking the updated network model as the current network model, taking the next convolution layer of the current convolution layer as the current convolution layer, and returning to execute the step of removing the convolution kernels of which the corresponding minimum reconstruction errors exceed the preset numerical range aiming at each convolution kernel in the current convolution layer in the current network model until each convolution layer of the current network model is pruned, thereby simplifying the structure of the network model and effectively reducing the calculated amount.

Fig. 6 is a schematic diagram of data flow before and after quantization in an embedded video image analysis method based on edge computation according to an embodiment of the present invention, and as shown in fig. 6, a linear quantization algorithm is used to quantize a weight parameter in a pruned current network model; then, the quantized model parameters are coded by using a Huffman coding algorithm; and taking the coded current network model as a final target convolutional neural network model. The storage scale of the network model after the quantization coding is reduced, and the storage requirement is reduced.

In practical application, in a trained model, parameters are stored in a 32-bit floating point mode, and the model trained by the large CNN network occupies hundreds of megabits of storage space, so that the model, namely a parameter quantization compression algorithm, can be further compressed by changing a parameter storage mode. In the practical application of the algorithm, the quantitative parameter setting is required according to the network structure characteristics. The quantization method can be as follows:

counting the maximum value and the minimum value of the parameters, dividing all the parameters by the difference between the maximum value and the minimum value in the parameters, multiplying the quotient of the obtained values by 256, mapping the values to an interval of 0-255, obtaining 8-bit parameters after quantization, and converting 32-bit floating point numbers into 8-bit integers. In practical applications, the quantization algorithm may be a non-linear quantization algorithm, etc.

Embodiments of the present invention exploit the non-uniform distribution of effective weights by quantizing the weights and by variable length coding, i.e., huffman coding, and characterize the weights using variable length coding without loss of training accuracy.

And E, deploying the target convolutional neural network on an embedded hardware platform, and then identifying a preset target by the embedded computing platform according to the content contained in each image in the video images shot by the camera.

In practical applications, the preset target to be identified in this step may be a target in a preset target list set manually, and the preset target list may be updated manually by an operator, or may be automatically identified by a system, and then automatically added.

In general, the important target is a person whose operation range exceeds a set range, a person who enters an alert area, a vehicle, a person who wears special clothes, or the like.

As shown in fig. 4, after a certain convolution kernel is culled, redundant neurons corresponding to the convolution kernel should be removed to simplify the network structure.

By applying the embodiment of the invention, the method realizes the compression of the weight of the convolutional neural network, thereby reducing and simplifying the structure of the convolutional neural network, reducing the storage amount of weight parameters and enabling the target convolutional neural network to realize the same operation speed and the equivalent target detection and identification effect under the embedded environment with less computing resources and small storage amount.

S102: for a preset target identified from a video, obtaining attribute features of the preset target and/or scene attribute features of the preset target, and forming description about scene image content, wherein the attribute features of the preset target include: the method comprises the steps that a preset target is a vehicle, and the type, the body color, the license plate, the position in an image and the like of the vehicle are recognized; the preset target is a person, and the sex, age, clothing, position in the image and the like of the person are identified; the preset target is one of buildings, and the type of the building, the position in the image and the like are identified; the scene attribute characteristics of the preset target include: the shooting time, the shooting place and the shooting angle of the original image are one or a combination.

In practical application, the extracted target attribute features and/or scene attribute features of the preset target can be structurally described by the recognition result, an image and a complete information frame combining corresponding description information thereof are generated, and a formatted information code stream conforming to a TCP/IP protocol is formed and uploaded to a network.

S103: and uploading the acquired attribute characteristics of the preset target and the scene attribute characteristics of the preset target to a monitoring center corresponding to the camera.

The obtained data is transmitted to a monitoring center through a video transmission network, so that the monitoring center can process the data received based on a big data technology, complete the tasks of storage, analysis, retrieval, statistics and the like of video images and contents thereof, and meet the requirement of a city or an area on rapid analysis processing and application of massive video contents.

By applying the embodiment shown in fig. 1 of the invention, the target attribute characteristics and the scene attribute characteristics in the shot video stream data are extracted at the camera end, so that the problems of receiving, storing, analyzing and calculating thousands of paths of videos at the same time by the cloud computing center are solved, the requirements on upgrading of the transmission bandwidth and the computing capacity of the cloud computing center are reduced, and the cost is further saved.

In addition, the device to which the method of the embodiment shown in fig. 1 of the present invention is applied can be in data connection with a plurality of cameras, and one device analyzes and uploads images shot by the plurality of cameras, so that the number of deployed devices is reduced, and the cost is saved.

Example 2

On the basis of embodiment 1 of the present invention, before the step S101, the method further includes:

s104: an original image in video stream data captured by a camera is acquired.

Specifically, model data of a camera can be acquired, and a video coding format of the camera is searched from a pre-stored model data-video coding format list according to the model data of the camera; and decoding the video stream data shot by the camera by using a decoding method corresponding to the video coding format, and restoring the original image shot by the camera.

The inventor also finds that in the security field, as the manufacturers of the cameras are numerous, the construction time span is large, and the specifications of the monitoring video terminal equipment, the image compression format and the coding format are different, in the embodiment of the invention, different decoding strategies are adopted according to the coding formats of the monitoring cameras with different specifications, information such as original images, image parameters and the like are decoded and recovered from the code streams output by the cameras, and then the preset target is identified, so that the purpose of being compatible with the existing cameras with different models is achieved.

Example 3

On the basis of embodiment 1 of the present invention, further, when the embodiment 1 of the present invention is executed, an ARM (Random Access Memory) may be used as a main control unit, and an FPGA (Field Programmable Gate Array) may be used as an acceleration unit to construct a hardware core platform architecture for identifying a preset target. Based on the hardware architecture, a preset target contained in each original image is identified by utilizing a pre-constructed convolutional neural network model, wherein the preset target comprises: one or a combination of a person, a vehicle, a building.

Specifically, when the convolutional neural network model is identified by using the pre-trained convolutional neural network model, the n × m convolutional kernel operation may be divided into n × m multiplication operations and n × m-1 addition operations, and,

Fig. 7 is a schematic diagram illustrating an FPGA implementation flow in an embedded video image parsing method based on edge computation according to an embodiment of the present invention; fig. 8 is a schematic diagram illustrating a convolution acceleration operation in an embedded video image parsing method based on edge calculation according to an embodiment of the present invention; fig. 9 is a schematic diagram illustrating a pooling acceleration operation in an embedded video image parsing method based on edge calculation according to an embodiment of the present invention; as shown in fig. 7 to fig. 8, the processing method according to the embodiment of the present invention may be deployed on an FPGA, so as to perform the above-described operations according to the embodiment of the present invention.

An ARM + FPGA heterogeneous processing architecture is adopted, wherein the ARM is used as a control unit and mainly completes scheduling and task management of an algorithm; the FPGA serves as a core acceleration unit, acceleration processing is carried out on operations such as convolution, pooling and the like of a main body in the neural network, algorithm operation efficiency is improved, and real-time processing requirements are met.

Currently, in the first aspect, the deep learning algorithm is generally implemented by using python language by means of deep learning frameworks such as Caffe, tensrflow, and pitorch. Although the development difficulty of the algorithm is greatly reduced by adopting the deep learning framework, most resources are occupied by installing the deep learning framework on the embedded platform, so that the algorithm cannot meet the requirement of real-time processing. Therefore, the embodiment of the invention adopts the C/C + + language to realize the realization of the convolution network, avoids using a deep learning framework, saves the embedded platform resources and effectively improves the processing speed of the algorithm.

In the second aspect, no mature chip is currently available for use in embedded applications. The invention adopts ARM + FPGA Hardware Processing architecture, simulates CPU (Central Processing Unit) and GPU (Graphics Processing Unit) architecture, performs convolutional neural network accelerated computation by using the parallel Processing capability of the FPGA, and uses Verilog HDL (Hardware Description Language) Hardware Description Language to realize the acceleration effect similar to GPU so as to meet the requirement of real-time Processing.

Because the convolutional neural network has a large amount of convolution and pooling operations, and needs to consume more DSP (digital signal processing) processing resources and RAM (random access memory) storage resources in the calculation process, the ZYNQ7100 FPGA with relatively more calculation and storage resources is selected as the core of system processing in the embodiment of the invention. As shown in fig. 4, 2020 DSP slices are provided in the chip, and each multiplication and addition operation consumes 2 DSP slices for calculation, so that more than 1000 multiplication or addition operations can be performed in parallel in one clock cycle; the internal storage capacity is 26.5Mb, and the requirements of data caching of optical images, convolution templates, characteristic diagrams and the like are met; the internal logic processing unit is 444K, and can provide relatively sufficient logic operation resource guarantee for complex logic operation and control. Table 1 is a list of FPGA model numbers suitable for use in embodiments of the present invention, as shown in table 1,

TABLE 1

/>

In practical application, each multiplication operation in the deep learning network model occupies 2 DSP slices, and each addition occupies 2 DSP slices. The system takes a representative target detection and identification algorithm SSD as an example (Single Shot multi box Detector), the input image of the 1 st convolution layer in the algorithm network is 300 (image length) × 300 (image width) × 3 (image channel number), the convolution kernel size is 3 × 3, and the number is 64. The number of times of multiplication required is 300 × 300 × 3 × 3 × 3 × 64= 155520000.

If the above is executed once in the FPGA, more than 3 hundred million DSP slices are needed, which obviously cannot be satisfied in the real situation. Therefore, the convolution acceleration calculation architecture is designed in the FPGA, taking 3 × 3 convolution as an example, as shown in fig. 8, the operation is performed 9 times and 8 times, and then 9 × 2+8 × 2=34 DSP slices are required to be occupied during one operation. Since 2020 DSP slices are shared in the ZYNQ7100 type FPGA, the parallel computation can be performed about 59 times in one clock. Taking the first layer convolution as an example, the input is 300 × 300 × 3, and there are 64 convolution kernels, 3 × 3. Therefore, the 3 × 3 convolution architecture in the FPGA is used for supply and demand 300 × 300 × 3 × 64=17280000 times, and since each clock can be calculated 59 times in parallel, this operation requires 17280000/59=292881 clocks. The clock frequency of ZYNQ7100 is 250MHz, and the processing time is 1.17ms in theory. The calculation time does not take into account delay and read data time, and is an upper limit achievable by theoretical calculation.

The deep learning network is mainly composed of a convolutional layer and a pooling layer, and pooling operation can also be accelerated in the FPGA, wherein the acceleration architecture of maximum pooling operation in the FPGA is shown in fig. 9. The algorithm adopts 2 multiplied by 2 maximum pooling, and each step of pooling operation needs to occupy 3 DSP slices. Since 2020 DSP Slice units are shared in ZYNQ7100, 670 pooling structures can be executed in parallel. Taking pooling level 1 as an example, its input is 300 × 300 × 64, it requires 150 × 150 × 64=1440000 pooling, it requires 1440000/670=2150 clocks to be executed in parallel, it requires 0.0085ms. The computation time is mostly in convolution operations.

Based on comprehensive analysis, the SSD target detection and identification algorithm consumes about 230 milliseconds when running on the platform, the processing speed is higher than 4 frames/second, and the requirement of near-real-time analysis processing can be met. With the technical progress, the neural network algorithm can be conveniently transplanted on an advanced FPGA hardware platform or an AI chip, and the algorithm processing speed is further improved.

By applying the embodiment 3 of the invention, the convolution neural network can be accelerated to operate in the FPGA.

In practical application, when an AI (Artificial Intelligence) technology is used for identifying an image generated by a video monitoring system, an AI intelligent camera can be used for replacing a traditional camera, so that the camera can not only image, but also can understand image content, the conversion from the image to information is realized, and the load of back-end big data processing is greatly simplified. Therefore, the smart camera is a future development trend. However, in practical application, the AI smart camera is expensive, the price of the low-end AI smart camera reaches more than one hundred thousand yuan, and the price of the middle-end AI smart camera is more than one hundred thousand yuan or even dozens of ten thousand yuan. For a large number of existing widely distributed traditional monitoring video terminals, the AI intelligent cameras are replaced completely, so that the cost is huge, and the problems of huge resource waste and repeated construction are caused, so that the problems are not paid.

Therefore, in the embodiment 3 of the present invention, by using the embedded edge computing platform, content analysis can be performed on multiple existing video images of the camera at the same time, so as to identify the preset target, and then the attribute information of the preset target is sent to the monitoring center through the video monitoring network, so that the pressure for performing analysis processing on a large amount of video images at the same time is reduced through distributed processing, and a large amount of cost is saved.

Corresponding to the embodiment of the invention shown in fig. 1, the embodiment of the invention also provides an embedded video image analysis device based on edge calculation.

Fig. 10 is a schematic structural diagram of an embedded video image parsing apparatus based on edge computation according to an embodiment of the present invention, as shown in fig. 10, which is applied to parsing image content of a camera in a video monitoring network and can process a plurality of camera images communicatively connected to a monitoring center at the same time, where the apparatus includes:

an identifying module 1001, configured to identify a preset target from a video captured by a camera, where the preset target includes: one or a combination of a person, a vehicle, a building;

a first obtaining module 1002, configured to obtain, for a preset target identified from a video, an attribute feature of the preset target and/or a scene attribute feature of the preset target, where the attribute feature of the preset target includes: the preset target is a vehicle, the preset target is a person, and the preset target is one of buildings; the scene attribute characteristics of the preset target include: one or a combination of shooting time, shooting place and shooting angle of the original image;

an uploading module 1003, configured to upload the acquired attribute features of the preset target and the scene attribute features of the preset target to a monitoring center corresponding to the camera.

By applying the embodiment shown in fig. 10 of the invention, the target attribute characteristics and the scene attribute characteristics in the shot video stream data are extracted and transmitted at the position close to the monitoring camera, the original system architecture is not changed, the problem that a background simultaneously analyzes a large number of video images is solved, and the cost of upgrading and transforming the monitoring video network is saved.

In a specific implementation manner of the embodiment of the present invention, on the basis of the embodiment shown in fig. 10 of the present invention, there are added:

in a specific implementation manner of the embodiment of the present invention, the apparatus further includes:

and the second acquisition module is used for acquiring an original image in the video stream data shot by the camera and taking the original image as a video shot by the camera.

In a specific implementation manner of the embodiment of the present invention, the second obtaining module is configured to:

In a specific implementation manner of the embodiment of the present invention, the identification module 1001 is configured to:

In a specific implementation manner of the embodiment of the present invention, the identification module 1001 includes a construction unit, configured to:

In a specific implementation manner of the embodiment of the present invention, the building unit is configured to:

according to the number of convolution kernels in the target convolution neural network obtained after the preset pruning and the number of convolution kernels in the constructed initial convolution neural network, using a formula, Y = (N × c × k) _h ×k _w )- ¹ ·n×c×k _h ×k _w A transformation matrix for the pruning operation is obtained, wherein,

min is a minimum value evaluation function; beta is a selection vector coefficient corresponding to the channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; | | non-woven hair _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w ^T A transpose matrix that is a weight matrix of the convolution kernel; c' is the number of channels reserved after pruning; c is the number of channels corresponding to the characteristic diagram; | | non-woven hair ₀ Is a zero norm function.

for each convolution kernel left after the elimination, the weight matrix of the convolution kernel is kept unchanged, and by using a formula,

if so, by using a formula,

obtaining weights corresponding to the convolution kernel that minimizes the reconstruction error; taking the current value of the selection vector coefficient corresponding to the channel with the length of c and the weight of the convolution kernel corresponding to the minimized reconstruction error as a target selection vector coefficient and a target convolution kernel weight of the convolution kernel, and updating the current network model according to the target selection vector coefficient and the target convolution kernel weight;

by means of the formula(s),

quantizing the model parameters in the current network model after pruning by using a quantization algorithm;

then, coding the current network model after the model parameters are quantized by using a Huffman coding algorithm;

when using a previously trained convolutional neural network model for identification, the n x m convolutional kernel operation is divided into n x m multiplication operations and n x m-1 addition operations, and,

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An embedded video image analysis method based on edge calculation is characterized in that the method is applied to cameras in a video monitoring network, the video monitoring network comprises a plurality of cameras which are in communication connection with a monitoring center, and the method comprises the following steps:

the method for recognizing the preset target from the video shot by the camera comprises the following steps:

the method comprises the following steps that an ARM is used as a main control unit, an FPGA is used as a core acceleration unit to construct a hardware computing framework for identifying a preset target; based on the hardware computing architecture, recognizing preset targets contained in China in each original image by using a pre-constructed convolutional neural network model, wherein the preset targets comprise: one or a combination of a person, a vehicle, a building;

the method comprises the steps of acquiring attribute features of a preset target and/or scene attribute features of the preset target aiming at the preset target identified from a video, wherein the attribute features of the preset target comprise: when the preset target is a vehicle, one or a combination of the type, the body color, the license plate and the position of the vehicle; when the preset target is a person, one or a combination of the sex, the age, the clothing and the position of the person; when the preset target is a building, one or a combination of the position and the type of the building; the scene attribute characteristics of the preset target include: one or a combination of shooting time, shooting place and shooting angle of the original image;

2. The embedded video image parsing method based on edge calculation as claimed in claim 1, wherein before the preset target is identified from the video captured by the camera, the method further comprises:

decoding video stream data shot by the camera by using a decoding method corresponding to the video coding format, and restoring an original image shot by the camera;

and taking the original image as a video shot by a camera.

3. The embedded video image parsing method based on edge computation of claim 1, wherein the pre-constructed target convolutional neural network is constructed by the following process:

4. The method according to claim 3, wherein the obtaining of the transformation matrix for the pruning operation according to the number of convolution kernels in the target convolution neural network obtained after the preset pruning and the number of convolution kernels in the constructed initial convolution neural network comprises:

y is a conversion matrix for pruning operation; n is the number of convolution kernels in the initial convolution neural network;c is the number of channels corresponding to the characteristic diagram; k is a radical of _h ×k _w Is the size of the convolution kernel; and n is the number of convolution kernels in the target convolution neural network obtained after pruning.

5. The embedded video image parsing method based on edge computation of claim 3, wherein the obtaining the minimized reconstruction error of each convolution kernel in the initial convolutional neural network according to the transformation matrix and the weight of each convolution kernel comprises:

based on the transformation matrix and the weights of the individual convolution kernels, using a formula,

min is a minimum evaluation function; beta is a selection vector coefficient corresponding to the channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; | | non-woven hair _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w is a group of ^T A transpose matrix that is a weight matrix of the convolution kernel; c' is the number of channels reserved after pruning; c is the number of channels corresponding to the characteristic diagram; | | non-woven hair ₀ Is a zero norm function.

6. The embedded video image parsing method based on edge computation of claim 3, wherein the obtaining the minimized reconstruction error of each convolution kernel in the initial convolutional neural network according to the transformation matrix and the weight of each convolution kernel comprises:

for each convolution kernel, based on the transformation matrix and the weights of the respective convolution kernels, using a formula,

beta is a selection vector coefficient corresponding to the channel with the length of c; beta is a _i Marking the batch of the ith channel; w is a weight matrix of the convolution kernel; n is the number of convolution kernels in the initial convolution neural network; i O _F Is a norm function; y is a conversion matrix for pruning operation; sigma is a summation function; x _i A slice matrix for the ith channel; w ^T A transpose matrix that is a weight matrix of the convolution kernel; lambda is a penalty coefficient; i O ₁ Is a norm function;

7. The embedded video image analysis method based on edge computing as claimed in claim 6, wherein the removing the convolution kernel whose corresponding minimum reconstruction error exceeds the preset value range to obtain the constructed target convolution neural network comprises:

obtain the current value of the select vector coefficient corresponding to the channel of length c, wherein->

Selecting vector coefficients for a channel of length cA previous value; argmin is a function minimum variable evaluation function;

if so, by using a formula,

8. The embedded video image analysis method based on edge computing according to claim 6, wherein the removing the convolution kernel whose corresponding minimized reconstruction error exceeds the preset numerical range to obtain the constructed target convolution neural network comprises:

by means of the formula (I) and (II),

9. The embedded video image parsing method based on edge calculation of claim 1, wherein the n x m convolution kernel operation is split into n x m multiplication operations and n x m-1 addition operations when using the pre-trained convolution neural network model for identification, and,