CN112927127A

CN112927127A - Video privacy data fuzzification method running on edge device

Info

Publication number: CN112927127A
Application number: CN202110265858.0A
Authority: CN
Inventors: 张泽华; 李向阳; 高焕丽; 罗家祥
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-08

Abstract

The invention discloses a method for fuzzifying video privacy data running on edge equipment, which comprises the steps of algorithm model design and algorithm model building, model optimization, model quantification and acceleration, and model migration to mobile terminal equipment for running; the invention has the beneficial effects that: the privacy automatic fuzzy system is low in computing resource requirement and high in computing speed, can run on a mobile terminal, does not depend on other server resources, and protects the privacy of others in the public video; the multi-target tracking algorithm is adopted to track the object, so that the purpose of specifying the fuzzy object or recording the motion track of the specified object can be realized; the TensorRT is used for quantifying, optimizing and accelerating the algorithm model, the deployment difficulty is reduced, and meanwhile, when the algorithm runs, three threads are used for respectively processing the preprocessing part, the model reasoning part and the post-processing part, so that the running efficiency is improved.

Description

Video privacy data fuzzification method running on edge device

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method for fuzzifying video privacy data running on edge equipment.

Background

The application range of monitoring and other camera devices is more and more extensive, for example, the monitoring device of a community or a kitchen environment is disclosed by a camera, or a video needs to be acquired by camera shooting and then uploaded to a network for publication; meanwhile, people pay more and more attention to privacy protection, and although the kitchen monitoring disclosure can help people to supervise the kitchen sanitary environment, the privacy of cooks is also leaked; the video is published on the internet, which is an efficient transmission way, and meanwhile, the privacy of others in the video is more easily revealed due to the efficient transmission and searching of network data.

The existing processing mode is mainly post-desensitization processing which needs to consume manpower processing, the post-desensitization processing is not suitable for online monitoring equipment, and meanwhile, the processing mode is mostly frame-by-frame processing, and the efficiency is low.

The segmentation technology in the computer vision technology can rapidly obtain the mask of all people in the video, and can be divided into a semantic segmentation technology and an instance segmentation technology; the semantic segmentation is different from the example segmentation, namely the semantic segmentation refers to pixel-level classification, each pixel point in the image is classified, and the example is taken as an example to judge whether each pixel point is a part of a person or not; the example segmentation is further distinguished on the basis of semantic segmentation, and not only can distinguish people from backgrounds, but also can distinguish people from people; therefore, semantic segmentation can be rapidly segmented, but is not flexible enough, and when the proportion of the target object to the image is small, the segmentation precision is low, the example segmentation technology is flexible, but the calculation amount is large, and the operation can be performed on low-calculation-capacity equipment only by performing more processing; meanwhile, the target tracking technology can be conveniently added to the instance level processing.

The convolutional neural network can realize higher-precision example segmentation, but the calculation amount of the example segmentation is huge, most of the example segmentation can only be operated at a server end, and the example segmentation is difficult to directly deploy to mobile-end equipment; with the rapid development of society, the number of image pickup apparatuses has also increased explosively; it is difficult to process a large-scale instance partitioning algorithm by a server alone, and meanwhile, the server processing has a risk of privacy leakage in the process of uploading video data, so that a method capable of performing edge calculation on edge equipment is needed to process the data in real time, wherein the edge calculation refers to that an application program runs on the edge equipment on the side close to an object or a data source, so that a faster service response is generated and real-time services are met.

In order to reduce the demand of computing resources, protect the privacy of others in the public video and realize the purpose of specifying a fuzzy object or recording the motion trail of a specified target without depending on other server resources, a video privacy data fuzzification method running on edge equipment is provided for the purpose.

Disclosure of Invention

The invention aims to provide a method for fuzzifying video privacy data running on edge equipment, which reduces the requirement of operation resources, does not depend on other server resources, protects the privacy of others in public videos, and realizes the purpose of specifying fuzzy objects or recording the motion trail of specified objects.

In order to achieve the purpose, the invention provides the following technical scheme: a video privacy data fuzzification method running on edge equipment comprises algorithm model building and model running, and the method comprises the following steps:

the method comprises the following steps: model initialization, including model building according to configuration files, model optimization acceleration on training and tracker initialization;

step two: acquiring a video sequence and importing the video sequence into an operation platform;

step three: performing feature extraction by adopting a lightweight network and an FPN structure;

step four: obtaining a detection result of the image and an example mask through an example segmentation algorithm;

step five: allocating an ID to each detected object through a multi-target tracking algorithm;

step six: whether the fuzzy processing is carried out on the object is controlled through the ID.

As a preferred technical solution of the present invention, in the first step, the optimization acceleration method includes: the PyTorch model is firstly converted into an intermediate format file ONNX, and then optimized and accelerated by TensorRT.

As a preferred technical solution of the present invention, in the second step, the operation platform is an operation platform supporting C + + or Python or is applied to NVIDIA Jetson devices.

In the third step, the light-weight feature extraction network ShuffleNet V2 or MobileNet is selected as the feature extraction network.

As a preferred technical solution of the present invention, in the fourth step, the example segmentation algorithm includes 2 subtasks, which are respectively used for target detection and mask generation.

In a preferred embodiment of the present invention, the target detection is to find all people in an input image by an algorithm, and the result is represented by selecting a frame with a bounding box, and the output result of the target detection includes classification and regression.

As a preferred technical solution of the present invention, the mask generation manner is a manner of combining an original mask and a mask coefficient; wherein the original mask is independent of the specific person in the image, the mask coefficients are associated with the specific person, and each person generates a set of mask coefficients.

As a preferred technical solution of the present invention, in the fifth step, the multi-target tracking algorithm is an SORT algorithm.

As a preferred technical solution of the present invention, in the sixth step, part of the persons are subjected to the fuzzy processing by ID pertinence, and the fuzzy object can be changed by changing the ID at different time periods, so as to prevent the tracking algorithm from tracking the wrong object.

As a preferred technical solution of the present invention, the video input may be a video read from a file or a camera video read directly for processing, and the video read library is OpenCV.

As an optimal technical scheme of the invention, TensorRT is used for quantifying, optimizing and accelerating the algorithm model, the deployment difficulty is reduced, and meanwhile, three threads are adopted to respectively process three parts of preprocessing, model reasoning and post-processing when the algorithm runs.

Compared with the prior art, the invention has the beneficial effects that:

(1) the privacy automatic fuzzy system is low in computing resource requirement, can run on a mobile terminal, does not depend on other server resources, and protects the privacy of other people in the public video; the tracking algorithm is adopted to track the object, so that the purpose of specifying the fuzzy object or recording the motion trail of the specified target can be realized;

(2) the TensorRT is used for quantifying, optimizing and accelerating the algorithm model, the deployment difficulty is reduced, and meanwhile, when the algorithm runs, three threads are used for respectively processing the preprocessing part, the model reasoning part and the post-processing part, so that the running efficiency is improved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a flow chart of the model operation of the present invention;

FIG. 3 is a diagram of an exemplary segmentation algorithm model of the present invention;

FIG. 4 is a flow chart of model transformation according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 2, fig. 3 and fig. 4, the present invention provides a technical solution: a method for fuzzifying video privacy data running on edge equipment comprises algorithm model building and model running, and comprises the following steps:

the method comprises the following steps: model initialization, including model building according to configuration files, model optimization acceleration on training and tracker initialization; model initialization, including building a model according to a configuration file, optimizing and accelerating the trained model through neural network acceleration and initializing a tracker, wherein the neural network acceleration is used for further improving the running speed of the model, and the TensorRT is used for reasoning and accelerating, is a high-performance deep learning reasoning and accelerating optimizer aiming at NVIDIA GPU, and a model realized by a deep learning framework PyTorch cannot be directly optimized by the TensorRT, so that the PyTorch model is firstly converted into an intermediate format file ONNX and then optimized and accelerated by the TensorRT;

step two: importing the obtained video sequence into an operation platform; the operation platform is an operation platform supporting C + + or Python or applied to NVIDIA Jetson equipment;

step three: performing feature extraction on an input image through a feature extraction network; the feature extraction network is composed of a convolutional neural network and is used for extracting features of an input image; the input is a preprocessed image, the output is a series of feature maps, and the feature maps are used for subsequent example segmentation algorithm processing; the feature extraction network selects the light weight type feature extraction network ShuffleNet V2, the light weight type feature extraction network ShuffleNet V2 is unnecessary, other feature extraction networks can be replaced, such as MobileNet, or a larger feature extraction network can be used when the calculation power is sufficient so as to meet the processing of higher resolution or obtain higher precision;

step four: identifying pixel points where people are located in an input image through an example segmentation algorithm, and generating an independent mask for each person; the instance segmentation algorithm adopts YOLACT, and is based on deep learning, wherein the instance segmentation algorithm comprises 2 subtasks which are respectively used for target detection and mask generation; the input of the algorithm is a preprocessed video frame, after algorithm processing, the output comprises a human body mask and a human body detection result, the detection result and the mask are in one-to-one correspondence, wherein the detection result is used for a subsequent tracking algorithm, the tracking algorithm allocates an identity ID to each object, and the mask is used for subsequent fuzzy processing; wherein the content of the first and second substances,

the target detection means that for an input image, all people are found in the image through an algorithm, the result is that a bounding box is used for selecting and representing a human frame, therefore, the output result comprises classification and regression, wherein the classification is used for judging whether the image is a human or not, when an object is a human, a regression task regresses the boundary of the bounding box, the used detection mode is an Anchor-based detection mode, the Anchor-based detection mode is that a preset detection frame is fully paved in the image, therefore, the classification in the detection is to judge whether the human exists in the preset detection frame, and meanwhile, the regression mode is the offset position of the specific position of the regressed human relative to the detection frame;

the mask generation mode is a mode of combining an original mask and mask coefficients, wherein the original mask is not related to specific people in an image, the mask coefficients are related to the specific people, and each person generates a group of mask coefficients;

an ID number is distributed to each person in a video sequence through a multi-target tracking algorithm, and fuzzy processing is carried out on part of the persons through the ID number; the multi-target tracking algorithm is SORT which is a high-speed multi-target tracking algorithm based on detection results, and according to the detection results, an Identity (ID) number is allocated to each person in the video by the multi-target tracking algorithm.

Step six: according to the ID number, part of people can be fuzzed in a targeted manner, and the condition that main people are not blurred and background people are all blurred in a video when the video is published on the Internet as required can be met.

It should be added that: the video input can be processed by reading the video from a file or directly reading the camera video.

It should be added that: for an input video sequence, all people in the image can be detected. And a mask is generated for each person, the mask is at an example level, different persons can be distinguished through the mask, the mask is different from the mask in semantic segmentation, only the persons and the background can be distinguished, meanwhile, different data can be adopted to train the algorithm model according to task requirements, different types of example segmentation models are obtained, and the example segmentation models do not only act on the persons.

It should be added that: the fuzzy operation can be performed on a specific area in a targeted manner according to the generated mask, the fuzzy algorithm can select any one of Gaussian blur, mean value blur and median blur, the mask is generated by adopting example segmentation and then the fuzzy operation is performed, a fuzzy boundary can be well obtained, and other information of the video can be retained to the maximum extent after the fuzzy operation is performed on the target.

It should be added that: in order to optimize the speed of the model and reduce the deployment difficulty, the precision of the model trained on the server is Float32, the precision of Float16 or INT8 can be quantified according to the support of the device, or the mixed precision is adopted, so that the size of the algorithm model can be reduced and the operation speed of the algorithm can be increased.

It should be added that: in order to further utilize the resources of the operation platform and improve the operation speed, 3 threads are adopted to respectively process the read video data and carry out preprocessing operation; invoking an instance segmentation model to obtain an output result; and thirdly, the output result is converted and post-processed, then is processed by a tracking algorithm and then is displayed or stored, and through concurrent processing of multiple threads, CPU resources can be more fully utilized, and real-time operation of the algorithm is facilitated.

In the embodiment, preferably, the TensorRT is used for quantizing, optimizing and accelerating the algorithm model, the deployment difficulty is reduced, and meanwhile, when the algorithm runs, three threads are adopted for respectively processing the preprocessing part, the model reasoning part and the post-processing part, so that the running efficiency is improved.

The algorithm model building data is as follows:

the main network part selects a light-weight feature extraction network ShuffleNet V2 to extract image features, the ShuffleNet V2 structure can reduce the memory access time and has higher operation speed, the ShuffleNet V2 is a feature extraction network, and the number of parameters and the amount of operation are greatly reduced by adopting packet convolution and Channel shuffle to replace standard convolution; because the complexity of the model and the time cost of memory access are reduced, the precision and the speed of the model are better balanced; the lightweight feature extraction network ShuffleNet V2 is optional, and other feature extraction networks, such as MobileNet, can be used instead, or a larger feature extraction network can be used instead when the computational power is sufficient to meet higher resolution processing or to achieve higher accuracy.

The characteristic pyramid and multi-scale characteristic output means that the multi-scale expression capability of a model is enhanced by adopting an FPN structure, because a convolutional neural network depends on stacking convolutional layers, the receptive field of a single neuron is limited, the receptive field of the neuron is rapidly increased by adopting a down-sampling layer, but a characteristic graph after down-sampling is correspondingly small, and a characteristic graph with low resolution is difficult to accurately restore the boundary of a mask in a subsequent example segmentation algorithm, so that the operation of FPN multi-scale fusion is adopted, the characteristic graph after down-sampling is up-sampled and restored to the original size and is directly added with an original characteristic graph, the receptive field can be effectively increased, deep semantic information is obtained, the positioning capability of the original characteristic graph is reserved, and the subsequent segmentation algorithm is facilitated to position a segmentation boundary;

the example segmentation algorithm Head is YOLACT, the used target detection algorithm is a detection method based on Anchor, the Anchor has the function of fully paving a preset detection frame in the image, and whether a detection object exists in the corresponding detection frame can be judged according to an output result; the input of the network is a single picture, the output is divided into 4 parts after passing through a convolutional network, and the 4 parts are respectively classified results, namely the probability that a target exists in the region corresponding to the Anchor; a regression result is used for representing the offset of the target real position relative to a preset detection frame, and the accurate position is restored according to the offset and the preset position; the combination of the third and fourth is used for respectively generating masks for each object, wherein the third is 32 original masks relative to the whole image and does not depend on specific objects; generating 32 coefficients for each object as a result of (iv); linearly combining the 32 original mask images in the third step with the 32 corresponding coefficients in the fourth step to obtain masks of each person in the image; selecting tanh as an activation function in the fourth step; ensuring that the coefficient can be positive or negative, and enhancing the combined expression capability; because the mask is generated for each object by adopting a mode of combining the original mask and the mask coefficient, only a 32-dimensional mask coefficient needs to be generated for each object, wherein the original mask is similar to semantic segmentation and is independent of specific objects, and the running speed of the method is greatly improved compared with a two-stage algorithm; after the linear summation of the mask coefficients of the original mask graph, mapping the output to the range of (0, 1) through a Sigmoid activation function, and simultaneously dividing the output into 2 classes of human masks and backgrounds by adopting a threshold value; because the range of the original mask is the whole image, noise points easily appear at other positions in the range of the whole image, and the value of the noise points is also larger than the threshold value; therefore, the detection result is adopted for range limitation on the result after linear summation, and the result outside the detection frame is forcibly set as the background, so that the noise interference of other areas is reduced;

adam is adopted as an optimizer for algorithm training, is a gradient descent optimizer, and compared with random gradient descent, a momentum mechanism is added, the gradients in all directions are balanced, and the optimization speed is higher; the tracking algorithm adopts SORT (simple Online and real tracking), the SORT is a high-speed associated matching algorithm, and the SORT only adopts Kalman filtering and Hungarian algorithm to match the recognition result without feature extraction through a neural network.

The whole network is trained on a server, and after the main network is pre-trained through the ImageNet classification data set, the whole model is trained through instance segmentation data.

Most of models trained by the server are difficult to directly run on edge equipment, the great reason is that resource limitations such as memory and computing power are far inferior to that of the server, and simultaneously, due to different architectures, part of programs cannot be compiled and run on the edge equipment directly, a neural network acceleration engine TensrT is selected, the reasoning speed is improved, the real-time running and the application of the algorithm to the edge equipment are facilitated, so that model quantization and model migration are required, the model quantization and model migration are used for applying the algorithm model to small-sized equipment, therefore, a Pythroch model is converted into TensrT engine, the subsequent reasoning is optimized, the TensrT cannot directly act on Pythch, the Algorithm model trained by Pythroch is required to be converted into a model of ONNX format supported by TensrT, and ONNX is accelerated by removing redundant operation nodes and combining part of operation nodes through ONNX2trt to generate the TensennrT, because the TensorRT does not fully support the operating nodes of PyTorch, the algorithm model is required to be split into a plurality of parts or the unsupported operating nodes are required to be rewritten in the conversion process.

The details of the model operation are as follows:

the method comprises the steps of firstly loading TensorRT into TensorRT engine for example segmentation reasoning, simultaneously constructing a multi-target tracker, sequentially carrying out image preprocessing (enabling the distribution and the size of images to be consistent with those of an example segmentation model during training) on each frame of image in an input video sequence, carrying out example segmentation model reasoning, carrying out post-processing, finally obtaining detection frames and masks of all targets, carrying out association matching according to the positions and the sizes of the detection frames, the aspect ratio and the like by using a multi-target tracking algorithm, obtaining a tracking result and updating the tracker.

In order to make better use of system resources; 3 threads are adopted to process each stage of data processing respectively, and video data reading and preprocessing, model reasoning and post-processing are processed respectively; the preprocessing comprises the zooming and normalization of images, and meanwhile, data are transferred from a memory to a video memory, so that the data can be conveniently and directly read from the video memory when a subsequent GPU processes the data; the model reasoning refers to the forward reasoning of the operation instance segmentation algorithm to obtain an original mask, a mask coefficient, and classification and regression output in detection; the post-processing is to restore the output result of the example segmentation algorithm to obtain an example mask and a detection result; the detection result is obtained by classifying and regressing the detection output, recovering the coordinates, filtering the low threshold value result and inhibiting the non-maximum value; the example mask is obtained by the linear combination, and the post-processing also comprises the processing of tracking tasks, which is used for allocating a specified ID to each detection result and associating the same person in different frames through the ID.

In the thread 1 (preprocessing), OpenCV reads a video frame of a local video or an external camera (a CSI interface or a USB interface), and simultaneously preprocesses an image, so that the mean value and the variance of image data distribution are consistent with those of a model during training, the size of the image resolution is simultaneously zoomed, the size of the image resolution directly influences the operation speed and the final precision of an algorithm, and the adjustment is carried out according to the calculation power of equipment; and simultaneously, transferring the image data from the memory to the video memory for the subsequent algorithm to directly read the data when running on the GPU.

And the example segmentation algorithm of the thread 2 (model inference) carries out forward inference on the preprocessed data, outputs the preprocessed data after algorithm processing and comprises an original mask, a mask coefficient, classification and regression output in detection, wherein the detection result and the mask are in one-to-one correspondence, the detection result is used for a subsequent tracking algorithm, the tracking algorithm allocates an identity ID to each object, and the mask is used for subsequent fuzzy processing.

Thread 3 (post-processing) is to perform post-processing on the result of the instance segmentation algorithm and allocate an ID to each person by using a multi-target tracking algorithm, and the post-processing flow comprises (i) coordinate conversion and converts the output position result into absolute coordinates; secondly, comparing the result with a threshold value, and filtering the result with low confidence coefficient; adopting non-maximum value to restrain, filtering out other coincidence results of a plurality of detection results of the same target; and fourthly, carrying out linear addition according to the original mask and the mask coefficient to obtain an example mask, obtaining a final detection result after post-processing, distributing the ID to each person according to the final detection result by a multi-target tracking algorithm, and updating the tracker.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for fuzzifying video privacy data running on an edge device is characterized in that: the method comprises the steps of algorithm model building and model operation, and comprises the following steps:

2. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: in the first step, the optimization acceleration method comprises the following steps: the PyTorch model is firstly converted into an intermediate format file ONNX, and then optimized and accelerated by TensorRT.

3. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: in the second step, the operation platform is an operation platform supporting C + + or Python or applied to NVIDIA Jetson equipment.

4. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: in the third step, the light-weight feature extraction network ShuffleNet V2 or MobileNet is selected as the feature extraction network.

5. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: in the fourth step, the example segmentation algorithm includes 2 subtasks, which are respectively used for target detection and mask generation.

6. The method of claim 5 for obfuscating video privacy data running on an edge device, wherein: the target detection means that all people are found in an input image through an algorithm, the result is that a people frame is selected and represented by a surrounding frame, and the output result of the target detection comprises classification and regression.

7. The method of claim 5 for obfuscating video privacy data running on an edge device, wherein: the mask generation mode is a mode of combining an original mask and a mask coefficient; wherein the original mask is independent of the specific person in the image, the mask coefficients are associated with the specific person, and each person generates a set of mask coefficients.

8. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: in the fifth step, the multi-target tracking algorithm is an SORT algorithm.

9. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: in the sixth step, the fuzzy processing is carried out on part of people in a targeted manner through the ID, and the ID can be changed at different time periods to change the fuzzy object so as to prevent the tracking algorithm from tracking the wrong object.

10. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: the video input can be processed by reading video from a file or directly reading camera video, and the video reading library is OpenCV.

11. The method of claim 1 for obfuscating video privacy data running on an edge device, wherein: and the TensorRT is used for quantizing, optimizing and accelerating the algorithm model, the deployment difficulty is reduced, and meanwhile, three threads are adopted to respectively process three parts of preprocessing, model reasoning and post-processing when the algorithm runs.