CN112702623A

CN112702623A - Video processing method, device, equipment and storage medium

Info

Publication number: CN112702623A
Application number: CN202011506996.5A
Authority: CN
Inventors: 陈建蓉; 黄启军; 唐兴兴; 陈瑞钦; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-23

Abstract

The present disclosure provides a video processing method, apparatus, device and storage medium, the method comprising: the method comprises the steps that first equipment extracts feature information of a video to be detected, wherein the feature information is used for indicating at least one of image features and audio features in the video to be detected; the first equipment inputs the characteristic information into the video detection model to obtain a detection result of the video to be detected, and the detection result is used for indicating whether the video to be detected is in compliance. And determining the detection result of the video to be detected according to the sub-detection result corresponding to each characteristic. The video detection model of the first device is determined according to the encryption model parameters transmitted by at least one second device, and each second device and the first device adopt the same model structure. In the scheme, the video detection model of the first device learns the model parameters of other devices, so that the video detection capability of other devices is learned under the condition that the local video resources of other devices are not acquired, and the video detection capability of the first device is improved.

Description

Video processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a video processing method, apparatus, device, and storage medium.

Background

With the continuous development of internet technology, besides massive videos provided by various large video websites, users can also upload videos by themselves through short video applications, or perform live video broadcasting through live broadcasting applications, and the like. Video content needs to be checked by each large video website or platform, and illegal videos including terrorism, riot, pornography, political sensitivity and the like are prevented from being transmitted on the network.

In the face of huge video data, the traditional video auditing based on a manual platform is not suitable any more, and the following defects mainly exist: the investment cost is high, the auditing efficiency is low, and error detection caused by human understanding deviation, artificial fatigue and the like cannot be eliminated. With the development of the deep learning technology, the video data is intelligently recognized by adopting a deep learning model, so that the video recognition precision can be improved.

Deep learning relies on data sets, for example, accuracy improvement of image recognition relies on large-scale still picture data sets. However, for the situation that the identification object of video auditing cannot be publicly obtained, the illegal video is not allowed to be published, so that a public large-scale data set cannot be constructed, and the problems of small data volume and poor data quality are caused, so that the video identification precision of the illegal video cannot be improved.

Disclosure of Invention

The disclosure provides a video processing method, a video processing device, video processing equipment and a storage medium, which are used for improving the video identification precision of violation videos.

In a first aspect, the present disclosure provides a video processing method, including:

extracting feature information of a video to be detected, wherein the feature information is used for indicating at least one of image features and audio features in the video to be detected;

inputting the characteristic information into a video detection model to obtain a detection result of the video to be detected, wherein the detection result is used for indicating whether the video to be detected is in compliance;

the detection result of the video to be detected is determined according to the sub-detection result corresponding to each feature; the model parameters of the video detection model are determined according to the model parameters transmitted by at least one second device, the model parameters transmitted by each second device are parameters subjected to encryption processing, and the model structure adopted by each second device is the same as that adopted by the first device.

In an embodiment of the present disclosure, if there is a sub-detection result corresponding to at least one feature indicating non-compliance, the detection result of the video to be detected is non-compliance.

In an embodiment of the present disclosure, the feature information includes audio features in the video to be detected, and the video detection model includes an audio detection submodel, where the audio detection submodel is used to detect whether audio data in the video is compliant;

the inputting the characteristic information into a video detection model to obtain a detection result of the video to be detected comprises:

inputting the audio features of the target object into the audio detection submodel to obtain a first sub detection result corresponding to the audio features;

and determining the detection result of the video to be detected according to the first sub-detection result.

In one embodiment of the present disclosure, the feature information includes a pose feature of the target object in the video image, and the video detection model includes a pose detection sub-model, which is used to detect whether the pose of the target object in the video is compliant;

inputting the attitude characteristics of the target object into the attitude detection submodel to obtain a second sub detection result corresponding to the attitude characteristics;

and determining the detection result of the video to be detected according to the second sub-detection result.

In an embodiment of the present disclosure, the feature information includes gait features of the target object in the video image, the video detection model includes a gait detection sub-model, and the gait detection sub-model is used to detect whether the gait of the target object in the video is normal;

inputting the gait characteristics of the target object into the gait detection submodel to obtain a third sub detection result corresponding to the gait characteristics;

and determining the detection result of the video to be detected according to the third sub-detection result.

In one embodiment of the present disclosure, the feature information includes facial features of the target object in the video image, and the video detection model includes a face detection sub-model for detecting whether a facial expression of the target object in the video is normal;

inputting the facial features of the target object into the facial detection submodel to obtain a fourth sub-detection result corresponding to the facial features;

and determining the detection result of the video to be detected according to the fourth sub-detection result.

In one embodiment of the present disclosure, the feature information includes text features in a video image, and the video detection model includes a text detection submodel, and the text detection submodel is used to detect whether illegal text content exists in the video image;

inputting the text features into the text detection sub-model to obtain a fifth sub-detection result corresponding to the text features;

and determining the detection result of the video to be detected according to the fifth sub-detection result.

In one embodiment of the present disclosure, the method further comprises:

receiving model parameters from the at least one second device, wherein the model parameters of each second device are parameters subjected to fully homomorphic or partially homomorphic encryption processing;

determining model parameters of the first device according to the model parameters of the at least one second device.

In one embodiment of the present disclosure, the determining the model parameters of the first device according to the model parameters of the at least one second device includes:

and updating the model parameters of the first equipment according to the model parameters of the at least one second equipment and the current model parameters of the first equipment.

In one embodiment of the present disclosure, the method further comprises:

sending the model parameters of the first device to the at least one second device.

In a second aspect, the present disclosure provides a video processing method applied to a second device, the method including:

training a video detection model of the second equipment according to a training video sample and the labeling result of the training video sample;

obtaining model parameters of the video detection model;

carrying out encryption processing on the model parameters;

sending the encrypted model parameters to first equipment; the first equipment and the second equipment adopt the same model structure.

In an embodiment of the present disclosure, the encrypting the model parameter includes:

and encrypting the model parameters by adopting a fully homomorphic or partially homomorphic encryption algorithm.

In a third aspect, the present disclosure provides a video processing apparatus comprising:

the characteristic extraction module is used for extracting characteristic information of the video to be detected, and the characteristic information is used for indicating at least one of image characteristics and audio characteristics in the video to be detected;

the processing module is used for inputting the characteristic information into a video detection model to obtain a detection result of the video to be detected, and the detection result is used for indicating whether the video to be detected is in compliance;

In a fourth aspect, the present disclosure provides a video processing apparatus comprising:

the processing module is used for training a video detection model of the second equipment according to a training video sample and the labeling result of the training video sample;

the acquisition module is used for acquiring the model parameters of the video detection model;

the processing module is further used for carrying out encryption processing on the model parameters;

the sending module is used for sending the encrypted model parameters to the first equipment; the first equipment and the second equipment adopt the same model structure.

In a fifth aspect, the present disclosure provides an electronic device comprising:

memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the method according to any one of the first aspects of the disclosure.

In a sixth aspect, the present disclosure provides an electronic device comprising:

memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the method according to any one of the second aspects of the disclosure.

In a seventh aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any one of the first aspects of the present disclosure.

In an eighth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any one of the second aspects of the present disclosure.

In a ninth aspect, the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first aspects of the present disclosure.

In a tenth aspect, the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to the second aspect of the present disclosure.

The embodiment of the disclosure provides a video processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps that first equipment extracts feature information of a video to be detected, wherein the feature information is used for indicating at least one of image features and audio features in the video to be detected; the first equipment inputs the characteristic information into the video detection model to obtain a detection result of the video to be detected, and the detection result is used for indicating whether the video to be detected is in compliance. And determining the detection result of the video to be detected according to the sub-detection result corresponding to each characteristic. The video detection model of the first device is determined according to the model parameters transmitted by at least one second device, the model parameters transmitted by each second device are parameters processed through encryption, and the model structure adopted by each second device is the same as that adopted by the first device. According to the scheme, the video detection model of the first device learns the model parameters of other devices, so that the video detection capability of other devices is learned under the condition that local video resources of other devices are not acquired, and the video detection capability of the first device is improved.

Drawings

Fig. 1 is a schematic view of a scene of a video processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic view of another scene of a video processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a video processing method according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of video processing of a multi-modal video detection model provided by an embodiment of the present disclosure;

fig. 5 is an interaction diagram of a video processing method according to an embodiment of the present disclosure;

fig. 6 is a first block diagram illustrating a structure of a video processing apparatus according to an embodiment of the disclosure;

fig. 7 is a block diagram of a second structure of a video processing apparatus according to an embodiment of the disclosure;

fig. 8 is a first block diagram of an electronic device according to an embodiment of the present disclosure;

fig. 9 is a block diagram of a second structure of an electronic device according to an embodiment of the present disclosure.

The objects, features, and advantages of the present disclosure will be further explained with reference to the accompanying drawings.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

With the continuous development of internet technology, besides massive videos provided by various large video websites, users can also upload videos by themselves through short video applications, or perform live video broadcasting through live broadcasting applications, and the like. At present, a man-machine combination mode is generally adopted to audit video contents of all large video websites and social platforms, wherein manual audit occupies a large proportion, and the mode has low audit efficiency and low accuracy.

With the continuous development of deep learning technology, video detection models are constructed on the basis of video data of respective platforms of various large video websites and social contact platforms, and intelligent video detection is achieved through training of the video detection models. In the modeling process of the video detection model, the data of the manual audit can be used as basic data (or called training data) of the modeling, and since the basic data may include illegal contents such as terrorism, riot, pornography, political sensitivity and the like, the basic data cannot be disclosed, the illegal video data accumulated on each large video website and social platform form a special data island. In addition, the data are stored by themselves and have violation risks, so that the violation videos need to be deleted in time by each large video website and the social platform, and the violation videos are prevented from being spread again.

From the above description, since each large video website and social platform cannot construct a large-scale video data set, the detection precision of a video detection model preset by the platform is low, and the legality of a video uploaded by a user cannot be accurately identified, the illegal video is rapidly spread in the network.

In order to solve the above problem, the present disclosure provides a video processing method, which can improve the detection accuracy of an illegal video. The main idea is as follows: in consideration of the independence of the video data of each large video website and the social platform, the large video websites and the social platform respectively train video detection models on the respective platforms, and due to the difference of training data sets, model parameters of the video detection models of the platforms are different. The inventor considers the characteristic that the participation party is allowed to use the data of other parties for combined modeling by federal learning, but all parties can not directly obtain video data resources, and realizes that the model parameter data of the federal cooperation participation party is allowed to be obtained while the video data is ensured not to leave the platform by interactively sending the encrypted model parameters by all parties, the model parameters of the video detection models of all parties are updated by encryption calculation, and the video detection precision of the video detection models of all parties is improved.

Before introducing the technical solution provided by the embodiment of the present disclosure, an application scenario of video processing provided by the embodiment of the present disclosure is briefly described.

Fig. 1 is a schematic view of a scene of a video processing method according to an embodiment of the present disclosure, as shown in fig. 1, the scene includes a plurality of servers, for example, servers 11 to 14 in fig. 1, a server 14 is connected to

servers

11, 12, and 13, the

servers

11, 12, and 13 represent servers of large video websites or social platforms, and the server 14 represents a server of a public security system. The authority of the server 14 is higher than the authorities of the

servers

11, 12, 13, which is represented by: the server 14 can obtain the illegal video resources from the

servers

11, 12 and 13, the

servers

11, 12 and 13 send the encrypted illegal video to the server 14, and in order to avoid the retransmission of the illegal video, the

servers

11, 12 and 13 delete the local illegal video without keeping the local illegal video after sending the illegal video to the server 14.

In this embodiment, the

servers

11, 12, and 13 can use the video resources on their respective platforms to train the video detection models, and the video resources of the servers are different, so the detection effects of the trained video detection models are also different. The server 14 can obtain the encrypted model parameters of each server from the

servers

11, 12, and 13, perform data processing on the model parameters of the servers, and then return the model parameters after data processing to each server, so that each server can learn the detection capability of other servers, thereby improving the video detection accuracy of each server.

For example, fig. 2 is a schematic view of another scenario of a video processing method provided by an embodiment of the present disclosure, as shown in fig. 2, the scenario includes a plurality of servers, for example, servers 11 to 14 in fig. 2, which are connected to each other, where the servers 11 to 14 represent servers of various large video websites or social platforms, and the permissions of the servers are equal. Different from the scenario shown in fig. 1, the servers in this embodiment do not share the video resource, that is, the video resource does not leave the server where the video resource is located. For any server in fig. 2, the server extracts various features of the violation video and performs model training, and in order to avoid re-propagation of the violation video, the server deletes the local violation video after extracting feature information of the violation video, and does not retain the local violation video.

Similar to the above embodiments, the

servers

11, 12, 13, and 14 in this embodiment may all use the video resources on their respective platforms to perform training of the video detection models, and the video resources of the servers are different, so the detection effects of the trained video detection models are also different. The encrypted model parameters can be exchanged among the servers, for example, the

servers

11, 12, 13 respectively send the encrypted model parameters to other servers in a broadcast mode, the server 14 receives the encrypted model parameters sent by the

servers

11, 12, 13, the server 14 can perform data processing based on the model parameters of the

servers

11, 12, 13 and the model parameters of the server 14, update the model parameters of the server 14, so that the server 14 can learn the detection capability of other servers, and thus improve the video detection accuracy of the server 14.

The technical solution of the present disclosure is explained in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 3 is a schematic flowchart of a video processing method according to an embodiment of the present disclosure, and the video processing method according to this embodiment may be applied to any device that executes the video processing method, for example, any server shown in fig. 1 or fig. 2.

For convenience of description, the scheme will be described below with the first device as the execution subject. As shown in fig. 3, the video processing method proposed in this embodiment includes the following steps:

step 101, extracting feature information of a video to be detected by first equipment, wherein the feature information is used for indicating at least one of image features and audio features in the video to be detected.

In this embodiment, the video to be detected includes image data of the video and audio data of the video, and accordingly, the first device first needs to extract image features and/or audio features of the video to be detected, that is, feature information of the video to be detected includes image features and/or audio features in the video.

The image characteristics differ from video to video in terms of image display content. For example, if the video to be detected records the motion of the person, the first device may extract facial features, posture features, gait features, and the like of the person in the video. Illustratively, if the video to be detected includes text information such as subtitles and barrages, the first device may further extract text features in the video image. The features are determined based on image data in the video and belong to image features in the video.

For the audio content of the video, the first device may extract features of background sounds in the video, or may extract features of sounds made by people or animals in the video, which are determined based on the audio data in the video.

And 102, inputting the characteristic information into a video detection model by the first equipment to obtain a detection result of the video to be detected, wherein the detection result is used for indicating whether the video to be detected is in compliance.

And determining the detection result of the video to be detected according to the sub-detection result corresponding to each characteristic. The model parameters of the video detection model of the first device are determined according to the model parameters transmitted by at least one second device, and each second device has the same model structure as the first device. The model structures of the video detection models in the first device and the second device are the same, which means that the layers of the video detection models are the same, the number of parameters of each layer is the same, and the processing process is the same. The video detection model of each device has a corresponding relation with the model parameters, and after the model parameters are determined, the model is correspondingly determined.

The second device may send its model parameters to the first device, which may also send its model parameters to the second device. Optionally, in order to ensure the security of the data, the data may be encrypted during the interaction between the devices. For example, the second device may encrypt the model parameters and send the encrypted model parameters to the first device, so as to ensure secure transmission of data.

The model parameters refer to any parameters for determining the model, including but not limited to the number of model layers, parameters corresponding to each layer, and the like.

In one example, the model parameters may include direct parameters in the model. Taking the video detection model as the neural network model as an example, the neural network model may include a plurality of layers such as a convolutional layer, a normalization layer, a full link layer, and the like, and the model parameter may be a parameter corresponding to each layer. The neural network model is supposed to comprise three convolutional layers, each convolutional layer is correspondingly provided with a convolution matrix, input data are operated on the convolutional layers and the convolution matrices, and obtained output data are input to the next layer for continuous calculation. In this case, the model parameters of the neural network model may include parameters of the three convolutional layers, i.e., convolution matrices corresponding to the three convolutional layers, respectively.

In another example, the model parameters may include any other parameters for determining direct parameters in the model. For example, the neural network model is trained through a gradient descent algorithm, the model parameters interacted between the devices may include gradient information obtained in the training process, and direct parameters such as a corresponding convolution matrix may be determined through the gradient information.

In this embodiment, the first device is in communication connection with at least one second device, and usually, video resources between the devices are not shared, that is, the video resources are not transmitted between the devices. It should be noted that, if the authority of the first device is higher than the authority of each second device, each second device may send the encrypted video resource to the first device, for example, the first device is a server of a public security system, and the second device is a server of each large video website or a social platform.

In this embodiment, the video detection model may be a single-mode model or a multi-mode model. The single-modality model is used for processing and understanding single-modality information, and the multi-modality model is used for processing and understanding multi-source modality information.

In one example, if the video detection model of the first device is a single-mode model, the video detection model of the second device is also a single-mode model; if the video detection model of the first device is a multi-modal model, the video detection model of the second device may be a single-modal model or a multi-modal model. If the video detection model of the first device is a multi-modal model, and the video detection model of the second device is also a multi-modal model, the number of the modal models of the first device and the second device may be the same or different.

For example, if the permissions of the first device and the second device are equal, the video detection models of the first device and the second device are both multi-modal models, the video detection model of the first device includes three modalities, and the video detection model of the second device also includes three modalities. Illustratively, if the authority of the first device is higher than the authority of each second device, the video detection model of the first device includes four modalities, and the video detection model of the second device may contain only three modalities.

In one example, the model structure adopted by the second device is the same as that adopted by the first device, and the model functions of the second device are also the same as those of the first device. For example, the models adopted by the second device and the first device both include a convolutional layer, a normalization layer and a full connection layer, and the models of the second device and the first device are both used for detecting audio data in a video or are both used for detecting image data in a video.

In one embodiment of the present disclosure, the video detection model of the first device may be one of an audio detection model, a posture detection model, a gait detection model, a face detection model, and a text detection model, that is, the video detection model of the first device may be a single-modality-based detection model for detecting whether a certain specific feature in the video is compliant.

If the video detection model is an audio detection model, the feature information extracted in step 101 includes audio features in the video to be detected. The first device inputs the audio features into the audio detection model to obtain detection results corresponding to the audio features, and the detection results are used for indicating whether audio data in the video are compliant or not. For example, if it is detected that politically sensitive, yellow, violent, etc. words are involved in the audio data, the first device may determine that the video has illegal content.

If the video detection model is a pose detection model, the feature information extracted in step 101 includes pose features of target objects in the video, where the target objects include people, animals, and other objects. And the first equipment inputs the attitude characteristics into the attitude detection model to obtain a detection result corresponding to the attitude characteristics, and the detection result is used for indicating whether the attitude of the target object in the video is in compliance or not. For example, if the detection result indicates that the pose of the target person is not compliant, the first device may determine that the video has illegal contents related to pornography, violence, and the like.

If the video detection model is a gait detection model, the feature information extracted in step 101 includes gait features of target objects in the video, where the target objects include objects such as people and animals. The first equipment inputs the gait characteristics into the gait detection model to obtain a detection result corresponding to the gait characteristics, and the detection result is used for indicating whether the gait of the target object in the video is normal or not. For example, if the detection result indicates that the target animal is not gait normally, such as falls, dodges, etc., the first device may determine that the video has abuse such as animals.

If the video detection model is a face detection model, the feature information extracted in step 101 includes the face features of the target object in the video, where the target object mainly refers to a person. The first equipment inputs the facial features into a facial detection model to obtain a detection result corresponding to the facial features, and the detection result is used for indicating whether the facial expression of the target object in the video is normal or not. For example, if the detection result indicates that the facial expression of the target person is not normal, for example, expression pain, etc., the first device may determine that the video has content related to violence, such as a child, an elderly person, etc.

If the video detection model is a text detection model, the feature information extracted in step 101 includes text features in the video image, the first device inputs the text features into the text detection model to obtain a detection result corresponding to the text features, and the detection result is used for indicating whether illegal character content exists in the video. For example, if the detection result indicates that the text bullet screen or the text map in the video relates to text contents such as political sensitivity, pornography, violence and the like, the first device may determine that the video has text transmission violation contents.

In another embodiment of the disclosure, the video detection model of the first device is a multi-modal based detection model for detecting whether at least two features in the video are compliant. Namely, the video detection model of the first device comprises at least two of an audio detection submodel, a posture detection submodel, a gait detection submodel, a face detection submodel and a character detection submodel. In this embodiment, the input of the video detection model is feature information, which may include image features and audio features, where the image features specifically include at least one of pose features, gait features, and facial features of a target object in a video image, and text features in the video image.

Illustratively, fig. 4 is a schematic diagram of video processing of a multi-modal video detection model provided by an embodiment of the present disclosure. As shown in fig. 4, the multi-modal video detection model includes an audio detection sub-model, a posture detection sub-model, a gait detection sub-model, a face detection sub-model, and a text detection sub-model. The video detection sub-model is used for detecting whether audio data in a video are in compliance or not, the posture detection sub-model is used for detecting whether the posture of a target object in the video is in compliance or not, the gait detection sub-model is used for detecting whether the gait of the target object in the video is normal or not, the face detection sub-model is used for detecting whether the facial expression of the target object in the video is normal or not, and the text detection sub-model is used for detecting whether illegal character content exists in a video image or not.

Specifically, the first device inputs the audio feature of the target object into the audio detection submodel to obtain a first sub detection result corresponding to the audio feature. And the first equipment inputs the attitude characteristics of the target object into the attitude detection submodel to obtain a second sub detection result corresponding to the attitude characteristics. And the first equipment inputs the gait characteristics of the target object into the gait detection sub-model to obtain a third sub-detection result corresponding to the gait characteristics. And the first equipment inputs the facial features of the target object into the face detection sub-model to obtain a fourth sub-detection result corresponding to the facial features. And the first equipment inputs the text characteristics into the text detection sub-model to obtain a fifth sub-detection result corresponding to the text characteristics. And then, determining a final video detection result based on the five seed detection results, wherein the final video detection result is used as the output of the multi-mode video detection model.

It should be noted that the sub-detection model provided in this embodiment does not exhaust all possible sub-models, and other sub-detection models or more sub-detection models may be set according to the actual application requirements, so that the video detection model has other detection capabilities or more detection capabilities.

In addition, it should be noted that the execution sequence of the steps in the above embodiments is not limited to the sequence defined above, and those skilled in the art may make any configuration according to the specific application requirement and design requirement.

In an embodiment of the present disclosure, if the video detection model is a single-mode model, a detection result corresponding to the single-mode feature is used as a detection result of the video to be detected.

In an embodiment of the present disclosure, if the video detection model is a multi-modal model, the detection result of the video to be detected is determined according to the sub-detection result corresponding to each modal feature. In one example, if there is a sub-detection result corresponding to at least one modal feature indicating non-compliance, the detection result of the video to be detected is determined to be non-compliance.

In the video processing method provided by this embodiment, a video detection model in the first device is used to detect a video to be detected, and the video detection model in the first device may be a single-mode model or a multi-mode model. Model parameters of a video detection model in first equipment are not obtained based on local video resource training only, but also refer to model parameters of at least one second equipment connected with the first equipment, wherein the model parameters of the second equipment are obtained based on the local video resource training of the second equipment, and video resources among the equipment are not shared. That is, the model parameters of the video detection model of the first device also learn the model parameters of the other devices, thereby improving the video detection capability of the first device. Through the video detection model, the final detection result of the video is determined according to the detection result corresponding to one or more characteristics of the video to be detected, and the efficiency and accuracy of video detection are improved.

The above embodiments disclose that the first device determines the model parameters of the first device from the model parameters of the at least one second device, i.e. the first device updates the model parameters of the first device by learning the model parameters of the at least one second device. The following describes the model parameter learning process of the first device in detail with reference to a specific embodiment.

Fig. 5 is an interaction schematic diagram of a video processing method provided in the embodiment of the present disclosure, and as shown in fig. 5, the video processing method provided in the embodiment includes the following steps:

step 201, the second device trains a video detection model according to the training video sample and the labeling result of the training video sample.

In practical applications, the number of the second devices may be multiple, and each second device may transmit the model parameters to the first device by using the method steps of the embodiment. Of course, the first device may also send its own model parameters to the second device, so as to implement a mutual learning process of the model parameters between the devices. It should be noted that the above learning process only involves the learning of the model parameters, and the video resources in the respective devices are not shared.

In this embodiment, the training video samples of the second device are all from the local of the second device. Illustratively, the second device is a server of a video website, and the training video sample of the second device is from a short video, a live video, a recorded video, a movie and television work and the like uploaded to the video website by a user. The marking work of the training video samples is completed by marking personnel of the video website, and the training video samples can be divided into two categories according to the marking result: positive video samples and negative video samples. The positive video sample refers to a video sample with illegal content in the video, and the negative video sample refers to a video sample without illegal content in the video.

Illustratively, the training process of the video detection model of the second device includes:

1. constructing an initial video detection model;

2. acquiring positive and negative video samples and labeling results of the positive and negative video samples;

3. taking positive and negative video samples as the input of a video detection model, taking the labeling result of the positive and negative video samples as the output of the video detection model, and training the initial video detection model;

4. and stopping the model training if the video detection model meets the convergence condition.

It should be noted that the training process of the video detection model of the first device is similar to the training process of the video detection model of the second device, and the only difference is that the training samples are different, the positive and negative training samples of the first device are both from the local of the first device, and the rest can refer to the training process. It should be understood that the model is trained by using different training samples, and the obtained model parameters have differences.

Step 202, the second device obtains model parameters of the video detection model and encrypts the model parameters.

And step 203, the second device sends the encrypted model parameters to the first device.

In one embodiment of the present disclosure, the second device may employ a fully homomorphic or partially homomorphic encryption algorithm to encrypt the model parameters of the first device.

The fully homomorphic encryption algorithm can perform any homomorphic operation (including homomorphic addition and homomorphic multiplication) on data for an infinite number of times, namely, the fully homomorphic encryption algorithm can homomorphically calculate any function. Partially homomorphic (or referred to as homomorphic) encryption algorithms can only do a wireless sub-homomorphic addition, or can only do an infinite number of homomorphic multiplication operations.

The fully homomorphic or partially homomorphic encryption scheme can calculate the encrypted data under the condition of not decrypting, and can process the data while not damaging sensitive source data. By adopting the encryption scheme, the model parameters of each device can be learned while the confidentiality of the model parameters is ensured.

Of course, the embodiment may also adopt other encryption algorithms to encrypt the model parameters, for example, a Paillier encryption algorithm, which satisfies the homomorphism of the addition method, and does not limit the embodiment.

And step 204, the first device determines the model parameters of the first device according to the model parameters sent by the second device.

In one embodiment of the present disclosure, if the first device is a management device of a plurality of second devices, such as the server 14 shown in fig. 1, the first device may determine the model parameters of the first device according to the model parameters of at least two second devices. For example, after receiving the model parameters sent by the two second devices, the first device may average the corresponding model parameters according to the model parameters of the two second devices, and use the averaged model parameters as the model parameters of the first device. The processing of the model parameters may be performed by averaging all the model parameters or by averaging some of the model parameters.

In an embodiment of the present disclosure, if the first device and the second device are equal device entities, that is, there is no management and managed relationship, for example, servers shown in fig. 2, the first device may update the model parameter of the first device according to at least one model parameter of the second device and the current model parameter of the first device. For example, after the first device receives the model parameter sent by a certain second device, the corresponding model parameter may be averaged according to the model parameter of the second device and the current model parameter of the first device, and the averaged model parameter is used as the latest model parameter of the first device.

Optionally, based on any of the above embodiments, after the model parameter of the first device is updated, the first device may further return the updated model parameter to each of the second devices, so that each of the second devices optimizes its own model according to the updated model parameter.

The video processing method provided by the embodiment of the application relates to a learning process of model parameters among devices, the model parameters interacted among the devices are all encryption parameters, the encrypted model parameters are not decrypted when the devices process the encryption parameters, but data processing is directly carried out, the transmission of the model parameters among the devices is confidential, and each device cannot directly obtain the specific model parameters provided by other devices. The learning process can be called a federal learning process, and the federal learning is a machine learning method meeting privacy protection and data safety, so that the artificial intelligence system is more efficient and more accurately suitable for respective data. At present, the application of federal learning mainly focuses on financial and medical data, and the present embodiment applies federal learning to video detection scenes based on the idea of federal learning, so that the capability of learning among devices is realized, and the video detection capability of models in each device is improved.

Fig. 6 is a block diagram of a video processing apparatus according to an embodiment of the present disclosure. For ease of illustration, only portions that are relevant to embodiments of the present disclosure are shown. As shown in fig. 6, the video processing apparatus 300 according to the present embodiment includes:

the feature extraction module 301 is configured to extract feature information of a video to be detected, where the feature information is used to indicate at least one of an image feature and an audio feature in the video to be detected;

the processing module 302 is configured to input the feature information into a video detection model to obtain a detection result of the video to be detected, where the detection result is used to indicate whether the video to be detected is compliant;

In an embodiment of the present disclosure, the processing module 302 is specifically configured to:

and if the sub-detection result corresponding to the at least one feature indicates that the video to be detected is not compliant, determining that the detection result of the video to be detected is not compliant.

In an embodiment of the present disclosure, the feature information includes audio features in the video to be detected, and the video detection model includes an audio detection submodel, and the audio detection submodel is used to detect whether audio data in the video is compliant.

In one embodiment of the present disclosure, the audio feature of the target object is input into the audio detection submodel, and a first sub-detection result corresponding to the audio feature is obtained; and determining the detection result of the video to be detected according to the first sub-detection result.

In one embodiment of the present disclosure, the feature information includes a pose feature of the target object in the video image, and the video detection model includes a pose detection sub-model, which is used to detect whether a pose of the target object in the video is compliant.

In one embodiment of the present disclosure, the posture feature of the target object is input into the posture detection sub-model, and a second sub-detection result corresponding to the posture feature is obtained; and determining the detection result of the video to be detected according to the second sub-detection result.

In an embodiment of the present disclosure, the feature information includes a gait feature of the target object in the video image, and the video detection model includes a gait detection sub-model, and the gait detection sub-model is used to detect whether a gait of the target object in the video is normal.

In one embodiment of the present disclosure, the gait feature of the target object is input into the gait detection sub-model, and a third sub-detection result corresponding to the gait feature is obtained; and determining the detection result of the video to be detected according to the third sub-detection result.

In one embodiment of the present disclosure, the feature information includes facial features of the target object in the video image, and the video detection model includes a face detection sub-model for detecting whether a facial expression of the target object in the video is normal.

In an embodiment of the present disclosure, the facial features of the target object are input into the face detection sub-model, and a fourth sub-detection result corresponding to the facial features is obtained; and determining the detection result of the video to be detected according to the fourth sub-detection result.

In an embodiment of the present disclosure, the feature information includes text features in a video image, and the video detection model includes a text detection sub-model, and the text detection sub-model is used to detect whether illegal text content exists in the video image.

In an embodiment of the present disclosure, the text feature is input into the text detection submodel, and a fifth sub-detection result corresponding to the text feature is obtained; and determining the detection result of the video to be detected according to the fifth sub-detection result.

In one embodiment of the present disclosure, the video processing apparatus further includes: a receiving module 303 and a transmitting module 304.

A receiving module 303, configured to receive model parameters from the at least one second device, where the model parameters of each second device are parameters obtained through fully homomorphic or partially homomorphic encryption processing;

the processing module 302 is further configured to determine a model parameter of the first device according to the model parameter of the at least one second device.

In an embodiment of the present disclosure, the processing module 302 is specifically configured to update the model parameter of the first device according to the model parameter of the at least one second device and the current model parameter of the first device.

In an embodiment of the present disclosure, the sending module 304 is configured to send the model parameters of the first device to the at least one second device.

The video processing apparatus provided in the embodiment of the present disclosure is configured to execute the technical solution of the first device in any one of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 7 is a block diagram of a second structure of a video processing apparatus according to an embodiment of the disclosure. For ease of illustration, only portions that are relevant to embodiments of the present disclosure are shown. As shown in fig. 7, the video processing apparatus 400 provided in the present embodiment includes: a processing module 401, an obtaining module 402 and a sending module 403.

The processing module 401 is configured to train a video detection model of the second device according to a training video sample and an annotation result of the training video sample;

an obtaining module 402, configured to obtain model parameters of the video detection model;

the processing module 401 is further configured to perform encryption processing on the model parameters;

a sending module 403, configured to send the encrypted model parameter to the first device; the first equipment and the second equipment adopt the same model structure.

In an embodiment of the present disclosure, the processing module 401 is specifically configured to perform encryption processing on the model parameter by using a fully homomorphic or partially homomorphic encryption algorithm.

The video processing apparatus provided in the embodiment of the present disclosure is configured to execute the technical solution of the second device in any one of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 8 is a block diagram of a first structure of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device 500 of the present embodiment may include:

at least one processor 501 (only one processor is shown in FIG. 8); and

a memory 502 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 502 stores a computer program executable by the at least one processor 501, the computer program being executable by the at least one processor 501 to enable the electronic device 500 to perform the solution of the first device in any of the method embodiments described above.

Alternatively, the memory 502 may be separate or integrated with the processor 501.

When the memory 502 is a separate device from the processor 501, the electronic device 500 further comprises: a bus 503 for connecting the memory 502 and the processor 501.

The electronic device provided in the embodiment of the present disclosure may execute the technical solution of the first device in any one of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 9 is a block diagram of a second structure of an electronic device according to an embodiment of the present disclosure. As shown in fig. 9, the electronic device 600 of the present embodiment may include:

at least one processor 601 (only one processor is shown in FIG. 9); and

a memory 602 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 602 stores a computer program executable by the at least one processor 601, and the computer program is executed by the at least one processor 601 to enable the electronic device 600 to execute the technical solution of the second device in any of the foregoing method embodiments.

Alternatively, the memory 602 may be separate or integrated with the processor 601.

When the memory 602 is a separate device from the processor 601, the electronic device 600 further comprises: a bus 603 for connecting the memory 602 and the processor 601.

The electronic device provided in the embodiment of the present disclosure may execute the technical solution of the second device in any of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program is used to implement a technical solution of the first device in any one of the foregoing method embodiments.

The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program is used to implement a technical solution of a second device in any one of the foregoing method embodiments.

The embodiment of the present disclosure further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the technical solution of the first device in any of the foregoing method embodiments is implemented.

An embodiment of the present disclosure further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements a technical solution of the second device in any of the foregoing method embodiments.

The embodiment of the present disclosure further provides a chip, including: a processing module and a communication interface, the processing module being capable of performing the technical solution of the first device in any of the method embodiments described above. Further, the chip further includes a storage module (e.g., a memory), where the storage module is configured to store instructions, and the processing module is configured to execute the instructions stored by the storage module, and the execution of the instructions stored in the storage module causes the processing module to execute the technical solution of the first device in any one of the foregoing method embodiments.

The embodiment of the present disclosure further provides a chip, including: a processing module and a communication interface, the processing module being capable of performing the solution of the second device in any of the method embodiments described above. Further, the chip further includes a storage module (e.g., a memory), where the storage module is configured to store instructions, and the processing module is configured to execute the instructions stored by the storage module, and the execution of the instructions stored in the storage module causes the processing module to execute the technical solution of the second device in any one of the foregoing method embodiments.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present disclosure are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present disclosure, and not for limiting the same; while the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims

1. A video processing method, applied to a first device, the method comprising:

2. The method of claim 1,

3. The method according to claim 1 or 2, wherein the feature information comprises audio features in the video to be detected, the video detection model comprises an audio detection submodel, and the audio detection submodel is used for detecting whether audio data in the video is in compliance;

inputting the audio features into the audio detection submodel to obtain a first sub detection result corresponding to the audio features;

4. The method according to claim 1 or 2, wherein the feature information comprises a pose feature of a target object in the video image, and the video detection model comprises a pose detection sub-model for detecting whether the pose of the target object in the video is compliant;

5. The method according to claim 1 or 2, wherein the feature information comprises gait features of a target object in a video image, the video detection model comprises a gait detection sub-model, and the gait detection sub-model is used for detecting whether the gait of the target object in the video is normal or not;

6. The method according to claim 1 or 2, wherein the feature information comprises facial features of a target object in a video image, and the video detection model comprises a face detection sub-model for detecting whether a facial expression of the target object in the video is normal;

7. The method according to claim 1 or 2, wherein the feature information comprises text features in a video image, and the video detection model comprises a text detection sub-model for detecting whether illegal text content exists in the video image;

8. The method of claim 1, further comprising:

9. The method of claim 8, wherein determining the model parameters of the first device from the model parameters of the at least one second device comprises:

10. The method of claim 8, further comprising:

11. A video processing method, applied to a second device, the method comprising:

obtaining model parameters of the video detection model;

carrying out encryption processing on the model parameters;

12. The method of claim 11, wherein said cryptographically processing said model parameters comprises:

13. A video processing apparatus, comprising:

the device comprises a characteristic extraction module, a video detection module and a video processing module, wherein the characteristic extraction module is used for extracting characteristic information of a video to be detected, and the characteristic information is used for indicating at least one of image characteristics and audio characteristics in the video to be detected;

14. A video processing apparatus, comprising:

15. An electronic device, characterized in that the electronic device comprises: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements the method of any one of claims 1-10.

16. An electronic device, characterized in that the electronic device comprises: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, implements the method of claim 11 or 12.

17. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method according to any one of claims 1-10.

18. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of claim 11 or 12.

19. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1-10.

20. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method of claim 11 or 12.