CN112990273A

CN112990273A - Compressed domain-oriented video sensitive character recognition method, system and equipment

Info

Publication number: CN112990273A
Application number: CN202110190037.5A
Authority: CN
Inventors: 刘雨帆; 李兵; 胡卫明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2021-06-18
Anticipated expiration: 2041-02-18
Also published as: CN112990273B

Abstract

The invention belongs to the field of image recognition, and particularly relates to a method, a system and equipment for recognizing a video sensitive figure facing a compressed domain, aiming at solving the problems of low efficiency and resource waste of the existing sensitive figure recognition method. The invention comprises the following steps: decoding a video part to be detected to obtain compressed domain multi-modal information, detecting and calibrating the compressed domain multi-modal information, obtaining multi-modal face features from the calibrated compressed domain face multi-modal information through a trained multi-modal face recognition network, comparing the multi-modal face features with a sensitive face feature library, and determining whether a sensitive face exists. The compressed domain face multi-modal information respectively extracts different features through the I branch, the MV branch and the Res branch, and then carries out multi-modal feature fusion to obtain unique multi-modal face features. The invention can complete feature extraction only by partial decoding, solves the problems of low efficiency and resource waste in the prior art, and simultaneously keeps higher identification precision.

Description

Compressed domain-oriented video sensitive character recognition method, system and equipment

Technical Field

The invention belongs to the field of image recognition, and particularly relates to a method, a system and equipment for recognizing a video sensitive person facing a compressed domain.

Background

In the field of video security, sensitive person identification is a relatively critical piece of work. At present, the existing method is to perform full decoding on a target video to obtain a video frame. And then carrying out face detection and face feature extraction on each frame of video. And finally, comparing the extracted features of the target video with the face features of the sensitive persons one by one, and judging whether the video contains the sensitive persons or not by combining a preset threshold value. This type of process has two distinct disadvantages: firstly, the video full decoding has high requirements on computing resources and computing time, so that the method is limited to run on the mobile terminal equipment, and the running time is long even in a cloud server; secondly, the method performs independent feature extraction on each face in the video, ignores the fact that a plurality of faces in the images of the previous frame and the next frame are actually the same identity, and thus, a large amount of repeated calculation is performed when the feature extraction is performed, further the calculation resources are wasted, and the calculation time is consumed.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, the problems of inefficiency and resource waste of the existing sensitive person identification method, the present invention provides a video sensitive person identification method for a compressed domain, wherein the method comprises:

s100, carrying out partial decoding on a video to be detected by using FFmepg and c + + coding methods, and extracting compressed domain multi-modal information of the video to be detected; the compressed domain multimodal information comprises: i frame, motion vector image, residual image, DCT coefficient and segmentation depth;

step S200, carrying out face detection and face calibration on the compressed domain multi-modal information to obtain calibrated compressed domain face multi-modal information;

step S300, inputting the calibrated multi-modal information of the compressed domain face into a trained multi-modal face recognition network, and acquiring multi-modal face features of each face of the video to be detected;

and step S400, matching the multi-modal face features with a preset sensitive character feature library to obtain a judgment result of whether the video to be detected contains sensitive characters.

Further, the multi-modal face recognition network comprises an I branch, an MV branch, a Res branch and a multi-modal fusion module:

the I branch is constructed based on one of ResNet, InceptionNet or DenseNet, and is input as a calibrated I frame and output as a feature map of the I frame; the I frame is a 3-channel RGB image;

the MV branch is constructed based on one of ResNet, InceptionNet or DenseNet, is input into a calibrated motion vector image, and is output as a feature map of the motion vector image; the motion vector image is a 2-channel vector image;

the Res branch is constructed based on one of ResNet, InceptionNet or DenseNet, is input into a calibrated residual image, and is output as a feature map corresponding to the residual image; the residual image is a 2-channel image;

the multi-mode fusion module comprises 3 residual modules connected in parallel, and each residual module comprises two convolution layers with convolution kernels of 3 x 3; the multi-mode fusion module inputs the feature map of the frame I, the feature map corresponding to the motion vector image and the feature map corresponding to the residual image and outputs the feature maps into multi-mode face feature vectors.

Further, the multi-modal face recognition network, the training method thereof comprises:

a100, acquiring a training data set by an off-line sample acquisition method;

step A200, compressing based on the training data set to obtain training video compression domain information; the training video compression domain information comprises an I frame, a motion vector image, a residual image, a DCT coefficient and a segmentation depth;

step A300, randomly selecting training video compression domain information of any training data in the training data set, and respectively acquiring corresponding training multi-mode face feature vectors through the multi-mode face recognition network;

step A500, calculating a contrast loss L based on the training multi-mode face feature vector;

and step A600, repeating the step A300 to the step A500, and reducing the contrast loss L through back propagation training until the network is converged to obtain the trained multi-modal face recognition network.

Further, the contrast loss L is:

wherein the content of the first and second substances,

representing sample features X₁To X₂P represents the characteristic dimension of the sample, Y is a label indicating whether the two samples match, Y ═ 1 indicates that the two samples are the same identity, Y ═ 0 indicates that the two samples are different identities, m is a preset threshold, N is the number of samples, R (X) is the number of samples₁) And R (X)₂) A sparse regularization term is represented.

Further, step S300 includes:

step S310, carrying out face detection and face calibration on the compressed domain multi-modal information to obtain calibrated I frame information;

step S320, based on the calibrated I frame information, carrying out face calibration on the residual image and the motion vector image to obtain a calibrated residual image and a calibrated motion vector;

and step S330, the calibrated I frame information, the calibrated residual image and the calibrated motion vector are calibrated face multi-modal information.

Further, the step S400 includes:

step S410, acquiring a face feature vector of the sensitive person by the method of the steps A200-A400 based on preset sensitive person video data;

step S420, constructing a sensitive character face feature library based on the face feature vector of the sensitive character;

step S430, calculating cosine similarity between the face features and the sensitive figure face features in the sensitive figure face feature library based on the face features, and judging that the video to be detected has sensitive figures when the cosine similarity is greater than a preset threshold value T.

Further, the offline sample collection method includes:

b100, crawling celebrity videos from a network;

step B200, extracting multi-modal information of the celebrity video based on the celebrity video;

b300, extracting all the facial features of the celebrities through a face recognition algorithm based on the multi-modal information of the celebrity video;

b400, based on the facial features of the celebrities, clustering is carried out through a clustering algorithm, and the class with the largest number of faces is used as the ID of the celebrity video;

and step B500, repeating the steps B100-B400 until the number of the processed celebrity videos reaches a preset number, and obtaining a training data set.

In another aspect of the present invention, a compressed domain-oriented video sensitive character recognition system is provided, including: the system comprises an information extraction module, a face positioning calibration module, a feature extraction module and a sensitive character matching module;

the information extraction module is configured to perform partial decoding on the video to be detected by using FFmepg and c + + coding methods, and extract compressed domain multi-modal information of the video to be detected; the compressed domain multimodal information comprises: i frame, motion vector image, residual image, DCT coefficient and segmentation depth;

the face positioning and calibrating module is configured to perform face detection and face calibration on the compressed domain multi-modal information to obtain calibrated compressed domain face multi-modal information;

the feature extraction module is configured to input the calibrated multi-modal information of the compressed domain face into a trained multi-modal face recognition network, and obtain multi-modal face features of each face of the video to be detected;

the sensitive person matching module is configured to match the multi-modal face features with a preset sensitive person feature library to obtain a judgment result of whether the video to be detected contains sensitive persons.

In a third aspect of the present invention, an electronic device is provided, including: at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for implementing the compressed domain oriented video sensitive character recognition method described above.

In a fourth aspect of the present invention, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores computer instructions for being executed by the computer to implement the above-mentioned method for identifying a video-sensitive person in a compressed domain.

The invention has the beneficial effects that:

(1) the method for identifying the video sensitive people facing the compressed domain ensures high speed and high performance of extracting the face features of the video by fully using the compressed domain information of the video, replaces the information of full decoding of the video by identifying the compressed domain information, and greatly reduces the calculated amount of a task of detecting the video sensitive people.

(2) The compressed domain-oriented video sensitive person identification method obtains the high-level semantic representation of the face combined with the face time-space information in a fusion mode of the face multi-mode compressed domain information, so that the multi-mode face identification network effectively uses the compressed domain information of various faces, and the precision of sensitive face detection is improved.

(3) The video sensitive person identification method facing the compressed domain expands the magnitude of training samples by crawling internet face data, and reduces the dirty sample proportion of a data center by a sample cleaning technology, thereby improving the identification performance of a face identification network.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a compressed domain-oriented video sensitive character recognition method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a multi-modal face recognition network architecture in an embodiment of the invention;

FIG. 3 is a block diagram of a computer system of a server for implementing embodiments of the method, system, and apparatus of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention provides a compressed domain-oriented video sensitive character recognition method, which replaces the information of video full decoding by using compressed domain information acquisition characteristics, and greatly reduces the calculated amount of video sensitive character tasks.

The invention relates to a compressed domain-oriented video sensitive character recognition method, which comprises the following steps of S100-S400, wherein the following steps are detailed:

In order to solve the problems existing in the existing video sensitive character recognition, the method provides an efficient video sensitive character recognition technology facing to a compressed domain. On one hand, when the method is used for processing the video, only partial decoding is carried out on the target video, and multi-mode information of { I frame, motion vector and residual error } is obtained. The time consuming problem is greatly mitigated since partial decoding takes only 1/10 that is fully decoded. On the other hand, the method slices the video based on the I frame, and designs a face recognition network aiming at multi-modal information. Therefore, the face feature extraction is only performed once in each slice (depending on how many faces are detected in the I frame corresponding to the current slice), and compared with the face feature extraction frame by frame, the time consumption is further reduced, and the operation cost is saved.

In order to more clearly describe the method for identifying a video sensitive person in a compressed domain according to the present invention, the following will describe each step in the embodiment of the present invention in detail with reference to fig. 1.

The method for identifying the video sensitive people facing the compressed domain in the first embodiment of the invention comprises the following steps S100-S400, wherein the following steps are described in detail:

in this embodiment, taking the mp4 format as an example of the video to be tested, a general processing means is to decode the video into a frame-by-frame image, and this operation needs to consume a large amount of computing resources. For videos with the same format, the invention can realize the task of character recognition only by partially decoding and processing the video content in a compressed domain.

in the embodiment, an open source entry FFmpeg is used for video decoding, compressed domain information generated in a decompression process is determined by analyzing an FFmpeg source code, and a decompression flow chart is shown as the following figure.

The I frame comprises key RGB space information in the video, the motion vector image comprises main motion information of the video, and the residual image comprises boundary information of a motion subject in the video; the DCT coefficient is obtained by performing DCT (discrete cosine transform) on a video frame in the video coding process, and can filter high-frequency information and reduce redundancy, so that the DCT coefficient can reflect texture information of the video frame in a compressed domain; when the split depth is used for video coding, each picture is first divided into macroblocks of different sizes, and the H264 standard specifies that the size of each macroblock is 16 × 16 pixels, and the macroblocks can be further divided into 16 × 8, 8 × 16, and 8 × 8 sub-macroblocks. Further, the 8 × 8 sub-macroblock may be further divided into 8 × 4, 4 × 8, 4 × 4 sub-macroblocks. These different partitioning rules are the partition depth, and the smaller the sub-macroblock is, the more drastic the pixel change at this location is, i.e. the more temporal information is enriched in the texture at this location:

time domain information: the motion vector sequence and the residual error sequence are included; spatial domain information: including I-frame information, as well as motion vector information and residual information for the current time.

In this embodiment, based on the FFmpeg-based video coding framework, the encoding and decoding method of the H265 code stream is deeply studied and analyzed. The research comprises the steps of decoding I frames, entropy decoding of code streams, inverse quantization, inverse DCT transformation and the like. For a decoded stream with motion vectors present in macroblock prediction, before entropy decoding is performed, it is necessary to first determine a prediction Mode or motion vector MV of a macroblock and a block coding Mode CBP, and then perform entropy decoding on luminance and chrominance, respectively. The method has the advantages that the source code of the FFmpeg is researched, the c + + code is written to decode the key decoding process, and unnecessary decoding information and processes are skipped, so that the compressed domain information is efficiently extracted. In addition, the whole network is trained end to end, and mixed compilation of c + + and python is required to be completed in engineering, so that compressed domain information extracted from FFmepg by c + + can be directly exchanged in training using a PyTorch framework.

The video, when in a compressed state (as in conventional avi, mp4), is in the form of a binary stream. In the conventional method, the image needs to be decoded into one frame by a decoder for analysis and processing, but the process takes much time, especially when the amount of video to be processed is large.

The invention adopts a partial decoding mode to replace the whole decoding mode, wherein the partial decoding is to carry out entropy decoding on binary code streams and is decoded into a readable compressed domain form; wherein the information of the compressed domain also contains the entire information of the video. And the time consumed by decoding is reduced by extracting data from the compressed domain data for processing, and further, the purpose of task identification can be achieved by adopting smaller data volume, so that the real-time performance of model processing is improved.

The prior art can achieve the technical effect of character recognition by fully decoding the video, and a technical scheme of directly performing character recognition by compressing domain data according to the invention has not been proposed yet, so that the invention has creativity.

The present invention can be implemented by any programming language without any limitation to the implementation means, and any technical solution for achieving the same technical effect by using the principles disclosed in the present invention should be considered as the scope of the present application.

Preferably, in this embodiment, the present invention is implemented by using a 3.2GHz central processing unit and an 8 gbyte memory computer, the training process of the network is implemented in a PyTorch framework, the training and testing processes of the whole network are all processed in parallel by using Telsa V100, and a working program for video compression and information extraction is compiled by using a C + + language, thereby implementing the present invention.

and training a multi-mode face recognition model. Different from the traditional RGB face recognition, the compressed domain information contains three kinds of information, and is a multi-modal face recognition technology. Therefore, aiming at the characteristics of compressed domain information, the invention designs a multi-mode face recognition network structure (which consists of three independent branches and a mode fusion module, namely an I branch corresponding to I frame information processing, an MV branch corresponding to motion vector processing, a Res branch corresponding to residual error processing and a { I, M, Res } fusion module). The invention also designs a multi-modal face recognition model training method based on the network structure, which comprises the steps of establishing a data set and extracting characteristics to train specifically.

In this embodiment, the multi-modal face recognition network includes an I branch, an MV branch, a Res branch, and a multi-modal fusion module, as shown in fig. 2:

the multi-mode fusion module comprises 3 residual modules connected in parallel, and each residual module comprises two convolution layers with convolution kernels of 3 x 3; the multi-mode fusion module inputs the feature map of the frame I, the feature map corresponding to the motion vector image and the feature map corresponding to the residual image and outputs the feature maps into multi-mode face feature vectors;

in this embodiment, the training method of the multi-modal face recognition network includes:

because the structure of the multi-modal neural network of the invention needs a large number of training samples, the invention designs an off-line sample acquisition method, which can effectively acquire a large number of samples required by training.

A100, acquiring a training data set by an off-line sample acquisition method;

in this embodiment, the offline sample collection method includes:

b100, crawling celebrity videos from a network;

and step B500, repeating the steps B100-B400 until the number of the processed celebrity videos reaches a preset number, and obtaining a training data set. In this embodiment, it is preferable to set the video of the celebrity crawling 10000 identities to a preset number. Because the celebrity video crawled from the network may have a large amount of dirty data, namely the condition of non-self video or multiple human faces, the method can effectively eliminate the dirty data, and avoids time and labor waste caused by manual marking. Preferably, the present invention can use the KMeans clustering algorithm to cluster the features.

In this embodiment, the contrast loss L is:

wherein the content of the first and second substances,

representing sample features X₁To X₂P represents the characteristic dimension of the sample, Y is a label indicating whether the two samples match, Y ═ 1 indicates that the two samples are the same identity, Y ═ 0 indicates that the two samples are different identities, m is a preset threshold, N is the number of samples, R (X) is the number of samples₁) And R (X)₂) A sparse regularization term is represented. The specific form of the sparse regularization term may be L1 norm, L2 norm, or other specific forms. By pairing features X₁And X₂By adding sparse regular constraint, the characteristics have stronger separability, the redundancy is reduced, the generalization capability and the processing efficiency of the model are improved, and particularly, the method has great advantages under the condition of small data volume. In this embodiment, step S300 includes:

In this embodiment, step S400 includes:

The video sensitive person identification system facing the compressed domain in the second embodiment of the invention comprises an information extraction module, a face positioning calibration module, a feature extraction module and a sensitive person matching module;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the video sensitive person identification system for a compressed domain provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

An electronic device according to a third embodiment of the present invention includes at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for implementing the compressed domain oriented video sensitive character recognition method described above.

A computer-readable storage medium according to a fourth embodiment of the present invention is characterized in that the computer-readable storage medium stores computer instructions for being executed by the computer to implement the above-mentioned method for identifying a video-sensitive person in a compressed domain.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Reference is now made to FIG. 3, which illustrates a block diagram of a computer system of a server for implementing embodiments of the method, system, and apparatus of the present application. The server shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 3, the computer system includes a Central Processing Unit (CPU)301 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for system operation are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An Input/Output (I/O) interface 305 is also connected to bus 304.

The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 308 including a hard disk and the like; and a communication section 309 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that a computer program read out therefrom is mounted into the storage section 308 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 311. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 301. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for identifying video sensitive people facing compressed domain is characterized in that the method comprises the following steps:

2. The compressed domain-oriented video sensitive character recognition method as claimed in claim 1, wherein the multi-modal face recognition network comprises an I branch, an MV branch, a Res branch and a multi-modal fusion module:

3. The compressed domain-oriented video sensitive character recognition method as claimed in claim 2, wherein the multi-modal face recognition network is trained by the method comprising:

a100, acquiring a training data set by an off-line sample acquisition method;

step A400, calculating a contrast loss L based on the training multi-mode face feature vector;

and step A500, repeating the step A300 to the step A400, and reducing the contrast loss L through back propagation training until the network is converged to obtain the trained multi-modal face recognition network.

4. The method of claim 3, wherein the contrast loss L is:

wherein the content of the first and second substances,

representing sample features X₁To X₂P represents the characteristic dimension of the sample, Y is a label indicating whether the two samples match, Y ═ 1 indicates that the two samples are the same identity, Y ═ O indicates that the two samples are different identities, m is a preset threshold, N is the number of samples, R (X) represents the number of samples₁) And R (X)₂) A sparse regularization term is represented.

5. The method for identifying a sensitive human in a compressed domain-oriented video according to claim 1, wherein the step S300 comprises:

6. The method for identifying a video-sensitive character in a compressed domain according to claim 3, wherein the step S400 comprises:

7. The compressed domain-oriented video sensitive character recognition method of claim 3, wherein the offline sample collection method comprises:

b100, crawling celebrity videos from a network;

8. A compressed domain oriented video-sensitive character recognition system, the system comprising: the system comprises an information extraction module, a face positioning calibration module, a feature extraction module and a sensitive character matching module;

9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor to implement the compressed domain oriented video sensitive character recognition method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for execution by the computer to implement the compressed domain oriented video sensitive character recognition method of any of claims 1-7.