CN109918538B

CN109918538B - Video information processing method and device, storage medium and computing equipment

Info

Publication number: CN109918538B
Application number: CN201910075369.1A
Authority: CN
Inventors: 朱军; 韦星星; 苏航
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2021-04-16
Anticipated expiration: 2039-01-25
Also published as: CN109918538A

Abstract

The embodiment of the invention provides a video information processing method and device, a storage medium and a computing device. The video information processing method comprises the following steps: obtaining a training data set, wherein the training data set comprises a plurality of training videos, and each training video has a corresponding real category label; obtaining a video classification model to be processed; constructing an objective function with the countermeasure noise as an unknown quantity, and meeting the following conditions: adding the counternoise to the video frames which are as few as possible in the training video, enabling the amplitude of the counternoise to be as small as possible, and enabling the classification result of the training video added with the counternoise to be wrong as possible by the to-be-processed video classification model; solving an objective function based on the training data set to obtain counternoise; and generating a confrontation video sample corresponding to the original video by using the obtained confrontation noise. The technology of the invention can use sparse antagonistic noise to resist the existing video classification algorithm, and can greatly improve the antagonistic effect compared with the prior art.

Description

Video information processing method and device, storage medium and computing equipment

Technical Field

The embodiment of the invention relates to the field of artificial intelligence security, in particular to a video information processing method and device, a storage medium and a computing device.

Background

A challenge sample, refers to an input sample formed by deliberately adding subtle perturbations in the data set, causing the model to give an erroneous output with high confidence. In a regularization context, the error rate of the original independent identically distributed test set is reduced by countertraining — the network is trained on a sample of the counterdisturbance training set.

The research on confrontational samples has been focused mainly on the problem of image classification, and as the research on confrontational samples has been increasingly advanced, various confrontational sample generation methods have been proposed, including a method for generating confrontational samples for video classification. However, the prior art still has the problem that the confrontation samples generated by aiming at the video classification model have poor confrontation effect.

Disclosure of Invention

For this reason, an improved method for generating countermeasure samples is highly needed to improve the countermeasure effect by using sparse countermeasure noise to counteract the existing video classification algorithm.

In this context, embodiments of the present invention are intended to provide a video information processing method and apparatus, a storage medium, and a computing device.

According to an aspect of the present invention, there is provided a video information processing method including: obtaining a training data set, the training data set comprising a plurality of training videos, wherein each training video has a corresponding true category label; obtaining a video classification model to be processed; constructing an objective function with the countermeasure noise as an unknown quantity so as to satisfy the following conditions: adding the counternoise to the video frames which are as few as possible in the training video, enabling the amplitude of the counternoise to be as small as possible, and enabling the to-be-processed video classification model to have the result of the training video classification after the counternoise is added as wrong as possible; solving the objective function based on the training data set to obtain a countering noise; and determining an original video to be processed, and generating a confrontation video sample corresponding to the original video by using the acquired confrontation noise.

Further, the step of obtaining a video classification model to be processed includes: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

Further, the video classification model to be processed comprises a video classification model constructed based on a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).

Further, the objective function includes a first expression constructed based on a norm of a point-by-point product of a mask of the video sequence and the countering noise.

Further, for each training video of the training data set, at least one video frame is selected as a candidate frame from all video frames of the training video, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame of the training video is 1, and the rest are 0.

Further, for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is determined according to a preset sparsity.

Further, for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is equal to the following value: the product of the number of all video frames of the training video and the sparsity.

Further, the objective function includes a second expression term constructed based on a loss function of the to-be-processed video classification model, wherein the loss function is used for measuring a distance between a prediction class label and a corresponding real class label of a training video added with anti-noise.

Further, the loss function includes a cross-entropy loss function.

Further, the objective function is solved by adopting an Adam optimization method.

Further, in the process of solving the objective function, all pairs of anti-noises are initialized to preset values, and the preset values are between 0 and 0.1.

According to another aspect of the present invention, there is also provided a video information processing apparatus including: a training data obtaining unit adapted to obtain a training data set comprising a plurality of training videos, wherein each training video has a corresponding real category label; the model obtaining unit is suitable for obtaining a video classification model to be processed; a construction unit adapted to construct an objective function with the countering noise as an unknown quantity to satisfy the following conditions: adding the counternoise to the video frames which are as few as possible in the training video, enabling the amplitude of the counternoise to be as small as possible, and enabling the to-be-processed video classification model to have the result of the training video classification after the counternoise is added as wrong as possible; a calculation unit adapted to solve the objective function based on the training data set to obtain a countering noise; and the sample generation unit is suitable for determining an original video to be processed and generating a confrontation video sample corresponding to the original video by using the acquired confrontation noise.

Further, the model obtaining unit is adapted to: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

Further, the objective function includes a first expression term constructed based on a norm of a point-product of a mask of the video sequence and the countering noise so that a value of the first expression term is as small as possible.

Further, the construction unit is adapted to: and for each training video of the training data set, selecting at least one video frame from all video frames of the training video as a candidate frame, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame of the training video is 1, and the rest are 0.

Further, the construction unit is adapted to: for each training video of the training data set, determining the number of video frames to be selected in all video frames of the training video according to a preset sparsity.

Further, the construction unit is adapted to, for each training video of the training data set, the number of video frames to be selected among all video frames of the training video being equal to the following value: the product of the number of all video frames of the training video and the sparsity.

Further, the objective function includes a second expression term constructed based on a loss function of the to-be-processed video classification model, wherein the loss function is used for measuring a distance between a prediction class label and a corresponding real class label of a training video added with anti-noise so as to make a value of the second expression term as small as possible.

Further, the loss function includes a cross-entropy loss function.

Further, the calculation unit is adapted to solve the objective function using an Adam optimization method.

Further, the computing unit is adapted to initialize all pairs of anti-noise signals to preset values in the process of solving the objective function, wherein the preset values are between 0 and 0.1.

According to still another aspect of the present invention, there is also provided a storage medium storing a program which, when executed by a processor, implements the video information processing method as described above.

According to still another aspect of the present invention, there is also provided a computing device including the storage medium as described above.

According to the video information processing method and device, the storage medium and the computing equipment, the objective function meeting the conditions is constructed, the confrontation noise is added to the video frames as few as possible in the training video, the amplitude of the confrontation noise is made to be small as much as possible, the result of the classification of the training video added with the confrontation noise by the to-be-processed video classification model is made to be wrong as much as possible, and the acquired confrontation noise is used for generating the confrontation video sample corresponding to the original video. Therefore, the existing video classification algorithm can be resisted by sparse antagonistic noise, the antagonistic effect can be greatly improved compared with the prior art, meanwhile, the safety problem existing in the field of artificial intelligence can be revealed, and the more robust artificial intelligence algorithm is provided.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a flowchart schematically showing an exemplary process of a video information processing method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating exemplary results of confrontational video and anti-noise generated using the video information processing method of the present invention;

fig. 3 is a block diagram schematically showing the configuration of one example of a video information processing apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram that schematically illustrates the architecture of a computer, in accordance with an embodiment of the present invention;

fig. 5 is a schematic diagram schematically illustrating a computer-readable storage medium according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Exemplary method

A video information processing method according to an exemplary embodiment of the present invention is described below with reference to fig. 1.

According to an embodiment of the present invention, there is provided a video information processing method including: obtaining a training data set, wherein the training data set comprises a plurality of training videos, and each training video has a corresponding real category label; obtaining a video classification model to be processed; constructing an objective function with the countermeasure noise as an unknown quantity so as to satisfy the following conditions: adding the counternoise to the video frames which are as few as possible in the training video, enabling the amplitude of the counternoise to be as small as possible, and enabling the classification result of the training video added with the counternoise to be wrong as possible by the to-be-processed video classification model; solving an objective function based on the training data set to obtain the counternoise; and determining an original video to be processed, and generating a confrontation video sample corresponding to the original video by using the acquired confrontation noise.

Fig. 1 schematically shows an exemplary process flow 100 of a video information processing method according to an embodiment of the present disclosure.

As shown in fig. 1, after the process flow 100 is started, step S110 is first executed.

In step S110, a training data set is obtained, the training data set including a plurality of training videos, wherein each training video has a corresponding real category label.

As an example, the training data set may be expressed as Φ ═ X_i，y_i1,2, …, N, wherein X is_iRepresenting the i-th training video, y, in the training data set_iRepresents the ith training video X_iAnd N is the number of training videos contained in the training data set.

As an example of this, the following is given,

i is 1,2, …, N, where T denotes the number of video frames contained in each training video (T is a positive integer), W, H and C denote the width, height and channel number of each training video, respectively, the unit of the width and height of the training video may be, for example, the number of pixels or other units, the channel number of the training video is, for example, 1 (representing a single channel) or a preset value (e.g., 3, such as RGB three channels, etc.),

represents the ith training video X_i(i-1, 2, …, N) is a video containing T frames, W width, H height, and C channels.

In one example, the number of video frames contained in each training video in the training data set may be the same, as well as the size (width and height) and number of channels of each training video.

In another example, the number of video frames, the size (width and/or height), and the number of channels contained in each training video in the training data set may also be at least partially different.

For example, videos can be downloaded from some existing video websites by writing a crawler program, and then the downloaded videos are processed, and video segments with human body actions are extracted from the videos as training videos and labeled (as real category labels).

For example, the segment length may be preset, the human motion video is intercepted at certain intervals in the downloaded video, a video segment with a fixed length (the fixed length is equal to the preset segment length) is obtained, and the size of each frame of the video segment is adjusted, for example, the width and height of each frame are adjusted to a preset width and a preset height, where the preset width and the preset height may be set according to an empirical value, or determined through experiments, according to an actual situation, and the like.

It should be understood that the content contained in the training video is not limited to the content described above, and may also contain other content such as animal actions, mechanical operations, and the like, for example.

In step S120, a video classification model to be processed is obtained. Then, step S130 is performed.

As an example, in step S120, for example, a video classification model J to be processed is obtained_θ(ii) a model network architecture and corresponding model parameters θ.

The video classification model to be processed includes, for example, a video classification model constructed based on a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). The CNN is used to extract features of each frame in a video, for example, and then input the features into the RNN, and performs encoding processing on adjacent frames through the RNN, so as to predict tag information of a current frame by combining previous and subsequent frames of the current frame. Finally, the label probability values of all frames in the video are pooled (Pooling), so that the label information of the whole video is obtained as the prediction category label of the video (such as a training video).

In addition, the video classification model to be processed may also be other existing video classification models, for example, other video classification models based on a deep neural network.

In step S130, an objective function with the countermeasure noise as an unknown quantity is constructed so as to satisfy the following condition: adding the anti-noise to as few video frames as possible in the training video, and making the amplitude of the anti-noise as small as possible (hereinafter referred to as a first condition); and making the video classification model to be processed as wrong as possible for the result of the training video classification after the anti-noise is added (hereinafter referred to as a second condition).

As an example, the objective function comprises, for example, a first expression constructed based on a norm of a point-by-point product of a mask of the video sequence and the countering noise.

For example, for each training video of the training data set, at least one video frame is selected as a candidate frame from all video frames of the training video, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame of the training video is 1, and the rest is 0.

Wherein, for each training video of the training data set, the number of video frames to be selected among all video frames of the training video may be determined according to a preset sparsity, for example.

As an example, for each training video of the training data set, the number of video frames to be selected among all video frames of the training video may be equal to the following value, for example: the product of the number of all video frames of the training video and the sparsity.

As an example, the objective function may for example comprise a second expression constructed based on a loss function of the video classification model to be processed, wherein the loss function is used to measure the distance between the prediction class label and the corresponding true class label of the training video added to combat noise.

Wherein the loss function comprises, for example, a cross-entropy loss function.

Furthermore, in other examples, the objective function may also include both the first and second expression terms described above, for example.

In one example, the objective function may be expressed, for example, in the form shown in equation one below.

The formula I is as follows:

wherein "λ | | M · E | | non-woven phosphor_2，1"this item may be taken as an example of the first expression item described above, but

This item may be taken as an example of the second expression item described above. "argmin" in the above objective function means that the sum of the above first expression term and second expression term in the following parenthesis is minimized.

In the above formula, E represents the generated countermeasure noise.

M∈{0，1}^T×W×H×CRepresenting a video sequence mask, for example, for a training video, K frames are selected out of the total T frames of the training video, and then the anti-noise is added to the selected K frames. For the selected K frames, the element values at the corresponding positions in the video sequence masks corresponding to the K frames are 1, and the element values at the corresponding positions in the video sequence masks corresponding to the frames other than the K frames in the training video are all 0.

For example, by using S to represent the preset sparsity described above, it can be determined how many frames should be selected in the video according to the following formula two, that is, the value of t is determined.

The formula II is as follows:

S＝K/T。

k can be obtained, for example, by rounding or rounding the value calculated by the above formula.

l (·, ·) is a loss function, and for example, l (u, v) ═ log (1-u · v) can be adopted, where u and v respectively refer to two quantities to be compared in the loss function, for example, in the above expression of the objective function, u is

v is J_θ(X_i+M·E)。

Wherein the content of the first and second substances,

is a binary representation (one-hot label), J, of the true label of a video (e.g., a training video)_θ(X_i+ M.E) represents the training video X with the addition of the countering noise_iThe prediction category label of (1).

Is l of the matrix E_2,1Norm for measuring the amplitude of the counteracting noise, where t represents the tth frame of the selected K frames, i.e. t is 1,2, …, K, E_tRepresenting the competing noise added to the t-th frame of the K selected frames. λ is a coefficient used to balance the two terms above.

Next, in step S140, based on the training data set, an objective function is solved to obtain the counternoise. Then, step S150 is performed.

As an example, the objective function may be solved using Adam optimization methods, for example.

During the process of solving the objective function, all the pair antinoise can be initialized to a preset value, the preset value can be between 0 and 0.1, for example, 0.01, so that the situation that l is caused in the process of solving the objective function can be prevented_2,1The norm causes the NaN value to occur, and λ ═ 1 is set in the solution process, for example.

In step S150, an original video to be processed is determined, and a confrontation video sample corresponding to the original video is generated using the acquired confrontation noise.

For example, by solving the objective function as shown in equation one above, the final competing noise E may be obtained, which may be added as a general competing noise to the original video to be processed. By using

Representing the original video to be processed, adding the anti-noise E finally obtained in step S140 to the original video X to obtainFinal challenge sample

That is to say that the first and second electrodes,

in the example given in fig. 2, the image sequence of the top row is a video frame in one of the generated competing videos, the image sequence of the middle row is a video frame in the other of the generated competing videos, and the image sequence of the bottom row is a general competing noise generated. Thus, the generated confrontation sample is input into the video classification model J_θTesting is carried out, and the result verifies that the countermeasure sample can cause the video classification model to classify the generated countermeasure sample wrongly.

Exemplary devices

Having described the video information processing method according to the exemplary embodiment of the present invention, a video information processing apparatus according to an exemplary embodiment of the present invention will be described next with reference to fig. 3.

Referring to fig. 3, a schematic diagram of a video information processing apparatus according to an embodiment of the present invention is schematically shown, where the apparatus may be disposed in a terminal device, for example, the apparatus may be disposed in an intelligent electronic device such as a desktop computer, a notebook computer, an intelligent mobile phone, and a tablet computer; of course, the apparatus according to the embodiment of the present invention may be provided in a server. The apparatus 300 of the embodiment of the present invention may include the following constituent elements: a training data obtaining unit 310, a model obtaining unit 320, a constructing unit 330, a calculating unit 340, and a sample generating unit 350.

As shown in fig. 3, the training data obtaining unit 310 is adapted to obtain a training data set comprising a plurality of training videos, wherein each training video has a corresponding real category label.

The model obtaining unit 320 is adapted to obtain a video classification model to be processed.

A construction unit 330 adapted to construct an objective function with the countering noise as an unknown quantity to satisfy the following conditions: the method comprises the steps of adding the countermeasure noise to as few as possible video frames in a training video, enabling the amplitude of the countermeasure noise to be as small as possible, and enabling a to-be-processed video classification model to have as wrong a result of the training video classification after the countermeasure noise is added as possible.

The calculation unit 340 is adapted to solve the objective function based on the training data set to obtain the counternoise.

The sample generating unit 350 is adapted to determine an original video to be processed, and generate a confrontation video sample corresponding to the original video by using the obtained confrontation noise.

As an example, the model obtaining unit may be adapted to: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

As an example, the video classification model to be processed includes, for example, a video classification model constructed based on a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).

As an example, the objective function comprises a first expression constructed based on a norm of a point-by-point product of a mask of the video sequence and the countering noise, for example, such that the value of the first expression is as small as possible.

As an example, the building unit may be adapted to: for each training video of the training data set, at least one video frame is selected from all video frames of the training video as a candidate frame, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame of the training video is 1, and the rest are 0.

As an example, the building unit may be adapted to: for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is determined according to a preset sparsity.

As an example, the construction unit may be adapted to, for each training video of the training data set, the number of video frames to be selected among all video frames of the training video being equal to the following value: the product of the number of all video frames of the training video and the sparsity.

As an example, the objective function includes, for example, a second expression term constructed based on a loss function of the video classification model to be processed, where the loss function is used to measure a distance between a prediction class label and a corresponding real class label of the training video added against noise, so as to minimize a value of the second expression term.

It should be noted that, in other examples, the objective function may also include both the first expression term and the second expression term.

By way of example, the loss function includes, for example, a cross-entropy loss function.

As an example, the calculation unit may be adapted to solve the objective function using Adam optimization method.

For example, the calculation unit may be adapted to initialize all pairs of anti-noise to a preset value, the preset value being between 0 and 0.1, in solving the objective function.

It should be noted that the constituent units in the video information processing apparatus according to the exemplary embodiment of the present invention can respectively perform the processing in the corresponding steps in the video information processing method according to the exemplary embodiment of the present invention described above with reference to fig. 1, and can achieve similar functions and effects, which are not described again here.

FIG. 4 illustrates a block diagram of an exemplary computer system/server 40 suitable for use in implementing embodiments of the present invention. The computer system/server 40 shown in FIG. 4 is only an example and should not be taken to limit the scope of use and functionality of embodiments of the present invention in any way.

As shown in FIG. 4, computer system/server 40 is in the form of a general purpose computing device. The components of computer system/server 40 may include, but are not limited to: one or more processors or processing units 401, a system memory 402, and a bus 403 that couples the various system components (including the system memory 402 and the processing unit 401).

Computer system/server 40 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 40 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)4021 and/or cache memory 4022. The computer system/server 40 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM4023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 403 by one or more data media interfaces. At least one program product may be included in system memory 402 having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 4025 having a set (at least one) of program modules 4024 may be stored, for example, in system memory 402, and such program modules 4024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. The program modules 4024 generally perform the functions and/or methods of the embodiments described herein.

The computer system/server 40 may also communicate with one or more external devices 404, such as a keyboard, pointing device, display, etc. Such communication may be through an input/output (I/O) interface 405. Also, the computer system/server 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through a network adapter 406. As shown in FIG. 4, network adapter 406 communicates with other modules of computer system/server 40 (e.g., processing unit 401, etc.) via bus 403. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with computer system/server 40.

The processing unit 401 executes various functional applications and data processing, for example, executes and realizes steps in the video information processing method, by running a program stored in the system memory 402; for example, displaying a current page, the current page including text and multimedia objects; highlighting a first multimedia object in the current page in response to a first operation on the first multimedia object; highlighting a second multimedia object in response to a second operation on the highlighted first multimedia object, the second multimedia object comprising multimedia objects located in pages other than the current page.

A specific example of a computer-readable storage medium embodying the present invention is shown in fig. 5.

The computer-readable storage medium of fig. 5 is an optical disc 500 on which a computer program (i.e., a program product) is stored, which when executed by a processor, performs the steps recited in the above-described method embodiments, such as displaying a current page, the current page including text and multimedia objects; highlighting a first multimedia object in the current page in response to a first operation on the first multimedia object; highlighting a second multimedia object in response to a second operation on the highlighted first multimedia object, the second multimedia object comprising multimedia objects located in pages other than the current page; the specific implementation of each step is not repeated here.

It should be noted that although in the above detailed description several units, modules or sub-modules of the video information processing apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

In summary, in the embodiments according to the present disclosure, the present disclosure provides the following solutions, but is not limited thereto:

the scheme 1. a video information processing method is characterized by comprising the following steps:

obtaining a training data set, the training data set comprising a plurality of training videos, wherein each training video has a corresponding true category label;

obtaining a video classification model to be processed;

constructing an objective function with the countermeasure noise as an unknown quantity so as to satisfy the following conditions:

adding the counteracting noise to as few video frames as possible in the training video, and making the amplitude of the counteracting noise as small as possible, an

Enabling the to-be-processed video classification model to be wrong as much as possible for the result of the training video classification after the anti-noise is added;

solving the objective function based on the training data set to obtain a countering noise; and

determining an original video to be processed, and generating a confrontation video sample corresponding to the original video by using the acquired confrontation noise.

Scheme 2. the video information processing method according to scheme 1, wherein the step of obtaining the video classification model to be processed includes: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

Scheme 3. the video information processing method according to scheme 1 or 2, wherein the video classification model to be processed includes a video classification model constructed based on a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).

Scheme 4. the video information processing method according to scheme 1 or 2, wherein the objective function includes a first expression constructed based on a norm of a point-product of a video sequence mask and counternoise.

Scheme 5. the video information processing method according to scheme 4, wherein for each training video of the training data set, at least one video frame is selected as a candidate frame from all video frames of the training video, wherein the element value at the position corresponding to the candidate frame of the training video in the video sequence mask corresponding to the training video is 1, and the rest are 0.

Scheme 6. the video information processing method according to scheme 5, wherein for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is determined according to a preset sparsity.

Scheme 7. the video information processing method according to scheme 6, wherein for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is equal to the following value: the product of the number of all video frames of the training video and the sparsity.

Scheme 8. the video information processing method according to scheme 1, wherein the objective function includes a second expression term constructed based on a loss function of the to-be-processed video classification model, wherein the loss function is used to measure a distance between a prediction class label and a corresponding true class label of a training video added with anti-noise.

Scheme 9. the video information processing method according to scheme 8, wherein the loss function comprises a cross-entropy loss function.

Scheme 10. the video information processing method according to scheme 1 or 2, characterized in that the objective function is solved using Adam optimization method.

Scheme 11. the method of processing video information according to scheme 10, wherein all pairs of anti-noise are initialized to a preset value in the process of solving the objective function, and the preset value is between 0 and 0.1.

A video information processing apparatus according to claim 12, comprising:

a training data obtaining unit adapted to obtain a training data set comprising a plurality of training videos, wherein each training video has a corresponding real category label;

the model obtaining unit is suitable for obtaining a video classification model to be processed;

a construction unit adapted to construct an objective function with the countering noise as an unknown quantity to satisfy the following conditions:

a calculation unit adapted to solve the objective function based on the training data set to obtain a countering noise; and

and the sample generation unit is suitable for determining an original video to be processed and generating a confrontation video sample corresponding to the original video by using the acquired confrontation noise.

Scheme 13. the video information processing apparatus according to scheme 12, wherein the model obtaining unit is adapted to: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

Scheme 14. the video information processing apparatus according to claim 12 or 13, wherein the video classification model to be processed includes a video classification model constructed based on a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).

Solution 15 the video-information processing apparatus according to solution 12 or 13, wherein the objective function includes a first expression term constructed based on a norm of a point-product of a video sequence mask and a competing noise, so that a value of the first expression term is as small as possible.

Scheme 16. the video-information processing apparatus according to claim 15, wherein said construction unit is adapted to: and for each training video of the training data set, selecting at least one video frame from all video frames of the training video as a candidate frame, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame of the training video is 1, and the rest are 0.

Scheme 17. the video information processing apparatus according to scheme 16, wherein said construction unit is adapted to: for each training video of the training data set, determining the number of video frames to be selected in all video frames of the training video according to a preset sparsity.

Scheme 18. the video-information processing apparatus according to scheme 17, wherein said construction unit is adapted to, for each training video of said training data set, number of video frames to be selected among all video frames of the training video equal to the following value: the product of the number of all video frames of the training video and the sparsity.

Scheme 19. the video information processing apparatus according to scheme 12, wherein the objective function includes a second expression term constructed based on a loss function of the to-be-processed video classification model, wherein the loss function is used to measure a distance between a predicted class label and a corresponding true class label of a training video added with anti-noise, so as to minimize a value of the second expression term.

Scheme 20. the video information processing apparatus of scheme 19, wherein the loss function comprises a cross-entropy loss function.

Scheme 21. the video information processing apparatus according to scheme 12 or 13, wherein said computing unit is adapted to solve said objective function using Adam optimization.

Scheme 22. the video information processing apparatus according to scheme 21, wherein the computing unit is adapted to initialize all pairs of anti-noise signals to a preset value in the process of solving the objective function, and the preset value is between 0 and 0.1.

Scheme 23. a storage medium storing a program which, when executed by a processor, implements the video information processing method according to any one of schemes 1 to 11.

Scheme 24. a computing device comprising the storage medium of scheme 23.

Claims

1. A video information processing method, characterized by comprising:

obtaining a video classification model to be processed;

adding the counternoise to a certain amount of video frames in the training video, and initializing the counternoise to a preset value, wherein the preset value is between 0 and 0.1, and

enabling the video classification model to be processed to comprise a second expression item constructed based on a loss function of the video classification model to be processed, wherein the loss function is used for measuring the distance between a prediction class label and a corresponding real class label of the training video added with the anti-noise;

2. The method according to claim 1, wherein the step of obtaining the classification model of the video to be processed comprises: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

3. The method according to claim 1 or 2, wherein the video classification model to be processed comprises a video classification model constructed based on a convolutional neural network and a cyclic neural network.

4. The method according to claim 1 or 2, wherein the objective function comprises a first expression constructed based on a norm of a dot product of a video sequence mask and a competing noise.

5. The method of claim 4, wherein for each training video of the training data set, at least one video frame is selected as a candidate frame from all video frames of the training video, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame is 1, and the rest are 0.

6. The method according to claim 5, wherein for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is determined according to a preset sparsity.

7. The method according to claim 6, wherein for each training video of the training data set, the number of video frames to be selected among all video frames of the training video is equal to: the product of the number of all video frames of the training video and the sparsity.

8. The method of claim 1, wherein the loss function comprises a cross-entropy loss function.

9. A method for processing video information according to claim 1 or 2, characterized in that said objective function is solved using Adam optimization.

10. A video information processing apparatus characterized by comprising:

11. The video-information processing apparatus according to claim 10, wherein said model-obtaining unit is adapted to: and obtaining a model network architecture of the video classification model to be processed and corresponding model parameters.

12. The apparatus according to claim 10 or 11, wherein the video classification model to be processed comprises a video classification model constructed based on a convolutional neural network and a cyclic neural network.

13. The apparatus according to claim 10 or 11, wherein said objective function includes a first expression term constructed based on a norm of a dot product of a video sequence mask and a competing noise so that a value of said first expression term is small.

14. The video-information processing apparatus of claim 13, wherein the construction unit is adapted to: and for each training video of the training data set, selecting at least one video frame from all video frames of the training video as a candidate frame, wherein the element value of the video sequence mask corresponding to the training video corresponding to the candidate frame of the training video is 1, and the rest are 0.

15. The video-information processing apparatus of claim 14, wherein the construction unit is adapted to: for each training video of the training data set, determining the number of video frames to be selected in all video frames of the training video according to a preset sparsity.

16. The video-information processing apparatus according to claim 15, wherein said construction unit is adapted to, for each training video of said training data set, number of video frames to be selected among all video frames of the training video equal to a value: the product of the number of all video frames of the training video and the sparsity.

17. The video-information processing apparatus of claim 10, wherein the loss function comprises a cross-entropy loss function.

18. Video information processing apparatus according to claim 10 or 11, wherein the calculation unit is adapted to solve the objective function using Adam optimization.

19. A storage medium storing a program which, when executed by a processor, implements the video information processing method according to any one of claims 1 to 9.

20. A computing device comprising the storage medium of claim 19.