CN112966723B

CN112966723B - Video data augmentation method, video data augmentation device, electronic device and readable storage medium

Info

Publication number: CN112966723B
Application number: CN202110172710.2A
Authority: CN
Inventors: 王世鹏; 黄军; 程军; 胡晓光
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2022-05-03
Anticipated expiration: 2041-02-08
Also published as: CN112966723A

Abstract

The disclosure discloses a video data augmentation method, a video data augmentation device, an electronic device and a readable storage medium, and relates to the field of artificial intelligence such as deep learning and computer vision, wherein the method comprises the following steps: in the model training process, aiming at M video data in any training batch, wherein M is a positive integer greater than one, the following processing is respectively carried out: composing a first video sequence using the M video data in the original order; randomly sequencing M video data in a first video sequence to obtain a second video sequence; and respectively mixing each video data in the first video sequence with the corresponding video data in the second video sequence to obtain M mixed video data for model training. By applying the scheme disclosed by the disclosure, the model training effect, the model performance and the like can be improved.

Description

Video data augmentation method, video data augmentation device, electronic device and readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for video data augmentation in the fields of deep learning and computer vision, an electronic device, and a readable storage medium.

Background

Currently, when video classification is performed, a deep learning technique is usually combined, for example, a trained video classification model may be used to perform video classification.

In model training, research is mainly performed on how to extract spatio-temporal features of videos, for example, by optimizing the design of a network structure, the utilization capacity of video spatio-temporal information, especially time information, is improved, and the like, but few references and researches are made on the aspect of network training strategies, especially video data expansion.

Disclosure of Invention

The disclosure provides a video data augmentation method, a video data augmentation device, an electronic device and a readable storage medium.

A method of video data augmentation, comprising:

in the model training process, aiming at M video data in any training batch, wherein M is a positive integer greater than one, the following processing is respectively carried out:

composing a first video sequence using the M video data in the original order;

randomly sequencing M video data in the first video sequence to obtain a second video sequence;

and respectively mixing each video data in the first video sequence with the corresponding video data in the second video sequence to obtain M mixed video data for model training.

A video data augmenting apparatus, comprising: the system comprises a first acquisition module, a second acquisition module and a mixing module;

the first obtaining module is configured to, in a model training process, for M video data in any training batch, where M is a positive integer greater than one, form a first video sequence using the M video data in an original order;

the second obtaining module is configured to randomly sort M video data in the first video sequence to obtain a second video sequence;

and the mixing module is used for mixing each video data in the first video sequence with the corresponding video data in the second video sequence respectively to obtain M mixed video data for model training.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.

A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

One embodiment in the above disclosure has the following advantages or benefits: through operations such as sequencing and mixing video data, video data augmentation is achieved, so that the diversity of data is increased, the problems of overfitting and the like when a training data set is small can be effectively avoided, and the model training effect, the model performance and the like are improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart illustrating an embodiment of a video data augmentation method according to the present disclosure;

FIG. 2 is a schematic diagram of an implementation of the mixing operation of the present disclosure;

fig. 3 is a schematic diagram of an overall implementation process of the video data augmentation method according to the present disclosure;

fig. 4 is a schematic diagram illustrating an exemplary embodiment 400 of a video data enhancement apparatus according to the present disclosure;

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a flowchart illustrating an embodiment of a video data augmentation method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.

In step 101, in the model training process, M video data in any training batch (batch), where M is a positive integer greater than one, are processed in the manner shown in steps 102 to 104, respectively.

In step 102, a first video sequence is composed using the original order of M video data.

In step 103, the M video data in the first video sequence are randomly ordered to obtain a second video sequence.

In step 104, each video data in the first video sequence and the corresponding video data in the second video sequence are mixed to obtain M mixed video data for model training.

It can be seen that in the scheme of the embodiment of the method, video data is amplified by sequencing, mixing and other operations on the video data, so that the diversity of the data is increased, the problems of overfitting and the like when a training data set is small can be effectively avoided, and the model training effect, the model performance and the like are further improved. The model may be a video classification model.

Data augmentation is one of the common skills in deep learning, and refers to expanding the scale of a training data set by performing a series of operations on training data to generate new training data which are similar but different. The present disclosure provides a video data augmentation method mainly for video data.

Each batch usually includes a plurality of video data, i.e. a plurality of video segments/video files, and the specific number can be determined according to actual needs.

As described above, for M video data, the first video sequence may be composed using M video data in the original order. And, the M video data in the first video sequence can be randomly ordered to obtain a second video sequence.

For example, the following steps are carried out: assuming that the first video sequence includes 8 video data, which are referred to as video data 1, video data 2, video data 3, video data 4, video data 5, video data 6, video data 7 and video data 8 for convenience of description, M video data in the first video sequence may be randomly ordered, and assuming that the randomly ordered 8 video data are video data 2, video data 3, video data 5, video data 1, video data 6, video data 8, video data 4 and video data 7 in sequence, then video data 2, video data 3, video data 5, video data 1, video data 6, video data 8, video data 4 and video data 7 constitute a second video sequence.

Then, each video data in the first video sequence and the corresponding video data in the second video sequence can be mixed respectively, so as to obtain M mixed video data for model training.

For any video data in the first video sequence, the manner of determining the corresponding video data of the video data is not limited, for example, any video data in the second video sequence may be used as the corresponding video data of the video data, or video data in the second video sequence at the same position as the video data (e.g., at the second position in the sorting order) may be used as the corresponding video data of the video data, and so on.

For example, the following steps are carried out: assuming that a first video sequence includes 8 video data, which are video data 1, video data 2, video data 3, video data 4, video data 5, video data 6, video data 7 and video data 8 in sequence, and 8 video data in a second video sequence are video data 2, video data 3, video data 5, video data 1, video data 6, video data 8, video data 4 and video data 7 in sequence, the video data 1 in the first video sequence and the video data 2 in the second video sequence may be mixed, the video data 2 in the first video sequence and the video data 3 in the second video sequence are mixed, the video data 3 in the first video sequence and the video data 5 in the second video sequence are mixed, and so on.

When any video data in the first video sequence is mixed with any video data in the second video sequence, the mixing can be performed in a mix (mixup) manner, a crop mix (cutmix) manner, or the like. The manner is only for illustration and is not used to limit the technical scheme of the disclosure.

The mixup mode and the cutmix mode are data augmentation modes commonly used in the image field, and are migrated to the video field in the disclosure and expanded along the time dimension. The mixup method is taken as an example, and how to mix video data is specifically described below.

For any video data in the first video sequence and any video data in the second video sequence, mixing may be performed in the following manner:

mix_inputs＝lambda×video+(1-lambda)×permutated_video； (1)

wherein, video represents any video data in the first video sequence, persistent _ video represents any video data in the second video sequence, and mix _ inputs represents the obtained mixed video data.

lambda is a hyper-parameter, and the calculation formula can be:

lambda＝Beta(α,α)； (2)

the probability density function of the Beta distribution (i.e., the Beta distribution) is as follows:

alpha and beta are shape parameters.

When mixing is performed in the manner shown in formula (1), as a possible implementation manner, two video data may be mixed frame by frame (image), that is, for any one frame in one of the video data, the frame may be multiplied by the corresponding weight lambda, and the corresponding frame in the other video data may be multiplied by the corresponding weight (1-lambda), and then the two products are added, which is specifically implemented in the prior art.

The implementation process of the cutmix mode is similar to that of the mixup mode, except that lambda is replaced by the corresponding hyper-parameter in the cutmix mode.

By the above manner, M pieces of mixed video data can be obtained. For any mixed video data, the following processing can be respectively carried out: obtaining a prediction output corresponding to the mixed video data; determining a first loss according to a label and a prediction output of first video data corresponding to the mixed video data; determining a second loss according to the label and the prediction output of the second video data corresponding to the mixed video data; determining a third loss corresponding to the mixed video data according to the first loss and the second loss so as to perform model training by using the third loss; the first video data and the second video data are two video data obtained by mixing the mixed video data. For example, the first loss and the second loss may be multiplied by corresponding weights, respectively, the two products may be added, and the added sum may be used as a third loss corresponding to the mixed video data.

That is, mixing of tags is required in addition to mixing of video data. The labels for the first video data and the second video data are both available, as in an animal classification scenario, the labels may be "cat" or "dog", etc. For the above-mentioned mixed video data, it may be transmitted into a classification network such as a video classification model to obtain a prediction output, i.e. a prediction result, and a first loss may be determined according to a label of first video data corresponding to the mixed video data and the prediction output, and a second loss may be determined according to a label of second video data corresponding to the mixed video data and the prediction output, respectively, how to determine the first loss and the second loss is the prior art. Further, the first loss and the second loss may be weighted and summed, that is, the first loss and the second loss may be multiplied by corresponding weights respectively, the two products are added, and the added sum is used as a third loss corresponding to the mixed video data, that is, a final required loss.

For example, there may be: loss ═ lambda × Loss_a+(1-lambda)×loss_b； (4)

Wherein, Loss represents the third Loss, Loss_aRepresents the first lossLoss, loss_bRepresenting a second loss.

Through the processing, the video data is expanded from the batch angle, the space-time relationship between frames of the video data is kept, different video data and different labels are mixed, the diversity of the data is increased, the problems of overfitting and the like when a training data set is small can be effectively avoided, and the model training effect, the model performance and the like are improved.

In addition, video data is different from image data, has time extensibility, and the video length of different video data may be different.

Therefore, the solution of the present disclosure further provides that, before any two video data are mixed, if the video lengths of the two video data are determined to be inconsistent, the video lengths of the two video data may also be adjusted to be consistent.

Specifically, when the video lengths of the two video data are adjusted to be consistent, the following manner may be adopted:

1) extracting the first N frames from the first video data of the two video data, wherein N is a positive integer larger than one, using the extracted N frames to form the adjusted first video data, extracting the first N frames from the second video data of the two video data, and using the extracted N frames to form the adjusted second video data.

2) And extracting P frames from the first video data according to a preset sampling frequency, wherein P is a positive integer greater than one, the extracted P frames are used for forming the adjusted first video data, the P frames are extracted from the second video data according to the preset sampling frequency, and the extracted P frames are used for forming the adjusted second video data.

The specific manner of adopting the above can be determined according to actual needs.

In the method 1), the first N frames can be extracted from the first video data and the second video data respectively, and the first video data and the second video data after adjustment are composed by using the extracted N frames, so that the video lengths of the first video data and the second video data after adjustment are the same, and the specific value of N can be determined according to actual needs.

In the mode 2), P frames are extracted from the first video data and the second video data respectively according to a predetermined sampling frequency, and the extracted P frames are used to constitute the adjusted first video data and the adjusted second video data, so that the video lengths of the adjusted first video data and the adjusted second video data are the same, and a specific value of P can be determined according to actual needs.

Through the processing, the video lengths of the two video data to be mixed are ensured to be consistent, so that the video data mixing effect is improved.

In addition, in practical applications, some other processing may be performed on the video data, and accordingly, the mixing operation for the first video data and the second video data may be further divided into an early mixing mode and a late mixing mode.

Fig. 2 is a schematic diagram of an implementation of the mixing operation described in the present disclosure. As shown in fig. 2, the early-mix mode may further include a mode a) and a mode b), in the mode a), the first N frames may be extracted from the first video data and the second video data respectively, thereby obtaining adjusted first video data and adjusted second video data, which may then be mixed, and further, some other processing may be performed on the resulting mixed video data, in manner b), P frames may be extracted from the first video data and the second video data respectively at a predetermined sampling frequency, the adjusted first video data and the adjusted second video data are obtained, and then the adjusted first video data and the adjusted second video data can be mixed, and further, some other processing can be carried out on the obtained mixed video data. The late mixing mode may further include a mode c) and a mode d), in the mode c), the first N frames may be extracted from the first video data and the second video data respectively, thereby obtaining adjusted first video data and adjusted second video data, which may then be subjected to some other processing, and further, the processed first video data and second video data may be mixed, in the manner d), P frames may be extracted from the first video data and the second video data respectively according to a predetermined sampling frequency, the adjusted first video data and the adjusted second video data are obtained, and then some other processing may be performed on the adjusted first video data and the adjusted second video data, and further, the processed first video data and the processed second video data may be mixed. The other processing specifically includes which processing may be determined according to actual needs, and for example, may include processing of cropping, flipping, and the like of frames in the video data.

With the above introduction, fig. 3 is a schematic diagram of an overall implementation process of the video data augmentation method according to the present disclosure. As shown in FIG. 3, the hyper-parameter may be referred to as lambda. For the specific implementation of the process shown in fig. 3, reference is made to the foregoing related descriptions, which are not repeated herein.

In addition, the method disclosed by the present disclosure can be applied to Video data in various formats, and has wide applicability, for example, the Video data can be in a Portable Media 4(MP4, Media Portable 4) format, an Audio Video Interleaved (AVI) format, or a vector format. Generally, the format of each video data in a batch is the same, and if the format of each video data in a batch is different, the format of each video data in the batch can be converted into a uniform format, that is, the format of each video data in the batch is uniform.

It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts, those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.

The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.

Fig. 4 is a schematic block diagram illustrating a structure of an embodiment 400 of a video data enhancement apparatus according to the present disclosure. As shown in fig. 4, includes: a first acquisition module 401, a second acquisition module 402, and a mixing module 403.

The first obtaining module 401 is configured to, in a model training process, form a first video sequence by using M video data in an original order for M video data in any batch, where M is a positive integer greater than one.

A second obtaining module 402, configured to randomly sequence the M video data in the first video sequence to obtain a second video sequence.

A mixing module 403, configured to mix each video data in the first video sequence with a corresponding video data in the second video sequence, respectively, to obtain M pieces of mixed video data, where the M pieces of mixed video data are used for model training.

When any video data in the first video sequence and any video data in the second video sequence are mixed, the mixing module 403 may mix according to a mixup mode or a cutmix mode.

For any resulting hybrid video data, the blending module 403 may further perform the following processing: obtaining a prediction output corresponding to the mixed video data; determining a first loss according to a label and a prediction output of first video data corresponding to the mixed video data; determining a second loss according to the label and the prediction output of the second video data corresponding to the mixed video data; determining a third loss corresponding to the mixed video data according to the first loss and the second loss so as to carry out model training by using the third loss; the first video data and the second video data are two video data which are mixed to obtain the mixed video data. For example, the first loss and the second loss may be multiplied by corresponding weights, respectively, the two products may be added, and the added sum may be used as a third loss corresponding to the mixed video data.

In addition, before mixing any two video data, if it is determined that the video lengths of the two video data are not the same, the mixing module 403 may also adjust the video lengths of the two video data to be the same.

Specifically, when the video lengths of the two video data are adjusted to be consistent, the mixing module 403 may adopt the following manner:

For a specific work flow of the apparatus embodiment shown in fig. 4, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, by adopting the scheme of the embodiment of the device disclosed by the invention, the video data is expanded by sequencing, mixing and other operations on the video data, so that the diversity of the data is increased, the problems of overfitting and the like when a training data set is small can be effectively avoided, and the model training effect, the model performance and the like are further improved.

The scheme disclosed by the invention can be applied to the field of artificial intelligence, in particular to the fields of deep learning, computer vision and the like. Artificial intelligence is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware technology and a software technology, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When loaded into RAM 503 and executed by computing unit 501, may perform one or more steps of the methods described in the present disclosure. Alternatively, in other embodiments, the computing unit 501 may be configured by any other suitable means (e.g., by means of firmware) to perform the methods described by the present disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS). The server may also be a server of a distributed system, or a server incorporating a blockchain. Cloud computing refers to accessing an elastically extensible shared physical or virtual resource pool through a network, resources can include servers, operating systems, networks, software, applications, storage devices and the like, a technical system for deploying and managing the resources in a self-service mode as required can be achieved, and efficient and powerful data processing capacity can be provided for technical applications and model training of artificial intelligence, block chains and the like through a cloud computing technology.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of video data augmentation, comprising:

composing a first video sequence using the M video data in the original order;

respectively mixing each video data in the first video sequence with the corresponding video data in the second video sequence to obtain M mixed video data for model training;

for any mixed video data, the following processing is respectively carried out:

obtaining a prediction output corresponding to the mixed video data;

determining a first loss according to a label of first video data corresponding to the mixed video data and the prediction output;

determining a second loss according to a label of second video data corresponding to the mixed video data and the prediction output;

determining a third loss corresponding to the mixed video data according to the first loss and the second loss so as to perform model training by using the third loss;

the first video data and the second video data are two video data obtained by mixing the mixed video data.

2. The method of claim 1, wherein the mixing comprises:

mixing according to a mixup mode;

or mixing according to a cutting mixing cutmix mode.

3. The method of claim 1, further comprising:

before any two video data are mixed, if the video lengths of the two video data are determined to be inconsistent, the video lengths of the two video data are adjusted to be consistent.

4. The method of claim 3, wherein the adjusting the video lengths of the two video data to be consistent comprises:

extracting the first N frames from the first video data of the two video data, wherein N is a positive integer greater than one, and forming the adjusted first video data by using the extracted N frames; extracting the first N frames from the second video data of the two video data, and forming adjusted second video data by utilizing the extracted N frames;

or extracting P frames from the first video data according to a preset sampling frequency, wherein P is a positive integer greater than one, and the extracted P frames are used for forming the adjusted first video data; and extracting P frames from the second video data according to a preset sampling frequency, and forming the adjusted second video data by utilizing the extracted P frames.

5. A video data augmenting apparatus, comprising: the system comprises a first acquisition module, a second acquisition module and a mixing module;

the mixing module is used for mixing each video data in the first video sequence with the corresponding video data in the second video sequence respectively to obtain M mixed video data for model training;

the mixing module is further configured to, for any one of the mixed video data, perform the following processing respectively: obtaining a prediction output corresponding to the mixed video data; determining a first loss according to a label of first video data corresponding to the mixed video data and the prediction output; determining a second loss according to a label of second video data corresponding to the mixed video data and the prediction output; determining a third loss corresponding to the mixed video data according to the first loss and the second loss so as to perform model training by using the third loss; the first video data and the second video data are two video data obtained by mixing the mixed video data.

6. The apparatus of claim 5, wherein,

and the mixing module is used for mixing according to a mix mixup mode, or mixing according to a cut mix cutmix mode.

7. The apparatus of claim 5, wherein,

the mixing module is further configured to, before mixing any two video data, adjust the video lengths of the two video data to be consistent if it is determined that the video lengths of the two video data are not consistent.

8. The apparatus of claim 7, wherein,

the mixing module extracts the first N frames from the first video data of the two video data, wherein N is a positive integer greater than one, the extracted N frames are used for forming the adjusted first video data, the first N frames are extracted from the second video data of the two video data, and the extracted N frames are used for forming the adjusted second video data;

or, the mixing module extracts P frames from the first video data according to a predetermined sampling frequency, where P is a positive integer greater than one, uses the extracted P frames to form adjusted first video data, extracts P frames from the second video data according to the predetermined sampling frequency, and uses the extracted P frames to form adjusted second video data.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-4.