CN113591655A

CN113591655A - Video contrast loss calculation method, system, storage medium and electronic device

Info

Publication number: CN113591655A
Application number: CN202110835232.9A
Authority: CN
Inventors: 胡郡郡; 唐大闰
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-11-02

Abstract

The application discloses a video contrast loss calculation method, a video contrast loss calculation system, a storage medium and electronic equipment, wherein the contrast loss calculation method comprises the following steps: and sound sampling: continuously sampling a plurality of video equal parts for each video; processing the video frame and the sound frequency spectrum of each video equal part through an Encoder network and a co-attribute i on module to obtain modal characteristics of vision and sound; calculating the contrast loss: calculating the contrast loss according to the modal characteristics of vision and sound; the processing step comprises: an input step: respectively sending the video frame and the sound frequency spectrum of each video equal part to an Encoder network; encoder network processing: and the video frame and the sound frequency spectrum are processed by the Encoder network to obtain visual characteristics and sound characteristics. The invention can realize cross-modal information fusion, and the sound information can guide the learning of the visual model and the visual information can guide the learning of the sound model.

Description

Video contrast loss calculation method, system, storage medium and electronic device

Technical Field

The invention belongs to the field of video contrast loss calculation, and particularly relates to a video contrast loss calculation method, a video contrast loss calculation system, a storage medium and electronic equipment.

Background

1) The characterization is performed based on a direct classification method, for example, a video directly gives a class, and the encoder part of the model is trained by the classification method. The direct classification method is a supervised method, data needs to be labeled, and the contrast learning method is a self-supervision method, does not need to be labeled, and can directly learn semantic abstract information of the image by using the characteristics of the data.

2) The generation learning method is used for representation, the method pays more attention to details of the pixel level of the image, however, the contrast learning method only needs to distinguish in a feature space, does not pay attention to details of the pixel, and pays more attention to abstract semantic information.

The prior art is as follows:

the existing video contrast loss calculation method is used for extracting features of a video, and downstream tasks such as action recognition, scene segmentation, scene classification and the like can be better performed after a better feature exists.

The prior art transducer can realize automatic attention, and the attention is not influenced by distance.

Disclosure of Invention

The embodiment of the application provides a method, a system, a storage medium and an electronic device for calculating the contrast loss of a video, so as to at least solve the problem that the conventional method for calculating the contrast loss of the video has a complex program.

The invention provides a video contrast loss calculation method, which comprises the following steps:

a sampling step: continuously sampling a plurality of video equal parts for each video;

processing the video frame and the sound frequency spectrum of each video equal part through an Encoder network and a co-attention module to obtain modal characteristics of vision and sound;

calculating the contrast loss: and calculating the contrast loss according to the modal characteristics of vision and sound.

The above-mentioned contrast loss calculation method, wherein the processing step includes:

an input step: respectively sending the video frame and the sound frequency spectrum of each video equal part to an Encoder network;

encoder network processing: the video frame and the sound frequency spectrum are processed by the Encoder network to obtain visual characteristics and sound characteristics;

and a co-attention module is used for processing, wherein the video features and the sound features are input into the co-attention module together, and after the processing of the co-attention module is completed, a processing result is input into the multi-layer perceptron layer to obtain the modal features of vision and sound.

The above method for calculating the contrast loss, wherein the Encoder network comprises: the video frame is input into the Encoder _ v network, the sound spectrum is input into the Encoder _ a network, the weights of the Encoder _ v network and the Encoder _ a network are not shared, and the video frame and the sound spectrum of the same video equal part are kept consistent in time.

The above-mentioned contrast loss calculation method, wherein the contrast loss calculation step includes: and calculating according to the modal characteristics of vision and sound to obtain the contrast loss of each video equal part, and performing mean operation according to a plurality of contrast losses to obtain the total loss.

The invention also provides a system for calculating the contrast loss of the video, which comprises the following steps:

the sampling module is used for continuously sampling a plurality of video equal parts for each video;

the processing module is used for processing the video frame and the sound frequency spectrum of each video equal part through an Encoder network and a co-attention module to obtain modal characteristics of vision and sound;

a contrast loss calculation module that performs a contrast loss calculation according to the modal characteristics of vision and sound.

The above ratio loss calculating system, wherein the processing module comprises:

the input unit is used for respectively transmitting the video frame and the sound frequency spectrum of each video equal part to an Encoder network;

the Encoder network processing unit is used for processing the video frame and the sound spectrum to obtain visual characteristics and sound characteristics;

and the co-attention module unit inputs the video features and the sound features into the co-attention module, and the co-attention module inputs the processing result into the multi-layer perceptron layer after the processing is finished so as to obtain the modal features of vision and sound.

The above ratio loss calculation system, wherein the Encoder network comprises: the video frame is input into the Encoder _ v network, the sound spectrum is input into the Encoder _ a network, the weights of the Encoder _ v network and the Encoder _ a network are not shared, and the video frame and the sound spectrum of the same video equal part are kept consistent in time.

The above-mentioned ratio loss calculating system, wherein the ratio loss calculating module includes: and calculating according to the modal characteristics of vision and sound to obtain the contrast loss of each video equal part, and performing mean operation according to a plurality of contrast losses to obtain the total loss.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the contrast loss calculation method as described in any one of the above when executing the computer program.

A storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a contrast loss calculation method as described in any one of the above.

The invention has the beneficial effects that:

the invention belongs to the field of computer vision in the deep learning technology. The invention uses cross-model attention mode to do comparison study; and performing information interaction between the sequences in a sequence clip mode. 1) The invention can realize cross-modal information fusion, and the sound information of the invention can guide the learning of the visual model and the visual information can guide the learning of the sound model. 2) The invention uses a self-supervision method without marking data. 3) The invention carries out contrast learning on the continuous clip sequences and increases the information interaction between the sequences.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application.

In the drawings:

FIG. 1 is a flow chart of a method of contrast loss calculation for video in accordance with the present invention;

FIG. 2 is a flow chart of substep S2 of the present invention;

FIG. 3 is a diagram of a model of the present invention;

FIG. 4a is a cross-model annotation block diagram of the present invention;

FIG. 4b is a diagram of self-attention module of the present invention;

FIG. 5 is a loss calculation graph of the present invention;

FIG. 6 is a schematic diagram of a system for calculating contrast loss of a video according to the present invention;

fig. 7 is a frame diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.

The first embodiment is as follows:

referring to fig. 1, fig. 1 is a flowchart of a video contrast loss calculation method. As shown in fig. 1, the method for calculating the contrast loss of a video according to the present invention includes:

sampling step S1: continuously sampling a plurality of video equal parts for each video;

a processing step S2, in which the video frame and the sound frequency spectrum of each video equal part are processed by an Encoder network and a co-attention module to obtain the modal characteristics of vision and sound;

contrast loss calculation step S3: and calculating the contrast loss according to the modal characteristics of vision and sound.

Referring to fig. 2, fig. 2 is a flowchart of the processing step S2. As shown in fig. 2, the processing step S2 includes:

input step S21: respectively sending the video frame and the sound frequency spectrum of each video equal part to an Encoder network;

encoder network processing step S22: the video frame and the sound frequency spectrum are processed by the Encoder network to obtain visual characteristics and sound characteristics;

and a co-attention module processing S23, inputting the video features and the sound features into the co-attention module together, and inputting the processing result into the multi-layer perceptron layer after the co-attention module processing is finished to obtain the modal features of vision and sound.

Wherein the Encoder network includes: the video frame is input into the Encoder _ v network, the sound spectrum is input into the Encoder _ a network, the weights of the Encoder _ v network and the Encoder _ a network are not shared, and the video frame and the sound spectrum of the same video equal part are kept consistent in time.

Wherein the contrast loss calculating step includes: and calculating according to the modal characteristics of vision and sound to obtain the contrast loss of each video equal part, and performing mean operation according to a plurality of contrast losses to obtain the total loss.

Specifically, the invention uses a deep learning method to characterize the visual part and the sound part of the video, so that the feature space can well map the original visual information and the sound information. When a video has good visual and sound characteristics, the downstream task can be better done.

Further, as shown in fig. 3, 4 and 5, the method for calculating the contrast loss of the video of the present invention includes:

step 1, continuously sampling n clips for each video, wherein the sampling form inside each clip is as follows: sampling every 4 frames for a total of 8 frames.

Respectively sending the visual frame and the voice frequency spectrum of each clip to an Encoder _ v and an Encoder _ a, wherein the network weights of the Encoder _ v and the Encoder _ a are not shared; the video frames and the audio spectrum of the same clip remain temporally coincident.

And 3, obtaining n-D characteristics after the information of the visual and sound modes passes through the Encoder, wherein n is the clip length, and D is the characteristic dimension.

And 4, sending the visual feature (n X D) and the sound feature (n X D) to the co-attention module together. Each co-attention module is composed of a cross-module attention module and a self-attention module, wherein the cross-module attention module is shown in figure 4(a), and the self-attention module is shown in figure 4 (b).

Step 5. there may be a plurality of co-attention modules in step 4. And entering a multilayer perceptron layer (MLP) after all the co-attention modules are completed.

And 6, obtaining the characteristics of the visual and sound modes in the step 5, wherein the characteristic dimension is n × 256, and then calculating the contrast loss, wherein the contrast loss is calculated in a mode shown in the figure 5, each pair of clip characteristics calculate one contrast loss, and the total loss is the average of n pairs of losses.

Still further, the present invention uses cross-model attention for comparative learning.

Still further, the invention uses a sequence clip mode to carry out information interaction between sequences.

Example two:

referring to fig. 6, fig. 6 is a schematic structural diagram of a video contrast loss calculation system according to the present invention. Fig. 6 shows a system for calculating contrast loss of a video according to the present invention, which includes:

Wherein the processing module comprises:

Wherein the contrast loss calculation module comprises: and calculating according to the modal characteristics of vision and sound to obtain the contrast loss of each video equal part, and performing mean operation according to a plurality of contrast losses to obtain the total loss.

Example three:

referring to fig. 7, this embodiment discloses an embodiment of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement the contrast loss calculation method for video in any of the above embodiments.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 7, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

The bus 80 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus), an FSB (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an Infini Band Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The electronic device may be based on the contrast loss calculation of the video to implement the methods described in connection with fig. 1-2.

In addition, in combination with the method for calculating the contrast loss of the video in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement a method of contrast loss calculation for video as in any of the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In summary, the beneficial effects of the invention are that the scheme realizes the calculation of the contrast loss of the video, 1) the invention can realize cross-modal information fusion, the sound information can guide the learning of the visual model, and the visual information can guide the learning of the sound model. 2) The invention uses a self-supervision method without marking data. 3) The invention carries out contrast learning on the continuous clip sequences and increases the information interaction between the sequences.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for calculating a contrast loss of a video, comprising:

2. The method of calculating the contrast loss of a video according to claim 1, wherein the processing step comprises:

3. The video contrast loss calculation method of claim 2, wherein the Encoder network comprises: the video frame is input into the Encoder _ v network, the sound spectrum is input into the Encoder _ a network, the weights of the Encoder _ v network and the Encoder _ a network are not shared, and the video frame and the sound spectrum of the same video equal part are kept consistent in time.

4. The method of calculating the contrast loss of a video according to claim 1, wherein the contrast loss calculating step comprises: and calculating according to the modal characteristics of vision and sound to obtain the contrast loss of each video equal part, and performing mean operation according to a plurality of contrast losses to obtain the total loss.

5. A system for calculating contrast loss of a video, comprising:

6. The video contrast loss calculation system of claim 5, wherein the processing module comprises:

7. The video contrast loss calculation system of claim 6, wherein the Encoder network comprises: the video frame is input into the Encoder _ v network, the sound spectrum is input into the Encoder _ a network, the weights of the Encoder _ v network and the Encoder _ a network are not shared, and the video frame and the sound spectrum of the same video equal part are kept consistent in time.

8. The system for calculating the contrast loss of a video according to claim 5, wherein the contrast loss calculating module comprises: and calculating according to the modal characteristics of vision and sound to obtain the contrast loss of each video equal part, and performing mean operation according to a plurality of contrast losses to obtain the total loss.

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the contrast loss calculation method according to any one of claims 1 to 4 when executing the computer program.

10. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the contrast loss calculation method according to any one of claims 1 to 4.