CN111930992A

CN111930992A - Neural network training method and device and electronic equipment

Info

Publication number: CN111930992A
Application number: CN202010819997.9A
Authority: CN
Inventors: 徐世坚; 杨田雨; 姜文浩; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-13
Anticipated expiration: 2040-08-14
Also published as: CN111930992B

Abstract

The application belongs to the technical field of artificial intelligence, and particularly relates to a neural network training method, a neural network training device, a computer readable medium and an electronic device. The method comprises the following steps: sampling at least two sample segments from a video sample according to a video time sequence; adjusting the arrangement sequence of the at least two sample fragments, and acquiring the adjusted fragment sequence information; performing feature extraction on the sample fragment through neural networks corresponding to different modality types to obtain at least two modality features of the sample fragment; and training the neural network according to the feature similarity of each modal feature and the fragment sequence information so as to update the network parameters of the neural network. The method does not need to label the video data manually, reduces the data processing cost and improves the data processing efficiency.

Description

Neural network training method and device and electronic equipment

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a neural network training method, a neural network training device, a computer readable medium and an electronic device.

Background

With the development of computer and network technologies, making, spreading or watching network videos through various computer devices such as mobile phones and computers has become a common entertainment activity in people's daily life. For massive video data stored and propagated on a network platform, in order to provide accurate and efficient service contents such as video search, video recommendation and the like for a user, classification processing is generally required to be performed on the video data, and in addition, various types of labels such as "sports", "movies", "entertainment", "fun" and the like can be added to videos according to video classification results.

With the continuous progress of the deep learning technology and the continuous improvement of the computer computing power, the video classification technology has made a huge progress, the traditional video classification technology depends on a large amount of manual marking data, not only higher labor cost is needed, but also the labor efficiency is lower, and the ever-increasing video service requirements are difficult to meet. Especially on some streaming media platforms, the video data uploaded by users every day is huge, and manual marking of the video data is impractical.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The present application aims to provide a neural network training method, a neural network training device, a computer readable medium, and an electronic device, which at least to some extent overcome the technical problems of high data processing cost, low efficiency, and the like in the related technologies such as video data processing.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a neural network training method, including:

sampling at least two sample segments from a video sample according to a video time sequence;

adjusting the arrangement sequence of the at least two sample fragments, and acquiring the adjusted fragment sequence information;

performing feature extraction on the sample fragment through neural networks corresponding to different modality types to obtain at least two modality features of the sample fragment;

and training the neural network according to the feature similarity of each modal feature and the fragment sequence information so as to update the network parameters of the neural network.

According to an aspect of an embodiment of the present application, there is provided a neural network training apparatus, including:

the video sampling module is configured to sample at least two sample segments from video samples according to video time sequence;

the sequence adjusting module is configured to adjust the arrangement sequence of the at least two sample fragments and acquire adjusted fragment sequence information;

a feature extraction module configured to perform feature extraction on the sample fragment through neural networks corresponding to different modality types to obtain at least two modality features of the sample fragment;

a parameter updating module configured to train the neural network according to the feature similarity of each modal feature and the segment sequence information to update network parameters of the neural network.

In some embodiments of the present application, based on the above technical solutions, the video sampling module includes:

the first extraction unit is configured to perform multi-modal information extraction on the video samples to obtain modal information samples corresponding to different modal types;

the first sampling unit is configured to synchronously sample the mode information samples respectively according to a video time sequence so as to obtain at least two sample segments corresponding to different mode types.

a second sampling unit configured to sample the video samples in a video time order to obtain at least two video segments;

and the second extraction unit is configured to perform multi-modal information extraction on the video clips, so as to obtain at least two sample clips corresponding to different modal types.

In some embodiments of the present application, based on the above technical solution, a sampling interval of the sample segment is greater than or equal to a sampling length of the sample segment.

In some embodiments of the present application, based on the above technical solutions, the modality types include at least two of an image modality, an audio modality, and a text modality; the feature extraction module includes:

an image feature extraction unit configured to perform feature extraction on the image sample through an image processing neural network to obtain an image feature of the sample fragment if the sample fragment includes an image sample corresponding to the image modality;

an audio feature extraction unit configured to perform feature extraction on the audio sample through an audio processing neural network to obtain an audio feature of the sample segment if the sample segment includes an audio sample corresponding to the audio modality;

and if the sample fragment comprises a text sample corresponding to the text modality, performing feature extraction on the text sample through a text processing neural network to obtain a text feature of the sample fragment.

In some embodiments of the present application, based on the above technical solution, the image processing neural network includes a plurality of sequentially connected three-dimensional convolution processing units, where the three-dimensional convolution processing units include a two-dimensional space convolution layer and a one-dimensional time convolution layer that are sequentially connected; the image feature extraction unit includes:

the two-dimensional space convolution subunit is configured to perform convolution processing on the image sample through the two-dimensional space convolution layer to obtain an intermediate feature map carrying spatial features;

and the one-dimensional time convolution subunit is configured to perform convolution processing on the intermediate feature map through the one-dimensional time convolution layer to obtain the image features of the sample segment carrying the spatial features and the temporal features.

In some embodiments of the present application, based on the above technical solution, the audio processing neural network includes a plurality of two-dimensional convolution processing units connected in sequence, and the audio feature extraction unit includes:

an audio filtering subunit, configured to perform filtering processing on the audio sample to obtain a two-dimensional mel frequency spectrum diagram;

a logarithm operation subunit, configured to perform a logarithm operation on the Mel frequency spectrogram to obtain two-dimensional frequency spectrum information for quantizing the sound intensity;

and the two-dimensional convolution subunit is configured to perform convolution processing on the two-dimensional spectrum information through the two-dimensional convolution processing unit to obtain the audio features of the sample fragment.

In some embodiments of the present application, based on the above technical solution, the two-dimensional convolution processing unit includes a residual connecting branch and a convolution connecting branch; the two-dimensional convolution subunit includes:

a residual mapping subunit, configured to perform mapping processing on the two-dimensional spectrum information through the residual connection branch to obtain residual mapping information;

a convolution mapping subunit, configured to perform convolution processing on the two-dimensional spectrum information through the convolution connection branch to obtain audio convolution information;

and the mapping and superposing subunit is configured to superpose the residual mapping information and the audio convolution information to obtain the audio features of the sample segments.

In some embodiments of the present application, based on the above technical solutions, the parameter updating module includes:

the contrast error determining unit is configured to respectively acquire feature similarity among the modal features and determine contrast error information of the modal features according to the feature similarity;

a sequence error determination unit configured to perform mapping processing on the modal characteristics to obtain sequence prediction information, and determine sequence error information of the sample segment according to the segment sequence information and the sequence prediction information;

and the error superposition unit is configured to carry out superposition processing on the comparison error information and the sequence error information to obtain an overall loss error, and update the network parameters of the neural network according to the overall loss error.

In some embodiments of the present application, based on the above technical solutions, the comparison error determination unit includes:

a positive and negative sample determination subunit configured to take the modal features corresponding to the same sample segment as positive samples and the modal features corresponding to different sample segments as negative samples;

a contrast error calculation subunit configured to perform error calculation on the feature similarity of the positive sample and the negative sample by using a contrast loss function to obtain contrast error information of the modal feature.

In some embodiments of the present application, based on the above technical solutions, the order error determination unit includes:

the characteristic selection subunit is configured to randomly select and obtain sample characteristics corresponding to the sample fragments from various modal characteristics corresponding to each sample fragment;

the local splicing subunit is configured to respectively perform pairwise splicing processing on the sample features corresponding to the sample segments to obtain local splicing features;

the local mapping subunit is configured to perform mapping processing on each local splicing feature to obtain a local mapping feature of the local splicing feature;

the integral splicing subunit is configured to splice the local mapping characteristics to obtain integral splicing characteristics;

the integral mapping subunit is configured to perform mapping processing on the integral splicing characteristic to obtain an integral mapping characteristic of the integral splicing characteristic;

a normalization mapping subunit configured to perform normalization mapping on each feature element in the overall mapping feature to obtain sequential prediction information of the modal feature.

In some embodiments of the present application, based on the above technical solutions, the order error determination unit further includes:

a prediction target determination subunit configured to take the piece order information as a target value and the order prediction information as a prediction value;

a sequential error calculation subunit configured to perform error calculation on the target value and the predicted value by a cross entropy loss function to obtain sequential error information of the sample segment.

In some embodiments of the present application, based on the above technical solutions, the neural network training device further includes:

and the common mapping module is configured to map each modal feature through a common mapping head neural network to obtain the modal feature corresponding to the common embedding space.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements a neural network training method as in the above technical solution.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the neural network training method as in the above solution via execution of the executable instructions.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the neural network training method as in the above technical scheme.

In the technical scheme provided by the embodiment of the application, video time sequence information can be blended in the neural network training process by performing fragment sampling and sequence adjustment on video samples, so that self-supervision learning is realized based on the video time sequence information, manual labeling of video data is not needed, the data processing cost is reduced, and the data processing efficiency is improved. In addition, through multi-modal feature extraction, semantic features of the video can be directly concerned, redundant information is reduced, comparison learning is carried out based on cross-modal features, multiple neural networks aiming at different modal types are obtained, and accuracy and representation capability of feature extraction are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

fig. 1 schematically shows a block diagram of an exemplary system architecture to which the solution of the present application applies.

Figure 2 schematically illustrates a flow chart of the steps of a neural network training method in some embodiments of the present application.

Fig. 3 schematically shows a schematic diagram of the neural network training method in an application scenario in the embodiment of the present application.

FIG. 4 schematically illustrates a model structure diagram of a three-dimensional convolution processing unit in some embodiments of the present application.

Fig. 5 schematically shows the effect of the conversion from an audio waveform diagram to a logarithmic mel spectrum diagram.

FIG. 6 schematically illustrates a flow chart of method steps for updating parameters of a neural network in some embodiments of the present application.

FIG. 7 schematically illustrates a schematic diagram of feature extraction optimization based on contrast learning in some embodiments of the present application.

Fig. 8 schematically shows a block diagram of a neural network training device provided in an embodiment of the present application.

FIG. 9 schematically illustrates a block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Before explaining the technical scheme of the present application, first, a brief description is made of the artificial intelligence technology involved in the technical scheme of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Unsupervised learning (Unsupervised learning): unsupervised learning is a learning strategy in machine learning. Conventional supervised learning approaches rely on large amounts of manually labeled data, which is often very time consuming and impractical. Unsupervised learning can be trained using unlabeled data. A representative unsupervised learning method, for example, generates a countermeasure network.

Self-supervised learning (Self-supervised learning): the self-supervised learning is a special supervised learning strategy. Like unsupervised learning, unsupervised learning also does not rely on artificially labeled data. Rather, the labels or supervisory information for self-supervised learning comes from the data itself. Some auxiliary learning tasks can be set according to some characteristics of the data, and the data can be used for learning by using the supervision information of the data. For example, performing a rotation operation on an image and then predicting the angle of image rotation can be used as an auxiliary learning task to learn good feature representations for image classification.

Multi-modal learning: the information received by people usually contains different modalities. For example, video mainly contains two modalities, image and audio. In general, different modalities characterize different components of information. Multimodal learning can integrate information between different modalities to obtain a more robust, more comprehensive representation of features.

Comparative learning (contrast learning): contrast learning can be considered as training a feature encoder to perform feature extraction in deep learning, and then performing dictionary lookup on the feature. In particular, consider the coding characteristics of a given query sample q, and the coding characteristics of a set of dictionary samples k₀,k₁,k₂… }. Suppose there is a sample k in the dictionary₊Match q, then weThe similarity between pairs of samples can be calculated using a contrast loss function to find the matching sample k in the dictionary₊。

Contrast loss function (contrast loss function): the contrast loss function is typically used in contrast learning. The commonly used contrast Loss functions include a boundary Loss function (Margin-based Loss) and a Noise contrast Estimation function (Noise contrast Estimation Loss, NCE Loss). One of the NCE loss functions may be, for example, an InfoNCE loss function, which is expressed as follows:

where τ is referred to as the temperature hyperparameter. The summation over the denominator involves the calculation of one positive and K negative samples. Thus, the penalty function can be viewed as a classifier for class K +1 classification, with the goal of classifying the query sample q to the kth₊And (4) class.

Convolutional Neural Networks (CNNs) are a class of feed-forward Neural networks that involve convolution operations, and generally consist of Convolutional layers, pooling layers, fully-connected layers, active layers, and the like. The method can be used for feature extraction of high-dimensional data such as images and audios.

Pre-training Pre-train and Fine-tune Finetune are two concepts that are combined together. In deep learning, for a variety of reasons (e.g., the data set is too small, training from scratch is too time consuming or difficult to converge), the network may be pre-trained before training on a particular data set for a particular task. The network is then fine-tuned to the task-specific data set. Typically, pre-training is performed on a larger data set, and the training task may be the same as or different from the particular task that follows. The trained network parameters can be used as initialization parameters of subsequent training, so that better feature extraction or faster convergence speed is obtained in a fine tuning stage, and the network performance is improved.

The technical scheme can be realized based on the cloud computing technology. Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform for short) is established, and multiple types of virtual resources are deployed in the resource pool and are used by external customers selectively. The cloud computing resource pool mainly comprises: computing devices (virtualized machines, including operating systems), storage devices, network devices.

Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), Distributed Computing (Distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like. With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. The terminal device 110 may include various electronic devices such as a smart phone, a tablet computer, a notebook computer, and a desktop computer. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, such as a wired communication link or a wireless communication link.

The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, according to implementation needs. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by both the terminal device 110 and the server 130, which is not particularly limited in this application.

For example, the neural network model trained based on the technical solution of the present application may be applied to an application scenario involving video retrieval, such as video classification of a streaming media platform and search of a public account or an applet, and the neural network model may be configured on the terminal device 110 shown in fig. 1, or may also be configured on the server 130. Through the trained neural network model, automatic labeling of the video uploaded by the user can be achieved, or semantic features corresponding to the video are extracted for video retrieval, so that the experience quality of the user can be effectively improved.

In an application scene of video search, the method and the device can be used for performing feature learning and classification on videos stored in a background server, and automatically printing labels according with semantic information of the videos. Therefore, in the practical use of the user, the search is carried out by using the search keywords input by the user, the accurate label search and matching can be carried out, and the video with high matching degree is recommended to the user.

The following describes technical solutions of the neural network training method, the neural network training device, the computer-readable medium, the electronic device, and the like provided in the present application in detail with reference to specific embodiments.

Figure 2 schematically illustrates a flow chart of the steps of a neural network training method in some embodiments of the present application. The neural network training method can be applied to a terminal device or a server, and can also be executed by the terminal device and the server together. As shown in fig. 2, the neural network training method may mainly include the following steps S210 to S240.

Step S210: at least two sample segments are sampled from the video samples in video time order.

Step S220: and adjusting the arrangement sequence of at least two sample fragments, and acquiring the adjusted fragment sequence information.

Step S230: and performing feature extraction on the sample fragment through the neural networks corresponding to different modality types to obtain at least two modality features of the sample fragment.

Step S240: and training the neural network according to the feature similarity of each modal feature and the fragment sequence information so as to update the network parameters of the neural network.

In the neural network training method provided by the embodiment of the application, video time sequence information can be blended in the neural network training process by performing fragment sampling and sequence adjustment on video samples, so that self-supervision learning is realized based on the video time sequence information, manual marking on video data is not needed, the data processing cost is reduced, and the data processing efficiency is improved. In addition, through multi-modal feature extraction, semantic features of the video can be directly concerned, redundant information is reduced, comparison learning is carried out based on cross-modal features, multiple neural networks aiming at different modal types are obtained, and accuracy and representation capability of feature extraction are improved.

Fig. 3 schematically shows a schematic diagram of the neural network training method in an application scenario in the embodiment of the present application. As shown in fig. 3, the method for training the neural network in the application scenario may mainly include the following steps S301 to S306.

Step S301: sampling a video sample to obtain three groups of sample fragments. Each set of sample segments includes an image segment corresponding to an image modality and an audio segment corresponding to an audio modality. Three groups of sample segments can be sequentially determined as segment 1, segment 2 and segment 3 according to the video time sequence.

Step S302: the sample fragments of each group were randomly ordered (shuffle) to shuffle their order. The shuffled arrangement as shown in the figure is segment 2, segment 3, and segment 1.

Step S303: for each group after the sequence is disorderly arrangedThe sample segments are compared and studied, and the image characteristic phi of each group of sample segments is obtained by extraction_vAnd audio characteristics phi_a。

Step S304: from the image characteristics of each set of sample segments_vAnd audio characteristics phi_aOne of the samples is randomly selected as a sample feature.

Step S305: and combining the sample characteristics pairwise and obtaining fusion characteristics after fusion processing.

Step S306: and predicting the arrangement sequence of the sample fragments of each group. The prediction target is the arrangement order of the sample fragments after random sorting in step S302, i.e., the fragment arrangement order of fragment 2, fragment 3, and fragment 1.

And (3) performing back propagation on the error between the prediction result and the prediction target in a network frame to obtain the error gradient of each network parameter in the neural network, so as to update and optimize each network parameter according to the error gradient. And continuously optimizing the feature extraction capability of the neural network through iterative training.

The following describes each method step of the neural network training method in the above embodiment in detail.

In step S210, at least two sample segments are sampled from the video samples in video time order.

In some optional embodiments, the method may first perform multi-modal information extraction on a video sample to obtain modal information samples corresponding to different modal types; then, each mode information sample is synchronously sampled according to the video time sequence, so as to obtain at least two sample segments corresponding to different mode types.

The modality types associated with the video data may include, for example, an image modality, an audio modality, a text modality, and so on, and the image information sample, the audio information sample, and the text information sample may be obtained accordingly through the modality information extraction. Wherein the image information sample is image data for representing each frame of video picture in the video sample; the audio information sample is audio data representing various sound signals such as background sounds and human dialogs in the video sample; the text information sample may include text data such as a brief video and a caption, which is directly extracted from the video data, text data such as a subtitle in a video image, which is obtained by text recognition of image data in the video, and text data obtained by voice-to-text conversion of a sound signal such as a character dialog in the video.

In other alternative embodiments, the embodiment of the present application may also sample the video samples according to the video time sequence to obtain at least two video segments; and then multi-modal information extraction is carried out on the video clips to obtain at least two sample clips corresponding to different modal types. By performing multi-mode information extraction after uniform video sampling, the times of data sampling can be reduced, and the data sampling efficiency is improved.

Optionally, in the embodiment of the present application, when data sampling is performed, a sampling interval of a sample segment may be controlled to be greater than or equal to a sampling length of the sample segment. Therefore, data superposition between two adjacent sample fragments can be avoided, and the neural network is prevented from carrying out subsequent arrangement sequence prediction by using bottom-layer characteristics of superposition information between the sample fragments, so that the learning capability of other high-level semantic information is lost.

Taking the application scenario shown in fig. 3 as an example, for a given video sample, n fixed-length image sample segments may be sampled from the video sample. Generally, the length of the image sample segment is 15-30 frames for calculation amount. The audio sample segment and the image sample segment are synchronous, that is, sampling is started at the same time, and the audio sample segment of 1-2 s can be taken generally. These n audio-video clips are present n! And (4) arranging the two. Since the factorial growth rate is very large, generally, the value of n is preferably 3 to 5. In sampling, the original video sample can be uniformly sampled, and every two adjacent sample segments are spaced by the same frame number.

In step S220, the arrangement order of at least two sample fragments is adjusted, and the adjusted fragment order information is obtained.

In step S210, data sampling is performed according to the video time sequence, and the sampled sample segments are also sequentially arranged according to the video playing time sequence. In order to introduce the timing information of the video in the neural network training, the step may perform order adjustment on the sample segments obtained by sampling, so as to obtain the adjusted segment order information. For example, the sequence information of the fragments obtained by the sequence scrambling in fig. 3 can be represented as a sequence of number numbers in the order of 2,3, and 1. In the embodiment of the present application, if the number of sampled sample segments is n, the random order scrambling may be arranged in an arrangement manner including n! Thus, the subsequent sequential prediction process using neural networks can be viewed as an n! Classification of classes.

In step S230, feature extraction is performed on the sample segment through neural networks corresponding to different modality types to obtain at least two modality features of the sample segment.

For sample fragments of different modal types, a special neural network can be adopted in the step to perform feature extraction on the sample fragments so as to obtain corresponding modal features. For example, the modality types in the embodiment of the present application may include at least two of an image modality, an audio modality, and a text modality; the sampled sample segments can correspondingly comprise image samples, audio samples or text samples; further, after feature extraction is carried out on the image features, the audio features or the text features, corresponding image features, audio features or text features can be obtained. The following description will be made with respect to the feature extraction methods of three different modality types, respectively.

And if the sample fragment comprises an image sample corresponding to the image modality, performing feature extraction on the image sample through an image processing neural network to obtain the image features of the sample fragment.

The image processing neural network used in the embodiments of the present application may be referred to as an R (2+1) D network, and the neural network includes a plurality of three-dimensional convolution processing units connected in sequence, and fig. 4 schematically shows a model structure diagram of the three-dimensional convolution processing unit in some embodiments of the present application. As shown in fig. 4, the three-dimensional convolution processing unit may include a two-dimensional space convolution layer and a one-dimensional time convolution layer, which are sequentially connected. The two-dimensional space convolution layer uses a two-dimensional convolution kernel with the size of 1 multiplied by d, the one-dimensional time convolution layer uses a one-dimensional convolution kernel with the size of t multiplied by 1, and the combination of the two-dimensional space convolution layer and the one-dimensional time convolution layer is equivalent to a three-dimensional convolution layer with the convolution kernel of t multiplied by d. In other words, the three-dimensional convolution processing unit used in the embodiment of the present application explicitly decomposes the three-dimensional convolution into two separate and continuous operations, that is, one two-dimensional space convolution kernel and one-dimensional time convolution, and compared with a conventional three-dimensional convolution network, an additional non-linear rectification layer (non-linear rectification layer) is added between two convolution operations after the separation, so that the non-linear expression capability of the model is enhanced, and thus a more complex function can be represented. In addition, the operation decomposition enables the network to be optimized more easily, and the feature extraction effect is better.

On the basis, the method for extracting the sample features based on the three-dimensional convolution processing unit can comprise the following steps: carrying out convolution processing on the image sample through a two-dimensional space convolution layer to obtain an intermediate characteristic diagram carrying spatial characteristics; and carrying out convolution processing on the intermediate feature map through a one-dimensional time convolution layer to obtain the image features of the sample segment carrying the spatial features and the time features.

And if the sample fragment comprises the audio sample corresponding to the audio modality, performing feature extraction on the audio sample through an audio processing neural network to obtain the audio feature of the sample fragment.

In some alternative embodiments, the audio processing neural network used in the embodiments of the present application may include a plurality of two-dimensional convolution processing units connected in sequence. When audio feature extraction is performed, filtering processing may be performed on an audio sample to obtain a two-dimensional Mel-frequency Spectrogram (Mel spectrum), and specifically, a one-dimensional audio data Spectrogram may be converted into a two-dimensional Mel-frequency Spectrogram through a Mel-scale filter bank (Mel-scale filter banks); then carrying out logarithmic operation on the Mel frequency spectrogram to obtain two-dimensional frequency spectrum information for quantizing the sound intensity (the unit is decibel dB); and performing convolution processing on the two-dimensional frequency spectrum information through a two-dimensional convolution processing unit to obtain the audio characteristics of the sample fragment.

Fig. 5 schematically shows the effect of the conversion from an audio waveform diagram to a logarithmic mel spectrum diagram. As shown in fig. 5, the mel-frequency spectrogram can depict more features about the audio sample than a simple waveform chart, and the processed audio sample can be treated as two-dimensional image data by a two-dimensional convolution network.

The audio processing neural network used in the embodiment of the present application may be, for example, a ResNet18 network, and the two-dimensional convolution processing unit constituting the neural network includes a residual connecting branch and a convolution connecting branch. Mapping the two-dimensional frequency spectrum information through the residual connecting branch to obtain residual mapping information; the audio convolution information can be obtained by performing convolution processing on the two-dimensional frequency spectrum information through the convolution connection branch; and then the residual mapping information and the audio convolution information are superposed to obtain the audio characteristics of the sample fragment. The data operation process of the two-dimensional convolution processing unit can be expressed as: x is the number of_m＝h(x_i)+F(x_i,W)，x_o＝f(x_m). Wherein x is_iAnd x_oRespectively, input data and output data of the two-dimensional convolution processing unit. F (-) is a residual function, representing the learned residual. h (-) is the identity map and f (-) is the final activation function. The training difficulty of the neural network can be reduced based on residual learning, and the training efficiency is improved.

And if the sample fragment comprises a text sample corresponding to the text mode, performing feature extraction on the text sample through a text processing neural network to obtain the text features of the sample fragment.

The text processing Neural Network may adopt various Network models suitable for text processing, such as a Recurrent Neural Network (RNN) or a Long Short-Term Memory (LSTM).

Different types of modal features, such as image features, audio features, text features, and the like, can be extracted from the neural networks for different modal types in the above embodiments. After the global pooling layer of each neural network performs pooling processing to output corresponding modal characteristics, the embodiment of the present application may further perform mapping processing on each modal characteristic through a shared mapping header (projection head) neural network to obtain the modal characteristics corresponding to the common embedding space (embedding space). And a uniform embedding space is adopted, so that the consistency and the stability of a processing result are ensured when the subsequent feature similarity calculation and comparison are carried out. The mapping head neural network may employ, for example, a multi-layer perceptron or any other network model suitable for feature extraction.

In step S240, the neural network is trained according to the feature similarity of each modal feature and the segment sequence information, so as to update the network parameters of the neural network.

Cross-modal comparison learning among various modal types can be achieved according to the feature similarity of each modal feature, unsupervised learning can be conducted according to the sequence information of the segments by utilizing the time sequence information of the video samples, each neural network used for feature extraction is trained based on the comparison learning and the unsupervised learning, and network parameters of the neural network can be updated, so that the feature extraction capability of the neural network on video data is improved, and the representation effect of the data features obtained through optimized extraction is achieved.

FIG. 6 schematically illustrates a flow chart of method steps for updating parameters of a neural network in some embodiments of the present application. As shown in fig. 6, the training of the neural network according to the feature similarity of each modal feature and the segment sequence information in step S240 to update the network parameters of the neural network may include steps S610 to S630 as follows.

Step S610: respectively obtaining the feature similarity among the modal features, and determining the contrast error information of the modal features according to the feature similarity.

The basic idea of contrast learning is to learn the feature representation of a sample by contrast between pairs of positive and negative samples. The feature extraction accuracy is improved by maximizing the similarity between the anchor sample (anchor sample) and the positive sample, while minimizing the similarity between the anchor sample and all other negative samples. FIG. 7 schematically illustrates a schematic diagram of feature extraction optimization based on contrast learning in some embodiments of the present application. As shown in fig. 7Each sample segment Video 1, Video 2 … … Video N includes a corresponding image sample and audio sample. Wherein, the image sample is subjected to feature extraction through a neural network R (2+1) D to obtain an image feature phi_vThe audio sample is subjected to feature extraction through a neural network ResNet18 to obtain an audio feature phi_a。

The modal characteristics corresponding to the same sample segment are used as positive samples, and the modal characteristics corresponding to different sample segments are used as negative samples. For example, given a pair of sample segments (a, v), for an image sample v, its positive sample is defined as the audio sample a corresponding to it. Similarly, for the audio sample a, its positive sample is the image sample v corresponding to it. Positive examples are defined as input data of different modalities corresponding to the same sample segment, and corresponding negative examples may be defined as input data from different sample segments. For example, given a pair of sample segments (a, v) and another pair of sample segments (a ', v'), a 'and v' are their negative samples for audio sample a or image sample v, and vice versa.

And performing error calculation on the feature similarity of the positive sample and the negative sample through a contrast loss function to obtain contrast error information of the modal features.

For example, a given batch of feature sets comprising k sets of input sample fragments

The contrast loss function consists of the following two parts:

wherein, sim (phi)_a,φ_v) Calculated is phi_aAnd phi_vFeature similarity between:

τ is a temperature hyperparameter.

The final comparison learning loss function is the sum of the two loss functions:

L_contrast＝L_av+L_va

learning a loss function L by comparison_contrastContrast error information for modal characteristics can be calculated.

Step S620: and mapping the modal characteristics to obtain sequence prediction information, and determining sequence error information of the sample fragment according to the fragment sequence information and the sequence prediction information.

Through contrast learning, the neural network can learn the feature representation corresponding to each sample segment including the image sample and the audio sample. For each video sample, a set of data sets of image samples and audio samples may be obtained as input data. Assuming that n sample segments are obtained from each video sample, after contrast learning, n modal feature combinations composed of image features and audio features, i.e., { (φ)_a1,φ_v1),(φ_a2,φ_v2),…,(φ_an,φ_vn)}。

In some optional embodiments, the modality feature combinations corresponding to the sample segments may be processed according to the following steps S621 to S624 to obtain the order error information of the sample segments.

Step S621: and randomly selecting the modal characteristics corresponding to each sample fragment to obtain the sample characteristics corresponding to the sample fragments.

The embodiment of the application can define a random function (), and the function can randomly select a sample feature from the same pair of mode feature combinations for subsequent sequential prediction:

f_i＝σ(φ_ai,φ_vi)

for the same video sample, the video sample can be subjected to randomObtaining n sample characteristics after screening { f₁,f₂,…,f_n}。

The order of the individual sample fragments has been shuffled as the data is input, so the n sample features are also out of order. The ordering of these sample features can be viewed as a classification problem. Considering each possible permutation as a class, the sequential prediction problem translates into an n! The classification problem of the class, its input data is the characteristic set of n sample characteristics, the output data is the probability distribution to all possible permutation modes.

Step S622: and respectively carrying out pairwise splicing treatment on the sample characteristics corresponding to the sample fragments to obtain local splicing characteristics, and respectively carrying out mapping treatment on the local splicing characteristics to obtain local mapping characteristics of the local splicing characteristics.

And (3) performing pairwise splicing on each sample characteristic in the n sample characteristics and all other sample characteristics to obtain M local splicing characteristics, wherein M is all possible combination numbers. Then, after each local splicing feature is mapped, M local mapping features h can be obtained_k：

k＝1,2……M

Wherein the content of the first and second substances,

representing a vector splicing operation, g (-) is a non-linear mapping function.

Step S623: and splicing the local mapping characteristics to obtain integral splicing characteristics, and mapping the integral splicing characteristics to obtain integral mapping characteristics of the integral splicing characteristics.

Firstly mapping M local features according to the following formulah_kThe whole splicing characteristic can be obtained after splicing processing, and the whole mapping characteristic a of the whole splicing characteristic can be obtained after mapping processing is carried out on the whole splicing characteristic:

where f (-) is a non-linear mapping function and a is n! A logits vector of dimensions.

Step S624: and carrying out normalized mapping on each feature element in the overall mapping feature to obtain sequential prediction information of the modal feature.

Globally mapping feature elements a of each dimension in feature a_iAll corresponding to a sample fragment arrangement mode, and carrying out normalized mapping on all feature elements in the overall mapping feature a to obtain the sequential prediction information p of the modal feature_i：

Sequential prediction information p_iAnd the prediction probability is used for expressing the i-th class arrangement mode.

Step S624: and taking the fragment sequence information as a target value and the sequence prediction information as a prediction value, and performing error calculation on the target value and the prediction value through a cross entropy loss function to obtain sequence error information of the sample fragment.

For example, three sets of sample segments are obtained from the same video sample by video time-sequential sampling (a)₁,v₁),(a₂,v₂),(a₃,v₃) And (c) randomly disordering the arrangement sequence of the sample fragments to obtain a sequence of sample fragments with the sequence adjusted as { (a)₂,v₂),(a₃,v₃),(a₁,v₁) Then (2,3,1) is taken as the prediction target. The order error information can be calculated based on the prediction objective using a cross entropy loss function as follows to measure the prediction result.

Wherein, y_iThe actual probability of the ith sequential category is represented, and the value is 1 or 0.

The sequence error information of the sample fragment can be obtained through the above steps S621 to S624.

Step S630: and superposing the comparison error information and the sequence error information to obtain an overall loss error, and updating the network parameters of the neural network according to the overall loss error.

The technical scheme in the above embodiment can obtain the contrast learning loss function L in the contrast learning process_contrastAnd a cross entropy loss function L in the sequential prediction process_entropyAnd superposing the two to obtain an overall loss function L:

L＝L_contrast+L_entropy

the training goal for the neural network is to minimize the overall loss function L.

The overall loss error obtained by superposing the comparison error information and the sequence error information under the current training turn can be calculated based on the overall loss L, the overall loss error is reversely propagated in a neural network framework, the error gradient of each network parameter can be calculated, and the network parameters of the neural network can be updated according to the error gradient. After the parameter update is completed, the training data set can be continuously input for the next round of training.

According to the embodiment of the application, large-scale data sets such as Youtube-8M, HowTo100M, kinetics-400, kinetics-600 and momenting in time can be used as training data sets to pre-train the neural network. For example, after training by the neural network training method described in the above embodiment, a pre-trained image processing neural network and audio processing neural network can be obtained, where the image processing neural network is used to extract image features in video data, and the audio processing neural network is used to extract audio features in video data.

In order to further realize specific business application, the embodiment of the application can perform fine tuning and testing on the pre-trained neural network. Wherein the fine tuning phase can use a labeled data set for supervised learning, and the testing phase can use a small-scale data set such as UCF101 and HMDB51 for testing. In the testing stage after the fine tuning is completed, for example, 10 sample segments may be sampled from each test video for classification, and the final test result is the mode of the classification results of the 10 sample segments.

In practical application, taking video retrieval as an example, the neural network training method provided by the technical scheme of the application can be used for pre-training, and then the business data set with smaller data scale is used for fine tuning of the neural network obtained by pre-training. And then, predicting the large-scale video data set on the server by using the trimmed neural network, wherein the prediction result is used as a pseudo label corresponding to the video. In the retrieval stage, keywords input by a user are used as tags, and the best matched video is searched from the server and returned to the user.

In addition, for tasks such as video retrieval, sample-based retrieval, that is, retrieval of one video given by a user to other similar videos may be performed in addition to the most direct search method based on keywords input by the user. In the process, semantic features of a given video of a user can be extracted by using a pre-trained neural network, then the extracted semantic features are matched with semantic features of other videos on a server, videos with the most similar semantics are searched, and the searched videos are returned to the user.

In some optional implementations, the neural network corresponding to a single modality type may be fine-tuned and applied according to actual needs. For example, in an application scenario based on video motion classification, only an image processing neural network for performing image feature extraction may be finely tuned, and after the fine tuning is completed, the image processing neural network may be used alone to perform prediction classification on a large-scale video data set on a server, and a corresponding classification label is added to a corresponding video.

In other alternative embodiments, the pre-trained neural networks corresponding to different modality types may be jointly fine-tuned and applied by the embodiments of the present application. For example, the image processing neural network for image feature extraction and the audio processing neural network for audio feature extraction may be fine-tuned at the same time, and after the fine tuning is completed, the extracted image features and audio features may be fused by a multi-layer perceptron or other feature fusion methods to obtain fusion features, and then video classification may be performed based on the fusion features.

By combining the technical schemes in the embodiments, the method realizes a new end-to-end self-supervision scheme for video feature learning and classification, avoids the need of marking video categories by a large amount of manpower, and can be conveniently applied to various real service scenes.

Unlike traditional self-supervised schemes of context-based assisted learning tasks, the present application uses contrast learning for feature extraction. Because the contrast learning is more concerned about the contrast between positive and negative samples, the learning method is more suitable for learning the characteristics with discriminability, high level and rich semantic information, and is beneficial to the downstream classification task.

The method and the device make full use of uniqueness, namely multimode and time sequence, of the video data. The traditional scheme based on frame sequencing or video fragment sequencing only considers the time sequence characteristics of videos and ignores multi-modal information; in the conventional multi-modal feature learning scheme, only the multi-modal features of the video are considered, and the time sequence is not considered. According to the technical scheme, the image and audio characteristics of the video (in addition, other modal characteristics such as texts can also be included) are fully combined, and better characteristic representation can be obtained. In particular, in the technical solution of the present application, feature sorting and comparative learning are two complementary tasks. Good contrast learning can ensure that cross-modal characteristics of the same video have good consistency and are easy to sequence; and the feature ordering requires that the learned audio features and image features have consistency and discriminability, so that a better feature representation is forced to be learned through comparison learning.

In a real application scene, the technical scheme of the application has expandability. Large-scale, diverse data sets can be used for pre-training. When applied to a particular application, a specific small-scale data set may be used for fine-tuning, thereby achieving targeted performance improvements.

It should be noted that although the various steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the shown steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes embodiments of the apparatus of the present application, which can be used to perform the neural network training method in the above embodiments of the present application. Fig. 8 schematically shows a block diagram of a neural network training device provided in an embodiment of the present application. As shown in fig. 8, the neural network training device 800 may mainly include:

a video sampling module 810 configured to sample at least two sample segments from a video sample in a video time order;

an order adjusting module 820 configured to adjust an arrangement order of at least two sample fragments and obtain adjusted fragment order information;

a feature extraction module 830 configured to perform feature extraction on the sample segment through neural networks corresponding to different modality types to obtain at least two modality features of the sample segment;

a parameter updating module 840 configured to train the neural network according to the feature similarity of each modal feature and the segment sequence information to update the network parameters of the neural network.

In some embodiments of the present application, based on the above embodiments, the video sampling module includes:

the first sampling unit is configured to synchronously sample each modal information sample according to a video time sequence to obtain at least two sample segments corresponding to different modal types.

and the second extraction unit is configured to perform multi-modal information extraction on the video clips to obtain at least two sample clips corresponding to different modal types.

In some embodiments of the present application, based on the above embodiments, the sampling interval of the sample segment is greater than or equal to the sampling length of the sample segment.

In some embodiments of the present application, based on the above embodiments, the modality types include at least two of an image modality, an audio modality, and a text modality; the feature extraction module includes:

the image characteristic extraction unit is configured to extract the characteristics of the image sample through an image processing neural network to obtain the image characteristics of the sample fragment if the sample fragment comprises the image sample corresponding to the image modality;

the audio characteristic extraction unit is configured to perform characteristic extraction on the audio sample through an audio processing neural network to obtain an audio characteristic of the sample fragment if the sample fragment comprises an audio sample corresponding to an audio modality;

and the text feature extraction unit is configured to perform feature extraction on the text sample through a text processing neural network to obtain the text features of the sample fragment if the sample fragment comprises the text sample corresponding to the text modality.

In some embodiments of the present application, based on the above embodiments, the image processing neural network includes a plurality of sequentially connected three-dimensional convolution processing units, and the three-dimensional convolution processing units include sequentially connected two-dimensional space convolution layers and one-dimensional time convolution layers; the image feature extraction unit includes:

In some embodiments of the present application, based on the above embodiments, the audio processing neural network includes a plurality of two-dimensional convolution processing units connected in sequence, and the audio feature extraction unit includes:

an audio filtering subunit, configured to perform filtering processing on the audio sample to obtain a two-dimensional mel frequency spectrogram;

a logarithm operation subunit configured to perform a logarithm operation on the Mel frequency spectrogram to obtain two-dimensional frequency spectrum information for quantizing the sound intensity;

In some embodiments of the present application, based on the above embodiments, the two-dimensional convolution processing unit includes a residual connecting branch and a convolution connecting branch; the two-dimensional convolution subunit includes:

a residual mapping subunit, configured to map the two-dimensional spectrum information through a residual connection branch to obtain residual mapping information;

the convolution mapping subunit is configured to perform convolution processing on the two-dimensional spectrum information through the convolution connection branch to obtain audio convolution information;

and the mapping and superposing subunit is configured to superpose the residual mapping information and the audio convolution information to obtain the audio characteristics of the sample fragment.

In some embodiments of the present application, based on the above embodiments, the parameter updating module includes:

the sequence error determining unit is configured to perform mapping processing on the mode characteristics to obtain sequence prediction information and determine sequence error information of the sample fragments according to the fragment sequence information and the sequence prediction information;

In some embodiments of the present application, based on the above embodiments, the comparison error determination unit includes:

and the contrast error calculation subunit is configured to perform error calculation on the feature similarity of the positive sample and the negative sample through a contrast loss function to obtain contrast error information of the modal features.

In some embodiments of the present application, based on the above embodiments, the order error determination unit includes:

the characteristic selection subunit is configured to randomly select and obtain sample characteristics corresponding to the sample fragments from the modal characteristics corresponding to each sample fragment;

the local splicing subunit is configured to respectively perform pairwise splicing treatment on the sample characteristics corresponding to the sample segments to obtain local splicing characteristics;

the local mapping subunit is configured to perform mapping processing on each local splicing feature respectively to obtain local mapping features of the local splicing features;

and the normalization mapping subunit is configured to perform normalization mapping on each feature element in the overall mapping feature to obtain sequential prediction information of the modal feature.

In some embodiments of the present application, based on the above embodiments, the order error determination unit further includes:

In some embodiments of the present application, based on the above embodiments, the neural network training device further includes:

The specific details of the neural network training device provided in each embodiment of the present application have been described in detail in the corresponding method embodiment, and are not described herein again.

Fig. 9 schematically shows a structural block diagram of a computer system of an electronic device for implementing the embodiment of the present application.

It should be noted that the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit 901 (CPU) that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory 902 (ROM) or a program loaded from a storage section 908 into a Random Access Memory 903 (RAM). In the random access memory 903, various programs and data necessary for system operation are also stored. The cpu 901, the rom 902 and the ram 903 are connected to each other via a bus 904. An Input/Output interface 905(Input/Output interface, i.e., I/O interface) is also connected to the bus 904.

The following components are connected to the input/output interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a local area network card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 910 is also connected to the input/output interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the central processor 901, performs various functions defined in the system of the present application.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A neural network training method, comprising:

2. The neural network training method of claim 1, wherein the obtaining at least two sample segments from the video samples according to the video time sequence comprises:

performing multi-modal information extraction on the video sample to obtain modal information samples corresponding to different modal types;

and synchronously sampling each modal information sample according to the video time sequence to obtain at least two sample segments corresponding to different modal types.

3. The neural network training method of claim 1, wherein the obtaining at least two sample segments from the video samples according to the video time sequence comprises:

sampling video samples according to a video time sequence to obtain at least two video segments;

and performing multi-modal information extraction on the video clips to obtain at least two sample clips corresponding to different modal types.

4. The neural network training method of claim 1, wherein the sampling interval of the sample segment is greater than or equal to the sampling length of the sample segment.

5. The neural network training method of claim 1, wherein the modality types include at least two of an image modality, an audio modality, and a text modality; the extracting features of the sample segment through neural networks corresponding to different modality types to obtain at least two modality features of the sample segment includes:

if the sample fragment comprises an image sample corresponding to the image modality, performing feature extraction on the image sample through an image processing neural network to obtain an image feature of the sample fragment;

if the sample fragment comprises an audio sample corresponding to the audio modality, performing feature extraction on the audio sample through an audio processing neural network to obtain an audio feature of the sample fragment;

and if the sample fragment comprises a text sample corresponding to the text mode, performing feature extraction on the text sample through a text processing neural network to obtain the text feature of the sample fragment.

6. The neural network training method of claim 5, wherein the image processing neural network comprises a plurality of sequentially connected three-dimensional convolution processing units, the three-dimensional convolution processing units comprising sequentially connected two-dimensional space convolution layers and one-dimensional time convolution layers; the performing feature extraction on the image sample through an image processing neural network to obtain the image features of the sample segment includes:

performing convolution processing on the image sample through the two-dimensional space convolution layer to obtain an intermediate feature map carrying spatial features;

and carrying out convolution processing on the intermediate characteristic graph through the one-dimensional time convolution layer to obtain the image characteristics of the sample segment carrying the spatial characteristics and the time characteristics.

7. The neural network training method of claim 5, wherein the audio processing neural network comprises a plurality of two-dimensional convolution processing units connected in sequence, and the performing feature extraction on the audio sample through the audio processing neural network to obtain the audio features of the sample segments comprises:

filtering the audio sample to obtain a two-dimensional Mel spectrogram;

carrying out logarithm operation on the Mel frequency spectrogram to obtain two-dimensional frequency spectrum information for quantizing the sound intensity;

and performing convolution processing on the two-dimensional frequency spectrum information through the two-dimensional convolution processing unit to obtain the audio characteristics of the sample fragment.

8. The neural network training method of claim 7, wherein the two-dimensional convolution processing unit includes a residual connecting branch and a convolution connecting branch; the convolving the two-dimensional spectrum information by the two-dimensional convolution processing unit to obtain the audio features of the sample segment includes:

mapping the two-dimensional frequency spectrum information through the residual connecting branch to obtain residual mapping information;

performing convolution processing on the two-dimensional frequency spectrum information through the convolution connection branch to obtain audio convolution information;

and superposing the residual mapping information and the audio convolution information to obtain the audio characteristics of the sample fragment.

9. The neural network training method according to claim 1, wherein the training the neural network according to the feature similarity of each modal feature and the segment sequence information to update the network parameters of the neural network comprises:

respectively obtaining feature similarity among the modal features, and determining contrast error information of the modal features according to the feature similarity;

mapping the modal characteristics to obtain sequence prediction information, and determining sequence error information of the sample fragment according to the fragment sequence information and the sequence prediction information;

and superposing the comparison error information and the sequence error information to obtain an overall loss error, and updating the network parameters of the neural network according to the overall loss error.

10. The neural network training method of claim 9, wherein the determining contrast error information of the modal features according to the feature similarity comprises:

taking the modal characteristics corresponding to the same sample fragment as a positive sample, and taking the modal characteristics corresponding to different sample fragments as a negative sample;

and carrying out error calculation on the feature similarity of the positive sample and the negative sample through a contrast loss function to obtain contrast error information of the modal features.

11. The neural network training method of claim 9, wherein the mapping the modal features to obtain sequential prediction information comprises:

respectively randomly selecting the modal characteristics corresponding to each sample fragment to obtain the sample characteristics corresponding to the sample fragment;

respectively splicing the sample characteristics corresponding to each sample fragment in pairs to obtain local splicing characteristics;

respectively carrying out mapping processing on each local splicing feature to obtain a local mapping feature of the local splicing feature;

splicing the local mapping characteristics to obtain integral splicing characteristics;

mapping the integral splicing characteristic to obtain an integral mapping characteristic of the integral splicing characteristic;

and carrying out normalized mapping on each feature element in the overall mapping feature to obtain the sequential prediction information of the modal feature.

12. The neural network training method of claim 9, wherein determining the order error information for the sample segments from the segment order information and the order prediction information comprises:

taking the segment sequence information as a target value and taking the sequence prediction information as a prediction value;

and carrying out error calculation on the target value and the predicted value through a cross entropy loss function to obtain sequence error information of the sample segment.

13. The neural network training method of claim 1, wherein after obtaining at least two modal features of the sample segment, the method further comprises:

and mapping each modal characteristic through a common mapping head neural network to obtain the modal characteristic corresponding to the common embedding space.

14. A neural network training device, comprising:

15. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the neural network training method of any one of claims 1-13 via execution of the executable instructions.