CN110234018A

CN110234018A - Multimedia content description generation method, training method, device, equipment and medium

Info

Publication number: CN110234018A
Application number: CN201910616904.XA
Authority: CN
Inventors: 王柏瑞; 马林; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2019-09-13
Anticipated expiration: 2039-07-09
Also published as: CN110234018B

Abstract

This application discloses description generation method, training method, device, equipment and the media of multimedia content, belong to artificial intelligence field.This method comprises: calling description to generate model carries out multi-modal feature extraction to multimedia content, the frame characteristic sequence of at least two modal characteristics is obtained, frame characteristic sequence is included in corresponding modal characteristics at least two multimedia frames；It calls description to generate model to merge the modal characteristics for belonging to same frame in the frame characteristic sequence of at least two modal characteristics, obtains advanced frame characteristic sequence；Advanced frame characteristic sequence is included in corresponding fused modal characteristics at least two multimedia frames；It calls description to generate model to be decoded advanced frame characteristic sequence, obtains the natural language description of multimedia content.It can solve and modal characteristics are only subjected to directly cascade simply to be merged, have ignored the relevance between different modalities feature, the natural language description ultimately generated is caused to have ignored part of semantic information.

Description

Multimedia content description generation method, training method, device, equipment and medium

Technical field

This application involves artificial intelligence field, in particular to multimedia content description generation method, training method, device, Equipment and medium.

Background technique

Video content description (Video Captioning) is that appointing for natural language description is automatically generated according to video content Business.Wherein, natural language description refers to the descriptive text of natural language form.

A kind of description generation method of multimedia content is provided in the related technology, and computer equipment uses convolutional Neural net Network carries out feature extraction to video to be described, obtains at least two modal characteristics.Computer equipment will at least both modalities which it is special Sign is directly cascaded, and fused modal characteristics are obtained.Fused modal characteristics are transformed to regard using attention mechanism Frequency level characteristics are finally decoded the video level characteristics, obtain the natural language description of the video.

In the related art, at least both modalities which feature is only subjected to directly cascade simply to be merged, be it is a kind of compared with For simple amalgamation mode, the relevance between different modalities feature is had ignored, the natural language description ultimately generated is caused to be neglected Part of semantic information is omited.

Summary of the invention

The embodiment of the present application provides description generation method, training method, device, equipment and the medium of multimedia content. The technical solution is as follows:

According to the one aspect of the embodiment of the present application, a kind of description generation method of multimedia content, the side are provided Method includes:

It calls description to generate model and multi-modal feature extraction is carried out to multimedia content, obtain at least two modal characteristics Frame characteristic sequence, the frame characteristic sequence are included in corresponding modal characteristics at least two multimedia frames；

The description is called to generate model to belonging to same frame in the frame characteristic sequence of at least two modal characteristics Modal characteristics are merged, and advanced frame characteristic sequence is obtained；The advanced frame characteristic sequence is included at least two multimedia frames In corresponding fused modal characteristics；

It calls the description to generate model to be decoded the advanced frame characteristic sequence, obtains the multimedia content Natural language description.

According to the other side of the embodiment of the present application, the description for providing a kind of multimedia content generates the training of model Method, which comprises

Training sample is obtained, the training sample includes that sample multimedia content and the sample multimedia content are corresponding Pattern representation；

It calls description to generate model and multi-modal feature extraction is carried out to the sample multimedia content, obtain at least two moulds The frame characteristic sequence of state feature, the frame characteristic sequence are included in corresponding modal characteristics at least two multimedia frames；

It calls the description to generate model to be decoded the advanced frame characteristic sequence, obtain in the sample multimedia The natural language description of appearance；

Error loss is calculated according to the natural language description and the pattern representation；

Model is generated to the description using back-propagation algorithm according to error loss to be trained end to end.

According to the other side of the embodiment of the present application, a kind of description generating means of multimedia content are provided, it is described Device includes:

Coding module obtains at least two for calling coding module to carry out multi-modal feature extraction to multimedia content The frame characteristic sequence of modal characteristics, the frame characteristic sequence are included in corresponding modal characteristics at least two multimedia frames；

Fusion Module, for calling characteristic crossover module to belonging in the frame characteristic sequence of at least two modal characteristics The modal characteristics of same frame are merged, and advanced frame characteristic sequence is obtained；The advanced frame characteristic sequence is included at least two Corresponding fused modal characteristics in multimedia frame；

Decoder module obtains the multimedia for calling decoder module to be decoded the advanced frame characteristic sequence The natural language description of content.

According to the other side of the embodiment of the present application, the description for providing a kind of multimedia content generates the training of model Device, described device include:

Module is obtained, for obtaining training sample, the training sample includes sample multimedia content and the sample The corresponding pattern representation of multimedia content；

Coding module obtains at least two mode for carrying out multi-modal feature extraction to the sample multimedia content The frame characteristic sequence of feature, the frame characteristic sequence are included in corresponding modal characteristics at least two multimedia frames；

Fusion Module, for belonging to the modal characteristics of same frame in the frame characteristic sequence at least two modal characteristics It is merged, obtains advanced frame characteristic sequence；The advanced frame characteristic sequence is included in corresponding at least two multimedia frames Fused modal characteristics；

Decoder module obtains oneself of the sample multimedia content for being decoded to the advanced frame characteristic sequence Right language description；

Computing module, for error loss to be calculated according to the natural language description and the pattern representation；

Training module, for using back-propagation algorithm to the coding module, the fusion according to error loss Module and the decoder module are trained end to end.

According to the other side of the embodiment of the present application, a kind of computer equipment is provided, the computer equipment includes Processor and memory are stored at least one instruction, at least a Duan Chengxu, code set or instruction set, institute in the memory Instruction, described program, the code set or described instruction collection is stated to be loaded by the processor and executed to realize such as aforementioned implementation The description generation method of multimedia content described in example, alternatively, the description of multimedia content as in the foregoing embodiment generates The training method of model.

According to the other side of the embodiment of the present application, a kind of computer readable storage medium is provided, the storage is situated between At least one instruction, at least a Duan Chengxu, code set or instruction set, described instruction, described program, the code are stored in matter Collection or described instruction collection are loaded by processor and executed to be generated with the description for realizing multimedia content as in the foregoing embodiment Method, alternatively, the description of multimedia content as in the foregoing embodiment generates the training method of model.

Technical solution bring beneficial effect provided by the embodiments of the present application includes at least:

The application carries out multi-modal feature extraction to multimedia by using encoder, obtains at least two modal characteristics Frame characteristic sequence, it is by characteristic crossover module that the mode for belonging to same frame in the frame characteristic sequence of at least both modalities which feature is special Sign is merged, and interactional knot of the multimedia between the frame characteristic sequence between the corresponding different modalities of same frame is obtained Fruit can effectively excavate and reinforce the internal connection of different modalities feature, to improve the accuracy of multimedia descriptions generation.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the structural block diagram of the description generation system for the multimedia content that one exemplary embodiment of the application provides；

Fig. 2 is the flow chart of the description generation method for the multimedia content that one exemplary embodiment of the application provides；

Fig. 3 is the schematic diagram of the description generation method for the multimedia content that one exemplary embodiment of the application provides；

Fig. 4 is the flow chart of the description generation method for the multimedia content that another exemplary embodiment of the application provides；

Fig. 5 is the schematic diagram of the description generation method for the multimedia content that another exemplary embodiment of the application provides；

Fig. 6 is that the description for the multimedia content that another exemplary embodiment of the application provides generates the training method of model Flow chart；

Fig. 7 is the structure chart of the description generation method for the multimedia content that another exemplary embodiment of the application provides；

Fig. 8 is the structure chart of the description generation method for the multimedia content that the application another exemplary embodiment provides；

Fig. 9 is the structural schematic diagram of the description generating means for the multimedia content that one exemplary embodiment of the application provides；

Figure 10 is that the description for the multimedia content that one exemplary embodiment of the application provides generates the training device of model Structural schematic diagram；

Figure 11 is a kind of structural block diagram for server that the application one embodiment provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

First to the invention relates to several nouns carry out brief introduction:

Neural network model: being the complex web for widely being interconnected by multiple processing units (referred to as neuron) and being formed Network system.Neural network model is used to simulating and reflecting many essential characteristics of human brain function, is one highly complex non-thread Property power learning system.

Encoder: for extracting the different modalities of video, and the nerve of the modal characteristics of video frame in each mode is obtained Network model.

Residual error formula intersects door: being a kind of for being carried out under the different modalities of video content to the modal characteristics of same frame Feature extraction obtains the modal characteristics under the same frame state of video in different modalities under the frame state, to obtain different modalities Influence result between feature.

Attention mechanism (Attention Mechanism): being fast from bulk information using limited attention resource Speed filters out the means of high price value information.Visual attention mechanism is brain signal treatment mechanism specific to human vision.People Class vision obtains the target area for needing to pay close attention to, that is, general described attention by quickly scanning global image Focus to obtain the detailed information of more required concern targets, and presses down then to the more attention resources of this regional inputs Make other garbages.It is various not that attention mechanism is widely used in natural language processing, image recognition and speech recognition etc. It is one of the core technology for most meriting attention and understanding in depth in depth learning technology in the deep learning task of same type.

Decoder: on the basis of carrying out the mixing together of residual error formula for multiple modal characteristics of each frame, using nerve net Network generates natural language according to fused modal characteristics, obtains the natural language for describing the corresponding video of the modal characteristics Speech.

Fig. 1 shows the structural frames of the description generation system of the multimedia content of one exemplary embodiment of the application offer Figure.The system includes: terminal 10 and server 20.

Terminal 10 is connected by wireless network or cable network with server 20.Terminal 10 can be smart phone, game At least one in host, desktop computer, tablet computer, E-book reader, MP4 player and pocket computer on knee Kind.10 installation and operation of terminal has the application program for supporting multimedia.Wherein, multimedia is the letter comprising video content Breath.For example, it may be sound video, silent video etc., the application is illustrated this programme by taking video as an example.Application program can Be video social application program, instant messaging application program, team's video application, based on topic or channel or circle into Social category application program that pedestrian's clustering is closed, the social category application program based on shopping, browser program, appointing in video program It anticipates one kind.Schematically, terminal 10 is the terminal that user uses.

Terminal 10 is connected by wireless network or cable network with server 20.

Server 20 includes at least one of a server, multiple servers, cloud computing platform or virtualization center. Server 20 is used to that the application program of commending contents to be supported to provide background service.Optionally, server 20 undertakes main calculating Work, terminal 10 undertake secondary calculation work；Alternatively, server 20 undertakes secondary calculation work, terminal 10 undertakes main calculating Work；Alternatively, server 20, terminal 10 carry out cooperated computing using distributed computing architecture between the two.

Optionally, server 20 includes: encoder 201, residual error formula intersection door 202, fusion part 203 and decoder 204. Encoder 201 is used to receive the video of the transmission of terminal 10, which is divided into multiple mode, and extract particular video frequency frame in video Modal characteristics in corresponding each mode.Illustratively, video is divided into 4 mode, includes by frame feature sequence in each mode Arrange the modal characteristics of composition, wherein frame characteristic sequence is video corresponding spy at a certain frame or a few frame positions in the mode What sign was constituted.For example, frame characteristic sequence includes: frame feature 1, frame feature 2, frame feature 3, frame feature 4 ... frame feature n, n frame Feature is arranged in order component frame characteristic sequence, which forms modal characteristics in corresponding mode.

It include n convolutional neural networks in encoder 201, each convolutional neural networks are for extracting a kind of modal characteristics. For example, including 2 convolutional neural networks, respectively the first convolutional neural networks and the second convolutional neural networks in encoder 201. Wherein, the first convolutional neural networks are used to extract the behavioral characteristics of video, and the second convolutional neural networks are for extracting the quiet of video State feature.

Residual error formula intersects door 202 for intersecting the modal characteristics of different modalities, obtains between different modalities feature The result that influences each other.Wherein, residual error formula is intersected the fusion part 203 in door 202 and is used for the modal characteristics after multiple intersections It is merged, generates advanced frame characteristic sequence.

Decoder 204 is used to generate the natural language for describing the video according to advanced frame characteristic sequence.

Terminal 10 can refer to one in multiple terminals, and the present embodiment is only illustrated with terminal 10.Terminal 10 Type includes: smart phone, game host, desktop computer, tablet computer, E-book reader, MP4 player and on knee At least one of portable computer.

Those skilled in the art could be aware that the quantity of above-mentioned terminal can be more or less, for example above-mentioned terminal can be with Only one perhaps above-mentioned terminal be tens or several hundred or greater number.The embodiment of the present application to the quantity of terminal and Device type is not limited.

Fig. 2 shows the flow charts of the description generation method of the multimedia content of exemplary embodiment offer.This is more The description generation method of media content is applied to be provided with the calculating of encoder 201, residual error formula intersection door 202 and decoder 204 In machine equipment.Wherein, multimedia content description generation method the following steps are included:

Step 101, server calls encoder carries out multi-modal feature extraction to video, obtains at least two modal characteristics Frame characteristic sequence, frame characteristic sequence is included in corresponding modal characteristics at least two video frames.

It include the convolutional neural networks for extracting video different modalities feature in encoder 201, when needing to extract video N kind modal characteristics when, n convolutional neural networks, a kind of mode of the corresponding extraction video of each convolutional neural networks are set Modal characteristics.

Server 20 extracts the mould of video in all frames of video or the position of partial frame using multiple convolutional neural networks State feature obtains modal characteristics of the video in different modalities.Wherein, the modal characteristics in every kind of mode form a framing features Sequence.

Illustratively, it is in the modal characteristics that video is extracted in the position of all frames of video with multiple convolutional neural networks Example.In conjunction with Fig. 3, video includes 90 frames, includes the first convolutional neural networks and the second convolutional neural networks in encoder 201.Its In, the first convolutional neural networks are used to extract the behavioral characteristics of video, and the second convolutional neural networks are used to extract the static state of video Feature.Encoder 201 carries out feature using 1st frame image content of two convolutional neural networks to video at first frame and mentions It takes, respectively obtains the behavioral characteristics and static nature at the 1st frame.It repeats the above steps, the mode for successively extracting remaining 89 frames is special Sign obtains the dynamic frame characteristic sequence and static frame characteristic sequence of video.

In another possible embodiment, two convolutional neural networks only extract the position of a part of frame of video Modal characteristics.Wherein, the position that two convolutional neural networks extract modal characteristics is identical.For example, when the first convolutional neural networks When extracting modal characteristics at the 1/3/4/5th frame of video, the second convolutional neural networks are also mentioned at the 1/3/4/5th frame of video Take modal characteristics.That is, the first convolutional neural networks and the second convolution mind are respectively adopted in the position of the 1/3/4/5th frame of video The extraction of modal characteristics is carried out to video through network.

It should be noted that it is only exemplary to carry out modal characteristics extraction to video using two convolutional neural networks above Citing description, the present embodiment is not construed as limiting this, the quantity of convolutional neural networks be also possible to it is multiple, such as 3,4 or 5 A etc., the quantity of convolutional neural networks is set according to actual needs by technical staff.

Step 102, it is same to belonging in the frame characteristic sequence of at least two modal characteristics to intersect door for server calls residual error formula The modal characteristics of one frame are merged, and advanced frame characteristic sequence is obtained；Advanced frame characteristic sequence is included at least two video frames In corresponding fused modal characteristics.

It is the neural network for extracting the result that influences each other between different characteristic that residual error formula, which intersects door 202,.The present embodiment In, door Fusion Module is intersected using nonlinear neural network building residual error formula.Since the modal characteristics that different modalities extract all are In the same frame position of video, residual error formula intersects door Fusion Module and is extracting the position of modal characteristics at least two modal characteristics It is merged, obtains fused advanced frame characteristic sequence, which includes corresponding at least two video frames Fused modal characteristics.

Step 103, it calls decoder to be decoded advanced frame characteristic sequence, obtains the natural language description of video.

Advanced frame characteristic sequence is input in decoder 203, to the fused mould of each of advanced frame characteristic sequence State feature is decoded, and obtains the corresponding natural language description of the video.

In conclusion method provided by the embodiments of the present application, carries out multi-modal feature to video by using encoder and mentions It takes, obtains the frame characteristic sequence of at least two modal characteristics, the frame for intersecting at least two modal characteristics of goalkeeper by residual error formula is special The modal characteristics for belonging to same frame in sign sequence are merged, and it is special to obtain frame of the video between the corresponding different modalities of same frame It levies interactional as a result, it is possible to the internal connection of different modalities feature be found and reinforce, to improve video between sequence The accuracy generated is described.

It is the stream of the description generation method for the multimedia content that the application another exemplary embodiment provides in conjunction with Fig. 4, Fig. 4 Cheng Tu.The description generation method of the multimedia content the following steps are included:

Step 201, it calls n convolutional neural networks to carry out feature extraction to the video frame in video respectively, obtains video Modal characteristics in frame.

Each convolutional neural networks are used to extract a kind of mode of the corresponding video of the convolutional neural networks.For example, when mentioning When taking the modal characteristics in the video in dynamic mode and static mode, first for extracting dynamic modal characteristics is respectively adopted Convolutional neural networks and the second convolutional neural networks for extracting static modal characteristics.In the present embodiment, encoder 201 is used for The convolutional neural networks for extracting modal characteristics are inception_resnet_V2, extract behavioral characteristics and quiet to each frame of video State feature.Illustratively, this feature is the feature vector of 1536 dimensions, but the present embodiment is not construed as limiting this.

Step 202, it is carried out according to video frame chronological order in video to belonging to same type of modal characteristics Combination, obtains the frame characteristic sequence of every kind of modal characteristics.

Step 201 is repeated, feature extraction is carried out to multiple frames of video, obtains multiple frame features of the video.

In one alternate embodiment, encoder 201 further include: first circulation neural network.The first circulation nerve net Network is used for after obtaining the frame characteristic sequence of every kind of modal characteristics, and first circulation neural network is called to extract the frame of modal characteristics Temporal aspect in characteristic sequence obtains the frame characteristic sequence containing temporal aspect.

Illustratively, in conjunction with Fig. 3, the acquisition methods of the temporal aspect include: using based on shot and long term memory unit (Long Short-Term Memory, LSTM) Recognition with Recurrent Neural Network as sequential coding device, from multi-modal convolutional neural networks feature Sequential extraction procedures timing information, the process can indicate are as follows:

Wherein, LSTM⁽¹⁾Indicate shot and long term memory unit in the convolutional neural networks characteristic sequence of mode 1General calculating process,Indicate LSTM⁽¹⁾In input the i-th frame convolution mind Hidden state after network characterization,Indicate LSTM⁽¹⁾Hidden state after inputting the i-th frame convolutional neural networks feature Memory cell state, whereinModal characteristics as the insertion timing information for corresponding to 1 i-th frame image of mode.Finally Obtain 1 characteristic sequence of mode for being embedded in timing informationFor mode 2, warp The characteristic sequence of mode 2 is obtained after crossing operation similar with above formula

In one alternate embodiment, encoder 201 further include: first circulation neural network.The first circulation nerve net Network is used for after obtaining the frame characteristic sequence of every kind of modal characteristics, extracts mould by server calls first circulation neural network Temporal aspect in the frame characteristic sequence of state feature obtains the frame characteristic sequence containing temporal aspect.

Step 203, it for i-th of video frame in video, calls to intersect and appoint in door processing i-th of video frame of part calculating The influence result anticipated between two modal characteristics；

In conjunction with Fig. 3 and Fig. 5, convolutional neural networks extract the mode 1 of video and the modal characteristics of mode 2, wherein mode 1 Modal characteristics CNN1 in include the (i-1)-th frame modal characteristics and the i-th frame modal characteristics；In the modal characteristics CNN2 of mode 2 The modal characteristics of modal characteristics and the i-th frame including the (i-1)-th frame.Wherein, it calls and intersects door processing i-th of video frame of part calculating Influence result between middle any two modal characteristics, comprising: for i-th of video frame in video, call and intersect door processing Influence of the first mode feature to second mode feature is as a result, and i-th of video frame of calculating in i-th of video frame of part calculating Influence result of the middle second mode feature to first mode feature.

For multiple modal characteristics of each frame, intersects door by residual error formula and be calculated between multiple modal characteristics mutually Result after influence.

The calculating process that residual error formula intersects door can be expressed as follows:

Gating<x, y>=(σ (wx+b) y+y)

Wherein, x and y represents two input variables that residual error formula intersects door operation function, and y influences x.σ indicates non- Linear activation primitive ReLU, w and b tabular form changes can learning parameter in module.

Illustratively, the calculating process to interact between multiple modal characteristics includes:

Pass through influencing each other as a result, it is possible to obtain multiple modal characteristics corresponding in same frame for two modal characteristics of calculating Between association features with influence each other as a result, improving order of accuarcy of the corresponding modal characteristics of the frame when describing video.

Step 204, call fusion part according to influence as a result, to the spy of at least two mode corresponding to i-th of video frame Sign is merged, and fused modal characteristics are obtained.

After obtaining the result that influences each other of each modal characteristics, got using the fusion part for intersecting door by last Multiple modal characteristics the result that influences each otherWithIt is merged:

Wherein w_fAnd b_fBe merge part can learning parameter, x_iIt is the feature vector of the i-th frame image after merging.

Step 205, fused modal characteristics are combined according to the chronological order of video frame in video, are obtained To advanced frame characteristic sequence.

Step 204 is repeated, the corresponding modal characteristics of all frames are merged, and the time according to video frame in video Sequencing is merged, and advanced frame characteristic sequence X={ x is obtained₁, x₂..., x_m}.Illustratively, server is according to timing spy Sign is ranked up each fused modal characteristics.

Step 206, allocating time notices that power module carries out attention calculating to advanced frame characteristic sequence, obtains at t-th The weight of the corresponding fused modal characteristics of each video frame is inscribed when decoding.

After obtaining advanced frame characteristic sequence, notice that power module is special to advanced frame using the time based on time attention mechanism Levy sequence X={ x₁, x₂..., x_mCarry out integration processing:

WhereinIt indicates in t-th of decoding moment frame feature x_iChangeable weight, meet conditionWherein, 0<t<m。

Wherein, the time based on time attention mechanism pays attention to power module for keeping decoder dynamic select key frame special Sign determines one or several that has critical impact to video presentation from multiple modal characteristics in advanced frame characteristic sequence A modal characteristics.

Step 207, allocating time pays attention to the weight of power module fused modal characteristics corresponding to each video frame Integrated, obtain t-th of decoding moment weight and.

Since significance level of the different modal characteristics when expressing video content is different, it, will in one section of video The biggish modal characteristics of specific gravity are influenced on the video and assign higher weight.

Step 208, second circulation neural network set weight and fused mould corresponding to each video frame are called The hidden state at state feature and the t-1 decoding moment is decoded, and the probability distribution of each word in dictionary is obtained, by probability Maximum word output is this decoded output.

Dictionary is for providing the dictionary of candidate natural language.

The process of decoder prediction word can be expressed as follows:

h_t, c_t=LSTM ([E (s_t-1), φ_t(X)], h_t-1)

P(s_t|s_{< t}, V；θ)=Softmax (W_sh_t+b_s)

Wherein h_tIndicate the hidden state at decoder current time, c_tIndicate the memory cell state at decoder current time, h_t-1Indicate the hidden state of decoder last moment, E (s_t-1) indicate the word s of last moment_t-1It is mapped to vector space. W_s, b_sFor can learning parameter, V indicate input video, θ indicate whole network in parameter.Using Softmax function by decoder The hidden state at current time is converted into the probability distribution of each word, therefrom predicts most possible word.

Step 209, when meeting decoding termination condition, the word of each decoding moment output is subjected to Sequential output, is obtained To the natural language description of video.

The termination condition of coding and decoding includes: that video playing terminates, or receives command for stopping.

In conclusion method provided in this embodiment, intersects door by the way that residual error formula is arranged between encoder and decoder, Residual error formula intersect door receive encoder output multiple mode modal characteristics, be obtained by calculation same frame it is corresponding it is multiple not With the relevance between modal characteristics, the natural language for describing the video is generated according to the modal characteristics after interrelated, The accuracy of nature description language can be effectively improved.

Meanwhile power module is paid attention to by the way that the time based on time attention mechanism is arranged, calculate advanced frame characteristic sequence In the modal characteristics with key effect, and higher weight will be assigned with the modal characteristics of key effect, so that solution The natural language that code device obtains being capable of accurate describing video contents.

Present invention also provides the training method that a kind of description of multimedia content generates model, this method is used for above-mentioned The model that the description of video content can be generated is trained, the description generate model include encoder, residual error formula intersect door and Decoder.In conjunction with Fig. 6, which includes at least following steps:

Step 301, training sample is obtained, which includes that Sample video and the corresponding sample of Sample video are retouched It states.

Training sample includes at least one Sample video, and each Sample video includes retouching with the one-to-one sample of the video It states.

Step 302, it calls encoder to carry out multi-modal feature extraction to Sample video, obtains at least two modal characteristics Frame characteristic sequence, frame characteristic sequence are included in corresponding modal characteristics at least two video frames.

Step 303, residual error formula is called to intersect door to belonging to same frame in the frame characteristic sequence of at least two modal characteristics Modal characteristics are merged, and advanced frame characteristic sequence is obtained；It is right at least two video frames that the advanced frame characteristic sequence is included in The fused modal characteristics answered.

Step 304, decoder is called to be decoded advanced frame characteristic sequence, the natural language for obtaining Sample video is retouched It states.

The content of step 302 to step 304 is identical as the content of embodiment where Fig. 2 above-mentioned, and the present embodiment is herein not It repeats again.

Step 305, error loss is calculated according to natural language description and pattern representation.

Obtained natural language description and pattern representation are compared, error loss is obtained.

Illustratively, the calculation method of error loss can be by minimizing model loss function, the calculating of error loss Process can indicate are as follows:

Wherein,Indicate that the loss function of model, N indicate training data number, V^kAnd S^kIndicate k-th of video and Its corresponding natural language description.P(S^k|V^k；It θ) indicates the probability for generating natural language description to k-th of video, can indicate Are as follows:

Wherein, V^kIndicate k-th of video, S^kIt is to indicate the corresponding natural language description of k-th of video,It indicates to V^kIt is raw At natural language description S^kDuring, in the word of current time (t frame) prediction.It indicates before current time, Predicted word,θ indicates network parameter.

Step 306, door is intersected to encoder, residual error formula using back-propagation algorithm according to error loss and decoder carries out It trains end to end.

Step 301 is repeated to step 306, which is optimized.

In conclusion training method provided in this embodiment, by carrying out modal characteristics extraction to Sample video, and according to The modal characteristics extracted generate natural language, and the natural language of generation is compared with pattern representation, adjust in model The weight of every modal characteristics can effectively improve the order of accuarcy that model generates natural language by training end to end.

The description generation method of multimedia content in above embodiment can at least apply to video search and video In the scene of classification.

The description generation method for the multimedia content that each embodiment of the application provides, can be applied to be equipped with target In the terminal of application program, which is the application program with video reception or sending function, which can be with It is smart phone, tablet computer, personal computer or portable computer etc..For example, the destination application is game, society Class application program, instant messaging application program, video playing application program etc. are handed over, the embodiment of the present application does not limit this.

It is illustrated below in conjunction with Fig. 7 to this method is applied to video search scene, in conjunction with Fig. 8 to by this method application It is illustrated in visual classification scene.

One, video search scene

Terminal 10 searches for relevant video content by natural language description in server 20.Video in server 20 Content by the description generation method of the multimedia content of previous embodiment obtain each video for natural language description, eventually End 10 is pushed by the natural language description in text search server 20, and by the corresponding video content of the natural language description A terminal 10.

For example, search key is that " tiger eats after terminal 10 is connect by cable network or wireless network with server 20 Chicken " corresponding video is stored with multiple video contents and natural language description corresponding with the video content in server 20. Illustratively, the video for including in server includes: video A21, the corresponding natural language description 1 of video A21 are as follows: Cai so-and-so, Zhang plays basketball in basketball court；The corresponding natural language description 2 of video B22, video B22 are as follows: Mr. Li is being covered with red blanket It gives out an award in scene；The natural language description 3 of video C23, video C23 are as follows: two northeastern tigers at the zoo in chase a chicken, Chicken is captured under one tree.

Server 20 is received after keyword " tiger eats chicken ", by searching for generally, nature corresponding with multiple videos Language description is compared, and is ranked up according to the degree of correlation, by video content according to degree of correlation sequence from high to low to terminal 10 push, are finally pushed to terminal 10 for video C23.

Two, visual classification scene

In conjunction with Fig. 8, under visual classification scape, the description generation method of multimedia content provided by the embodiments of the present application can be with The description for being implemented as a multimedia content in application program generates model 30.For example, when terminal to server 20 uploads Video A21, video B22 and video C23, the description of multimedia content generate model 30 and carry out feature extraction to each video, from And the natural language description for describing the video is obtained, and classified according to the natural language description to video, it will be in video Appearance is stored under corresponding classification, and it is related that the rear video A that classifies to video B belongs to amusement, therefore it is corresponding to be stored in Class1 It stores in section；Video C belongs to record sheet type, therefore is stored in the corresponding storage section of type 2.Illustratively, divide Class standard can be according to the name of character, video type (animation, film, TV play) or video content (section in video It learns, education, amusement).

It should be noted that the content of the natural language description of above-mentioned video is only exemplary for example, true field It will include the natural language description of more details feature in scape.

Above-mentioned only to be schematically illustrated by taking several possible application scenarios as an example, method provided by the embodiments of the present application is also It can be applied to other application scenarios for needing the description of video content to generate, the embodiment of the present application is not to concrete application scene It is defined.

Present invention also provides a kind of description generating means of video content, and in conjunction with Fig. 9, which includes:

Coding module 501 obtains at least two mode spy for calling encoder to carry out multi-modal feature extraction to video The frame characteristic sequence of sign, frame characteristic sequence are included in corresponding modal characteristics at least two video frames；Fusion Module 502 is used The modal characteristics for belonging to same frame in frame characteristic sequence of the calling characteristic crossover module at least two modal characteristics melt It closes, obtains advanced frame characteristic sequence；Advanced frame characteristic sequence is included in corresponding fused mode at least two video frames Feature；Decoder module 503, for calling decoder to be decoded advanced frame characteristic sequence, the natural language for obtaining video is retouched It states.

Fusion Module 502 intersects door processing i-th of view of part calculating for calling for i-th of video frame in video Influence of the first mode feature to second mode feature is as a result, and second mode feature in i-th of video frame of calculating in frequency frame Influence result to first mode feature.

Coding module 501 is also used to that n convolutional neural networks is called respectively to propose the video frame progress feature in video It takes, obtains the modal characteristics in video frame；It is first according to the time of video frame in video to same type of modal characteristics are belonged to It is sequentially combined afterwards, obtains the frame characteristic sequence of every kind of modal characteristics；Recognition with Recurrent Neural Network is called to extract the frame of modal characteristics Temporal aspect in characteristic sequence obtains the frame characteristic sequence containing temporal aspect.

Decoder module 503 is also used to allocating time and notices that power module carries out attention calculating to advanced frame characteristic sequence, obtains To the weight for inscribing the corresponding fused modal characteristics of each video frame in t-th of decoding；Allocating time attention mould The weight of block fused modal characteristics corresponding to each video frame is integrated, and the weight at t-th of decoding moment is obtained With；Call second circulation neural network set weight and fused modal characteristics corresponding to each video frame and t-1 The hidden state at a decoding moment is decoded, and obtains the probability distribution of each word in dictionary, and the word of maximum probability is defeated It is out this decoded output；When meeting decoding termination condition, the word of each decoding moment output is subjected to Sequential output, is obtained To the natural language description of video.

Disclosed herein as well is the training device that a kind of description of multimedia content generates model, which generates model packet It includes:

Module 601 is obtained, for obtaining training sample, training sample includes Sample video and the corresponding sample of Sample video This description；Coding module 602 obtains at least two moulds for calling encoder to carry out multi-modal feature extraction to Sample video The frame characteristic sequence of state feature, frame characteristic sequence are included in corresponding modal characteristics at least two video frames；Fusion Module 603, for calling characteristic crossover module to the modal characteristics for belonging to same frame in the frame characteristic sequence of at least two modal characteristics It is merged, obtains advanced frame characteristic sequence；Advanced frame characteristic sequence is included at least two video frames after corresponding fusion Modal characteristics；Decoder module 604 obtains Sample video for calling decoder to be decoded advanced frame characteristic sequence Natural language description；Computing module 605, for error loss to be calculated according to natural language description and pattern representation；Training Module 606 is arrived for carrying out end to encoder, characteristic crossover module and decoder using back-propagation algorithm according to error loss The training at end.

Present invention also provides a kind of computer equipment, which includes processor and memory, the storage At least one instruction, at least a Duan Chengxu, code set or instruction set, described instruction, described program, the code are stored in device Collection or described instruction collection are loaded by the processor and are executed the description to realize the multimedia content provided such as previous embodiment Generation method, alternatively, as the description of the multimedia content of previous embodiment offer generates the training method of model.

Present invention also provides a kind of computer readable storage medium, be stored in the storage medium at least one instruction, At least a Duan Chengxu, code set or instruction set, described instruction, described program, the code set or described instruction collection are by processor The description generation method to realize the multimedia content provided such as previous embodiment is loaded and executes, alternatively, such as previous embodiment The description of the multimedia content of offer generates the training method of model.

Figure 11 shows the structural schematic diagram of the server of the application one embodiment offer.The server is for implementing The description generation method of the multimedia content provided in embodiment is provided.Specifically:

Server 800 includes central processing unit (CPU) 801 including random access memory (RAM) 802 and read-only deposits The system storage 804 of reservoir (ROM) 803, and the system bus of connection system storage 804 and central processing unit 801 805.Server 800 further includes the basic input/output (I/O that information is transmitted between each device helped in computer System) 806, and for the mass-memory unit of storage program area 813, application program 814 and other program modules 815 807。

Basic input/output 806 includes display 808 for showing information and inputs information for user The input equipment 809 of such as mouse, keyboard etc.Wherein display 808 and input equipment 809 are all by being connected to system bus 805 input and output controller 810 is connected to central processing unit 801.Basic input/output 806 can also include defeated Enter o controller 810 for receiving and handling from the defeated of multiple other equipment such as keyboard, mouse or electronic touch pen Enter.Similarly, input and output controller 810 also provides output to display screen, printer or other kinds of output equipment.

Mass-memory unit 807 is connected by being connected to the bulk memory controller (not shown) of system bus 805 To central processing unit 801.Mass-memory unit 807 and its associated computer-readable medium are that server 800 provides Non-volatile memories.That is, mass-memory unit 807 may include such as hard disk or CD-ROM drive etc Computer-readable medium (not shown).

Without loss of generality, computer-readable medium may include computer storage media and communication media.Computer storage Medium includes any of the information such as computer readable instructions, data structure, program module or other data for storage The volatile and non-volatile of method or technique realization, removable and irremovable medium.Computer storage medium include RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storages its technologies, CD-ROM, DVD or other optical storages, cassette, magnetic Band, disk storage or other magnetic storage devices.Certainly, skilled person will appreciate that computer storage medium is not limited to It states several.Above-mentioned system storage 804 and mass-memory unit 807 may be collectively referred to as memory.

According to the various embodiments of the application, server 800 can also pass through the network connections such as internet to network On remote computer operation.Namely server 800 can be by the Network Interface Unit 811 that is connected on system bus 805 It is connected to network 812, in other words, Network Interface Unit 811 can be used also to be connected to other kinds of network or long-range meter Calculation machine system (not shown).

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the application Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.Description and embodiments are considered only as illustratively, and the true scope and spirit of the application are by above-mentioned Claim is pointed out.

It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

It should be understood that referenced herein " multiple " refer to two or more."and/or", description association The incidence relation of object indicates may exist three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A And B, individualism B these three situations.Character "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or".

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely the preferred embodiments of the application, not to limit the application, it is all in spirit herein and Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims

1. a kind of description generation method of multimedia content, which is characterized in that the described method includes:

It calls description to generate model and multi-modal feature extraction is carried out to multimedia content, the frame for obtaining at least two modal characteristics is special Sequence is levied, the frame characteristic sequence is included in corresponding modal characteristics at least two multimedia frames；

The description is called to generate model to the mode for belonging to same frame in the frame characteristic sequence of at least two modal characteristics Feature is merged, and advanced frame characteristic sequence is obtained；It is right at least two multimedia frames that the advanced frame characteristic sequence is included in The fused modal characteristics answered；

It calls the description to generate model to be decoded the advanced frame characteristic sequence, obtains the nature of the multimedia content Language description.

2. the method according to claim 1, wherein the description generate model include Fusion Module, it is described to melt Molding block includes intersecting door processing part and fusion part；

It is described to call description generation model to belonging to same frame in the frame characteristic sequence of at least two modal characteristics Modal characteristics are merged, and advanced frame characteristic sequence is obtained, comprising:

For i-th of multimedia frame in the multimedia content, the intersections door processing is called partially to calculate described more than i-th Influence result in media frame between any two modal characteristics；

Call the fusion part according to the influence as a result, at least two mode corresponding to i-th of multimedia frame Feature is merged, and fused modal characteristics are obtained；

To the fused modal characteristics according to chronological order of the multimedia frame in the multimedia content into Row combination, obtains the advanced frame characteristic sequence.

3. according to the method described in claim 2, it is characterized in that, the modal characteristics include first mode feature and the second mould State feature；

I-th of multimedia frame in the multimedia content calls intersection door processing part to calculate described i-th Influence result in a multimedia frame between any two modal characteristics, comprising:

For i-th of multimedia frame in the multimedia content, the intersections door processing is called partially to calculate described more than i-th Influence of first mode feature described in media frame to the second mode feature is as a result, and calculating i-th of multimedia Influence result of second mode feature described in frame to the first mode feature.

4. method according to any one of claims 1 to 3, which is characterized in that it includes coding module that the description, which generates model, The coding module includes n convolutional neural networks, and each convolutional neural networks are for extracting a kind of modal characteristics；

It is described that the description is called to generate model to the multi-modal feature extraction of multimedia content progress, obtain at least two mode spy The frame characteristic sequence of sign, comprising:

It calls the n convolutional neural networks to carry out feature extraction to the multimedia frame in the multimedia content respectively, obtains Modal characteristics in the multimedia frame；

To belonging to chronological order of the same type of modal characteristics according to the multimedia frame in the multimedia content It is combined, obtains the frame characteristic sequence of every kind of modal characteristics.

5. according to the method described in claim 4, it is characterized in that, the coding module further include: first circulation neural network；

Described pair belongs to time order and function of the same type of modal characteristics according to the multimedia frame in the multimedia content Sequence is combined, after obtaining the frame characteristic sequence of every kind of modal characteristics, further includes:

It calls the first circulation neural network to extract the temporal aspect in the frame characteristic sequence of the modal characteristics, is contained The frame characteristic sequence of the temporal aspect.

6. method according to any one of claims 1 to 3, which is characterized in that it includes decoder module that the description, which generates model, The decoder module includes: to pay attention to power module and second circulation neural network the time；

It is described that the description generation model is called to be decoded the advanced frame characteristic sequence, obtain the multimedia content Natural language description, comprising:

It calls the time to notice that power module carries out attention calculating to the advanced frame characteristic sequence, obtains decoding at t-th When inscribe the weights of the corresponding fused modal characteristics of each multimedia frame；

Call the time pay attention to power module to the weights of the corresponding fused modal characteristics of each described multimedia frame into Row integration, obtain it is described t-th decoding the moment weight and；

Call weight described in the second circulation neural network set and, it is corresponding fused to each described multimedia frame Modal characteristics and the hidden state at the t-1 decoding moment are decoded, and obtain the probability distribution of each word in dictionary, will be general The maximum word output of rate is this decoded output；

When meeting decoding termination condition, the word of each decoding moment output is subjected to Sequential output, obtains the multimedia The natural language description of content.

7. method according to any one of claims 1 to 3, which is characterized in that it is by end-to-end that the description, which generates model, What training method obtained.

8. a kind of description of multimedia content generates the training method of model, which is characterized in that the described method includes:

Training sample is obtained, the training sample includes sample multimedia content and the corresponding sample of the sample multimedia content This description；

It calls description to generate model and multi-modal feature extraction is carried out to the sample multimedia content, obtain at least two mode spy The frame characteristic sequence of sign, the frame characteristic sequence are included in corresponding modal characteristics at least two multimedia frames；

It calls the description to generate model to be decoded the advanced frame characteristic sequence, obtains the sample multimedia content Natural language description；

9. a kind of description generating means of multimedia content, which is characterized in that described device includes:

Coding module obtains at least two mode for calling coding module to carry out multi-modal feature extraction to multimedia content The frame characteristic sequence of feature, the frame characteristic sequence are included in corresponding modal characteristics at least two multimedia frames；

Fusion Module, for calling characteristic crossover module same to belonging in the frame characteristic sequence of at least two modal characteristics The modal characteristics of frame are merged, and advanced frame characteristic sequence is obtained；The advanced frame characteristic sequence is included in more than at least two matchmakers Corresponding fused modal characteristics in body frame；

Decoder module obtains the multimedia content for calling decoder module to be decoded the advanced frame characteristic sequence Natural language description.

10. device according to claim 9, which is characterized in that

The Fusion Module, for calling the intersection door processing unit for i-th of multimedia frame in the multimedia content Divide the influence calculated first mode feature described in i-th of multimedia frame to the second mode feature as a result, and meter Second mode feature described in i-th of multimedia frame is calculated to the influence result of the first mode feature.

11. device according to claim 10, which is characterized in that described device includes:

The coding module is also used to call the n convolutional neural networks respectively to the multimedia in the multimedia content Frame carries out feature extraction, obtains the modal characteristics in the multimedia frame；To belonging to same type of modal characteristics according to described Chronological order of the multimedia frame in the multimedia content is combined, and obtains the frame feature of every kind of modal characteristics Sequence.

12. according to any device of claim 9 to 11, which is characterized in that

The decoder module is also used to that the time is called to notice that power module carries out attention meter to the advanced frame characteristic sequence It calculates, obtains the weight for inscribing the corresponding fused modal characteristics of each multimedia frame in t-th of decoding；When calling described Between notice that power module integrates the weight of the corresponding fused modal characteristics of each described multimedia frame, obtain described T-th decoding the moment weight and；Call weight described in the second circulation neural network set and, to each described more matchmaker The corresponding fused modal characteristics of body frame and the hidden state at the t-1 decoding moment are decoded, and are obtained each in dictionary The probability distribution of word exports the word of maximum probability for this decoded output；It, will be each when meeting decoding termination condition The word for decoding moment output carries out Sequential output, obtains the natural language description of the multimedia content.

13. a kind of description of multimedia content generates the training device of model, which is characterized in that described device includes:

Module is obtained, for obtaining training sample, the training sample includes sample multimedia content and the more matchmakers of the sample Hold corresponding pattern representation in vivo；

Coding module obtains at least two modal characteristics for carrying out multi-modal feature extraction to the sample multimedia content Frame characteristic sequence, the frame characteristic sequence is included in corresponding modal characteristics at least two multimedia frames；

Fusion Module, the modal characteristics for belonging to same frame in the frame characteristic sequence at least two modal characteristics carry out Fusion, obtains advanced frame characteristic sequence；The advanced frame characteristic sequence is included in corresponding fusion at least two multimedia frames Modal characteristics afterwards；

Decoder module obtains the natural language of the sample multimedia content for being decoded to the advanced frame characteristic sequence Speech description；

Training module, for using back-propagation algorithm to the coding module, the Fusion Module according to error loss It is trained end to end with the decoder module.

14. a kind of computer equipment, which is characterized in that the computer equipment includes processor and memory, the memory In be stored at least one instruction, at least a Duan Chengxu, code set or instruction set, described instruction, described program, the code set Or described instruction collection is loaded by the processor and is executed to realize such as the described in any item multimedia content of claim 1-7 Generation method is described, alternatively, the description of multimedia content as claimed in claim 8 generates the training method of model.

15. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium A few Duan Chengxu, code set or instruction set, described instruction, described program, the code set or described instruction collection are added by processor It carries and executes to realize the description generation method such as the described in any item multimedia content of claim 1-7, alternatively, as right is wanted The description of multimedia content described in asking 8 generates the training method of model.