CN110225368B - Video positioning method and device and electronic equipment - Google Patents

Video positioning method and device and electronic equipment Download PDF

Info

Publication number
CN110225368B
CN110225368B CN201910570609.5A CN201910570609A CN110225368B CN 110225368 B CN110225368 B CN 110225368B CN 201910570609 A CN201910570609 A CN 201910570609A CN 110225368 B CN110225368 B CN 110225368B
Authority
CN
China
Prior art keywords
feature
video
sentence
time sequence
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910570609.5A
Other languages
Chinese (zh)
Other versions
CN110225368A (en
Inventor
袁艺天
马林
王景文
刘威
朱文武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910570609.5A priority Critical patent/CN110225368B/en
Publication of CN110225368A publication Critical patent/CN110225368A/en
Application granted granted Critical
Publication of CN110225368B publication Critical patent/CN110225368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Television Signal Processing For Recording (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video positioning method, a video positioning device and electronic equipment. The method comprises the following steps: acquiring a video and a sentence; respectively extracting the features of the video and the sentence to obtain corresponding video segment features and word features; fusing the video segment characteristics and the word characteristics to obtain fused characteristics; aggregating and associating the fusion features layer by layer based on time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature graph; and carrying out semantic modulation on each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map, and carrying out time sequence convolution operation on the modulated feature map to obtain a target video clip related to the semantics of the sentence. By the method and the device, the target video clip related to the semantics of the input sentence can be quickly and accurately positioned.

Description

Video positioning method and device and electronic equipment
Technical Field
The present invention relates to computer vision technology in the field of Artificial Intelligence (AI), and in particular, to a video positioning method, apparatus and electronic device.
Background
The computer vision technology is a branch of the AI technology field, and its purpose is to utilize the machine to learn according to the priori knowledge, thereby having the logical ability to classify and judge the media information such as video, image, etc.
Video localization techniques are a typical application of computer vision techniques, which, for a given sentence, are able to find a video segment from the video that is semantically related to the sentence. For example, for the sentence "i want to go to swim," the video localization technique can find a video clip that includes a swimming scene from the video. For another example, the video positioning technology may be applied to application scenarios such as viewing and editing of online videos, and helps a user quickly position a video segment of interest in a video for viewing or corresponding editing operations.
With the rapid increase of the number of videos, the video positioning technology can improve the processing efficiency of the videos in various application scenes, so that the video positioning technology has more and more important application value, but the related technology lacks an effective video positioning technical scheme.
Disclosure of Invention
The embodiment of the invention provides a video positioning method, a video positioning device and electronic equipment, which can quickly and accurately position a target video segment related to the semantics of an input sentence.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a video positioning method, which comprises the following steps:
acquiring a video and a sentence;
respectively extracting the features of the video and the sentence to obtain corresponding video segment features and word features;
fusing the video segment characteristics and the word characteristics to obtain fused characteristics;
aggregating and associating the fusion features layer by layer based on time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature graph;
semantically modulating each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map, and
and performing time sequence convolution operation on the modulated characteristic diagram to obtain a target video segment related to the semantics of the statement.
An embodiment of the present invention further provides a video positioning apparatus, including:
an acquisition unit for acquiring a video and a sentence;
the feature extraction unit is used for respectively extracting features of the video and the sentences to obtain corresponding video segment features and word features;
the feature fusion unit is used for fusing the video segment features and the word features to obtain fusion features;
the aggregation association unit is used for aggregating and associating the fusion features layer by layer based on time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature graph;
the semantic modulation unit is used for carrying out semantic modulation on each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map;
and the time sequence convolution unit is used for carrying out time sequence convolution operation on the modulated characteristic diagram to obtain a target video segment related to the semantics of the statement.
In the foregoing solution, the feature fusion unit is specifically configured to:
performing feature integration on the word features to obtain sentence features, wherein the sentence features comprise context information of the sentences;
averaging the word features corresponding to each word in the sentence features to obtain the average features of each word in the sentence;
and respectively fusing the video segment characteristics with the average characteristics of the words in the sentences to obtain fused characteristics.
In the foregoing solution, the feature fusion unit is specifically configured to:
respectively splicing the average characteristics of all words in the sentence with the characteristics of all the video segments through an activation function to obtain corresponding sub-characteristics;
and fusing all the obtained sub-features to form fused features corresponding to the videos and the sentences.
In the foregoing solution, the semantic modulation unit includes:
a generating unit, configured to generate a modulation parameter based on the feature unit included in each layer of feature map in the multilayer time series convolution feature map and the statement;
the normalization modulation unit is used for carrying out normalization modulation on the feature units contained in each layer of feature diagram in the multilayer time sequence convolution feature diagram based on the modulation parameters to obtain updated feature units;
the generating unit is further configured to form a modulated feature map based on the updated feature unit.
In the foregoing solution, the generating unit is specifically configured to:
distributing corresponding attention weight to each word feature based on the feature unit contained in each layer of feature map in the multilayer time sequence convolution feature map;
based on the attention weight, carrying out weighted summation processing on each word feature to obtain a corresponding attention weighted sentence feature;
and inputting the attention weighted sentence characteristics into two fully-connected networks in the time sequence convolutional neural network, and respectively connecting the attention weighted sentence characteristics through the two fully-connected networks to obtain modulation parameters output by the two fully-connected networks.
In the foregoing solution, the time-series convolution unit is specifically configured to:
performing time sequence convolution operation on each layer of feature map in the modulated feature map to obtain candidate video clips and time overlapping scores corresponding to the candidate video clips and the target video clips;
determining a set number of top ranked candidate video segments as the target video segment based on a descending ranking of the temporal overlap scores.
In the above solution, the video positioning apparatus further includes:
the loss function constructing unit is used for constructing a time overlapping loss function and a time sequence position prediction loss function; constructing a joint loss function of the time sequence convolutional neural network based on the time overlapping loss function and the time sequence position prediction loss function;
a training unit, configured to update the time-series convolutional neural network based on the joint loss function, so that the joint loss function converges.
In the foregoing solution, the loss function constructing unit is specifically configured to:
determining the time overlapping rate of the candidate video clips and the real target video clips;
constructing the temporal overlap loss function based on the temporal overlap rate and a temporal overlap score, the temporal overlap score corresponding to the candidate video segment and the predicted target video segment;
determining the central position and the length of the real target video clip;
and constructing the time sequence position prediction loss function based on the difference of the central positions corresponding to the predicted target video segment and the real target video segment and the difference of the corresponding lengths.
An embodiment of the present invention further provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the video positioning method provided by the embodiment of the invention when executing the executable instructions stored in the memory.
The embodiment of the invention also provides a storage medium, which stores executable instructions, and the executable instructions are used for realizing the video positioning method provided by the embodiment of the invention when being executed.
The application of the embodiment of the invention has the following beneficial effects:
by applying the video positioning method provided by the embodiment of the invention, the mode of carrying out video positioning according to the whole statement can be optimized in a combined manner and executed efficiently by combining a semantic-based modulation mechanism and a layered time sequence convolution neural network; semantic modulation is carried out on each layer of feature map based on statement information and multilayer time sequence convolution feature map information, so that correlation and aggregation of target video segments related to the semantics of statements on time sequences are closer, and accuracy of time sequence position prediction of the target video segments is enhanced. Therefore, according to the embodiment of the invention, the target video segment related to the semantic meaning of the sentence can be quickly and accurately positioned according to the given sentence, so that the video watching efficiency and browsing experience of a user are improved.
Drawings
Fig. 1 is a schematic view of an alternative application scenario of a video positioning system 10 according to an embodiment of the present invention;
fig. 2A is an alternative schematic structural diagram of an electronic device according to an embodiment of the present invention;
fig. 2B is a schematic diagram of an alternative structure of the video positioning apparatus 30 according to the embodiment of the present invention;
fig. 3 is a schematic flowchart of an alternative implementation of a video positioning method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an alternative structure of a basic time-series convolutional neural network according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of video temporal position prediction according to an embodiment of the present invention;
fig. 6 is a schematic flowchart of another alternative implementation of the video positioning method according to the embodiment of the present invention;
fig. 7 is a schematic structural diagram of an alternative principle of the video positioning method according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the present invention is further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and that the various solutions described in the embodiments of the present invention may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present invention belong. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) Multimodal, the source or form of each type of information, can be referred to as a modality, e.g., media of information, including voice, video, text, etc., each media form can be referred to as a modality, and the sum of two or more media forms can be referred to as multimodal.
2) A Convolutional Neural Network (CNN), which is a feedforward Neural Network that directly starts from the pixel features of the bottom layer of an image and performs feature extraction on the image layer by layer, is the most common implementation model of an encoder and is responsible for encoding the image into vectors.
3) A Sequential Convolutional Neural Network (SCNN) is a Network model for classifying and predicting a Sequential signal based on CNN.
4) The three-dimensional convolution (C3D, connected 3D) neural network (that is, the dimension of the convolution kernel is three-dimensional) is proposed based on that the two-dimensional convolution (that is, the dimension of the convolution kernel is two-dimensional) neural network cannot capture information on a video time sequence well, and is a network model for learning spatio-temporal features to capture feature information on the video time sequence.
5) The Glove model is a model of an unsupervised learning algorithm for obtaining word vector representation, namely, the Glove model can be used for vectorizing and representing words in a sentence, so that as much information including semantics and grammar among vectors as possible is input.
The video positioning technology has very wide application scenes, for example, the video positioning technology can be applied to service scenes of online videos to help audiences position interesting contents in the videos, can meet the personalized requirements of users on video watching, and provides powerful technical support for video positioning requirements of various video websites and applications in practical application; for another example, the method can be applied to a scene of video post-processing production, help video editors to find video segments needing to be edited, and no longer depend on a traditional manual fast-forward or backward mode, so that the efficiency of video editing is remarkably improved.
The following analyzes the scheme provided by the related art with respect to video localization.
In some schemes of the related art, a sentence positioning method based on multi-mode matching is adopted, firstly, video content is traversed for many times in a sliding window mode to obtain candidate video segments with various lengths; then, performing multi-mode fusion matching on the sentence information and each candidate video clip to obtain a matching score; and finally, determining the video segment with the highest matching score as a time sequence positioning result, namely determining the video segment with the highest matching score as a target video segment. Therefore, the above video positioning method needs to traverse the video content for multiple times in a sliding window manner to acquire the candidate video segments with different lengths, so that the time complexity of the method is high, the calculation cost is high, and the positioning efficiency is reduced; in addition, the video positioning mode is to acquire candidate video segments first and then perform multi-mode matching, so that the positioning mode cannot be optimized in a combined manner, and the positioning precision is influenced.
In other schemes of the related technology, a sentence positioning method based on end-to-end time sequence aggregation is adopted, and firstly, a video is averagely divided into a plurality of video fragment units; then, the sentence information and each video segment unit are fused to obtain fusion characteristics, the video segment units are gradually aggregated in time sequence through a long-time neural network or a long-time convolutional neural network on the basis of the fusion characteristics to obtain aggregation characteristics corresponding to different video contents, and the target video segment is predicted based on the aggregation characteristics. Therefore, although the above video positioning mode can realize joint optimization, only the matching relationship between sentences and video contents is considered, and the important guidance function of sentence information on aggregating and associating video contents is ignored, so that the accuracy of video positioning is influenced.
It can be seen that the related art lacks an effective solution for how to quickly and highly accurately locate a target video segment related to the semantics of an input sentence.
In order to at least solve the above technical problems of the related art, the video positioning method, the video positioning device and the electronic device provided by the embodiments of the present invention can quickly and highly accurately position the target video segment related to the semantic meaning of the sentence according to the given sentence, thereby improving the video watching efficiency and the browsing experience of the user.
An exemplary application of the video positioning system of the embodiment of the present invention will be described below with reference to the accompanying drawings. Fig. 1 is a schematic view of an optional application scenario of a video positioning system 10 according to an embodiment of the present invention, and referring to fig. 1, an electronic device 100 (an electronic device 100-1 and an electronic device 100-2 are exemplarily shown in fig. 1) provided in an embodiment of the present invention may be various types of mobile terminals such as a smart phone, a tablet computer, a notebook computer, a portable multimedia player, a vehicle-mounted terminal (e.g., a vehicle-mounted navigation terminal), and may also be various types of fixed terminals such as a digital television, a desktop computer (which are collectively referred to as terminal devices and have a function of playing a video), and the terminal devices may predict a target video segment related to a semantic of a sentence according to the sentence input by a user. In practical application, the terminal device can also play the predicted target video clip according to the user requirement.
Of course, fig. 1 is only an example, and the electronic device provided in the embodiment of the present invention may also be a control device that is connected to the terminal device through various wireless communication manners or a wired communication manner and controls video positioning, for example, may be the server 300. Taking a video as an online played video and an electronic device as an example of a server, after obtaining the video and a sentence (a user inputs the sentence through a terminal device and the sentence is sent to the server by the terminal device), the server 300 obtains a target video segment related to the semantic of the sentence based on analysis processing of the obtained sentence and the video, and sends the predicted target video segment to the terminal device for playing. The server and the terminal device are connected through a network 200, the network 200 may be a wide area network or a local area network, or a combination of the two, and the data transmission is realized by using a wireless link.
In some embodiments, the electronic device 100 (e.g., a server) is configured to, after obtaining a video and a sentence input by a user, perform feature extraction on the video and the sentence to obtain corresponding video segment features and word features, and then fuse the video segment features and the word features to obtain fusion features; secondly, aggregating and associating the fusion features layer by layer based on time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature map, and performing semantic modulation on each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map; and finally, carrying out time sequence convolution operation on the modulated characteristic diagram so as to predict and obtain a target video segment related to the semantics of the sentence. Of course, the server may also send the predicted target video segment to the terminal device, so as to display and play the predicted target video segment through the graphical interface 110 (the graphical interface 110-1 and the graphical interface 110-2 are exemplarily shown in fig. 1) in the terminal device.
An electronic device implementing the embodiment of the present invention will now be described with reference to the drawings, and fig. 2A is an alternative structural schematic diagram of the electronic device provided in the embodiment of the present invention, it can be understood that fig. 2A only shows an exemplary structure of the electronic device, and not a whole structure, and a part of the structure or a whole structure shown in fig. 2A may be implemented as required, and should not bring any limitation to the function and the use range of the embodiment of the present invention.
Referring to fig. 2A, an electronic device 20 provided in an embodiment of the present invention includes: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the electronic device 20 are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2A.
The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.
It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory.
The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the electronic device 20. Examples of such data include: any executable instructions for operating on the electronic device 20, such as a computer program, including executable programs and an operating system, may be included in the executable instructions, as may programs implementing the video positioning method of embodiments of the present invention.
The processor 201 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the video positioning method provided by the embodiment of the present invention may be implemented by an integrated logic circuit of hardware in the processor 201. The integrated logic circuit described above may be a general purpose processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 201 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention.
The steps of the video positioning method provided by the embodiment of the present invention may be completed by software modules, the software modules may be located in a storage medium, the storage medium is located in the memory 202, and the processor 201 executes the software modules in the memory 202, and completes the steps of the video positioning method provided by the embodiment of the present invention by combining hardware thereof.
For example, as an example of a software module, the memory 202 may include the video positioning apparatus 30 provided in the embodiment of the present invention, which includes a series of software modules, such as the obtaining unit 31, the feature extracting unit 32, the feature fusing unit 33, the aggregation associating unit 34, the semantic modulating unit 35, and the time-series convolution unit 36, and referring to an alternative structural schematic diagram of the video positioning apparatus 30 provided in the embodiment of the present invention shown in fig. 2B, functions of the above units will be described below.
So far, the structure of the electronic device provided by the embodiment of the invention and an exemplary application scenario of the video positioning system have been described according to the functions thereof. Next, an implementation of the video positioning method provided in the embodiment of the present invention is described.
Referring to fig. 3, fig. 3 is a schematic diagram of an optional implementation flow of the video positioning method provided in the embodiment of the present invention, where the video positioning method in the embodiment of the present invention may be applied to various types of terminal devices such as a smart phone, a tablet computer, a digital television, and a desktop computer, that is, the terminal device may predict a target video segment related to the semantic of a sentence by executing the video positioning method in the embodiment of the present invention; the video positioning method in the embodiment of the invention can also be applied to a server, the server performs video positioning to obtain the target video clip related to the semantic meaning of the sentence, and at the moment, the terminal equipment is in a controlled mode, namely, the terminal equipment receives and plays the target video clip sent by the server. The following describes the steps shown in fig. 3 by taking the video positioning method of the embodiment of the present invention executed by the server as an example.
Step 301: and acquiring a video and a sentence.
Here, the video acquired by the server may be a video in the cloud database, or a video from the terminal device, for example, a video played online in the terminal device, that is, the terminal device uploads the video played online to the server. The video acquired by the server may be generally composed of one or more video clips, and when the acquired video is composed of a plurality of video clips, the lengths of the plurality of video clips may be the same or different.
For the statements obtained by the server, the statements may be input by the user through the client of the terminal device, and then the input statements may be uploaded to the server by the terminal device. The sentence obtained by the server may be generally composed of one word or a group of words which are syntactically related.
Step 302: and respectively extracting the features of the video and the sentence to obtain corresponding video segment features and word features.
In the embodiment of the invention, a method based on a deep neural network can be adopted to extract features of videos and sentences, specifically, a C3D neural network can be used as a video encoder to extract features of the videos so as to obtain corresponding video segment features in the videos, and the C3D neural network can learn the spatio-temporal features, so that the video segment features obtained by extracting the features by using the C3D neural network are video segment features in a video time sequence; the Glove model can be used as a sentence encoder to perform feature extraction on the sentences acquired by the server so as to obtain corresponding word features in the sentences.
It should be noted that, in practical applications, the features of each video segment extracted by the server and the features of each word in the sentence can be generally expressed in the form of a feature vector.
Step 303: and fusing the video segment characteristics and the word characteristics to obtain fused characteristics.
In some embodiments, the server may fuse the video segment features and the word features to obtain fused features by: performing feature integration on the word features to obtain sentence features, wherein the sentence features comprise context information of a sentence; averaging the word characteristics of each word in the corresponding sentence in the sentence characteristics to obtain the average characteristics of each word in the sentence; and respectively fusing the characteristics of each video segment with the average characteristics of each word in the sentence to obtain fused characteristics.
In some embodiments, for the case that the server fuses each video segment feature with the average feature of each word in the sentence to obtain the fused feature, the following method may be adopted: respectively splicing the average characteristics of all words in the sentence with the characteristics of all video segments through an activation function to obtain corresponding sub-characteristics; and then, fusing all the obtained sub-features to form fused features corresponding to the videos and the sentences.
Specifically, in the case of obtaining a sentence characteristic by integrating the word characteristic, context information of the sentence is actually integrated into the word characteristic to obtain the sentence characteristic including the word characteristic and the context information of the sentence. Illustratively, feature integration can be performed on the extracted word features by using a Bi-directional Gated Recurrent neural network (Bi-GRU) to obtain sentence features including context information of the sentence. The server can input the characteristics of each video segment and the average characteristics of each word in the sentence into a full-connection network, and connect (i.e. splice) the characteristics of each video segment with the average characteristics of each word in the sentence by using full-connection operation, so as to obtain the multi-modal fusion characteristics corresponding to the videos and the sentences.
The above feature fusion process is described below by taking an example in which the acquired video includes T video segments and the acquired sentence includes N words.
It is assumed that after feature extraction is performed on an acquired video and a sentence, the obtained video segment features and word features are all represented in the form of feature vectors, that is, a video segment feature sequence and a word feature sequence are obtained, for example, the video segment feature sequence obtained through the feature extraction operation is V ═ V [ V ] V ═ V [ ]1,…,vT]The word characteristic sequence is W ═ W1,…,wN]Then, the context information of the sentence is integrated into the word feature sequence W to obtain the sentence feature including the context information of the sentence, that is, to obtain the sentence feature sequence S ═ S1,…,sN]Then, the word feature S of each word of the corresponding sentence in the sentence feature sequence S is added1,…,sNCarrying out averaging processing to obtain average characteristics of each word in the sentence
Figure BDA0002110762190000111
Illustratively, the term may be calculated by the following formulaAverage characteristics of individual words in a sentence:
Figure BDA0002110762190000112
average features of words in a resulting sentence
Figure BDA0002110762190000115
Then, will
Figure BDA0002110762190000113
With each video segment feature V in the sequence of video segment features VtInput into a fully connected network, using a fully connected operation
Figure BDA0002110762190000114
And vtSplicing to obtain corresponding sub-characteristics ftAnd finally, fusing all the obtained sub-features to form a multi-modal fusion feature F corresponding to the video and the sentence.
Illustratively, the calculation can be based on the following formula (1)
Figure BDA0002110762190000121
And vtSub-feature f obtained by splicingt
Figure BDA0002110762190000122
Wherein f istRepresenting any sub-feature in a multi-modal fusion feature, Relu representing a non-linear activation function (Rectified L initial Unit), which may also be referred to as a linear rectification function, WfAnd bfRespectively representing the parameters learned by the model of the fully-connected network in the training process; i means will
Figure BDA0002110762190000123
And vtThe concatenation is carried out, and the meaning of the inexhaustible parameters in the above formula (1) can be understood by referring to the above.
In the embodiment of the present invention, all the sub-features such as f are combined1,…,fTThe fusion is carried out, and the fusion is carried out,fusion feature F forming multiple modalities:
Figure BDA0002110762190000124
wherein f is1The representation is based on
Figure BDA0002110762190000125
And v1Sub-features obtained by splicing; f. ofTThe representation is based on
Figure BDA0002110762190000126
And vTThe resulting sub-features of the stitching.
Step 304: and aggregating and associating the fusion features layer by layer based on the time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature graph.
Step 305: and performing semantic modulation on each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map.
The time sequence convolution neural network in the embodiment of the invention has a multilayer structure, and the obtained multi-modal fusion features are input into the multilayer time sequence convolution neural network, so that the multi-modal fusion features are aggregated and associated layer by layer based on time sequence to obtain multilayer time sequence convolution feature graphs corresponding to video segments with different scales in a video.
In the embodiment of the present invention, in combination with step 304 and step 305, a semantic modulation-based time series convolutional neural network coupled by a semantic modulation-based modulation mechanism and a basic hierarchical time series convolutional neural network can be obtained. The underlying hierarchical time-series convolutional neural network is explained below.
Referring to fig. 4, fig. 4 is an optional structural schematic diagram of the basic time-series convolutional neural network provided in the embodiment of the present invention, and by way of example, the basic time-series convolutional neural network in the embodiment of the present invention may include an input layer, a plurality of intermediate layers (also referred to as convolutional layers, only 3 convolutional layers are illustrated here, for example, convolutional layers 1 to convolutional layers 3), and an output layer, where the input layer is configured to receive the obtained fusion feature F of the multi-modes=[f1,…,fT]Convolution layers are mainly used to fuse features F ═ F for multiple modes1,…,fT]Performing a convolution operation assuming the time-series convolution operation defining the basis as Conv (theta)ks,dh) Wherein, thetakFor the convolution kernel size, θsIs the stride length, dhFor the number of filters, that is, the dimensions of hidden layers, and meanwhile, in the embodiment of the present invention, Relu is uniformly used as an activation function of time-series convolution, the multi-modal fusion feature will perform a basic convolution operation through each convolution layer, that is, each basic convolution operation will make the input multi-modal fusion feature F ═ F1,…,fT]While the receptive field of the output signature will be doubled relative to the input multi-modal fused signature. In this way, by superimposing the multiple layers of basic time series convolution operations, a multiple layers of time series convolution characteristic diagrams can be obtained and output through the output layer. The characteristic units contained in each layer of characteristic diagram in the multilayer time sequence convolution characteristic diagram respectively correspond to the video clips with different scales in the video.
Here, for the simplicity of subsequent expression, the k-th layer feature map in the multi-layer time-series convolution feature map is defined as
Figure BDA0002110762190000131
Wherein, Tk=Tk-1The/2 is used to represent the time dimension of the k-th layer profile,
Figure BDA0002110762190000132
for representing the ith feature cell in the k-th layer feature map. It should be noted that, because the multilayer time series convolution characteristic diagram covers video segments with different positions and different scales in the video, the embodiment of the present invention can meet the requirement of diversification of positions and lengths of the video segments described by the statements.
In order to enable target video segments related to semantics of statements to be more closely combined in a time sequence convolution process, the embodiment of the invention provides a semantic-based modulation mechanism, namely, in the embodiment of the invention, a server performs semantic modulation on feature units contained in each layer of feature map in a multilayer time sequence convolution feature map through the semantic-based modulation mechanism to obtain a modulated feature map. The following is a detailed description of the process of obtaining a modulated feature map by a semantic-based modulator.
In some embodiments, the server may perform semantic modulation on each layer of feature map in the multilayer time-series convolution feature map by the following method to obtain a modulated feature map: generating modulation parameters based on feature units and statements contained in each layer of feature map in the multilayer time sequence convolution feature map; based on modulation parameters, carrying out normalized modulation on feature units contained in each layer of feature diagram in the multilayer time sequence convolution feature diagram to obtain updated feature units; and forming a modulated characteristic diagram based on the updated characteristic units.
In some embodiments, for generating the modulation parameter based on the feature units and statements included in each layer of the multi-layer time-series convolution feature map, the following method may be specifically adopted: distributing corresponding attention weight for each word feature based on feature units contained in each layer of feature map in the multilayer time sequence convolution feature map; based on the attention weight, carrying out weighted summation processing on each word feature to obtain a corresponding attention weighted sentence feature; and inputting the attention weighted sentence characteristics into two fully-connected networks in the time sequence convolution neural network, and respectively connecting the attention weighted sentence characteristics through the two fully-connected networks to obtain modulation parameters output by the two fully-connected networks.
It should be noted that, for inputting the feature of the attention weighted sentence into two fully connected networks, the feature may be input at the same time, or the two inputs may have a certain time difference, that is, the two inputs are not input at the same time.
The following describes a process of generating the modulation parameter. Assume that a word feature sequence obtained by feature extraction of an acquired sentence is W ═ W1,…,wN]Then, context information of the sentence is integrated into the word feature sequence W to obtain a sentence feature sequence S ═ S1,…,sN]By sentence feature sequence S ═ S1,…,sN]And a multi-layer time sequence convolution characteristic diagram A ═ a obtained by time sequence convolution operationiExplanation is given by taking the example (for convenience of description, the number k of layers in the multilayer time series convolution characteristic diagram is omitted here).
Specifically, a ═ a is given for the multi-layer time-series convolution signature graph a given aboveiEach feature cell a iniThe attention weight assigned to each word feature in the sentence feature sequence S can be calculated by the following formula (2):
Figure BDA0002110762190000141
wherein,
Figure BDA0002110762190000142
for representing an attention weight assigned to each word feature; snUsed for representing the nth word characteristic in the sentence characteristic sequence S; w is aT,Ws,WaAnd b are parameters learned by the model of the time sequence convolution neural network in the training process; tanh is a hyperbolic sinusoidal activation function and softmax is a normalized exponential function, as understood with reference to the above for the inexhaustible parametric implications of equation (2).
After obtaining the corresponding attention weights assigned to the respective word features in the sentence feature sequence S, the corresponding attention weighted sentence features can be calculated by the following equation (3):
Figure BDA0002110762190000143
wherein, ciFor expressing the feature of the attention-weighted sentence, the meaning of the inexhaustible parameter in the above formula (3) can be understood with reference to the above.
Then, the obtained attention weighted sentence characteristic ciSimultaneously inputting the data into two fully-connected networks in the time sequence convolution neural network, and calculating modulation parameters respectively output by each fully-connected network through the following formulas (4) and (5):
Figure BDA0002110762190000144
Figure BDA0002110762190000151
wherein,
Figure BDA0002110762190000152
a modulation parameter indicative of an output of one of the fully connected networks;
Figure BDA0002110762190000153
a modulation parameter for representing an output of the other fully connected network;
Figure BDA0002110762190000154
Wγβγandβare parameters learned by the corresponding model of the fully-connected network in the training process. The meaning of the unexhausted parameters in the above formulas (4) and (5) can be understood with reference to the above.
In the embodiment of the invention, the modulation parameters are obtained
Figure BDA0002110762190000155
And
Figure BDA0002110762190000156
then, based on the modulation parameter
Figure BDA0002110762190000157
And
Figure BDA0002110762190000158
for the feature unit contained in each layer of feature map in the multilayer time sequence convolution feature map, for example, for the feature unit a in the layer of feature mapiPerforming normalized modulation, i.e. aiUpdating to new feature cells
Figure BDA0002110762190000159
Based on updated feature cells
Figure BDA00021107621900001510
A modulated signature may be formed.
Illustratively, the updated feature cell may be calculated by the following equation (6)
Figure BDA00021107621900001511
Figure BDA00021107621900001512
Wherein μ (a) is used to represent the mean of the feature cells contained in the layer feature map; σ (a) is used to represent the standard deviation of the feature cells contained in the layer feature map;
Figure BDA00021107621900001513
the meaning of the unexhausted parameter in the above formula (6) can be understood with reference to the above.
It should be noted that, in the process of performing normalized modulation on the feature units included in each layer of feature map in the multilayer time series convolution feature map based on the generated modulation parameters, each feature unit can be scaled and moved in the feature space under the guidance of the statement semantic information, so that the relevance between the video segment corresponding to the feature unit and the semantic of the statement is stronger.
Step 306: and performing time sequence convolution operation on the modulated characteristic diagram to obtain a target video segment related to the semantics of the statement.
In some embodiments, the server may perform a time-series convolution operation on the modulated feature map to obtain a target video segment related to the semantics of the sentence by: performing time sequence convolution operation on each layer of feature map in the modulated feature map to obtain candidate video clips and time overlapping scores of the corresponding candidate video clips and the target video clip; and determining the video clips of the set number of the candidate sorted at the top as the target video clips based on the descending sorting of the time overlapping scores.
In this embodiment of the present invention, the candidate video segments may be video segments of different scales, and for feature units included in each layer of feature map in the modulated feature map, corresponding time scaling ratios R, R ∈ R ═ 0.25,0.5,0.75,1.0 are respectively configured, and based on the time scaling ratios, a time series convolution operation is performed on each layer of feature map in the modulated feature map, so as to obtain the candidate video segments.
Referring to fig. 5, fig. 5 is a schematic diagram of video timing position prediction provided by an embodiment of the present invention, with respect to a time dimension TkAssuming that the obtained entire video length is regarded as 1, then the length of each feature unit in the modulated feature map corresponding to a video segment is 1/TkThe scaling factor r will be 1/T in length based on the configured time scaling ratiokIs scaled to obtain a video segment of length r/TkThe video clip of (1). In particular, T for the time dimensionkFor the ith feature cell in the modulated feature map of (2), it will correspond to | R | video segments of different lengths, and the centers of these video segments are all located at (i +0.5)/TkThen, for the entire modulated profile, the entire modulated profile will contain Tk× | R | video clips with different lengths and different positions, which are all candidate video clips for time sequence position prediction, can be used
Figure BDA0002110762190000161
Where K is used to represent the number of layers of the time-series convolutional neural network.
Here, each candidate video segment corresponds to a set of prediction vectors p ═ (p)overΔ c, Δ w), where poverFor representing the temporal overlap fraction, p, of the corresponding candidate video segment and the target video segmentoverThe larger the candidate video segment is, the closer the candidate video segment is to the target video segment is indicated; Δ c is used to represent the center offset of the candidate video segment from the target video segment; Δ w is used to represent candidate videoThe length offset of the segment from the target video segment. In practical application, the time overlapping fraction p of the candidate video segment and the target video segmentoverSorting in a descending order to obtain a sorting result; all candidate video segments predicted based on sorting result
Figure BDA0002110762190000162
The screening is performed, that is, the video segments with the top set number of candidates are determined as the target video segments.
The following is a continuation of the description to obtain
Figure BDA0002110762190000163
The process of (1). Suppose a candidate video segment has a center position μcLength of muwThe prediction vector corresponding to the candidate video segment is p ═ (p)overΔ c, Δ w), the center position of the target video segment predicted from the candidate video segment can be calculated by the following formula (7)
Figure BDA0002110762190000164
Figure BDA0002110762190000165
The length of the target video segment predicted from the candidate video segment can be calculated by the following formula (8)
Figure BDA0002110762190000166
Figure BDA0002110762190000167
Wherein α in the above formula (7)cAnd α in formula (8)wThe coefficient for making the position prediction more stable is usually 0.1, and the meaning of the inexhaustible parameters in the above equations (7) and (8) can be understood with reference to the above.
Thus, T for one time dimensionkWill correspond to a set of predicted candidate video segments
Figure BDA0002110762190000171
Collecting the prediction results of all feature maps in the time sequence convolution neural network, all candidate video segments for time sequence position prediction can be obtained
Figure BDA0002110762190000172
In some embodiments, the video localization method may further include: constructing a time overlapping loss function and a time sequence position prediction loss function; constructing a joint loss function of the time sequence convolution neural network based on the time overlapping loss function and the time sequence position prediction loss function; the time series convolutional neural network is updated based on the joint loss function to converge the joint loss function.
Here, as for the construction of the time overlap loss function, it can be realized in the following manner: determining the time overlapping rate of the candidate video clips and the real target video clips; and constructing a time overlap loss function based on the time overlap rate and the time overlap fraction, wherein the time overlap fraction corresponds to the candidate video segment and the predicted target video segment. In terms of constructing the time-series position prediction loss function, the following method can be adopted: determining the central position and the length of a real target video clip; and constructing a time sequence position prediction loss function based on the difference of the central positions corresponding to the predicted target video segment and the real target video segment and the difference of the corresponding lengths.
The construction process of the joint loss function described above is described below by way of example.
The combined loss function in the embodiment of the invention comprises two parts, namely a time overlapping loss function and a time sequence position prediction loss function; that is to say, the joint loss function in the embodiment of the present invention combines the time overlap loss function and the temporal position prediction loss function to optimize the video positioning method according to the embodiment of the present invention.
Illustratively, the joint loss function may be determined by the following equation (9):
L=αLover+βLloc(9)
l, representing the joint loss function of the model for updating the time series convolutional neural network LoverFor representing a temporal overlap loss function LlocFor representing the time sequence position prediction loss function, α and β are respectively used for representing the balanced two-term loss function LoverAnd Llocα generally has a value of 100 and β generally has a value of 10.
The following continues with the description of the determination of the time overlap penalty function and the time series position prediction penalty function.
Illustratively, the temporal overlap loss function L may be determined by the following equation (10)over
Figure BDA0002110762190000181
Wherein,
Figure BDA0002110762190000182
for representing the temporal overlapping rate of the candidate video segment and the real target video segment, it can be calculated according to the following equation (11)
Figure BDA0002110762190000183
Figure BDA0002110762190000184
When in use
Figure BDA0002110762190000185
If the candidate video clip is greater than 0.5, the candidate video clip is indicated as a positive example, otherwise, the candidate video clip is indicated as a negative example;
Figure BDA0002110762190000186
for representing the temporal overlap fraction of the candidate video segment and the predicted target video segment.
Illustratively, the timing position prediction loss function L may be determined by the following equation (12)loc
Figure BDA0002110762190000187
Wherein,
Figure BDA0002110762190000188
a center position for representing a real target video segment;
Figure BDA0002110762190000189
for representing the length of the real target video segment, the meaning of the inexhaustible parameter in the above formula (12) is the same as that of the same parameter in the above formula, and thus can be understood with reference to the above.
By adopting the technical scheme provided by the embodiment of the invention, the semantic relevance between the video and the sentence can be effectively captured, and the semantic-based modulation mechanism is coupled with the layered time sequence convolution neural network, so that the video positioning mode of the embodiment of the invention can be optimized in a combined manner, and each layer of feature map is modulated semantically based on the sentence information and the multilayer time sequence convolution feature map information, so that the relevance and aggregation of the target video segments related to the semantics of the sentence on the time sequence are tighter, the accuracy of the prediction of the time sequence position of the target video segments is enhanced, the target video segments related to the semantics of the sentence can be positioned quickly and accurately, and the video watching efficiency and the browsing experience of a user are further improved. In addition, the embodiment of the invention can be combined with a video searching technology, and the efficiency of video searching can be improved according to given sentences.
Next, the implementation of the video positioning method provided by the embodiment of the present invention is described by taking a video as a video played online in a terminal device and taking an input sentence as a sentence a.
Referring to fig. 6, fig. 6 is a schematic view of another optional implementation flow of the video positioning method provided in the embodiment of the present invention, where the video positioning method in the embodiment of the present invention may be applied to various types of terminal devices such as a smart phone, a tablet computer, a digital television, and a desktop computer, that is, the terminal device may predict a target video segment related to the semantic of a sentence by executing the video positioning method in the embodiment of the present invention; the video positioning method in the embodiment of the invention can also be applied to a server, the server performs video positioning to obtain the target video clip related to the semantic meaning of the sentence, and at the moment, the terminal equipment is in a controlled mode, namely, the terminal equipment receives and plays the target video clip sent by the server. The steps shown in fig. 6 will be described below by taking as an example that the server executes the video positioning method according to the embodiment of the present invention. For details which are not exhaustive in the following description of the steps, reference is made to the above for an understanding.
Step 601: and acquiring the online played video and the sentence A.
Step 602: and respectively carrying out feature extraction on the online played video and the sentence A to obtain corresponding video segment features and word features.
In the embodiment of the invention, a C3D neural network can be used for carrying out feature extraction on the online played video to obtain each video segment feature in the online played video, wherein the video segment feature is a video segment feature in a time sequence; and (4) performing feature extraction on the sentence A by using a Glove model to obtain corresponding word features in the sentence A.
Step 603: and fusing the video segment characteristics and the word characteristics to obtain multi-modal fusion characteristics.
In the embodiment of the invention, the context information of the sentence A can be integrated into the word characteristics to obtain the sentence characteristics, and then the word characteristics of each word in the sentence A corresponding to the sentence characteristic are averaged to obtain the average characteristics of each word in the sentence A; and finally, fusing the characteristics of each video segment with the average characteristics of each word in the sentence A to obtain multi-modal fusion characteristics. And performing feature integration on the extracted word features through the Bi-GRU to obtain the sentence features comprising the context information of the sentence A.
Here, for the server to fuse each video segment feature with the average feature of each word in the sentence a to obtain the multi-modal fusion feature, the following method may be specifically adopted: inputting the average characteristics of all words in the sentence A and the characteristics of all video segments into a full-connection network, and splicing the average characteristics of all words in the sentence with the characteristics of all video segments by using full-connection operation to obtain corresponding sub-characteristics; and then, fusing all the obtained sub-features to form multi-modal fused features corresponding to the videos and the sentences.
Step 604: and aggregating and associating the multi-modal fusion features layer by layer based on the time sequence through a time sequence convolution neural network to obtain a multi-layer time sequence convolution feature map.
Step 605: and generating modulation parameters based on the feature units contained in each layer of feature map in the multilayer time series convolution feature map and the statement A.
Step 606: and carrying out normalized modulation on the feature units contained in each layer of feature diagram in the multilayer time sequence convolution feature diagram based on the modulation parameters to obtain updated feature units, and forming a modulated feature diagram based on the updated feature units.
In some embodiments, to generate the modulation parameters based on the feature units included in each layer of the multi-layer time-series convolution feature map and the statement a in step 605, the following method may be adopted: distributing corresponding attention weight for each word feature based on feature units contained in each layer of feature map in the multilayer time sequence convolution feature map; based on the attention weight, carrying out weighted summation processing on each word feature to obtain a corresponding attention weighted sentence feature; and inputting the attention weighted sentence characteristics into two fully-connected networks in the time sequence convolution neural network, and respectively connecting the attention weighted sentence characteristics through the two fully-connected networks to obtain modulation parameters output by the two fully-connected networks.
Step 607: and performing time sequence convolution operation on each layer of feature map in the modulated feature map to obtain candidate video clips and time overlapping scores of the corresponding candidate video clips and the target video clip.
Step 608: and determining the video clips of the set number of the candidate video clips ranked in the front as target video clips based on descending order of the time overlapping scores, wherein the target video clips are video clips related to the semantics of the sentence A.
In some embodiments, the video localization method may further include: constructing a time overlapping loss function and a time sequence position prediction loss function; constructing a joint loss function of the time sequence convolution neural network based on the time overlapping loss function and the time sequence position prediction loss function; the time series convolutional neural network is updated based on the joint loss function to converge the joint loss function.
In the embodiment of the present invention, as for constructing the time overlap loss function, the following method may be adopted: determining the time overlapping rate of the candidate video clips and the real target video clips; and constructing a time overlap loss function based on the time overlap rate and the time overlap fraction, wherein the time overlap fraction corresponds to the candidate video segment and the predicted target video segment. In terms of constructing the time-series position prediction loss function, the following method can be adopted: determining the central position and the length of a real target video clip; and constructing a time sequence position prediction loss function based on the difference of the central positions corresponding to the predicted target video segment and the real target video segment and the difference of the corresponding lengths.
Referring to fig. 7, fig. 7 is a schematic diagram of an optional principle structure of the video positioning method according to the embodiment of the present invention, and assuming that a video including T video segments and a sentence including N words are given, a C3D neural network is used to perform feature extraction on the obtained video, so as to obtain a corresponding video segment feature sequence V ═ V1,…,vT]And performing feature extraction on the sentence by using a Glove model to obtain a corresponding word feature sequence W ═ W1,…,wN](ii) a Integrating context information of a sentence into a word feature sequence W through a Bi-directional GRU (Bi-GRU) to obtain a sentence feature sequence S [ [ S ] ]1,…,sN]Further, the word characteristics S of each word of the corresponding sentence in S1,…,sNCarrying out averaging processing to obtain average characteristics of each word in the sentence
Figure BDA0002110762190000211
Average characteristics of each word in the sentence
Figure BDA0002110762190000212
And carrying out full connection with each video segment feature in the video segment feature sequence V to obtain a multi-modal fusion feature F. Then, the multi-modal fusion features F are subjected to sequential convolutional neural network based on semantic modulation to obtain a modulated feature map, specifically, the multi-modal fusion features F are subjected to sequential layer-by-layer aggregation and association to obtain a multilayer sequential convolutional feature map, then semantic modulation is performed on each layer of feature map in the multilayer sequential convolutional feature map based on a semantic modulation mechanism to obtain a modulated feature map, and the modulated feature map is obtained by updated feature units
Figure BDA0002110762190000213
Formed by convolving feature cells a in a feature map with multiple layers in time seriesiPerforming normalized modulation to update as characteristic unit
Figure BDA0002110762190000214
Finally, the modulated characteristic diagram is subjected to time sequence convolution operation so as to predict a target video segment related to the semantics of the given sentence. For the semantic-based modulation mechanism, the specific implementation of semantic modulation on each layer of the multilayer time-series convolution feature map can be understood with reference to the above.
By adopting the technical scheme provided by the embodiment of the invention, the semantic relevance between the video and the sentence can be effectively captured, and the semantic-based modulation mechanism is coupled with the layered time sequence convolution neural network, so that the video positioning mode of the embodiment of the invention can be optimized in a combined manner, and each layer of feature map is modulated semantically based on the sentence information and the multilayer time sequence convolution feature map information, so that the relevance and aggregation of the target video segments related to the semantics of the sentence on the time sequence are tighter, the accuracy of the prediction of the time sequence position of the target video segments is enhanced, the target video segments related to the semantics of the sentence can be positioned quickly and accurately, and the video watching efficiency and the browsing experience of a user are further improved. In addition, the embodiment of the invention can be combined with a video searching technology, and the efficiency of video searching can be accelerated according to given sentences.
Next, a software implementation of the video positioning apparatus 30 according to the embodiment of the present invention will be described. Taking the software module included in the memory 202 of the electronic device 20 as an example, the details that are not described in the following description of the function of the module can be understood by referring to the above description.
An acquisition unit 31 for acquiring a video and a sentence; a feature extraction unit 32, configured to perform feature extraction on the video and the sentence respectively to obtain corresponding video segment features and word features; a feature fusion unit 33, configured to fuse the video segment feature and the word feature to obtain a fusion feature; the aggregation association unit 34 is configured to aggregate and associate the fusion features layer by layer based on a time sequence through a time sequence convolutional neural network to obtain a multilayer time sequence convolutional feature map; the semantic modulation unit 35 is configured to perform semantic modulation on each layer of feature map in the multilayer time series convolution feature map to obtain a modulated feature map; and a time sequence convolution unit 36, configured to perform a time sequence convolution operation on the modulated feature map to obtain a target video segment related to the semantic meaning of the statement.
In some embodiments, to the extent that the feature fusion unit fuses the video segment feature and the word feature to obtain a fusion feature, the following method may be adopted: performing feature integration on the word features to obtain sentence features, wherein the sentence features comprise context information of the sentences; averaging the word features corresponding to each word in the sentence features to obtain the average features of each word in the sentence; and respectively fusing the video segment characteristics with the average characteristics of the words in the sentences to obtain fused characteristics.
In some embodiments, for the feature fusion unit to fuse each of the video segment features with the average feature of each word in the sentence, to obtain the fusion feature, the following method may be adopted: respectively splicing the average characteristics of all words in the sentence with the characteristics of all the video segments through an activation function to obtain corresponding sub-characteristics; and fusing all the obtained sub-features to form fused features corresponding to the videos and the sentences.
In some embodiments, the semantic modulation unit comprises:
a generating unit, configured to generate a modulation parameter based on the feature unit included in each layer of feature map in the multilayer time series convolution feature map and the statement;
the normalization modulation unit is used for carrying out normalization modulation on the feature units contained in each layer of feature diagram in the multilayer time sequence convolution feature diagram based on the modulation parameters to obtain updated feature units;
the generating unit is further configured to form a modulated feature map based on the updated feature unit.
In some embodiments, the generation unit may generate the modulation parameter based on the feature unit included in each layer of the multi-layer time-series convolution feature map and the statement, in the following manner: distributing corresponding attention weight to each word feature based on the feature unit contained in each layer of feature map in the multilayer time sequence convolution feature map; based on the attention weight, carrying out weighted summation processing on each word feature to obtain a corresponding attention weighted sentence feature; and inputting the attention weighted sentence characteristics into two fully-connected networks in the time sequence convolutional neural network, and respectively connecting the attention weighted sentence characteristics through the two fully-connected networks to obtain modulation parameters output by the two fully-connected networks.
In some embodiments, to the extent that the time-series convolution unit performs the time-series convolution operation on the modulated feature map to obtain the target video segment related to the semantics of the sentence, the following method may be adopted: performing time sequence convolution operation on each layer of feature map in the modulated feature map to obtain candidate video clips and time overlapping scores corresponding to the candidate video clips and the target video clips; determining a set number of top ranked candidate video segments as the target video segment based on a descending ranking of the temporal overlap scores.
In some embodiments, the video positioning apparatus further comprises:
the loss function constructing unit is used for constructing a time overlapping loss function and a time sequence position prediction loss function; constructing a joint loss function of the time sequence convolutional neural network based on the time overlapping loss function and the time sequence position prediction loss function;
a training unit, configured to update the time-series convolutional neural network based on the joint loss function, so that the joint loss function converges.
In some embodiments, in the case that the loss function constructing unit constructs the time overlap loss function, the following may be adopted: determining the time overlapping rate of the candidate video clips and the real target video clips; constructing the temporal overlap loss function based on the temporal overlap rate and a temporal overlap score, the temporal overlap score corresponding to the candidate video segment and the predicted target video segment.
In some embodiments, in the case that the loss function constructing unit constructs the time-series position prediction loss function, the following may be adopted: determining the central position and the length of the real target video clip; and constructing the time sequence position prediction loss function based on the difference of the central positions corresponding to the predicted target video segment and the real target video segment and the difference of the corresponding lengths.
The embodiment of the present invention further provides a storage medium, which stores executable instructions, and when the executable instructions are executed, the storage medium is used for implementing the video positioning method provided by the embodiment of the present invention. The storage medium may be a computer-readable storage medium, and may be, for example, a Memory such as a magnetic random access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read Only Memory (CD-ROM).
In summary, the technical scheme of the embodiment of the invention has the following beneficial effects:
1. by combining a semantic-based modulation mechanism with a layered time sequence convolutional neural network, the mode of performing video positioning according to the whole statement can be optimized in a combined manner and executed efficiently;
2. semantic modulation is carried out on each layer of feature map based on statement information and multilayer time sequence convolution feature map information, so that correlation and aggregation of target video segments related to the semantics of statements on time sequences are closer, and the accuracy of time sequence position prediction of the target video segments is enhanced;
3. according to the given sentence, the target video clip related to the semantic meaning of the sentence can be quickly and accurately positioned, so that the video watching efficiency and browsing experience of a user are improved, and the video searching efficiency can be improved according to the given sentence by combining with a video searching technology.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A method for video localization, comprising:
acquiring a video and a sentence;
respectively extracting the features of the video and the sentence to obtain corresponding video segment features and word features;
fusing the video segment characteristics and the word characteristics to obtain fused characteristics;
aggregating and associating the fusion features layer by layer based on time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature graph;
semantically modulating each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map, and
performing time sequence convolution operation on the modulated feature map to obtain a target video segment related to the semantics of the sentence;
the semantic modulation is performed on each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map, and the method comprises the following steps:
generating modulation parameters based on the feature units contained in each layer of feature map in the multilayer time series convolution feature map and the statements; based on the modulation parameters, carrying out normalized modulation on the feature units contained in each layer of feature diagram in the multilayer time sequence convolution feature diagram to obtain updated feature units; and forming a modulated characteristic diagram based on the updated characteristic unit.
2. The method as claimed in claim 1, wherein said fusing said video segment feature and said word feature to obtain a fused feature comprises:
performing feature integration on the word features to obtain sentence features, wherein the sentence features comprise context information of the sentences;
averaging the word features corresponding to each word in the sentence features to obtain the average features of each word in the sentence;
and respectively fusing the video segment characteristics with the average characteristics of the words in the sentences to obtain fused characteristics.
3. The method of claim 2, wherein said fusing each of said video segment features with an average feature of each word in said sentence to obtain a fused feature comprises:
respectively splicing the average characteristics of all words in the sentence with the characteristics of all the video segments through an activation function to obtain corresponding sub-characteristics;
and fusing all the obtained sub-features to form fused features corresponding to the videos and the sentences.
4. The method of claim 1, wherein generating modulation parameters based on the feature units included in each layer of the multi-layer time-series convolution feature map and the statements comprises:
distributing corresponding attention weight to each word feature based on the feature unit contained in each layer of feature map in the multilayer time sequence convolution feature map;
based on the attention weight, carrying out weighted summation processing on each word feature to obtain a corresponding attention weighted sentence feature;
and inputting the attention weighted sentence characteristics into two fully-connected networks in the time sequence convolutional neural network, and respectively connecting the attention weighted sentence characteristics through the two fully-connected networks to obtain modulation parameters output by the two fully-connected networks.
5. The method of claim 1, wherein the performing a time-series convolution operation on the modulated feature map to obtain a target video segment related to the semantics of the sentence comprises:
performing time sequence convolution operation on each layer of feature map in the modulated feature map to obtain candidate video clips and time overlapping scores corresponding to the candidate video clips and the target video clips;
determining a set number of top ranked candidate video segments as the target video segment based on a descending ranking of the temporal overlap scores.
6. The method of claim 1, wherein the method further comprises:
constructing a time overlapping loss function and a time sequence position prediction loss function;
constructing a joint loss function of the time sequence convolutional neural network based on the time overlapping loss function and the time sequence position prediction loss function;
updating the time-series convolutional neural network based on the joint loss function to converge the joint loss function.
7. The method of claim 6, wherein constructing the temporal overlap penalty function and the temporal position prediction penalty function comprises:
determining the time overlapping rate of the candidate video clips and the real target video clips;
constructing the temporal overlap loss function based on the temporal overlap rate and a temporal overlap score, the temporal overlap score corresponding to the candidate video segment and the predicted target video segment;
determining the central position and the length of the real target video clip;
and constructing the time sequence position prediction loss function based on the difference of the central positions corresponding to the predicted target video segment and the real target video segment and the difference of the corresponding lengths.
8. A video positioning apparatus, comprising:
an acquisition unit for acquiring a video and a sentence;
the feature extraction unit is used for respectively extracting features of the video and the sentences to obtain corresponding video segment features and word features;
the feature fusion unit is used for fusing the video segment features and the word features to obtain fusion features;
the aggregation association unit is used for aggregating and associating the fusion features layer by layer based on time sequence through a time sequence convolution neural network to obtain a multilayer time sequence convolution feature graph;
the semantic modulation unit is used for carrying out semantic modulation on each layer of feature map in the multilayer time sequence convolution feature map to obtain a modulated feature map;
the time sequence convolution unit is used for carrying out time sequence convolution operation on the modulated characteristic diagram to obtain a target video segment related to the semantics of the statement;
the semantic modulation unit comprises:
a generating unit, configured to generate a modulation parameter based on the feature unit included in each layer of feature map in the multilayer time series convolution feature map and the statement;
the normalization modulation unit is used for carrying out normalization modulation on the feature units contained in each layer of feature diagram in the multilayer time sequence convolution feature diagram based on the modulation parameters to obtain updated feature units;
the generating unit is further configured to form a modulated feature map based on the updated feature unit.
9. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the video positioning method of any of claims 1 to 7 when executing executable instructions stored in the memory.
CN201910570609.5A 2019-06-27 2019-06-27 Video positioning method and device and electronic equipment Active CN110225368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910570609.5A CN110225368B (en) 2019-06-27 2019-06-27 Video positioning method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910570609.5A CN110225368B (en) 2019-06-27 2019-06-27 Video positioning method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110225368A CN110225368A (en) 2019-09-10
CN110225368B true CN110225368B (en) 2020-07-10

Family

ID=67815259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910570609.5A Active CN110225368B (en) 2019-06-27 2019-06-27 Video positioning method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110225368B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128285A (en) * 2019-12-31 2021-07-16 华为技术有限公司 Method and device for processing video
CN111582170B (en) * 2020-05-08 2023-05-23 浙江大学 Method and system for positioning specified object in video based on multi-branch relation network
CN111866607B (en) * 2020-07-30 2022-03-11 腾讯科技(深圳)有限公司 Video clip positioning method and device, computer equipment and storage medium
CN114078223A (en) * 2020-08-17 2022-02-22 华为技术有限公司 Video semantic recognition method and device
CN112488063B (en) * 2020-12-18 2022-06-14 贵州大学 Video statement positioning method based on multi-stage aggregation Transformer model
CN113128431B (en) * 2021-04-25 2022-08-05 北京亮亮视野科技有限公司 Video clip retrieval method, device, medium and electronic equipment
CN113868519B (en) * 2021-09-18 2023-11-14 北京百度网讯科技有限公司 Information searching method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649542A (en) * 2015-11-03 2017-05-10 百度(美国)有限责任公司 Systems and methods for visual question answering
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN108647255A (en) * 2018-04-23 2018-10-12 清华大学 The video sequential sentence localization method and device returned based on attention
CN108960337A (en) * 2018-07-18 2018-12-07 浙江大学 A kind of multi-modal complicated activity recognition method based on deep learning model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9183466B2 (en) * 2013-06-15 2015-11-10 Purdue Research Foundation Correlating videos and sentences
CN104113789B (en) * 2014-07-10 2017-04-12 杭州电子科技大学 On-line video abstraction generation method based on depth learning
CN105228033B (en) * 2015-08-27 2018-11-09 联想(北京)有限公司 A kind of method for processing video frequency and electronic equipment
CN107391646B (en) * 2017-07-13 2020-04-10 清华大学 Semantic information extraction method and device for video image
CN109543519B (en) * 2018-10-15 2022-04-15 天津大学 Depth segmentation guide network for object detection
CN109871736B (en) * 2018-11-23 2023-01-31 腾讯科技(深圳)有限公司 Method and device for generating natural language description information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649542A (en) * 2015-11-03 2017-05-10 百度(美国)有限责任公司 Systems and methods for visual question answering
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN108647255A (en) * 2018-04-23 2018-10-12 清华大学 The video sequential sentence localization method and device returned based on attention
CN108960337A (en) * 2018-07-18 2018-12-07 浙江大学 A kind of multi-modal complicated activity recognition method based on deep learning model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression;Yuan,Yitian等;《https://arxiv.org/abs/1804.07014》;20180419;全文 *

Also Published As

Publication number Publication date
CN110225368A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
CN110225368B (en) Video positioning method and device and electronic equipment
CN112084331B (en) Text processing and model training method and device, computer equipment and storage medium
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
EP3796189A1 (en) Video retrieval method, and method and apparatus for generating video retrieval mapping relationship
CN112287170B (en) Short video classification method and device based on multi-mode joint learning
CN116720004B (en) Recommendation reason generation method, device, equipment and storage medium
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN113254711B (en) Interactive image display method and device, computer equipment and storage medium
CN113705299A (en) Video identification method and device and storage medium
CN115114395B (en) Content retrieval and model training method and device, electronic equipment and storage medium
CN111831813A (en) Dialog generation method, dialog generation device, electronic equipment and medium
Li et al. A deep reinforcement learning framework for Identifying funny scenes in movies
CN116977701A (en) Video classification model training method, video classification method and device
CN114998777A (en) Training method and device for cross-modal video retrieval model
CN111324773A (en) Background music construction method and device, electronic equipment and storage medium
CN114282055A (en) Video feature extraction method, device and equipment and computer storage medium
CN115221369A (en) Visual question-answer implementation method and visual question-answer inspection model-based method
KR102492774B1 (en) Method for providing music contents licensing platform service based on artificial intelligence
CN115238126A (en) Method, device and equipment for reordering search results and computer storage medium
CN111626058A (en) Based on CR2Method and system for realizing image-text double coding of neural network
Gao et al. Generalized pyramid co-attention with learnable aggregation net for video question answering
CN116186197A (en) Topic recommendation method, device, electronic equipment and storage medium
CN117312630A (en) Recommendation information acquisition method, model training method, device, electronic equipment and storage medium
CN116975016A (en) Data processing method, device, equipment and readable storage medium
CN117216536A (en) Model training method, device and equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant