GB2558582A - Method and apparatus for automatic video summarisation - Google Patents
Method and apparatus for automatic video summarisation Download PDFInfo
- Publication number
- GB2558582A GB2558582A GB1700265.0A GB201700265A GB2558582A GB 2558582 A GB2558582 A GB 2558582A GB 201700265 A GB201700265 A GB 201700265A GB 2558582 A GB2558582 A GB 2558582A
- Authority
- GB
- United Kingdom
- Prior art keywords
- video
- attention
- locations
- temporal
- text description
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000002123 temporal effect Effects 0.000 claims abstract description 66
- 239000013598 vector Substances 0.000 claims abstract description 53
- 238000013528 artificial neural network Methods 0.000 claims abstract description 30
- 238000004519 manufacturing process Methods 0.000 claims abstract description 10
- 230000015654 memory Effects 0.000 claims description 38
- 238000004590 computer program Methods 0.000 claims description 16
- 238000012545 processing Methods 0.000 description 18
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000003860 storage Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 241001025261 Neoraja caerulea Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/738—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/738—Presentation of query results
- G06F16/739—Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method of creating a video summary, comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value. The attention map may be a temporal attention map in that the locations correspond to temporal locations of the map having the highest attention value, a spatial map where the locations correspond to spatial locations with the highest attention value, or a combination thereof. A summary video may then be output with video portions corresponding to the locations with the highest attention values. The text description summary and input questions may be converted into vectors which can be input into the neural network.
Description
(54) Title of the Invention: Method and apparatus for automatic video summarisation Abstract Title: Method and Apparatus for Automatic Video Summarisation (57) A method of creating a video summary, comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value. The attention map may be a temporal attention map in that the locations correspond to temporal locations of the map having the highest attention value, a spatial map where the locations correspond to spatial locations with the highest attention value, or a combination thereof. A summary video may then be output with video portions corresponding to the locations with the highest attention values. The text description summary and input questions may be converted into vectors which can be input into the neural network.
S3000 S3100 S3200
S3300 S3400 S3500 S3700 S3800
¢.. $ f ’ ££ rlU. Ό /6
Automatic video summariser
Video-text module 20 | User interface 40 |
Al attention | Output 50 |
module | |
30 |
2/6
3/6
S1000 S1100 S1200 S1300
S1400
S1500
S1600
S1700
S2100 S2200 S2400 S2500 S2600 S2700
oo co
CQ co
CO
O o
tO co ez>
o o
'ΓΟΟ co co co
5/6
Angular sections
Processing circuitry
Output
14A
Method and Apparatus for Automatic Video Summarisation
Field
This specification generally relates to automatic video summarisation.
Background
Video summarisation includes producing a video which is smaller in size. Temporal video summarisation includes producing a shorter video. Spatial video summarisation includes producing a video which has less spatial extent that the original. Video summarisation may include detecting events in the video which are relatively more interesting than other events in the video.
Summary
According to a first aspect, the specification describes a method comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
The attention map may be a temporal attention map, wherein the locations correspond to temporal locations of the attention map having the highest attention value.
The attention map may be a spatial attention map, wherein the locations correspond to spatial locations of the attention map having the highest attention value.
The attention map may be a spatio-temporal attention map, wherein the locations correspond to spatial and temporal locations of the attention map having the highest attention value.
The method may further comprise outputting a video summary having video portions corresponding to the locations of the attention map having the highest attention value.
The method may further comprise selecting a video portion based on the temporal location of the attention map having the highest attention value and surrounding temporal locations having an attention value above a threshold attention value.
-2.The method may further comprise converting the input video to the text description.
The method may further comprise converting the text description and input question respectively to a text description summary vector and a question summary vector.
The method may further comprise providing the text description summary vector and the question summary vector to the neural network.
According to a second aspect, the specification describes a computer program comprising machine readable instructions that, when executed by computing apparatus, causes it to perform any method as described with reference to the first aspect.
According to a third aspect, the specification describes an apparatus configured to perform any method as described with reference to the first aspect.
According to a fourth aspect, the specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform a method comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
The attention map may be a temporal attention map, wherein the locations correspond to temporal locations of the attention map having the highest attention value.
The attention map may be a spatial attention map, wherein the locations correspond to spatial locations of the attention map having the highest attention value.
The attention map may be a spatio-temporal attention map, wherein the locations correspond to spatial and temporal locations of the attention map having the highest attention value.
-3The computer program code, when executed, may cause the apparatus to perform: outputting a video summary having video portions corresponding to the locations of the attention map having the highest attention value.
The computer program code, when executed, may cause the apparatus to perform: selecting a video portion based on the temporal location of the attention map having the highest attention value and surrounding temporal locations having an attention value above a threshold attention value.
The computer program code, when executed, may cause the apparatus to perform: converting the input video to the text description.
The computer program code, when executed, may cause the apparatus to perform: converting the text description and input question respectively to a text description summary vector and a question summary vector.
The computer program code, when executed, may cause the apparatus to perform: providing the text description summary vector and the question summary vector to the neural network.
According to a fifth aspect, the specification describes a computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by at least one processor, causes performance of at least: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
According to a sixth aspect, there is provided an apparatus comprising means for: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
Brief Description of the Figures
-4For a more complete understanding of the methods, apparatuses and computerreadable instructions described herein, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
Figure l is a schematic illustration of an automatic video summariser, according to embodiments of this specification;
Figure 2 is a schematic illustration of temporal video summarisation according to embodiments of this specification;
Figure 3 is a schematic illustration of spatial video summarisation according to embodiments of this specification;
Figure 4 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to convert video to a text description according to embodiments of this specification;
Figure 5 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to produce a video summary based on a user’s question according to embodiments of this specification;
Figure 6 is a flow chart illustrating operations which maybe performed by the automatic video summariser in order to produce a spatio-temporal attention map according to embodiments of this specification;
Figure 7 illustrates an example of a spatio-temporal attention map produced by the automatic video summariser according to embodiments of this specification;
Figure 8 is a schematic illustration of an example configuration of the automatic video summariser according to embodiments of this specification;
Figure 9 is a computer-readable memory medium upon which computer-readable code may be stored, according to embodiments of this specification.
Detailed Description
In the description and drawings, like reference numerals may refer to like elements throughout.
Figure 1 is a schematic illustration of an automatic video summariser 10. The automatic video summariser 10 described herein make use of neural networks in order to produce spatio-temporal summaries including visual information relevant to a user’s question or request. In this way, the events in the video which are considered to be relevant to the user’s question are determined and video portions showing these events can be output as a spatio-temporal summary for the user.
-5The automatic video summariser io comprises a video-to-text module 20, an artificial intelligence (AI) attention module 30, a user interface 40 for receiving a user input, and an output 50, which may be a display, for example. The AI attention module may use deep learning methods such as attention mechanisms, neural attention mechanisms, or one or more neural networks outputting attention weights.
Figure 2 is a schematic illustration of temporal video summarisation. In temporal summarisation, the size of an input video too made up of video frames tooa-iooi is reduced in size in terms of content by producing a video summary with a shorter time duration. A number of frames may be extracted from a video too formed of frames tooa-i. For example, frames iooa,b,e,f,g,i maybe extracted and joined temporally one after the other, maintaining the temporal order intact. The output video summary would comprise video portion 101 made up of frames tooa-b, video portion 102 made up of frames tooe-f, and video portion 103 made up of frames tooh-i. Accordingly, the summary will be a video having fewer frames than the input video. The portions may be made up of any number of frames. The portions may contain different frame numbers to the other portions. The temporal portions may be determined based on events occurring in the video. For example, a temporal portion may relate to one specific event occurring in the video. Selection of the temporal portions of the video maybe performed as described in more detail with reference to Figures 4 to 7.
The video too may be a virtual reality video, for example a 360 degree video shot by a camera having a 360 degree field of view, such as the Nokia OZO camera. An example of a frame 110 from a virtual reality video can be seen in figure 3. The video may include multiple events in different spatial sectors of the video. The video may therefore be spatially summarised. A spatial video summary is a video comprising video crops, i.e. spatial video portions extracted from the original video by cropping spatially. Figure 3 illustrates spatial crops 111,112, and 113. In spatial summarisation, the size of the video crops may be the same for all crops. In embodiments where the crops are not the same size, a resizing step may be applied to increase the resolution of at least one video crop. Increasing the resolution maybe performed, for example, by upsampling with or without interpolation. Increasing the resolution may also be performed by using neural super-resolution methods. Alternatively, the resizing step may involve decreasing the resolution of at least one video crop. Decreasing the resolution may be performed, for example, by down-sampling of the video crop.
-6Selection of the spatial portions of the video may be performed as described in more detail with reference to Figures 4 to 7.
By performing both temporal and spatial summarisation, a spatio-temporal video summary can be produced. For example, the video 100 may be a full length 360 degree movie. The movie may include multiple events temporally and multiple events spatially.
Figure 4 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to convert video to a text description. In some embodiments, not all of the illustrated operations need to be performed. Operations may also be performed in a different order compared to the order presented in Figure
4·
In operation S1000 the automatic video summariser may receive an input video from a video source. The video may be a video extract, or it may be a full length movie. The video may be provided from any suitable video source. For example, the video may be stored on a storage medium such as a DVD, Blue-Ray, hard drive, or any other suitable storage medium. Alternatively, the video may be obtained via streaming or download from an external server.
In operation S1100 an input video is analysed by a feature extraction module. The feature extraction module may comprise a Convolutional Neural Network (CNN). A CNN is an artificial neural network which represents currently the state-of-the-art for performing feature extraction from images and videos. A CNN consists of a sequence of computation layers, where the input is the data (a video frame or an image) and the output is a feature vector, i.e., a vector describing the input image. There maybe different types of computation layers in a CNN, but the most important is the convolutional layer. A convolutional layer performs a convolution operation on its input, but using a set of convolution kernels. Other types of computation layers present in a CNN may be pooling layers, non-linear activation function layers, batchnormalization layers, etc. However, the present invention is not limited to a CNN and other feature extraction methodologies may be utilized.
In operation S1200, the features extracted in operation S1000 maybe input to a temporal neural network. The temporal neural network may comprise a Recurrent
-ΊNeural Network (RNN). A suitable RNN maybe, for example, a Long Short-Term Memory network (LSTM).
In operation S1300, the temporal network outputs a “frame-description” vector, for each input video-frame. The frame description vector corresponds to a description of the video-frame. The frame-description vector may be used for generating a sentence or phrase describing the video frame, represented by a vector of real numbers.
In operation S1400, the frame description vectors may be analysed by a second RNN. The second RNN may also be a LSTM network, or any other suitable temporal neural network.
The second RNN generates a set of characters, or words, describing the input videoframe. As such a vector comprising a set of sentences describing the whole video is output.
In operation S1500, a softmax function is applied to the vector output by the second RNN as a result of operation S1400. This indicates the distribution of the words corresponding to the extracted features throughout the video. The vector which is output may be referred to as a “text description vector”.
In operation S1600, an index synchronisation is performed. In order to determine the temporal locations of the features within the video, the text description is synchronised with the video. This includes associating each word or character with a certain video frame. A word or character may be associated with several adjacent video frames.
The association of the words or characters with corresponding video frames can be achieved by outputting a video-frame index for each word or character, corresponding to the index of the frame which is described by those words or characters. For example, in one case, one word may be associated with multiple adjacent frames.
In operation S1700, the automatic video summariser outputs a text description of the video associated with corresponding time indexes.
However, it will be recognised that any suitable implementation of a video to text module 20 can be utilised.
-8Figure 5 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to produce a spatio-temporal summary of an input video. In some embodiments, not all of the illustrated operations need to be performed. Operations may also be performed in a different order compared to the order presented in Figure 5.
In operation S2000 an input video is received.
In operation S2100, the video is converted to text, for example as described with reference to Figure 4. However, it will be understood that any suitable video to text conversion may be used.
In operation S2200, the automatic video summariser outputs text descriptions of the video.
In operation S2300 the automatic video summariser receives a user question or request. The question or request is input, or converted into, a text format. The question or request may relate to information the user would like to know about the input video. For example, the user may wish to find out whether there are any car crashes in the video. Therefore, the user may input a question such as “was there any car crashes in this movie?”, or a request such as “would you summarise all the romantic scenes from the movie”. The interface maybe configured such that the user can input the question or request through user interface 40, for example by typing on a keyboard or on a touchscreen device connected to the automatic video summariser. Alternatively, the question or request may be verbally output by a user and received by voice recognition software to convert the question into text.
In operation S2400, the text question (or request) and the text descriptions of the video are input into an artificial intelligence (Al) attention module 30, which may comprise one or more neural networks, for example attention neural networks, and/or other operations which produce an “attention vector”. The text question and text description are analysed by the Al attention module. The question maybe analysed before being input into the Al attention module 30. An example of how the question maybe analysed is described in more detail with reference to Figure 6.
-9In operation S2500, the Al attention module 30 produces a spatio-temporal attention map representing the attention-intensity that a neural network has put at that point in time and spatial region when trying to answer the user’s questions.
In step S2600, the automatic video summariser retrieves the spatial and temporal portions of the input video corresponding to the temporal and spatial locations of the spatio-temporal attention map having the highest attention-intensity values.
In step S2700, the automatic video summariser outputs the selected video portions as a spatio-temporal video summary.
Figure 6 is a flow chart illustrating in more detail the steps involved in producing the spatio-temporal attention map used in order to produce the spatio-temporal video summarisation. In some embodiments, not all of the illustrated operations need to be performed. Operations may also be performed in a different order compared to the order presented in Figure 6.
In operation S3000, the text descriptions output as a result of operation S1700 of Figure 4 are input to a word-embedding module.
In operation S3100, the word-embedding module converts the text descriptions to a set of dense vectors. Each of the dense vectors may represent a single word with a plurality of real numbers. The words in the text description are each converted from a vocabulary representation to a vector of real numbers. The vector of real numbers may be of lower dimensionality than the input vector of vocabulary entries, for example a vector with less dimensions or axes. The new representation is a point in an “embedding space”, where words with similar semantics are nearby. The wordembedding module may be implemented by a multi-layer perceptron network or alternatively a single fully-connected layer. In general, the word-embedding module may transform an input into a more convenient output representation. For example, words maybe transformed into a new representation for which similar words lie close to each other in the new representation space.
In operation S3200, the text description vectors are input to an RNN where the vectors are analysed. The RNN outputs a single output vector, which will be referred to herein as a text description summary vector. The RNN may be an LSTM.
- 10 In operation S3300, the question is input to a word-embedding module.
In operation S3400, the word-embedding module converts the question to a set of dense vectors. The words in the question are each converted from a vocabulary representation to a vector of real numbers with lower dimensionality, in a similar way to the text descriptions in operation S3100.
In operation S3500, the question vectors are input into an RNN where the vectors are analysed. The RNN outputs a single output vector which summarises the question, which will be referred to herein as a question summary vector. The RNN may be an LSTM.
In operation S3600, the text description summary vector and question summary vector are combined. The combination operation may be a concatenation in one of the dimensions of the input vectors, or an element-wise addition (if the input vectors have same dimensionalities). However, any suitable combination operation maybe used at this step.
In operation S3700, the concatenated summary vectors are provided to a multi-layer perceptron (MLP) neural network. The MLP neural network may be referred to as an “attention neural network”. The MLP is a neural network comprising a set of dense (i.e. fully connected) layers, followed by a softmax layer.
The dense layers of the MLP learn how to map the concatenated word-embedded text descriptions and user questions to an attention vector. The mapping is learned from data via a training process which happens offline, and which happens end-to-end for the whole model proposed in this invention. The input data is videos and a set of questions for each video, and the ground-truth output is the video segments which form the target video summary. The attention vector is in practice a set of attention weights (i.e., real numbers), summing up to 1, where each attention weight is associated to a certain temporal location of the video.
The softmax layer will output a probability distribution over “temporal attention weights” w.
- 11 The size of the output vector (i.e. the number of weights w) is the number of temporal locations, which is the number of words in the text describing the input video. In an alternative implementation, the size of the output vector is less than the number of words in the video description, and thus an attention weight can refer to more than one word. This would be a case where the attention is “quantized”.
The weights represent a l-dimensional “temporal attention map” (TAM), having bins which each correspond to a temporal location, and having a value of the value of the attention weight associated to that temporal location. The TAM value at location t, TAM[t], represents the attention-intensity that the attention neural network has put at that point in time when trying to answer the user’s question.
The temporal locations associated with each bin correspond to temporal location of the input video. The attention weights output by the MLP are a vector of N bins, where N is the number of total temporal locations of the video. Therefore, the attention weights correspond to words of the text description and are arranged in the same temporal order as the temporal order of the words of the text description of the video. Accordingly temporal synchronisation is achieved based on the temporal location of the attention weights and the corresponding words of the text description. The dimensionality of the vector output by the MLP is determined automatically based on the number of words of the text description created by the video-to-text module 20.
In operation S3800, the attention neural network outputs the probability distribution over attention weights which can be represented as the temporal attention map. A temporal location t* of the TAM corresponding to the highest attention value in the TAM indicates the temporal location of the video which answers the user’s question.
The temporal extent of the video portion is determined based on the temporal extent of the attention values around £*. For example, a threshold value of the attention weight values may determine the temporal boundaries of the video portion to extract. That is, the video portion is selected based on temporal locations of attention weights above a given threshold. However, the temporal extent of the video portion may be selected in any other suitable way.
- 12 Figure 7 illustrates an example of a spatio-temporal attention map (STAM) produced by the automatic video summariser. The STAM represents the attention weights corresponding to each temporal location and spatial region of the video.
The TAM is extended to the spatial domain by analysing the video separately in the spatial dimension. For example, the video may be divided into a given number of angular sectors. Each sector is analysed separately by several attention networks. The joint output of the attention network is a 2-dimensional attention map, or “spatiotemporal attention map” (STAM).
The STAM is output as a matrix indexed using two indices, one for the time (f), and one for the space (the angular sector s). In Figure 7, the time runs along the x-axis of the map, and the space runs along the y-axis. In order to answer the user’s question, the video portion (i.e. the particular temporal location and extent, and the spatial crop) will be determined by the highest value of attention within the STAM matrix. The video portion is based on the temporal location t* and angular sector s* having the highest attention value in the STAM matrix.
The video maybe divided into a number of predetermined sectors. Alternatively, the division of the video may be dynamic. For example, the automatic video summariser may divide the video by means of “spatial scene cut detection”. Spatial scene cut detection may be achieved by analysing the video with deep learning or multimedia analysis techniques to detect objects, actions and activities, and then virtually cutting the scene to include the object, action and activity spatially. Therefore, the amount of data needed to analyse and summarize a spatial virtual reality video maybe reduced.
Spatial summary may be applicable to 360 degree videos in order to convert a 360 degree video to a standard size video. Spatial summary may also be performed without any temporal summarisation, if this is desired.
In Figure 7, the temporal and spatial locations determined by the automatic video summariser as answering the user’s question are indicated by the indices tl, si, and t2, S2.
Therefore, the indices maybe used to extract the corresponding temporal and spatial portions of the video for output to a user as a spatio-temporal summary of the video,
-13based on the user’s question. In this case, the user is provided with two video portions. The first video portion corresponds to the temporal location of the video indicated by indices ti, si. The temporal extent of the video portion may be determined as described above, for example by setting a threshold attention value for the values temporally adjacent to ti. The second video portion corresponds to the temporal location of the video indicated by indices t2, s2.
Accordingly, by determining the highest attention values, the automatic video summariser is able to output a video summary which is determined to be the most relevant to the user’s question or request. The video portions may be output through the output 50. The output may be a display which forms part of the automatic video summariser io. Alternatively, the automatic video summariser io may be configured to output the video portions to a display which does not form part of the automatic video summariser io, such as a display of a TV or PC, etc. For example, the automatic video summariser may be located on a server which is separate to the display through which the video portions are output. The automatic video summariser may be configured to output indicators of temporal locations of a video to be played in a video summary.
Figure 8 is a schematic block diagram of an example configuration of an automatic video summariser such as that described with reference to Figures l to 7. The video summariser may comprise memory and processing circuitry. The memory 11 may comprise any combination of different types of memory. In the example of Figure 8, the memory comprises one or more read-only memory (ROM) media 13 and one or more random access memory (RAM) memory media 12. The processing circuitry 14 may be configured to process an input video and user question as described with reference to Figures 1 to 7.
The memory described with reference to Figure 8 may have computer readable instructions stored thereon 13A, which when executed by the processing circuitry 14 causes the processing circuitry 14 to cause performance of various ones of the operations described above. The processing circuitry 14 described above with reference to Figure 8 may be of any suitable composition and may include one or more processors 14A of any suitable type or suitable combination of types. For example, the processing circuitry 14 may be a programmable processor that interprets computer program instructions and processes data. The processing circuitry 14 may include plural programmable processors. Alternatively, the processing circuitry 14 may be, for
-14example, programmable hardware with embedded firmware. The processing circuitry 14 maybe termed processing means. The processing circuitry 14 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processing circuitry 14 maybe referred to as computing apparatus.
The processing circuitry 14 described with reference to Figure 8 is coupled to the memory 11 (or one or more storage devices) and is operable to read/write data to/from the memory. The memory may comprise a single memory unit or a plurality of memory units 13 upon which the computer readable instructions 13A (or code) is stored. For example, the memory 11 may comprise both volatile memory 12 and non-volatile memory 13. For example, the computer readable instructions 13A may be stored in the non-volatile memory 13 and may be executed by the processing circuitry 14 using the volatile memory 12 for temporary storage of data or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories 11 in general may be referred to as non-transitory computer readable memory media.
The term ‘memory’, in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories.
The computer readable instructions 13A described herein with reference to Figure 8 maybe pre-programmed into the automatic video summariser. Alternatively, the computer readable instructions 13A may arrive at the automatic video summariser via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium such as a CD-ROM or DVD. The computer readable instructions 13A may provide the logic and routines that enable the automatic video summariser to perform the functionalities described above. For example, the video-text module 20, AI attention module 30, the feature extraction module, and the word-embedding module may be implemented as computer readable instructions stored on one or more memories, which, when executed by the processor circuitry, cause processing input data according to embodiments of the invention. The combination of computer-readable instructions stored on memory (of
-15any of the types described above) may be referred to as a computer program or a computer program product.
Figure 9 illustrates an example of a computer-readable medium 16 with computerreadable instructions (code) stored thereon. The computer-readable instructions (code), when executed by a processor, may cause any one of or any combination of the operations described above to be performed.
As will be appreciated, the automatic video summariser described herein may include various hardware components which have may not been shown in the Figures since they may not have direct interaction with the shown features.
Embodiments may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “memory” or “computer-readable medium” maybe any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
Reference to, where relevant, “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc., or a “processor” or “processing circuitry” etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specific circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.
As used in this application, the term ‘circuitry’ refers to all of the following: (a) hardware-only circuit implementations (such as implementations in only analogue
-ι6and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile device or server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
This definition of ‘circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device.
If desired, the different functions discussed herein maybe performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagrams of Figures 4 to 6 are examples only and that various operations depicted therein maybe omitted, reordered and/or combined.
Although various aspects are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the appended claims.
Claims (14)
1. A method comprising:
analysing, using a neural network, a text description of an input video and an input question;
causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
2. A method according to claim l, wherein the attention map is a temporal attention map, and wherein the locations correspond to temporal locations of the attention map having the highest attention value.
3. A method according to claim l, wherein the attention map is a spatial attention map, and wherein the locations correspond to spatial locations of the attention map having the highest attention value.
4. A method according to claim l, wherein the attention map is a spatio-temporal attention map, wherein the locations correspond to spatial and temporal locations of the attention map having the highest attention value.
5. A method according to any preceding claim, comprising outputting a video summary having video portions corresponding to the locations of the attention map having the highest attention value.
6. A method according to claim 5, comprising selecting a video portion based on the temporal location of the attention map having the highest attention value and surrounding temporal locations having an attention value above a threshold attention value.
7. A method according to any preceding claim, further comprising converting the input video to the text description.
-188. A method according to any preceding claim, further comprising converting the text description and input question respectively to a text description summary vector and a question summary vector.
9. A method according to claim 8, further comprising providing the text description summary vector and the question summary vector to the neural network.
10. A computer program comprising machine readable instructions that, when executed by computing apparatus, causes it to perform the method of any preceding claim.
11. Apparatus configured to perform the method of any of claims l to 9.
12. Apparatus comprising:
at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform a method comprising:
analysing, using a neural network, a text description of an input video and an input question;
causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
13. A computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by the at least one processor, causes performance of at least:
analysing, using a neural network, a text description of an input video and an input question;
causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
14. Apparatus comprising means for:
analysing, using a neural network, a text description of an input video and an input question;
-19causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
Go?
Intellectual
Property
Office
Application No: Claims searched:
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1700265.0A GB2558582A (en) | 2017-01-06 | 2017-01-06 | Method and apparatus for automatic video summarisation |
PCT/FI2018/050001 WO2018127627A1 (en) | 2017-01-06 | 2018-01-02 | Method and apparatus for automatic video summarisation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1700265.0A GB2558582A (en) | 2017-01-06 | 2017-01-06 | Method and apparatus for automatic video summarisation |
Publications (2)
Publication Number | Publication Date |
---|---|
GB201700265D0 GB201700265D0 (en) | 2017-02-22 |
GB2558582A true GB2558582A (en) | 2018-07-18 |
Family
ID=58463740
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1700265.0A Withdrawn GB2558582A (en) | 2017-01-06 | 2017-01-06 | Method and apparatus for automatic video summarisation |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2558582A (en) |
WO (1) | WO2018127627A1 (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2019239454B2 (en) | 2018-03-22 | 2021-12-16 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method and system for retrieving video temporal segments |
CN109413448A (en) * | 2018-11-05 | 2019-03-01 | 中山大学 | Mobile device panoramic video play system based on deeply study |
CN109871124B (en) * | 2019-01-25 | 2020-10-27 | 华南理工大学 | Emotion virtual reality scene evaluation method based on deep learning |
CN109889923B (en) * | 2019-02-28 | 2021-03-26 | 杭州一知智能科技有限公司 | Method for summarizing videos by utilizing layered self-attention network combined with video description |
CN109919114A (en) * | 2019-03-14 | 2019-06-21 | 浙江大学 | One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution |
US11568247B2 (en) * | 2019-03-22 | 2023-01-31 | Nec Corporation | Efficient and fine-grained video retrieval |
CN110267051B (en) * | 2019-05-16 | 2021-09-14 | 北京奇艺世纪科技有限公司 | Data processing method and device |
CN110414377B (en) * | 2019-07-09 | 2020-11-13 | 武汉科技大学 | Remote sensing image scene classification method based on scale attention network |
CN110933518B (en) * | 2019-12-11 | 2020-10-02 | 浙江大学 | Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism |
CN111241410B (en) * | 2020-01-22 | 2023-08-22 | 深圳司南数据服务有限公司 | Industry news recommendation method and terminal |
CN112016493B (en) * | 2020-09-03 | 2024-08-23 | 科大讯飞股份有限公司 | Image description method, device, electronic equipment and storage medium |
CN112261491B (en) * | 2020-12-22 | 2021-04-16 | 北京达佳互联信息技术有限公司 | Video time sequence marking method and device, electronic equipment and storage medium |
CN113343821B (en) * | 2021-05-31 | 2022-08-30 | 合肥工业大学 | Non-contact heart rate measurement method based on space-time attention network and input optimization |
CN115334367B (en) * | 2022-07-11 | 2023-10-17 | 北京达佳互联信息技术有限公司 | Method, device, server and storage medium for generating abstract information of video |
CN116089654B (en) * | 2023-04-07 | 2023-07-07 | 杭州东上智能科技有限公司 | Audio supervision-based transferable audio-visual text generation method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8051446B1 (en) * | 1999-12-06 | 2011-11-01 | Sharp Laboratories Of America, Inc. | Method of creating a semantic video summary using information from secondary sources |
US20130081082A1 (en) * | 2011-09-28 | 2013-03-28 | Juan Carlos Riveiro Insua | Producing video bits for space time video summary |
US20150127626A1 (en) * | 2013-11-07 | 2015-05-07 | Samsung Tachwin Co., Ltd. | Video search system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9244924B2 (en) * | 2012-04-23 | 2016-01-26 | Sri International | Classification, search, and retrieval of complex video events |
-
2017
- 2017-01-06 GB GB1700265.0A patent/GB2558582A/en not_active Withdrawn
-
2018
- 2018-01-02 WO PCT/FI2018/050001 patent/WO2018127627A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8051446B1 (en) * | 1999-12-06 | 2011-11-01 | Sharp Laboratories Of America, Inc. | Method of creating a semantic video summary using information from secondary sources |
US20130081082A1 (en) * | 2011-09-28 | 2013-03-28 | Juan Carlos Riveiro Insua | Producing video bits for space time video summary |
US20150127626A1 (en) * | 2013-11-07 | 2015-05-07 | Samsung Tachwin Co., Ltd. | Video search system and method |
Also Published As
Publication number | Publication date |
---|---|
GB201700265D0 (en) | 2017-02-22 |
WO2018127627A1 (en) | 2018-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
GB2558582A (en) | Method and apparatus for automatic video summarisation | |
CN108763325B (en) | A kind of network object processing method and processing device | |
CN109740670B (en) | Video classification method and device | |
CN108307229B (en) | Video and audio data processing method and device | |
JP7537060B2 (en) | Information generation method, device, computer device, storage medium, and computer program | |
US9646227B2 (en) | Computerized machine learning of interesting video sections | |
CN109218629B (en) | Video generation method, storage medium and device | |
EP3992924A1 (en) | Machine learning based media content annotation | |
US20170065889A1 (en) | Identifying And Extracting Video Game Highlights Based On Audio Analysis | |
CN109819338A (en) | A kind of automatic editing method, apparatus of video and portable terminal | |
US20170300752A1 (en) | Method and system for summarizing multimedia content | |
US10665267B2 (en) | Correlation of recorded video presentations and associated slides | |
US10768887B2 (en) | Electronic apparatus, document displaying method thereof and non-transitory computer readable recording medium | |
CN114390217B (en) | Video synthesis method, device, computer equipment and storage medium | |
CN111813998B (en) | Video data processing method, device, equipment and storage medium | |
US20200380690A1 (en) | Image processing method, apparatus, and storage medium | |
CN115496820A (en) | Method and device for generating image and file and computer storage medium | |
CN113906437A (en) | Improved face quality of captured images | |
CN115119014A (en) | Video processing method, and training method and device of frame insertion quantity model | |
CN110418148A (en) | Video generation method, video generation device and readable storage medium | |
US20150111189A1 (en) | System and method for browsing multimedia file | |
US11823433B1 (en) | Shadow removal for local feature detector and descriptor learning using a camera sensor sensitivity model | |
CN113255423A (en) | Method and device for extracting color scheme from video | |
CN116389849A (en) | Video generation method, device, equipment and storage medium | |
CN116528015A (en) | Digital human video generation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |