GB2558582A - Method and apparatus for automatic video summarisation - Google Patents

Method and apparatus for automatic video summarisation Download PDF

Info

Publication number
GB2558582A
GB2558582A GB1700265.0A GB201700265A GB2558582A GB 2558582 A GB2558582 A GB 2558582A GB 201700265 A GB201700265 A GB 201700265A GB 2558582 A GB2558582 A GB 2558582A
Authority
GB
United Kingdom
Prior art keywords
video
attention
locations
temporal
text description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1700265.0A
Other versions
GB201700265D0 (en
Inventor
Cricri Francesco
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to GB1700265.0A priority Critical patent/GB2558582A/en
Publication of GB201700265D0 publication Critical patent/GB201700265D0/en
Priority to PCT/FI2018/050001 priority patent/WO2018127627A1/en
Publication of GB2558582A publication Critical patent/GB2558582A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of creating a video summary, comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value. The attention map may be a temporal attention map in that the locations correspond to temporal locations of the map having the highest attention value, a spatial map where the locations correspond to spatial locations with the highest attention value, or a combination thereof. A summary video may then be output with video portions corresponding to the locations with the highest attention values. The text description summary and input questions may be converted into vectors which can be input into the neural network.

Description

(54) Title of the Invention: Method and apparatus for automatic video summarisation Abstract Title: Method and Apparatus for Automatic Video Summarisation (57) A method of creating a video summary, comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value. The attention map may be a temporal attention map in that the locations correspond to temporal locations of the map having the highest attention value, a spatial map where the locations correspond to spatial locations with the highest attention value, or a combination thereof. A summary video may then be output with video portions corresponding to the locations with the highest attention values. The text description summary and input questions may be converted into vectors which can be input into the neural network.
S3000 S3100 S3200
Figure GB2558582A_D0001
S3300 S3400 S3500 S3700 S3800
¢.. $ f ’ ££ rlU. Ό /6
Automatic video summariser
Video-text module 20 User interface 40
Al attention Output 50
module
30
2/6
Figure GB2558582A_D0002
3/6
S1000 S1100 S1200 S1300
Figure GB2558582A_D0003
S1400
S1500
S1600
S1700
S2100 S2200 S2400 S2500 S2600 S2700
Figure GB2558582A_D0004
Figure GB2558582A_D0005
oo co
CQ co
CO
O o
tO co ez>
o o
'ΓΟΟ co co co
5/6
Angular sections
Figure GB2558582A_D0006
Figure GB2558582A_D0007
Figure GB2558582A_D0008
Processing circuitry
Output
14A
Figure GB2558582A_D0009
Figure GB2558582A_D0010
Method and Apparatus for Automatic Video Summarisation
Field
This specification generally relates to automatic video summarisation.
Background
Video summarisation includes producing a video which is smaller in size. Temporal video summarisation includes producing a shorter video. Spatial video summarisation includes producing a video which has less spatial extent that the original. Video summarisation may include detecting events in the video which are relatively more interesting than other events in the video.
Summary
According to a first aspect, the specification describes a method comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
The attention map may be a temporal attention map, wherein the locations correspond to temporal locations of the attention map having the highest attention value.
The attention map may be a spatial attention map, wherein the locations correspond to spatial locations of the attention map having the highest attention value.
The attention map may be a spatio-temporal attention map, wherein the locations correspond to spatial and temporal locations of the attention map having the highest attention value.
The method may further comprise outputting a video summary having video portions corresponding to the locations of the attention map having the highest attention value.
The method may further comprise selecting a video portion based on the temporal location of the attention map having the highest attention value and surrounding temporal locations having an attention value above a threshold attention value.
-2.The method may further comprise converting the input video to the text description.
The method may further comprise converting the text description and input question respectively to a text description summary vector and a question summary vector.
The method may further comprise providing the text description summary vector and the question summary vector to the neural network.
According to a second aspect, the specification describes a computer program comprising machine readable instructions that, when executed by computing apparatus, causes it to perform any method as described with reference to the first aspect.
According to a third aspect, the specification describes an apparatus configured to perform any method as described with reference to the first aspect.
According to a fourth aspect, the specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform a method comprising: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
The attention map may be a temporal attention map, wherein the locations correspond to temporal locations of the attention map having the highest attention value.
The attention map may be a spatial attention map, wherein the locations correspond to spatial locations of the attention map having the highest attention value.
The attention map may be a spatio-temporal attention map, wherein the locations correspond to spatial and temporal locations of the attention map having the highest attention value.
-3The computer program code, when executed, may cause the apparatus to perform: outputting a video summary having video portions corresponding to the locations of the attention map having the highest attention value.
The computer program code, when executed, may cause the apparatus to perform: selecting a video portion based on the temporal location of the attention map having the highest attention value and surrounding temporal locations having an attention value above a threshold attention value.
The computer program code, when executed, may cause the apparatus to perform: converting the input video to the text description.
The computer program code, when executed, may cause the apparatus to perform: converting the text description and input question respectively to a text description summary vector and a question summary vector.
The computer program code, when executed, may cause the apparatus to perform: providing the text description summary vector and the question summary vector to the neural network.
According to a fifth aspect, the specification describes a computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by at least one processor, causes performance of at least: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
According to a sixth aspect, there is provided an apparatus comprising means for: analysing, using a neural network, a text description of an input video and an input question; causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
Brief Description of the Figures
-4For a more complete understanding of the methods, apparatuses and computerreadable instructions described herein, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
Figure l is a schematic illustration of an automatic video summariser, according to embodiments of this specification;
Figure 2 is a schematic illustration of temporal video summarisation according to embodiments of this specification;
Figure 3 is a schematic illustration of spatial video summarisation according to embodiments of this specification;
Figure 4 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to convert video to a text description according to embodiments of this specification;
Figure 5 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to produce a video summary based on a user’s question according to embodiments of this specification;
Figure 6 is a flow chart illustrating operations which maybe performed by the automatic video summariser in order to produce a spatio-temporal attention map according to embodiments of this specification;
Figure 7 illustrates an example of a spatio-temporal attention map produced by the automatic video summariser according to embodiments of this specification;
Figure 8 is a schematic illustration of an example configuration of the automatic video summariser according to embodiments of this specification;
Figure 9 is a computer-readable memory medium upon which computer-readable code may be stored, according to embodiments of this specification.
Detailed Description
In the description and drawings, like reference numerals may refer to like elements throughout.
Figure 1 is a schematic illustration of an automatic video summariser 10. The automatic video summariser 10 described herein make use of neural networks in order to produce spatio-temporal summaries including visual information relevant to a user’s question or request. In this way, the events in the video which are considered to be relevant to the user’s question are determined and video portions showing these events can be output as a spatio-temporal summary for the user.
-5The automatic video summariser io comprises a video-to-text module 20, an artificial intelligence (AI) attention module 30, a user interface 40 for receiving a user input, and an output 50, which may be a display, for example. The AI attention module may use deep learning methods such as attention mechanisms, neural attention mechanisms, or one or more neural networks outputting attention weights.
Figure 2 is a schematic illustration of temporal video summarisation. In temporal summarisation, the size of an input video too made up of video frames tooa-iooi is reduced in size in terms of content by producing a video summary with a shorter time duration. A number of frames may be extracted from a video too formed of frames tooa-i. For example, frames iooa,b,e,f,g,i maybe extracted and joined temporally one after the other, maintaining the temporal order intact. The output video summary would comprise video portion 101 made up of frames tooa-b, video portion 102 made up of frames tooe-f, and video portion 103 made up of frames tooh-i. Accordingly, the summary will be a video having fewer frames than the input video. The portions may be made up of any number of frames. The portions may contain different frame numbers to the other portions. The temporal portions may be determined based on events occurring in the video. For example, a temporal portion may relate to one specific event occurring in the video. Selection of the temporal portions of the video maybe performed as described in more detail with reference to Figures 4 to 7.
The video too may be a virtual reality video, for example a 360 degree video shot by a camera having a 360 degree field of view, such as the Nokia OZO camera. An example of a frame 110 from a virtual reality video can be seen in figure 3. The video may include multiple events in different spatial sectors of the video. The video may therefore be spatially summarised. A spatial video summary is a video comprising video crops, i.e. spatial video portions extracted from the original video by cropping spatially. Figure 3 illustrates spatial crops 111,112, and 113. In spatial summarisation, the size of the video crops may be the same for all crops. In embodiments where the crops are not the same size, a resizing step may be applied to increase the resolution of at least one video crop. Increasing the resolution maybe performed, for example, by upsampling with or without interpolation. Increasing the resolution may also be performed by using neural super-resolution methods. Alternatively, the resizing step may involve decreasing the resolution of at least one video crop. Decreasing the resolution may be performed, for example, by down-sampling of the video crop.
-6Selection of the spatial portions of the video may be performed as described in more detail with reference to Figures 4 to 7.
By performing both temporal and spatial summarisation, a spatio-temporal video summary can be produced. For example, the video 100 may be a full length 360 degree movie. The movie may include multiple events temporally and multiple events spatially.
Figure 4 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to convert video to a text description. In some embodiments, not all of the illustrated operations need to be performed. Operations may also be performed in a different order compared to the order presented in Figure
In operation S1000 the automatic video summariser may receive an input video from a video source. The video may be a video extract, or it may be a full length movie. The video may be provided from any suitable video source. For example, the video may be stored on a storage medium such as a DVD, Blue-Ray, hard drive, or any other suitable storage medium. Alternatively, the video may be obtained via streaming or download from an external server.
In operation S1100 an input video is analysed by a feature extraction module. The feature extraction module may comprise a Convolutional Neural Network (CNN). A CNN is an artificial neural network which represents currently the state-of-the-art for performing feature extraction from images and videos. A CNN consists of a sequence of computation layers, where the input is the data (a video frame or an image) and the output is a feature vector, i.e., a vector describing the input image. There maybe different types of computation layers in a CNN, but the most important is the convolutional layer. A convolutional layer performs a convolution operation on its input, but using a set of convolution kernels. Other types of computation layers present in a CNN may be pooling layers, non-linear activation function layers, batchnormalization layers, etc. However, the present invention is not limited to a CNN and other feature extraction methodologies may be utilized.
In operation S1200, the features extracted in operation S1000 maybe input to a temporal neural network. The temporal neural network may comprise a Recurrent
-ΊNeural Network (RNN). A suitable RNN maybe, for example, a Long Short-Term Memory network (LSTM).
In operation S1300, the temporal network outputs a “frame-description” vector, for each input video-frame. The frame description vector corresponds to a description of the video-frame. The frame-description vector may be used for generating a sentence or phrase describing the video frame, represented by a vector of real numbers.
In operation S1400, the frame description vectors may be analysed by a second RNN. The second RNN may also be a LSTM network, or any other suitable temporal neural network.
The second RNN generates a set of characters, or words, describing the input videoframe. As such a vector comprising a set of sentences describing the whole video is output.
In operation S1500, a softmax function is applied to the vector output by the second RNN as a result of operation S1400. This indicates the distribution of the words corresponding to the extracted features throughout the video. The vector which is output may be referred to as a “text description vector”.
In operation S1600, an index synchronisation is performed. In order to determine the temporal locations of the features within the video, the text description is synchronised with the video. This includes associating each word or character with a certain video frame. A word or character may be associated with several adjacent video frames.
The association of the words or characters with corresponding video frames can be achieved by outputting a video-frame index for each word or character, corresponding to the index of the frame which is described by those words or characters. For example, in one case, one word may be associated with multiple adjacent frames.
In operation S1700, the automatic video summariser outputs a text description of the video associated with corresponding time indexes.
However, it will be recognised that any suitable implementation of a video to text module 20 can be utilised.
-8Figure 5 is a flow chart illustrating various operations which may be performed by the automatic video summariser in order to produce a spatio-temporal summary of an input video. In some embodiments, not all of the illustrated operations need to be performed. Operations may also be performed in a different order compared to the order presented in Figure 5.
In operation S2000 an input video is received.
In operation S2100, the video is converted to text, for example as described with reference to Figure 4. However, it will be understood that any suitable video to text conversion may be used.
In operation S2200, the automatic video summariser outputs text descriptions of the video.
In operation S2300 the automatic video summariser receives a user question or request. The question or request is input, or converted into, a text format. The question or request may relate to information the user would like to know about the input video. For example, the user may wish to find out whether there are any car crashes in the video. Therefore, the user may input a question such as “was there any car crashes in this movie?”, or a request such as “would you summarise all the romantic scenes from the movie”. The interface maybe configured such that the user can input the question or request through user interface 40, for example by typing on a keyboard or on a touchscreen device connected to the automatic video summariser. Alternatively, the question or request may be verbally output by a user and received by voice recognition software to convert the question into text.
In operation S2400, the text question (or request) and the text descriptions of the video are input into an artificial intelligence (Al) attention module 30, which may comprise one or more neural networks, for example attention neural networks, and/or other operations which produce an “attention vector”. The text question and text description are analysed by the Al attention module. The question maybe analysed before being input into the Al attention module 30. An example of how the question maybe analysed is described in more detail with reference to Figure 6.
-9In operation S2500, the Al attention module 30 produces a spatio-temporal attention map representing the attention-intensity that a neural network has put at that point in time and spatial region when trying to answer the user’s questions.
In step S2600, the automatic video summariser retrieves the spatial and temporal portions of the input video corresponding to the temporal and spatial locations of the spatio-temporal attention map having the highest attention-intensity values.
In step S2700, the automatic video summariser outputs the selected video portions as a spatio-temporal video summary.
Figure 6 is a flow chart illustrating in more detail the steps involved in producing the spatio-temporal attention map used in order to produce the spatio-temporal video summarisation. In some embodiments, not all of the illustrated operations need to be performed. Operations may also be performed in a different order compared to the order presented in Figure 6.
In operation S3000, the text descriptions output as a result of operation S1700 of Figure 4 are input to a word-embedding module.
In operation S3100, the word-embedding module converts the text descriptions to a set of dense vectors. Each of the dense vectors may represent a single word with a plurality of real numbers. The words in the text description are each converted from a vocabulary representation to a vector of real numbers. The vector of real numbers may be of lower dimensionality than the input vector of vocabulary entries, for example a vector with less dimensions or axes. The new representation is a point in an “embedding space”, where words with similar semantics are nearby. The wordembedding module may be implemented by a multi-layer perceptron network or alternatively a single fully-connected layer. In general, the word-embedding module may transform an input into a more convenient output representation. For example, words maybe transformed into a new representation for which similar words lie close to each other in the new representation space.
In operation S3200, the text description vectors are input to an RNN where the vectors are analysed. The RNN outputs a single output vector, which will be referred to herein as a text description summary vector. The RNN may be an LSTM.
- 10 In operation S3300, the question is input to a word-embedding module.
In operation S3400, the word-embedding module converts the question to a set of dense vectors. The words in the question are each converted from a vocabulary representation to a vector of real numbers with lower dimensionality, in a similar way to the text descriptions in operation S3100.
In operation S3500, the question vectors are input into an RNN where the vectors are analysed. The RNN outputs a single output vector which summarises the question, which will be referred to herein as a question summary vector. The RNN may be an LSTM.
In operation S3600, the text description summary vector and question summary vector are combined. The combination operation may be a concatenation in one of the dimensions of the input vectors, or an element-wise addition (if the input vectors have same dimensionalities). However, any suitable combination operation maybe used at this step.
In operation S3700, the concatenated summary vectors are provided to a multi-layer perceptron (MLP) neural network. The MLP neural network may be referred to as an “attention neural network”. The MLP is a neural network comprising a set of dense (i.e. fully connected) layers, followed by a softmax layer.
The dense layers of the MLP learn how to map the concatenated word-embedded text descriptions and user questions to an attention vector. The mapping is learned from data via a training process which happens offline, and which happens end-to-end for the whole model proposed in this invention. The input data is videos and a set of questions for each video, and the ground-truth output is the video segments which form the target video summary. The attention vector is in practice a set of attention weights (i.e., real numbers), summing up to 1, where each attention weight is associated to a certain temporal location of the video.
The softmax layer will output a probability distribution over “temporal attention weights” w.
- 11 The size of the output vector (i.e. the number of weights w) is the number of temporal locations, which is the number of words in the text describing the input video. In an alternative implementation, the size of the output vector is less than the number of words in the video description, and thus an attention weight can refer to more than one word. This would be a case where the attention is “quantized”.
The weights represent a l-dimensional “temporal attention map” (TAM), having bins which each correspond to a temporal location, and having a value of the value of the attention weight associated to that temporal location. The TAM value at location t, TAM[t], represents the attention-intensity that the attention neural network has put at that point in time when trying to answer the user’s question.
The temporal locations associated with each bin correspond to temporal location of the input video. The attention weights output by the MLP are a vector of N bins, where N is the number of total temporal locations of the video. Therefore, the attention weights correspond to words of the text description and are arranged in the same temporal order as the temporal order of the words of the text description of the video. Accordingly temporal synchronisation is achieved based on the temporal location of the attention weights and the corresponding words of the text description. The dimensionality of the vector output by the MLP is determined automatically based on the number of words of the text description created by the video-to-text module 20.
In operation S3800, the attention neural network outputs the probability distribution over attention weights which can be represented as the temporal attention map. A temporal location t* of the TAM corresponding to the highest attention value in the TAM indicates the temporal location of the video which answers the user’s question.
The temporal extent of the video portion is determined based on the temporal extent of the attention values around £*. For example, a threshold value of the attention weight values may determine the temporal boundaries of the video portion to extract. That is, the video portion is selected based on temporal locations of attention weights above a given threshold. However, the temporal extent of the video portion may be selected in any other suitable way.
- 12 Figure 7 illustrates an example of a spatio-temporal attention map (STAM) produced by the automatic video summariser. The STAM represents the attention weights corresponding to each temporal location and spatial region of the video.
The TAM is extended to the spatial domain by analysing the video separately in the spatial dimension. For example, the video may be divided into a given number of angular sectors. Each sector is analysed separately by several attention networks. The joint output of the attention network is a 2-dimensional attention map, or “spatiotemporal attention map” (STAM).
The STAM is output as a matrix indexed using two indices, one for the time (f), and one for the space (the angular sector s). In Figure 7, the time runs along the x-axis of the map, and the space runs along the y-axis. In order to answer the user’s question, the video portion (i.e. the particular temporal location and extent, and the spatial crop) will be determined by the highest value of attention within the STAM matrix. The video portion is based on the temporal location t* and angular sector s* having the highest attention value in the STAM matrix.
The video maybe divided into a number of predetermined sectors. Alternatively, the division of the video may be dynamic. For example, the automatic video summariser may divide the video by means of “spatial scene cut detection”. Spatial scene cut detection may be achieved by analysing the video with deep learning or multimedia analysis techniques to detect objects, actions and activities, and then virtually cutting the scene to include the object, action and activity spatially. Therefore, the amount of data needed to analyse and summarize a spatial virtual reality video maybe reduced.
Spatial summary may be applicable to 360 degree videos in order to convert a 360 degree video to a standard size video. Spatial summary may also be performed without any temporal summarisation, if this is desired.
In Figure 7, the temporal and spatial locations determined by the automatic video summariser as answering the user’s question are indicated by the indices tl, si, and t2, S2.
Therefore, the indices maybe used to extract the corresponding temporal and spatial portions of the video for output to a user as a spatio-temporal summary of the video,
-13based on the user’s question. In this case, the user is provided with two video portions. The first video portion corresponds to the temporal location of the video indicated by indices ti, si. The temporal extent of the video portion may be determined as described above, for example by setting a threshold attention value for the values temporally adjacent to ti. The second video portion corresponds to the temporal location of the video indicated by indices t2, s2.
Accordingly, by determining the highest attention values, the automatic video summariser is able to output a video summary which is determined to be the most relevant to the user’s question or request. The video portions may be output through the output 50. The output may be a display which forms part of the automatic video summariser io. Alternatively, the automatic video summariser io may be configured to output the video portions to a display which does not form part of the automatic video summariser io, such as a display of a TV or PC, etc. For example, the automatic video summariser may be located on a server which is separate to the display through which the video portions are output. The automatic video summariser may be configured to output indicators of temporal locations of a video to be played in a video summary.
Figure 8 is a schematic block diagram of an example configuration of an automatic video summariser such as that described with reference to Figures l to 7. The video summariser may comprise memory and processing circuitry. The memory 11 may comprise any combination of different types of memory. In the example of Figure 8, the memory comprises one or more read-only memory (ROM) media 13 and one or more random access memory (RAM) memory media 12. The processing circuitry 14 may be configured to process an input video and user question as described with reference to Figures 1 to 7.
The memory described with reference to Figure 8 may have computer readable instructions stored thereon 13A, which when executed by the processing circuitry 14 causes the processing circuitry 14 to cause performance of various ones of the operations described above. The processing circuitry 14 described above with reference to Figure 8 may be of any suitable composition and may include one or more processors 14A of any suitable type or suitable combination of types. For example, the processing circuitry 14 may be a programmable processor that interprets computer program instructions and processes data. The processing circuitry 14 may include plural programmable processors. Alternatively, the processing circuitry 14 may be, for
-14example, programmable hardware with embedded firmware. The processing circuitry 14 maybe termed processing means. The processing circuitry 14 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processing circuitry 14 maybe referred to as computing apparatus.
The processing circuitry 14 described with reference to Figure 8 is coupled to the memory 11 (or one or more storage devices) and is operable to read/write data to/from the memory. The memory may comprise a single memory unit or a plurality of memory units 13 upon which the computer readable instructions 13A (or code) is stored. For example, the memory 11 may comprise both volatile memory 12 and non-volatile memory 13. For example, the computer readable instructions 13A may be stored in the non-volatile memory 13 and may be executed by the processing circuitry 14 using the volatile memory 12 for temporary storage of data or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories 11 in general may be referred to as non-transitory computer readable memory media.
The term ‘memory’, in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories.
The computer readable instructions 13A described herein with reference to Figure 8 maybe pre-programmed into the automatic video summariser. Alternatively, the computer readable instructions 13A may arrive at the automatic video summariser via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium such as a CD-ROM or DVD. The computer readable instructions 13A may provide the logic and routines that enable the automatic video summariser to perform the functionalities described above. For example, the video-text module 20, AI attention module 30, the feature extraction module, and the word-embedding module may be implemented as computer readable instructions stored on one or more memories, which, when executed by the processor circuitry, cause processing input data according to embodiments of the invention. The combination of computer-readable instructions stored on memory (of
-15any of the types described above) may be referred to as a computer program or a computer program product.
Figure 9 illustrates an example of a computer-readable medium 16 with computerreadable instructions (code) stored thereon. The computer-readable instructions (code), when executed by a processor, may cause any one of or any combination of the operations described above to be performed.
As will be appreciated, the automatic video summariser described herein may include various hardware components which have may not been shown in the Figures since they may not have direct interaction with the shown features.
Embodiments may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “memory” or “computer-readable medium” maybe any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
Reference to, where relevant, “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc., or a “processor” or “processing circuitry” etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specific circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.
As used in this application, the term ‘circuitry’ refers to all of the following: (a) hardware-only circuit implementations (such as implementations in only analogue
-ι6and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile device or server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
This definition of ‘circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device.
If desired, the different functions discussed herein maybe performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagrams of Figures 4 to 6 are examples only and that various operations depicted therein maybe omitted, reordered and/or combined.
Although various aspects are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the appended claims.

Claims (14)

Claims
1. A method comprising:
analysing, using a neural network, a text description of an input video and an input question;
causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
2. A method according to claim l, wherein the attention map is a temporal attention map, and wherein the locations correspond to temporal locations of the attention map having the highest attention value.
3. A method according to claim l, wherein the attention map is a spatial attention map, and wherein the locations correspond to spatial locations of the attention map having the highest attention value.
4. A method according to claim l, wherein the attention map is a spatio-temporal attention map, wherein the locations correspond to spatial and temporal locations of the attention map having the highest attention value.
5. A method according to any preceding claim, comprising outputting a video summary having video portions corresponding to the locations of the attention map having the highest attention value.
6. A method according to claim 5, comprising selecting a video portion based on the temporal location of the attention map having the highest attention value and surrounding temporal locations having an attention value above a threshold attention value.
7. A method according to any preceding claim, further comprising converting the input video to the text description.
-188. A method according to any preceding claim, further comprising converting the text description and input question respectively to a text description summary vector and a question summary vector.
9. A method according to claim 8, further comprising providing the text description summary vector and the question summary vector to the neural network.
10. A computer program comprising machine readable instructions that, when executed by computing apparatus, causes it to perform the method of any preceding claim.
11. Apparatus configured to perform the method of any of claims l to 9.
12. Apparatus comprising:
at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform a method comprising:
analysing, using a neural network, a text description of an input video and an input question;
causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
13. A computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by the at least one processor, causes performance of at least:
analysing, using a neural network, a text description of an input video and an input question;
causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
14. Apparatus comprising means for:
analysing, using a neural network, a text description of an input video and an input question;
-19causing production of an attention map based on the text description and the input question; and determining locations of the attention map having the highest attention value.
Go?
Intellectual
Property
Office
Application No: Claims searched:
GB1700265.0A 2017-01-06 2017-01-06 Method and apparatus for automatic video summarisation Withdrawn GB2558582A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1700265.0A GB2558582A (en) 2017-01-06 2017-01-06 Method and apparatus for automatic video summarisation
PCT/FI2018/050001 WO2018127627A1 (en) 2017-01-06 2018-01-02 Method and apparatus for automatic video summarisation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1700265.0A GB2558582A (en) 2017-01-06 2017-01-06 Method and apparatus for automatic video summarisation

Publications (2)

Publication Number Publication Date
GB201700265D0 GB201700265D0 (en) 2017-02-22
GB2558582A true GB2558582A (en) 2018-07-18

Family

ID=58463740

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1700265.0A Withdrawn GB2558582A (en) 2017-01-06 2017-01-06 Method and apparatus for automatic video summarisation

Country Status (2)

Country Link
GB (1) GB2558582A (en)
WO (1) WO2018127627A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2019239454B2 (en) 2018-03-22 2021-12-16 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Method and system for retrieving video temporal segments
CN109413448A (en) * 2018-11-05 2019-03-01 中山大学 Mobile device panoramic video play system based on deeply study
CN109871124B (en) * 2019-01-25 2020-10-27 华南理工大学 Emotion virtual reality scene evaluation method based on deep learning
CN109889923B (en) * 2019-02-28 2021-03-26 杭州一知智能科技有限公司 Method for summarizing videos by utilizing layered self-attention network combined with video description
CN109919114A (en) * 2019-03-14 2019-06-21 浙江大学 One kind is based on the decoded video presentation method of complementary attention mechanism cyclic convolution
US11568247B2 (en) * 2019-03-22 2023-01-31 Nec Corporation Efficient and fine-grained video retrieval
CN110267051B (en) * 2019-05-16 2021-09-14 北京奇艺世纪科技有限公司 Data processing method and device
CN110414377B (en) * 2019-07-09 2020-11-13 武汉科技大学 Remote sensing image scene classification method based on scale attention network
CN110933518B (en) * 2019-12-11 2020-10-02 浙江大学 Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism
CN111241410B (en) * 2020-01-22 2023-08-22 深圳司南数据服务有限公司 Industry news recommendation method and terminal
CN112016493B (en) * 2020-09-03 2024-08-23 科大讯飞股份有限公司 Image description method, device, electronic equipment and storage medium
CN112261491B (en) * 2020-12-22 2021-04-16 北京达佳互联信息技术有限公司 Video time sequence marking method and device, electronic equipment and storage medium
CN113343821B (en) * 2021-05-31 2022-08-30 合肥工业大学 Non-contact heart rate measurement method based on space-time attention network and input optimization
CN115334367B (en) * 2022-07-11 2023-10-17 北京达佳互联信息技术有限公司 Method, device, server and storage medium for generating abstract information of video
CN116089654B (en) * 2023-04-07 2023-07-07 杭州东上智能科技有限公司 Audio supervision-based transferable audio-visual text generation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051446B1 (en) * 1999-12-06 2011-11-01 Sharp Laboratories Of America, Inc. Method of creating a semantic video summary using information from secondary sources
US20130081082A1 (en) * 2011-09-28 2013-03-28 Juan Carlos Riveiro Insua Producing video bits for space time video summary
US20150127626A1 (en) * 2013-11-07 2015-05-07 Samsung Tachwin Co., Ltd. Video search system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244924B2 (en) * 2012-04-23 2016-01-26 Sri International Classification, search, and retrieval of complex video events

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051446B1 (en) * 1999-12-06 2011-11-01 Sharp Laboratories Of America, Inc. Method of creating a semantic video summary using information from secondary sources
US20130081082A1 (en) * 2011-09-28 2013-03-28 Juan Carlos Riveiro Insua Producing video bits for space time video summary
US20150127626A1 (en) * 2013-11-07 2015-05-07 Samsung Tachwin Co., Ltd. Video search system and method

Also Published As

Publication number Publication date
GB201700265D0 (en) 2017-02-22
WO2018127627A1 (en) 2018-07-12

Similar Documents

Publication Publication Date Title
GB2558582A (en) Method and apparatus for automatic video summarisation
CN108763325B (en) A kind of network object processing method and processing device
CN109740670B (en) Video classification method and device
CN108307229B (en) Video and audio data processing method and device
JP7537060B2 (en) Information generation method, device, computer device, storage medium, and computer program
US9646227B2 (en) Computerized machine learning of interesting video sections
CN109218629B (en) Video generation method, storage medium and device
EP3992924A1 (en) Machine learning based media content annotation
US20170065889A1 (en) Identifying And Extracting Video Game Highlights Based On Audio Analysis
CN109819338A (en) A kind of automatic editing method, apparatus of video and portable terminal
US20170300752A1 (en) Method and system for summarizing multimedia content
US10665267B2 (en) Correlation of recorded video presentations and associated slides
US10768887B2 (en) Electronic apparatus, document displaying method thereof and non-transitory computer readable recording medium
CN114390217B (en) Video synthesis method, device, computer equipment and storage medium
CN111813998B (en) Video data processing method, device, equipment and storage medium
US20200380690A1 (en) Image processing method, apparatus, and storage medium
CN115496820A (en) Method and device for generating image and file and computer storage medium
CN113906437A (en) Improved face quality of captured images
CN115119014A (en) Video processing method, and training method and device of frame insertion quantity model
CN110418148A (en) Video generation method, video generation device and readable storage medium
US20150111189A1 (en) System and method for browsing multimedia file
US11823433B1 (en) Shadow removal for local feature detector and descriptor learning using a camera sensor sensitivity model
CN113255423A (en) Method and device for extracting color scheme from video
CN116389849A (en) Video generation method, device, equipment and storage medium
CN116528015A (en) Digital human video generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)