WO2020077999A1 - 视频摘要生成方法和装置、电子设备、计算机存储介质 - Google Patents

视频摘要生成方法和装置、电子设备、计算机存储介质 Download PDF

Info

Publication number
WO2020077999A1
WO2020077999A1 PCT/CN2019/088020 CN2019088020W WO2020077999A1 WO 2020077999 A1 WO2020077999 A1 WO 2020077999A1 CN 2019088020 W CN2019088020 W CN 2019088020W WO 2020077999 A1 WO2020077999 A1 WO 2020077999A1
Authority
WO
WIPO (PCT)
Prior art keywords
lens
feature
global
video
shot
Prior art date
Application number
PCT/CN2019/088020
Other languages
English (en)
French (fr)
Inventor
冯俐铜
肖达
旷章辉
张伟
Original Assignee
深圳市商汤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市商汤科技有限公司 filed Critical 深圳市商汤科技有限公司
Priority to SG11202003999QA priority Critical patent/SG11202003999QA/en
Priority to JP2020524009A priority patent/JP7150840B2/ja
Publication of WO2020077999A1 publication Critical patent/WO2020077999A1/zh
Priority to US16/884,177 priority patent/US20200285859A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to computer vision technology but is not limited to computer vision technology, in particular to a method and apparatus for generating video summaries, electronic equipment, and computer storage media.
  • Video summary is an emerging video understanding technology. Video summary is to extract some shots from a long video to synthesize a short new video that contains the story line or exciting shots in the original video.
  • the embodiments of the present application provide a method and an apparatus for generating a video summary, an electronic device, and a computer storage medium.
  • a method for generating a video summary includes:
  • a video summary of the to-be-processed video stream is obtained based on the weight of the shot.
  • a feature extraction unit configured to perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image characteristics of each of the shots, and each of the shots includes at least one frame of video image;
  • a global feature unit configured to acquire global features of the lens based on image features of all the lenses
  • a weight acquisition unit configured to determine the weight of the lens according to the image characteristics and the global characteristics of the lens
  • the summary generating unit is configured to obtain a video summary of the to-be-processed video stream based on the weight of the shot.
  • an electronic device including a processor, where the processor includes the video digest generating device according to any one of the above.
  • an electronic device includes: a memory for storing executable instructions;
  • a processor configured to communicate with the memory to execute the executable instructions to complete the operation of any one of the video digest generation methods described above.
  • a computer program product including computer readable code, wherein, when the computer readable code runs on a device, a processor in the device executes for implementation The instruction of the video digest generation method as described in any one of the above.
  • feature extraction is performed on shots in a shot sequence of a video stream to be processed to obtain image characteristics of each shot.
  • Each shot includes at least one frame of video image; obtain the global characteristics of the shot according to the image characteristics of all shots; determine the weight of the shot according to the image characteristics and global characteristics of the shot; obtain the video summary of the video stream to be processed based on the weight of the shot, combined Image features and global features determine the weight of each shot, which realizes the understanding of the video from the perspective of the overall video.
  • the video summary determined based on the weight of the lens in this embodiment can be integrated in the overall
  • the video content is expressed in the above, reducing the problem of one-sided video summary.
  • FIG. 1 is a schematic flowchart of an embodiment of a video digest generation method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another embodiment of a video digest generation method provided by an embodiment of this application.
  • FIG. 3 is a partial flowchart of an optional example of a video digest generation method provided by an embodiment of the present application.
  • FIG. 4 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application.
  • FIG. 5 is a schematic flowchart of still another embodiment of a video digest generation method provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of some optional examples of the video digest generation method provided by the embodiment of the present application.
  • FIG. 7 is a schematic flowchart of another embodiment of a video digest generation method provided by an embodiment of this application.
  • FIG. 8 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an embodiment of an apparatus for generating a video summary provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of an embodiment of a video digest generation method provided by an embodiment of the present application. This method can be performed by any video digest extraction device, such as a terminal device, server, mobile device, etc. As shown in FIG. 1, the method in this embodiment includes:
  • Step 110 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
  • the video summary is: extracting key information or subject information from the original video stream, generating a video summary, the video summary is smaller than the original video stream data stream, and at the same time covering the original video stream's subject content or The key content can be used for subsequent retrieval of the original video stream.
  • a video summary representing the movement trajectory of the same target in the video stream is generated.
  • this is only an example, and the specific implementation is not limited to the above example.
  • the to-be-processed video stream is a video stream to obtain a video summary, and the video stream includes at least one frame of video images.
  • the embodiments of the present application use shots as a constituent unit of the video abstract, and each shot includes at least one frame of video images.
  • the feature extraction in the embodiments of the present application may be implemented based on any feature extraction network, and feature extraction is performed for each shot separately based on the feature extraction network to obtain at least two image features.
  • the application does not limit specific The process of feature extraction.
  • Step 120 Obtain the global characteristics of the lens according to the image characteristics of all the lenses.
  • all image features corresponding to the video stream are processed (such as mapping or embedding, etc.) to obtain a conversion feature sequence corresponding to the overall video stream, and the conversion feature sequence is then calculated with each image feature to obtain a correspondence with each shot Global feature (global attention), the global feature can reflect the association between each shot and other shots in the video stream.
  • the global features include but are not limited to: image features that characterize the correspondence or positional relationship between the same image element in multiple video images in one shot. It should be noted that the above-mentioned association relationship is not limited to the corresponding relationship and / or position relationship.
  • Step 130 Determine the weight of the lens according to the image characteristics and global characteristics of the lens.
  • the weight of the lens is determined by the image characteristics of the lens and its global characteristics.
  • the weight obtained is not only based on the lens itself, but also based on the correlation between the lens and other shots in the entire video stream, from the perspective of the overall video Evaluate the importance of the lens.
  • Step 140 Obtain a video summary of the video stream to be processed based on the weight of the shot.
  • the importance of the shots in the shot sequence is determined by the weight of the shots, but the determination of the video summary is not only based on the importance of the shots, but also the length of the video summary needs to be controlled, that is, the weight of the shots and the length of the shots need to be combined (Number of frames) Determine the video summary.
  • the weight is positively related to the importance of the shot and / or the length of the video summary.
  • the knapsack algorithm may be used to determine the video summary, and other algorithms may also be used to determine, which are not listed here one by one.
  • the video summary generation method performs feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image characteristics of each shot, and each shot includes at least one frame of video image; based on the image characteristics of all shots, The global characteristics of the lens; determine the weight of the lens according to the image characteristics and global characteristics of the lens; obtain the video summary of the video stream to be processed based on the weight of the lens, and combine the image characteristics and the global characteristics to determine the weight of each shot
  • the global association relationship between each shot and the entire video stream is used. Based on the video summary determined in this embodiment, the video content can be expressed as a whole, reducing the problem of one-sided video summary.
  • FIG. 2 is a schematic flowchart of another embodiment of a video digest generation method provided by an embodiment of this application. As shown in FIG. 2, the method in this embodiment includes:
  • Step 210 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
  • Step 210 in the embodiment of the present application is similar to step 110 in the above-mentioned embodiment, and the step can be understood by referring to the above-mentioned embodiment, which will not be repeated here.
  • the memory neural network may include at least two embedding matrices, by inputting image features of all shots of the video stream into at least two embedding matrices, and obtaining the global features of each shot through the output of the embedding matrix,
  • the global characteristics of a shot can express the association between the shot and other shots in the video stream. From the weight of the shot, the larger the weight, the greater the correlation between the shot and other shots, the more likely it is included in the video summary .
  • Step 230 in the embodiment of the present application is similar to step 130 in the foregoing embodiment, and this step can be understood with reference to the foregoing embodiment, and details are not described herein again.
  • Step 240 Obtain a video summary of the video stream to be processed based on the weight of the shot.
  • Step 240 in the embodiment of the present application is similar to step 140 in the foregoing embodiment, and this step can be understood by referring to the foregoing embodiment, and details are not described herein again.
  • the embodiments of the present application imitate the practice of humans in creating video summaries through the memory neural network, that is, understand the video from the perspective of the whole video, use the memory neural network to store the information of the entire video stream, and use the global relationship between each shot and the video Decide on its importance, and then choose the shot as the video summary.
  • FIG. 3 is a partial flowchart of an optional example of a video digest generation method provided by an embodiment of the present application. As shown in FIG. 3, step 220 in the above embodiment includes:
  • Step 310 Map the image features of all lenses to the first embedding matrix and the second embedding matrix, respectively, to obtain input memory and output memory.
  • the input memory and the output memory in this embodiment correspond to all the shots of the video stream, and each embedding matrix corresponds to a memory (input memory or output memory).
  • each embedding matrix corresponds to a memory (input memory or output memory).
  • Step 320 Obtain the global characteristics of the lens according to the image characteristics, input memory, and output memory of the lens.
  • the global characteristics of the lens can be obtained.
  • the global characteristics reflect the association between the lens and all the shots in the video stream, so that the weight of the lens obtained based on the global characteristics is
  • the video streams are overall related, which in turn leads to a more comprehensive video summary.
  • each shot may correspond to at least two global features, and the acquisition of at least two global features may be obtained by at least two sets of embedded matrix groups.
  • the structure of each set of embedded matrix groups is the same as that in the above embodiments
  • the first embedding matrix and the second embedding matrix are similar;
  • Each set of embedded matrix groups includes two embedded matrices, and each set of memory groups includes input memory and output memory;
  • At least two memory groups and the image characteristics of the lens are obtained.
  • At least two global features are obtained through at least two memory groups, and the weight of the lens is obtained by combining multiple global features, wherein the embedding matrix included in each group Different or the same, when the embedding matrix groups are different, the obtained global characteristics can better reflect the association between the lens and the video as a whole.
  • FIG. 4 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application. As shown in FIG. 4, step 320 in the above embodiment includes:
  • the third embedding matrix can transpose the image features, that is, transpose the image features of the lens to obtain the feature vector of the lens, for example: the image corresponding to the ith lens in the lens sequence
  • the feature u i is transposed to obtain the feature vector
  • Step 404 Perform an inner product operation on the feature vector and the input memory to obtain the weight vector of the lens.
  • the input memory corresponds to the sequence of shots. Therefore, the input memory includes at least two vectors (the number corresponds to the number of shots).
  • the feature vector and the input When performing the inner product operation of the feature vector and the input memory, the feature vector and the input The results of calculating the inner product of multiple vectors in memory are mapped to the (0,1) interval, and the values expressed in multiple probability forms are obtained.
  • the values expressed in multiple probability forms are used as the weight vector of the lens, for example:
  • the weight vector is obtained by formula (1):
  • u i represents the image feature of the i-th lens, that is, the image feature corresponding to the lens whose weight is currently required to be calculated;
  • a represents the input memory;
  • p i represents the weight vector of the correlation between the i-th image feature and the input memory ;
  • the Softmax activation function is used in the multi-classification process to map the output of multiple neurons into the (0,1) interval, which can be understood as a probability; where the value of i is the number of shots in the shot sequence; through the formula (1)
  • a weight vector expressing the correlation between the i-th image feature and the shot sequence can be obtained.
  • Step 406 Perform weighted superposition operation on the weight vector and the output memory to obtain a global vector, and use the global vector as a global feature.
  • the global vector is obtained by the following formula (2):
  • b represents the output memory obtained based on the second embedding matrix
  • o i represents the global vector obtained by calculating the i-th image feature and the output memory.
  • the inner product operation is performed by the image feature and the input memory to obtain the correlation between the image feature and each shot.
  • the image feature may be transposed to Ensure that the image features and the vectors in the input memory can perform inner product operations.
  • the weight vector obtained at this time includes multiple probability values. Each probability value indicates the correlation between the shot and each shot in the shot sequence. The greater the probability value , The stronger the correlation, the inner product operation of each probability value and multiple vectors in the output memory, respectively, to obtain the global vector of the lens as a global feature.
  • obtaining at least two global features of the lens according to at least two memory groups includes:
  • u i represents the image feature of the ith lens, that is, the image feature corresponding to the lens whose weight is currently required to be calculated, Represents the feature vector of the ith lens; ak represents the input memory in the kth memory group; Weight vector representing the correlation between the i-th image feature and the input memory in the k-th memory group; Softmax activation function is used in the multi-classification process to map the output of multiple neurons to (0,1 ) In the interval, it can be understood as a probability; where k takes values from 1 to N; at least two weight vectors expressing the correlation between the i-th image feature and the shot sequence can be obtained by formula (5).
  • At least two global vectors in this embodiment are obtained by deforming the above formula (2) to obtain formula (6):
  • b k represents the output memory based on the kth memory group; Representing the i-th image feature and the global vector obtained by calculating the output memory in the k-th memory group, at least two global vectors of the lens can be obtained based on formula (6).
  • FIG. 5 is a schematic flowchart of still another embodiment of a video digest generation method provided by an embodiment of this application. As shown in Figure 5,
  • Step 510 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
  • step 510 is similar to step 110 in the foregoing embodiment, and the step can be understood by referring to the foregoing embodiment, and details are not described herein again.
  • Step 520 Obtain the global characteristics of the lens according to the image characteristics of all the lenses.
  • step 520 is similar to step 120 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
  • Step 530 Perform an inner product operation on the image features of the lens and the global features of the lens to obtain weighted features.
  • the obtained weight feature not only reflects the importance of the lens in the overall video, but also depends on the information of the lens itself.
  • the weight characteristics can be obtained by the following formula (3):
  • Step 540 Pass the weight feature through a fully connected neural network to obtain the weight of the lens.
  • the weight is used to reflect the importance of the lens. Therefore, it needs to be embodied in a numerical form.
  • the dimension of the weight feature is transformed through a fully connected neural network to obtain the weight of the lens expressed in a one-dimensional vector.
  • the weight of the lens can be obtained based on the following formula (4):
  • s i represents the weight of the i-th lens
  • W D and b D represent the weight and offset in the fully connected network through which the target image feature passes.
  • Step 550 Obtain a video summary of the video stream to be processed based on the weight of the shot.
  • This embodiment combines the image characteristics of the lens and the global characteristics of the lens to determine the weight of the lens. While embodying the information of the lens, the association of the lens and the entire video is combined to realize the understanding of the video from the perspective of the local video and the overall video Make the obtained video summary more in line with human habits.
  • determining the weight of the lens according to the image characteristics and global characteristics of the lens includes:
  • the first weight feature is used as the image feature
  • the second global feature of the at least two global features of the lens is used as the first global feature
  • the second global feature is a global feature other than the first global feature among the at least two global features
  • the first weight feature is used as the weight feature of the lens
  • the weight feature is passed through a fully connected neural network to obtain the weight of the lens.
  • FIG. 6 is a schematic diagram of some optional examples of the video digest generation method provided by the embodiment of the present application.
  • this example includes multiple memory groups, where the number of memory groups is n, multiple matrices are obtained by segmenting the video stream, and the above formulas (5), (6), (7 ), (4) calculation, the weight s i of the i-th lens can be obtained.
  • the weight s i of the i-th lens can be obtained.
  • Step 710 Perform shot segmentation on the video stream to be processed to obtain a shot sequence.
  • shot segmentation is performed based on the similarity between at least two frames of video images in the video stream to be processed to obtain a shot sequence.
  • the similarity between the two frames of video images may be determined by the distance between the features corresponding to the two frames of video images (such as Euclidean distance, cosine distance, etc.). High, indicating that the possibility of two video images belonging to the same shot is greater.
  • the similarity between the video images can be used to divide the video images with obvious differences into different shots to achieve accurate shot division.
  • Step 720 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
  • step 720 is similar to step 110 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
  • Step 730 Obtain the global characteristics of the lens according to the image characteristics of all the lenses.
  • Step 730 in the embodiment of the present application is similar to step 120 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
  • Step 740 Determine the weight of the lens according to the image characteristics and global characteristics of the lens.
  • step 740 is similar to step 130 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
  • Step 750 Obtain a video summary of the video stream to be processed based on the weight of the shot.
  • step 750 is similar to step 140 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
  • the lens is used as the unit for extracting the abstract.
  • the method of segmenting the lens can be segmented by a neural network or by a known photography lens or human judgment; The embodiment does not limit the specific means of splitting the lens.
  • FIG. 8 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application. As shown in FIG. 8, step 710 in the above embodiment includes:
  • Step 802 Segment the video images in the video stream based on at least two segmentation intervals of different sizes to obtain at least two video segment groups.
  • each video clip group includes at least two video clips, and the division interval is greater than or equal to 1 frame.
  • the video stream is divided by multiple division intervals of different sizes, for example: the division interval is respectively: 1 frame, 4 frames, 6 frames, 8 frames, etc., and the video stream can be divided into one division interval Multiple video clips of fixed size (eg 6 frames).
  • the disconnected frame is the first frame in the video segment; optionally, in response to the similarity between at least two disconnected frames being less than or equal to the set value, it is determined that the segmentation is correct;
  • the association between the two frames of video images may be determined based on the similarity between the features.
  • the greater the similarity the greater the likelihood of the same shot.
  • the embodiments of the present application mainly use the scene change as the basis of the lens segmentation, that is, even For video clips shot in the same long shot, when the correlation between the image of a certain frame and the first frame of the long shot is less than or equal to the set value, the shots are also segmented.
  • step 806 in response to the correct segmentation, the video segment is determined as a shot, and a shot sequence is obtained.
  • the video stream is divided by a plurality of division intervals with different sizes, and then the similarity between the broken frames of two consecutive video clips is judged to determine whether the division at the position is correct.
  • the similarity between the disconnected frames exceeds a certain value, it indicates that the segmentation at this position is incorrect, that is, the two video clips belong to one shot, and the shot sequence can be obtained by correct segmentation.
  • a video segment obtained by obtaining the division interval with a smaller size is used as the shot to obtain a shot sequence.
  • a disconnected frame at a disconnected position is simultaneously at least two divided interval divided ports, for example: for a video stream including 8 frame images, 2 frames and 4 frames are used as the first divided interval and the second divided interval respectively, the first Divide the interval to obtain 4 video clips, of which the first frame, the third frame, the fifth frame and the seventh frame are disconnected frames, and obtain the second video clip of the second divided interval, of which the first frame and the fifth frame are disconnected Frame; at this time, if it is determined that the split frame corresponding to the broken frame of the 5th frame and the 7th frame is correct, that is, the 5th frame is the broken frame of the first split pitch and the second frame of the second split pitch, , Subject to the first division interval, that is, the video stream is divided into three shots: the first frame to the fourth frame is a shot, the fifth frame and the sixth frame are a shot, the seventh frame and the eighth frame It is a shot; instead of taking frames 5 to 8 as a shot according to the second split pitch.
  • feature extraction is performed for each frame of video image in the shot separately through a feature extraction network.
  • the image feature is used as the image feature.
  • the average value is calculated for multiple image features, and the average feature is used as the image feature of the lens.
  • step 140 includes:
  • Video summary also known as video enrichment, is a brief summary of the video content. It can realize the main content of the video expression in a relatively short time. It is necessary to summarize the video content while expressing the main content of the video. The length of time is limited, otherwise the brief function will not be achieved, just like watching the full video.
  • the embodiment of the present application limits the duration of the video summary by limiting the duration, that is, the duration of the video summary required to be obtained is less than or equal to the limited duration, and the specific value of the limited duration may be set according to actual conditions.
  • the embodiment of the present application uses the 01 knapsack algorithm to extract the video summary.
  • the problem solved by the 01 knapsack problem applied to this embodiment can be described as:
  • the shot sequence includes multiple shots, and each shot has a corresponding (usually Different) length, each shot has a corresponding (usually different) weight, you need to obtain a video summary of a limited duration, how to ensure that the sum of the weights of the video summary within the limited duration is the largest. Therefore, the embodiment of the present application can obtain the video summary of the best content through the backpack algorithm.
  • step 110 Before performing step 110, it also includes:
  • the feature extraction network and the memory neural network are jointly trained based on the sample video stream.
  • the sample video stream includes at least two sample shots, and each sample shot includes a label weight.
  • the feature extraction network and the memory neural network In order to achieve more accurate weights, it is necessary to train the feature extraction network and the memory neural network before obtaining the weights. Training the feature extraction network and the memory neural network separately can also achieve the purpose of the embodiments of the present application, but the feature extraction network and the memory
  • the parameters obtained by the joint training of the neural network are more suitable for the embodiments of the present application, and can provide more accurate prediction weights; the training process assumes that the sample video stream has been divided into at least two sample shots, and the segmentation process can be based on the trained segmented neural network or
  • the embodiments of the present application are not limited.
  • the process of joint training may include:
  • the loss is determined based on the prediction weight and the labeling weight, and the parameters of the feature extraction network and the memory neural network are adjusted based on the loss.
  • FIG. 9 is a schematic structural diagram of an embodiment of an apparatus for generating a video summary provided by an embodiment of the present application.
  • the apparatus of this embodiment may be used to implement the above method embodiments of the present application.
  • the device of this embodiment includes:
  • the feature extraction unit 91 is configured to perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain the image features of each shot.
  • the to-be-processed video stream is a video stream to obtain a video summary, and the video stream includes at least one frame of video images.
  • the embodiments of the present application use shots as a constituent unit of the video abstract, and each shot includes at least one frame of video images.
  • the feature extraction in the embodiments of the present application may be implemented based on any feature extraction network, and feature extraction is performed for each shot separately based on the feature extraction network to obtain at least two image features, and the application does not limit specific features The extraction process.
  • the global feature unit 92 is configured to acquire global features of the lens according to image features of all lenses.
  • all image features corresponding to the video stream are processed (such as mapping or embedding, etc.) to obtain a conversion feature sequence corresponding to the overall video stream, and the conversion feature sequence is then calculated with each image feature to obtain a correspondence with each shot Global feature (global attention), the global feature can reflect the association between each shot and other shots in the video stream.
  • the weight of the lens is determined by the image characteristics of the lens and its global characteristics.
  • the weight obtained is not only based on the lens itself, but also based on the correlation between the lens and other shots in the entire video stream, from the perspective of the overall video Evaluate the importance of the lens.
  • the summary generating unit 94 is configured to obtain a video summary of the to-be-processed video stream based on the weight of the shot.
  • the embodiments of the present application reflect the importance of each shot through the weight of the shot, and can determine some of the more important shots in the shot sequence, but determine that the video summary is not only based on the importance of the shot, but also needs to control the video
  • the length of the summary that is, the video summary needs to be determined in combination with the weight and duration (number of frames) of the shot.
  • a backpack algorithm can be used to obtain the video summary.
  • the video summary generating device provided in the above embodiment combines the image features and global features to determine the weight of each shot, and realizes the understanding of the video from the perspective of the entire video.
  • the global association relationship between each shot and the entire video stream is used.
  • the video summary determined in the embodiment can express the video content as a whole, avoiding the problem of one-sidedness of the video summary.
  • the global feature unit 92 is configured to process image features of all lenses based on a memory neural network to obtain global features of the lens.
  • the memory neural network may include at least two embedding matrices, by inputting image features of all shots of the video stream into at least two embedding matrices, and obtaining the global features of each shot through the output of the embedding matrix,
  • the global characteristics of a shot can express the association between the shot and other shots in the video stream. From the weight of the shot, the larger the weight, the greater the correlation between the shot and other shots, the more likely it is included in the video summary .
  • the global feature unit 92 is configured to map the image features of all lenses to the first embedding matrix and the second embedding matrix, respectively, to obtain input memory and output memory; according to the lens image features, input memory, and output memory To obtain the global characteristics of the lens.
  • the global feature unit 92 is configured to map the image feature of the lens to the third embedding matrix to obtain the feature vector of the lens when acquiring the global feature of the lens according to the image feature, input memory, and output memory of the lens; Perform the inner product operation of the feature vector and the input memory to obtain the weight vector of the lens; perform the weighted superposition operation of the weight vector and the output memory to obtain the global vector, and use the global vector as the global feature.
  • the weight acquisition unit 93 is configured to perform an inner product operation on the image features of the lens and the global features of the lens to obtain weight features; the weight features are obtained through a fully connected neural network to obtain the lens Weights.
  • This embodiment combines the image characteristics of the lens and the global characteristics of the lens to determine the weight of the lens. While embodying the information of the lens, the association of the lens and the entire video is combined to realize the understanding of the video from the perspective of the local video and the overall video Make the obtained video summary more in line with human habits.
  • the global feature unit 92 is configured to process the image features of the lens based on the memory neural network to obtain at least two global features of the lens.
  • At least two global features are obtained through at least two memory groups, and the weight of the lens is obtained by combining multiple global features, wherein the embedding matrix included in each group Different or the same, when the embedding matrix groups are different, the obtained global characteristics can better reflect the association between the lens and the video as a whole.
  • the global feature unit 92 is configured to map the image features of the lens to at least two sets of embedded matrix groups to obtain at least two sets of memory groups, each of the embedded matrix groups includes two embedded matrices, Each of the memory groups includes an input memory and an output memory; according to at least two groups of the memory groups and image characteristics of the lens, at least two global characteristics of the lens are acquired.
  • the global feature unit 92 is configured to map the image features of the lens to the third embedding matrix when acquiring at least two global features of the lens based on at least two memory groups and the image features of the lens, to obtain the lens ’s Feature vector; perform inner product operation on the feature vector and at least two input memories to obtain at least two weight vectors of the lens; perform weighted superposition operation on the weight vector and at least two output memories to obtain at least two global vectors, At least two global vectors serve as at least two global features.
  • the weight acquisition unit 93 is configured to perform an inner product operation on the image feature of the lens and the first global feature among the at least two global features of the lens to obtain the first weight feature; use the first weight feature as the image Feature, the second global feature of the at least two global features of the lens as the first global feature, the second global feature is the global feature of the at least two global features other than the first global feature; the image feature of the lens and the lens
  • the first global feature of the at least two global features is subjected to an inner product operation to obtain the first weight feature; until the second global feature is not included in the at least two global features of the lens, the first weight feature is used as the weight feature of the lens;
  • the weight feature is passed through a fully connected neural network to obtain the weight of the lens.
  • the device further includes:
  • shot segmentation is performed based on the similarity between at least two frames of video images in the video stream to be processed to obtain a shot sequence.
  • the similarity between the two frames of video images may be determined by the distance between the features corresponding to the two frames of video images (such as Euclidean distance, cosine distance, etc.). High, indicating that the possibility of two video images belonging to the same shot is greater.
  • the similarity between the video images can be used to divide the video images with obvious differences into different shots to achieve accurate shot division.
  • the shot segmentation unit is configured to perform shot segmentation based on the similarity between at least two frames of video images in the video stream to be processed to obtain a shot sequence.
  • the lens segmentation unit is configured to segment the video images in the video stream based on at least two segmentation pitches of different sizes to obtain at least two groups of video segments, each group of video segments includes at least two video segments ,
  • the split interval is greater than or equal to 1 frame; based on the similarity between at least two broken frames in each video clip group, determine whether the split is correct, and the split frame is the first frame in the video clip; in response to the correct split, determine The video clip is used as a shot to obtain a shot sequence.
  • the shot segmentation unit is configured to respond to the similarity between at least two broken frames when determining whether the segmentation is correct based on the similarity between at least two broken frames in each video clip group Less than or equal to the set value, it is determined that the segmentation is correct; in response to the similarity between at least two disconnected frames being greater than the set value, it is determined that the segmentation is incorrect.
  • the shot segmentation unit when the shot segmentation unit determines that the video segment is a shot in response to the correct segmentation, and obtains the shot sequence, it is configured to respond to the broken frame corresponding to at least two segmentation pitches, and obtain the video at a smaller segmentation pitch
  • the clip is used as a shot to obtain a shot sequence.
  • the feature extraction unit 91 is configured to perform feature extraction on at least one frame of video image in the shot to obtain at least one image feature; obtain the average feature of all image features, and convert the average feature As the image feature of the lens.
  • feature extraction is performed for each frame of video image in the shot separately through a feature extraction network.
  • the image feature is used as the image feature.
  • the average value is calculated for multiple image features, and the average feature is used as the image feature of the lens.
  • the summary generating unit is configured to obtain a limited duration of the video summary; according to the weight of the shot and the limited duration of the video summary, the video summary of the video stream to be processed is obtained.
  • Video summary also known as video enrichment, is a brief summary of the video content. It can realize the main content of the video expression in a relatively short time. It is necessary to summarize the video content while expressing the main content of the video. The duration of the video is limited, otherwise the brief function will not be achieved, which is no different from watching the full video.
  • the embodiment of the present application limits the duration of the video summary by limiting the duration, that is, the duration of the video summary required to be obtained is less than or equal to the limited duration
  • the specific value of the limited duration can be set according to the actual situation.
  • the device of the embodiments of the present application further includes:
  • the joint training unit is configured to perform joint training on the feature extraction network and the memory neural network based on the sample video stream.
  • the sample video stream includes at least two sample shots, and each sample shot includes a labeling weight.
  • the feature extraction network and the memory neural network In order to achieve more accurate weights, it is necessary to train the feature extraction network and the memory neural network before obtaining the weights. Training the feature extraction network and the memory neural network separately can also achieve the purpose of the embodiments of the present application, but the feature extraction network and the memory
  • the parameters obtained by the joint training of the neural network are more suitable for the embodiments of the present application, and can provide more accurate prediction weights; the training process assumes that the sample video stream has been divided into at least two sample shots, and the segmentation process can be based on the trained segmented neural network or
  • the embodiments of the present application are not limited.
  • an electronic device which includes a processor, and the processor includes the video digest generating apparatus provided in any one of the foregoing embodiments.
  • an electronic device including: a memory configured to store executable instructions;
  • a processor configured to communicate with the memory to execute the executable instructions to complete the operation of the video digest generation method provided by any one of the foregoing embodiments.
  • a computer storage medium configured to store computer-readable instructions, and when the instructions are executed, the operations of the video digest generation method provided in any of the foregoing embodiments are performed.
  • a computer program product including computer readable code, and when the computer readable code runs on a device, a processor in the device executes to implement any of the above The instruction of the video digest generation method provided in the embodiment.
  • the special-purpose processors may serve as the acceleration unit 1013, which may include but not limited to images The processor (GPU), FPGA, DSP, and other dedicated processors such as ASIC chips, etc.
  • the processor can be loaded into the random access memory according to the executable instructions stored in the read only memory (ROM) 1002 or from the storage section 1008 RAM) 1003 executable instructions to perform various appropriate actions and processes.
  • the communication unit 1012 may include but is not limited to a network card, and the network card may include but not limited to an IB (Infiniband) network card.
  • the processor may communicate with the read-only memory 1002 and / or the random access memory 1003 to execute executable instructions, connect to the communication unit 1012 through the bus 1004, and communicate with other target devices via the communication unit 1012, thereby completing the embodiment of the present application.
  • the operation corresponding to any of the methods, for example, feature extraction of the shots in the shot sequence of the video stream to be processed to obtain the image characteristics of each shot, each shot includes at least one frame of video images; according to the image characteristics of all shots, Obtain the global characteristics of the shot; determine the weight of the shot according to the image characteristics and global characteristics of the shot; obtain the video summary of the video stream to be processed based on the weight of the shot.
  • RAM 1003 various programs and data necessary for device operation can also be stored.
  • the CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004.
  • ROM1002 is an optional module.
  • the RAM 1003 stores executable instructions, or writes executable instructions to the ROM 1002 at runtime, and the executable instructions cause the central processing unit 1001 to perform operations corresponding to the above communication method.
  • An input / output (I / O) interface 1005 is also connected to the bus 1004.
  • the communication unit 1012 may be integratedly provided, or may be provided with multiple sub-modules (for example, multiple IB network cards), and are on the bus link.
  • the following components are connected to the I / O interface 1005: an input section 1006 including a keyboard, a mouse, etc .; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 1008 including a hard disk, etc. ; And a communication section 1009 including a network interface card such as a LAN card, a modem, etc. The communication section 1009 performs communication processing via a network such as the Internet.
  • the driver 1010 is also connected to the I / O interface 1005 as needed.
  • a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1010 as necessary, so that the computer program read out therefrom is installed into the storage section 1008 as necessary.
  • FIG. 10 is only an optional implementation method.
  • the number and types of the components in FIG. 10 can be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated settings can also be adopted for the setting of different functional components.
  • the acceleration unit 1013 and the CPU 1001 can be separated or the acceleration unit 1013 can be integrated on the CPU 1001. Or on the acceleration unit 1013, etc.
  • embodiments of the present application include a computer program product including a computer program tangibly contained on a machine-readable medium, the computer program containing program code for performing the method shown in the flowchart, the program code may include a corresponding Execute the instructions corresponding to the method steps provided in the embodiments of the present application, for example, perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image characteristics of each shot, and each shot includes at least one frame of video image; according to all shots To obtain the global characteristics of the lens; determine the weight of the lens according to the image characteristics and the global characteristics of the lens; obtain the video summary of the video stream to be processed based on the weight of the lens.
  • the computer program may be downloaded and installed from the network through the communication section 1009, and / or installed from the removable medium 1011.
  • the computer program is executed by the central processing unit (CPU) 1001, the operation of the above-mentioned functions defined in the method of the present application is performed.
  • the method and apparatus of the present application may be implemented in many ways.
  • the method and apparatus of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above sequence of steps for the method is for illustration only, and the steps of the method of the present application are not limited to the sequence specifically described above unless specifically stated otherwise.
  • the present application may also be implemented as programs recorded in a recording medium, and these programs include machine-readable instructions for implementing the method according to the present application.
  • the present application also covers a recording medium storing a program for executing the method according to the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Studio Devices (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本申请实施例公开了一种视频摘要生成方法和装置、电子设备、计算机存储介质,其中,方法包括:对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征,每个镜头包括至少一帧视频图像;根据所有镜头的图像特征,获取镜头的全局特征;根据镜头的图像特征和全局特征确定镜头的权重;基于镜头的权重获得待处理视频流的视频摘要。

Description

视频摘要生成方法和装置、电子设备、计算机存储介质
相关申请的交叉引用
本申请基于申请号为201811224169.X、申请日为2018年10月19日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及计算机视觉技术但不限于计算机视觉技术,尤其是一种视频摘要生成方法和装置、电子设备、计算机存储介质。
背景技术
随着视频数据的快速增加,为了在短时间内快速浏览这些视频,视频摘要开始扮演着越来越重要的角色。视频摘要是一种新兴的视频理解技术。视频摘要是从一段较长的视频中提取一些镜头,来合成一段较短的,包含着原视频中故事线或者精彩镜头的新视频。
人工智能技术针对许多计算机视觉问题已经得到了很好的解决方案,比如图像分类,人工智能的表现甚至已经超越了人类,但是这仅限于一些有着明确目标的方面。相较于其他计算机视觉任务,视频摘要更加抽象,更加强调对于整个视频全局的理解。视频摘要中镜头的取舍,不仅依赖于这个镜头本身的信息,更加依赖于视频整体所表达的信息。
发明内容
本申请实施例提供了一种视频摘要生成方法和装置、电子设备、计算机存储介质。
根据本申请实施例的一个方面,提供的一种视频摘要生成方法,包括:
对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,每个所述镜头包括至少一帧视频图像;
根据所有所述镜头的图像特征,获取所述镜头的全局特征;
根据所述镜头的图像特征和所述全局特征确定所述镜头的权重;
基于所述镜头的权重获得所述待处理视频流的视频摘要。
根据本申请实施例的另一个方面,提供的一种视频摘要生成装置,包括:
特征提取单元,配置为对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,每个所述镜头包括至少一帧视频图像;
全局特征单元,配置为根据所有所述镜头的图像特征,获取所述镜头的全局特征;
权重获取单元,配置为根据所述镜头的图像特征和所述全局特征确定所述镜头的权重;
摘要生成单元,配置为基于所述镜头的权重获得所述待处理视频流的视频摘要。
根据本申请实施例的又一个方面,提供的一种电子设备,包括处理器,所述处理器包括如上任意一项所述的视频摘要生成装置。
根据本申请实施例的还一个方面,提供的一种电子设备,包括:存储器,用于存储可执行指令;
以及处理器,用于与所述存储器通信以执行所述可执行指令从而完成如上任意一项所述视频摘要生成方法的操作。
根据本申请实施例的再一个方面,提供的一种计算机存储介质,用于存储计算机可读取的指令,其中,所述指令被执行时执行如上任意一项所述视频摘要生成方法的操作。
根据本申请实施例的另一个方面,提供的一种计算机程序产品,包括计算机可读代码,其中,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现如上任意一项所述视频摘要生成方法的指令。
基于本申请上述实施例提供的一种视频摘要生成方法和装置、电子设备、计算机存储介质,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征。每个镜头包括至少一帧视频图像;根据所有镜头的图像特征,获取镜头的全局特征;根据镜头的图像特征和全局特征确定镜头的权重;基于镜头的权重获得待处理视频流的视频摘要,结合图像特征和全局特征确定每个镜头的权重,实现了从视频整体的角度来理解视频,利用了每个镜头与视频全局的关系,基于本实施例的镜头的权重确定的视频摘要,可以在整体上对视频内容进行表达,减少了视频摘要较为片面的问题。
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
附图说明
构成说明书的一部分的附图描述了本申请的实施例,并且连同描述一起用于解释本申请的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本申请,其中:
图1为本申请实施例提供的视频摘要生成方法的一个实施例的流程示意图。
图2为本申请实施例提供的视频摘要生成方法的另一个实施例的流程示意图。
图3为本申请实施例提供的视频摘要生成方法的一个可选示例的部分流程示意图。
图4为本申请实施例提供的视频摘要生成方法的另一可选示例的部分流程示意图。
图5为本申请实施例提供的视频摘要生成方法的又一实施例的流程示意图。
图6为本申请实施例提供的视频摘要生成方法的一些可选示例的示意图。
图7为本申请实施例提供的视频摘要生成方法的又一实施例的流程示意图。
图8为本申请实施例提供的视频摘要生成方法的又一可选示例的部分流程示意图。
图9为本申请实施例提供的视频摘要生成装置的一个实施例的结构示意图。
图10为适于用来实现本申请实施例的终端设备或服务器的电子设备的结构示意图。
具体实施方式
现在将参照附图来详细描述本申请的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本申请及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
图1为本申请实施例提供的视频摘要生成方法的一个实施例的流程示意图。该方法可以由任意视频摘要提取设备执行,例如终端设备、服务器、移动设备等等,如图1所示,该实施例方法包括:
步骤110,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征。
所述视频摘要为:从原始的视频流中提取出关键信息或主旨信息,生成了视频摘要,视频摘要相对于原始的视频流数据流更小,且同时涵盖了原始的视频流的主旨内容或关 键内容,可以用于后续原始的视频流的检索等。
在本实施例中,例如,通过分析所述视频流中特定目标的运动变化,生成表征同一个目标在视频流中运动轨迹的视频摘要。当然此处仅是举例,具体实现不局限于上述举例。
在本实施例中,待处理视频流为获取视频摘要的视频流,视频流包括至少一帧视频图像。为了使获得的视频摘要具有内容含义,而不仅仅是由不同帧的视频图像构成的图像集合,本申请实施例将镜头作为视频摘要的构成单位,每个镜头包括至少一帧视频图像。
在一些实施例中,本申请实施例中的特征提取可以是基于任一特征提取网络实现,基于特征提取网络分别对每个镜头进行特征提取,以获得至少两个图像特征,本申请不限制具体进行特征提取的过程。
步骤120,根据所有镜头的图像特征,获取镜头的全局特征。
在一些实施例中,将视频流对应的所有图像特征经过处理(如:映射或嵌入等)获得对应整体视频流的转换特征序列,转换特征序列再与每个图像特征进行计算获得每个镜头对应的全局特征(全局注意力),通过全局特征可以体现每个镜头与视频流中其他镜头之间的关联关系。
此处的全局特征包括但不限于:表征一个镜头中多个视频图像中同一个图像元素之间对应关系或者位置关系的图像特征。值得注意的上述的关联关系不局限于所述对应关系和/或位置关系。
步骤130,根据镜头的图像特征和全局特征确定镜头的权重。
通过镜头的图像特征及其全局特征确定该镜头的权重,由此得到的权重不仅基于该镜头本身,还基于该镜头与整个视频流中其他镜头之间的关联关系,实现了从视频整体的角度对镜头的重要性进行评估。
步骤140,基于镜头的权重获得待处理视频流的视频摘要。
本实施例中,通过镜头的权重大小确定镜头序列中镜头的重要性,但确定视频摘要不仅仅基于镜头的重要性,还需要控制视频摘要的长度,即,需要结合镜头的权重和镜头的时长(帧数)确定视频摘要。具体如,所述权重与所述镜头的重要性和/或视频摘要的长度等正相关。在本实施例中,可采用背包算法确定视频摘要,还可以采用其他算法确定,这里不一一列举。
上述实施例提供的视频摘要生成方法,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征,每个镜头包括至少一帧视频图像;根据所有镜头的图像特征,获取镜头的全局特征;根据镜头的图像特征和全局特征确定镜头的权重;基于镜头的权重获得待处理视频流的视频摘要,结合图像特征和全局特征确定每个镜头的权重,实现了从视频整体的角度来理解视频,利用了每个镜头与整个视频流的全局关联关系,基于本实施例确定的视频摘要,可以在整体上对视频内容进行表达,减少了视频摘要较为片面的问题。
图2为本申请实施例提供的视频摘要生成方法的另一个实施例的流程示意图。如图2所示,本实施例方法包括:
步骤210,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征。
本申请实施例中步骤210与上述实施例的步骤110类似,可参照上述实施例对该步骤进行理解,在此不再赘述。
步骤220,基于记忆神经网络对所有镜头的图像特征进行处理,获取镜头的全局特征。
在一些实施例中,记忆神经网络可以包括至少两个嵌入矩阵,通过将视频流的所有镜头的图像特征分别输入到至少两个嵌入矩阵中,通过嵌入矩阵的输出获得每个镜头的全局特征,镜头的全局特征可以表达该镜头与视频流中其他镜头之间的关联关系,从镜头的权重看,权重越大,表明该镜头与其他镜头的关联越大,越有可能被包含在视频摘要中。
步骤230,根据镜头的图像特征和全局特征确定镜头的权重。
本申请实施例中步骤230与上述实施例的步骤130类似,可参照上述实施例对该步骤进行理解,在此不再赘述。
步骤240,基于镜头的权重获得待处理视频流的视频摘要。
本申请实施例中步骤240与上述实施例的步骤140类似,可参照上述实施例对该步骤进行理解,在此不再赘述。
本申请实施例通过记忆神经网络模仿人类创造视频摘要时的做法,即从视频整体的角度来理解视频,利用记忆神经网络来存储整个视频流的信息,利用每一个镜头与视频全局的关系,来决定其重要性,从而选择出作为视频摘要的镜头。
图3为本申请实施例提供的视频摘要生成方法的一个可选示例的部分流程示意图。如图3所示,上述实施例中的步骤220包括:
步骤310,将所有镜头的图像特征分别映射到第一嵌入矩阵和第二嵌入矩阵,获得输入记忆和输出记忆。
本实施例中的输入记忆和输出记忆分别对应视频流的全部镜头,每个嵌入矩阵对应一个记忆(输入记忆或输出记忆),通过将所有镜头的图像特征映射到一个嵌入矩阵中,可获得一组新的图像特征,即一个记忆。
步骤320,根据镜头的图像特征、输入记忆和输出记忆,获取镜头的全局特征。
基于输入记忆和输出记忆结合该镜头的图像特征,即可获得该镜头的全局特征,该全局特征体现了该镜头与视频流中所有镜头之间的关联,使基于全局特征获得的镜头的权重与视频流整体相关,进而获得更全面的视频摘要。
在一个或多个的实施例中,每个镜头可以对应至少两个全局特征,至少两个全局特征的获取可通过至少两组嵌入矩阵组获得,每组嵌入矩阵组的结构与上述实施例中的第一嵌入矩阵和第二嵌入矩阵类似;
将镜头的图像特征分别映射到至少两组嵌入矩阵组,获得至少两组记忆组,每组嵌入矩阵组包括两个嵌入矩阵,每组记忆组包括输入记忆和输出记忆;
根据至少两组记忆组和镜头的图像特征,获取镜头的至少两个全局特征。
本申请实施例中,为了提高镜头的权重的全局性,通过至少两组记忆组获得至少两个全局特征,结合多个全局特征获得镜头的权重,其中,每组嵌入矩阵组中包括的嵌入矩阵不同或相同,当嵌入矩阵组之间不同时,获得的全局特征能更好的体现镜头与视频整体的关联。
图4为本申请实施例提供的视频摘要生成方法的另一可选示例的部分流程示意图。如图4所示,上述实施例中的步骤320包括:
步骤402,将镜头的图像特征映射到第三嵌入矩阵,得到镜头的特征向量。
在一些实施例中,该第三嵌入矩阵可实现对图像特征的转置,即将该镜头的图像特征进行转置,获得镜头的特征向量,例如:将镜头序列中的第i个镜头对应的图像特征u i经过转置获得特征向量
Figure PCTCN2019088020-appb-000001
步骤404,将特征向量与输入记忆进行内积运算,得到镜头的权值向量。
在一些实施例中,输入记忆对应镜头序列,因此,输入记忆包括至少两个向量(数 量对应镜头数量),将特征向量与输入记忆进行内积运算时,可通过Softmax激活函数将特征向量与输入记忆中的多个向量计算内积得到的结果映射到(0,1)区间内,获得的多个概率形式表达的值,多个概率形式表达的值作为该镜头的权值向量,例如:可通过公式(1)获得权值向量:
Figure PCTCN2019088020-appb-000002
其中,u i表示第i个镜头的图像特征,即当前需要计算权重的镜头对应的图像特征;a表示输入记忆;p i表示第i个图像特征与输入记忆之间的关联性的权值向量;Softmax激活函数用于多分类过程中,将多个神经元的输出,映射到(0,1)区间内,可以看成概率来理解;其中i的取值为镜头序列的镜头数量;通过公式(1)即可获得表达第i个图像特征与镜头序列的关联性的权值向量。
步骤406,将权值向量与输出记忆进行加权叠加运算,得到全局向量,将全局向量作为全局特征。
在一些实施例中,通过以下公式(2)获得全局向量:
o i=∑ ip ib        (2)
其中,b表示基于第二嵌入矩阵获得的输出记忆;o i表示第i个图像特征与输出记忆计算获得的全局向量。
本实施例通过图像特征与输入记忆进行内积运算,获得该图像特征与每个镜头之间的关联性,可选地,在进行内积运算之前,可以对该图像特征进行转置处理,以保证图像特征与输入记忆中的向量可以进行内积运算,此时获得的权值向量包括多个概率值,每个概率值表示该镜头与镜头序列中每个镜头的关联性,概率值越大,关联性越强,分别将每个概率值与输出记忆中的多个向量进行内积运算,获得该镜头的全局向量作为全局特征。
在一个实施例中,每个镜头对应至少两个全局特征时,根据至少两组记忆组,获取镜头的至少两个全局特征,包括:
将镜头的图像特征映射到第三嵌入矩阵,得到镜头的特征向量;
将特征向量与至少两个输入记忆进行内积运算,得到镜头的至少两个权值向量;
将权值向量与至少两个输出记忆进行加权叠加运算,得到至少两个全局向量,将至少两个全局向量作为至少两个全局特征。
其中,计算每个权值向量和全局向量的过程与上述实施例中类似,可参照理解,在 此不再赘述。可选地,获得权值向量的公式可基于上述公式(1)经过变形获得公式(5)实现:
Figure PCTCN2019088020-appb-000003
其中,u i表示第i个镜头的图像特征,即当前需要计算权重的镜头对应的图像特征,
Figure PCTCN2019088020-appb-000004
表示第i个镜头的特征向量;a k表示第k组记忆组中的输入记忆;
Figure PCTCN2019088020-appb-000005
表示第i个图像特征与第k组记忆组中的输入记忆之间的关联性的权值向量;Softmax激活函数用于多分类过程中,将多个神经元的输出,映射到(0,1)区间内,可以看成概率来理解;其中k的取值为1到N;通过公式(5)即可获得表达第i个图像特征与镜头序列的关联性的至少两个权值向量。
在一些实施例中,通过对上述公式(2)进行变形获得公式(6)获得本实施例中的至少两个全局向量:
Figure PCTCN2019088020-appb-000006
其中,b k表示基于第k组记忆组中的输出记忆;
Figure PCTCN2019088020-appb-000007
表示第i个图像特征与第k组记忆组中的输出记忆计算获得的全局向量,基于公式(6)即可获得该镜头的至少两个全局向量。
图5为本申请实施例提供的视频摘要生成方法的又一实施例的流程示意图。如图5所示,
步骤510,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征。
本申请实施例中步骤510与上述实施例的步骤110类似,可参照上述实施例对该步骤进行理解,在此不再赘述。
步骤520,根据所有镜头的图像特征,获取镜头的全局特征。
本申请实施例中步骤520与上述实施例的步骤120类似,可参照上述任一实施例对该步骤进行理解,在此不再赘述。
步骤530,将镜头的图像特征和镜头的全局特征进行内积运算,得到权重特征。
在一些实施例中,通过镜头的图像特征与镜头的全局特征进行内积运算,使获得的权重特征在体现镜头在视频整体中重要性的同时,还依赖于镜头本身的信息,可选地,可通过以下公式(3)获得权重特征:
u′ i=u i⊙o i          (3)
其中,u′ i表示第i个镜头的权重特征,o i表示第i个镜头的全局向量;⊙表示点乘,即内积运算。
步骤540,将权重特征通过全连接神经网络,得到镜头的权重。
权重用于体现镜头的重要性,因此,需要以数值的形式进行体现,可选地,本实施例通过全连接神经网络将权重特征的维度变换,获得一维向量表达的镜头的权重。
在一些实施例中,可基于以下公式(4)获得镜头的权重:
s i=W D·u′ i+b D        公式(4)
其中,s i表示第i个镜头的权重,W D和b D分别表示目标图像特征经过的全连接网络中的权重和偏移量。
步骤550,基于镜头的权重获得待处理视频流的视频摘要。
本实施例结合镜头的图像特征和镜头的全局特征确定镜头的权重,在体现该镜头的信息的同时,结合了镜头与视频整体的关联,实现了从视频局部和视频整体的角度来理解视频,使获得的视频摘要更符合人类习惯。
在一些实施例中,根据镜头的图像特征和全局特征确定镜头的权重,包括:
将镜头的图像特征和镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;
将第一权重特征作为图像特征,镜头的至少两个全局特征中的第二全局特征作为第一全局特征,第二全局特征为至少两个全局特征中除了第一全局特征之外的全局特征;
将镜头的图像特征和镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;
直到镜头的至少两个全局特征中不包括第二全局特征,将第一权重特征作为镜头的权重特征;
将权重特征通过全连接神经网络,得到镜头的权重。
本实施例中,由于全局特征具有多个,每次将图像特征与全局特征内积运算的结果作为下一次运算的图像特征,实现循环,每次运算可基于对上述公式(3)变更得到的公式(7)实现:
Figure PCTCN2019088020-appb-000008
其中,
Figure PCTCN2019088020-appb-000009
表示第i个图像特征与第k组记忆组中的输出记忆计算获得的全局向量;u′ i表示第一权重特征,⊙表示点乘,在循环到第k+1组记忆组中的输出记忆计算获得的 全局向量时;将u′ i替换u i表示第i个镜头的图像特征,此时
Figure PCTCN2019088020-appb-000010
变换为
Figure PCTCN2019088020-appb-000011
直到完成所有记忆组的运算,将u′ i输出作为镜头的权重特征,通过权重特征确定镜头的权重与上述实施例类似,在此不再赘述。
图6为本申请实施例提供的视频摘要生成方法的一些可选示例的示意图。如图6所示,本示例中包括多组记忆组,其中记忆组的数量为n,通过对视频流分割获得多个矩阵,通过对图像特征结合上述公式(5)、(6)、(7)、(4)计算,可获得第i个镜头的权重s i,具体获得权重的过程可参照上述实施例的描述,在此不再赘述。
图7为本申请实施例提供的视频摘要生成方法的又一实施例的流程示意图。如图7所示,该实施例方法包括:
步骤710,对待处理视频流进行镜头分割获得镜头序列。
在一些实施例中,基于待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得镜头序列。
在一些实施例中,可通过两帧视频图像对应的特征之间的距离(如:欧式距离、余弦距离等)确定两帧视频图像之间的相似度,两帧视频图像之间的相似度越高,说明两帧视频图像属于同一镜头的可能性越大,本实施例通过视频图像之间的相似度可将差异较为明显的视频图像分割到不同的镜头中,实现准确的镜头分割。
步骤720,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征。
本申请实施例中步骤720与上述实施例的步骤110类似,可参照上述任一实施例对该步骤进行理解,在此不再赘述。
步骤730,根据所有镜头的图像特征,获取镜头的全局特征。
本申请实施例中步骤730与上述实施例的步骤120类似,可参照上述任一实施例对该步骤进行理解,在此不再赘述。
步骤740,根据镜头的图像特征和全局特征确定镜头的权重。
本申请实施例中步骤740与上述实施例的步骤130类似,可参照上述任一实施例对该步骤进行理解,在此不再赘述。
步骤750,基于镜头的权重获得待处理视频流的视频摘要。
本申请实施例中步骤750与上述实施例的步骤140类似,可参照上述任一实施例对该步骤进行理解,在此不再赘述。
本申请实施例以镜头作为提取摘要的单位,首先,需要基于视频流获得至少两个镜头,进行镜头分割的方法可以通过神经网络进行分割或通过已知摄影镜头或人为判断等方法实现;本申请实施例不限制分割镜头的具体手段。
图8为本申请实施例提供的视频摘要生成方法的又一可选示例的部分流程示意图。如图8所示,上述实施例中步骤710包括:
步骤802,基于至少两个大小不同的分割间距对视频流中的视频图像进行分割,获得至少两组视频片段组。
其中,每组视频片段组包括至少两个视频片段,分割间距大于等于1帧。
本申请实施例中通过多个大小不同的分割间距对视频流进行分割,例如:分割间距分别为:1帧、4帧、6帧、8帧等等,通过一个分割间距可将视频流分割为固定大小(如:6帧)的多个视频片段。
步骤804,基于每组视频片段组中至少两个断开帧之间的相似度,确定分割是否正确。
其中,断开帧为视频片段中的第一帧;可选地,响应于至少两个断开帧之间的相似度小于或等于设定值,确定分割正确;
响应于至少两个断开帧之间的相似度大于设定值,确定分割不正确。
在一些实施例中,两帧视频图像之间的关联可以基于特征之间的相似度确定,相似度越大,说明是同一镜头的可能性越大。从拍摄角度讲,场景的切换包括两种,一种是镜头直接切换场景,另一种是通过长镜头逐渐变化场景,本申请实施例主要以场景的变化作为镜头分割的依据,即,即使是同一长镜头中拍摄的视频片段,当某一帧的图像与该长镜头的第一帧图像的关联性小于或等于设定值时,也进行镜头分割。
步骤806,响应于分割正确,确定视频片段作为镜头,获得镜头序列。
本申请实施例中通过多个大小不同的分割间距对视频流进行分割,再判断连续的两个视频片段的断开帧之间的相似度,以确定该位置的分割是否正确,当两个连续的断开帧之间的相似度超过一定值时,说明该位置的分割不正确,即这两个视频片段属于一个镜头,通过正确的分割即可获得镜头序列。
在一些实施例中,步骤806包括:
响应于断开帧对应至少两个分割间距,以大小较小的得到分割间距获得的视频片段作为所述镜头,获得镜头序列。
当一个断开位置的断开帧同时是至少两个分割间距分割的端口,例如:对包括8帧图像的视频流分别以2帧和4帧作为第一分割间距和第二分割间距,第一分割间距获得4个视频片段,其中第1帧、第3帧、第5帧和第7帧为断开帧,第二分割间距获得2个视频片段,其中第1帧和第5帧为断开帧;此时,如果确定第5帧和第7帧的断开帧对应的分割正确,即第5帧即是第一分割间距的断开帧,也是第二分割间距的断开帧,此时,以第一分割间距为准,即:对该视频流分割获得3个镜头:第1帧到第4帧为一个镜头,第5帧和第6帧为一个镜头,第7帧和第8帧为一个镜头;而不是按照第二分割间距将第5帧到第8帧作为一个镜头。
在一个或多个的实施例中,步骤110包括:
对镜头中的至少一帧视频图像进行特征提取,获得至少一个图像特征;
获取所有图像特征的均值特征,并将均值特征作为镜头的图像特征。
在一些实施例中,通过特征提取网络分别对镜头中的每帧视频图像进行特征提取,当一个镜头仅包括一帧视频图像时,以该图像特征作为图像特征,当包括多帧视频图像时,对多个图像特征计算均值,以均值特征作为该镜头的图像特征。
在一个或多个实施例中,步骤140包括:
(1)获取视频摘要的限定时长。
视频摘要又称视频浓缩,是对视频内容的一个简要概括,可实现在相对较短的时间内将视频表达的主要内容进行体现,需要在实现将视频主要内容表达的同时,还要对视频摘要的时长进行限制,否则将达不到简要的功能,与看完整视频无异。本申请实施例通过限定时长来限制视频摘要的时长,即,要求获得的视频摘要的时长小于或等于限定时长,限定时长的具体取值可根据实际情况进行设定。
(2)根据镜头的权重和视频摘要的限定时长,获得待处理视频流的视频摘要。
在一些实施例中,本申请实施例通过01背包算法实现视频摘要的提取,01背包问题解决的问题应用到本实施例中可描述为:镜头序列包括多个镜头,每个镜头具有对应(通常不同)的长度,每个镜头具有对应(通常不同)的权重,需要获得限定时长的视频摘要,如何保证视频摘要在限定时长内权重总和最大。因此,本申请实施例通过背包算法可获得最佳内容的视频摘要。此时还存在一种特殊情况,响应于获得权重最高的至少两个镜头中存在长度大于第二设定帧数的镜头,删除长度大于第二设定帧数的镜头,当获得的某一镜头的重要性分数较高,但是它的长度已经大于第二设定帧数(例如:第 一设定帧数的一半),此时如果还将该镜头加入视频摘要,将导致视频摘要中的内容过少,因此,不将该镜头加入到视频摘要中。
在一个或多个可选的实施例中,本申请实施例方法基于特征提取网络和记忆神经网络实现;
在执行步骤110之前,还包括:
基于样本视频流对特征提取网络和记忆神经网络进行联合训练,样本视频流包括至少两个样本镜头,每个样本镜头包括标注权重。
为了实现获得较准确的权重,在获得权重之前需要对特征提取网络和记忆神经网络进行训练,单独训练特征提取网络和记忆神经网络也可以实现本申请实施例的目的,但将特征提取网络和记忆神经网络联合训练得到的参数更适合本申请实施例,能提供更准确的预测权重;该训练过程假设样本视频流已经分割为至少两个样本镜头,该分割过程可以基于训练好的分割神经网络或其他手段,本申请实施例不限制。
在一些实施例中,联合训练的过程可以包括:
利用特征提取网络对样本视频流包括的至少两个样本镜头中的每个样本镜头进行特征提取,获得至少两个样本图像特征;
利用记忆神经网络基于样本镜头特征确定每个样本镜头的预测权重;
基于预测权重和标注权重确定损失,基于损失调整对特征提取网络和记忆神经网络的参数。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
图9为本申请实施例提供的视频摘要生成装置的一个实施例的结构示意图。该实施例的装置可用于实现本申请上述各方法实施例。如图9所示,该实施例的装置包括:
特征提取单元91,配置为对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征。
在本实施例中,待处理视频流为获取视频摘要的视频流,视频流包括至少一帧视频图像。为了使获得的视频摘要具有内容含义,而不仅仅是由不同帧的视频图像构成的图像集合,本申请实施例将镜头作为视频摘要的构成单位,每个镜头包括至少一帧视频图 像。可选地,本申请实施例中的特征提取可以是基于任一特征提取网络实现,基于特征提取网络分别对每个镜头进行特征提取,以获得至少两个图像特征,本申请不限制具体进行特征提取的过程。
全局特征单元92,配置为根据所有镜头的图像特征,获取镜头的全局特征。
在一些实施例中,将视频流对应的所有图像特征经过处理(如:映射或嵌入等)获得对应整体视频流的转换特征序列,转换特征序列再与每个图像特征进行计算获得每个镜头对应的全局特征(全局注意力),通过全局特征可以体现每个镜头与视频流中其他镜头之间的关联关系。
权重获取单元93,配置为根据镜头的图像特征和全局特征确定镜头的权重。
通过镜头的图像特征及其全局特征确定该镜头的权重,由此得到的权重不仅基于该镜头本身,还基于该镜头与整个视频流中其他镜头之间的关联关系,实现了从视频整体的角度对镜头的重要性进行评估。
摘要生成单元94,配置为基于镜头的权重获得待处理视频流的视频摘要。
在一些实施例中,本申请实施例通过镜头的权重体现了每个镜头的重要性,可确定镜头序列中较为重要的一些镜头,但确定视频摘要不仅仅基于镜头的重要性,还需要控制视频摘要的长度,即,需要结合镜头的权重和时长(帧数)确定视频摘要,可选地,可采用背包算法获得视频摘要。
上述实施例提供的视频摘要生成装置,结合图像特征和全局特征确定每个镜头的权重,实现了从视频整体的角度来理解视频,利用了每个镜头与整个视频流的全局关联关系,基于本实施例确定的视频摘要,可以在整体上对视频内容进行表达,避免了视频摘要较为片面的问题。
在一个或多个可选的实施例中,全局特征单元92,配置为基于记忆神经网络对所有镜头的图像特征进行处理,获取镜头的全局特征。
在一些实施例中,记忆神经网络可以包括至少两个嵌入矩阵,通过将视频流的所有镜头的图像特征分别输入到至少两个嵌入矩阵中,通过嵌入矩阵的输出获得每个镜头的全局特征,镜头的全局特征可以表达该镜头与视频流中其他镜头之间的关联关系,从镜头的权重看,权重越大,表明该镜头与其他镜头的关联越大,越有可能被包含在视频摘要中。
在一些实施例中,全局特征单元92,配置为将所有镜头的图像特征分别映射到第一 嵌入矩阵和第二嵌入矩阵,获得输入记忆和输出记忆;根据镜头的图像特征、输入记忆和输出记忆,获取镜头的全局特征。
在一些实施例中,全局特征单元92在根据镜头的图像特征、输入记忆和输出记忆,获取镜头的全局特征时,配置为将镜头的图像特征映射到第三嵌入矩阵,得到镜头的特征向量;将特征向量与输入记忆进行内积运算,得到镜头的权值向量;将权值向量与输出记忆进行加权叠加运算,得到全局向量,将全局向量作为全局特征。
在一个或多个可选的实施例中,权重获取单元93,配置为将镜头的图像特征和镜头的全局特征进行内积运算,得到权重特征;将权重特征通过全连接神经网络,得到镜头的权重。
本实施例结合镜头的图像特征和镜头的全局特征确定镜头的权重,在体现该镜头的信息的同时,结合了镜头与视频整体的关联,实现了从视频局部和视频整体的角度来理解视频,使获得的视频摘要更符合人类习惯。
在一个或多个可选的实施例中,全局特征单元92,配置为基于记忆神经网络对镜头的图像特征进行处理,获取镜头的至少两个全局特征。
本申请实施例中,为了提高镜头的权重的全局性,通过至少两组记忆组获得至少两个全局特征,结合多个全局特征获得镜头的权重,其中,每组嵌入矩阵组中包括的嵌入矩阵不同或相同,当嵌入矩阵组之间不同时,获得的全局特征能更好的体现镜头与视频整体的关联。
在一些实施例中,全局特征单元92,配置为将所述镜头的图像特征分别映射到至少两组嵌入矩阵组,获得至少两组记忆组,每组所述嵌入矩阵组包括两个嵌入矩阵,每组所述记忆组包括输入记忆和输出记忆;根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征。
在一些实施例中,全局特征单元92在根据至少两组记忆组和镜头的图像特征,获取镜头的至少两个全局特征时,配置为将镜头的图像特征映射到第三嵌入矩阵,得到镜头的特征向量;将特征向量与至少两个输入记忆进行内积运算,得到镜头的至少两个权值向量;将权值向量与至少两个输出记忆进行加权叠加运算,得到至少两个全局向量,将至少两个全局向量作为至少两个全局特征。
在一些实施例中,权重获取单元93,配置为将镜头的图像特征和镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;将第一权重特征作为图像 特征,镜头的至少两个全局特征中的第二全局特征作为第一全局特征,第二全局特征为至少两个全局特征中除了第一全局特征之外的全局特征;将镜头的图像特征和镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;直到镜头的至少两个全局特征中不包括第二全局特征,将第一权重特征作为镜头的权重特征;将权重特征通过全连接神经网络,得到镜头的权重。
在一个或多个可选的实施例中,装置还包括:
镜头分割单元,用于对待处理视频流进行镜头分割获得镜头序列。
在一些实施例中,基于待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得镜头序列。
在一些实施例中,可通过两帧视频图像对应的特征之间的距离(如:欧式距离、余弦距离等)确定两帧视频图像之间的相似度,两帧视频图像之间的相似度越高,说明两帧视频图像属于同一镜头的可能性越大,本实施例通过视频图像之间的相似度可将差异较为明显的视频图像分割到不同的镜头中,实现准确的镜头分割。
在一些实施例中,镜头分割单元,配置为基于待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得镜头序列。
在一些实施例中,镜头分割单元,配置为基于至少两个大小不同的分割间距对视频流中的视频图像进行分割,获得至少两组视频片段组,每组视频片段组包括至少两个视频片段,分割间距大于等于1帧;基于每组视频片段组中至少两个断开帧之间的相似度,确定分割是否正确,断开帧为视频片段中的第一帧;响应于分割正确,确定视频片段作为镜头,获得镜头序列。
在一些实施例中,镜头分割单元在基于每组视频片段组中至少两个断开帧之间的相似度,确定分割是否正确时,配置为响应于至少两个断开帧之间的相似度小于或等于设定值,确定分割正确;响应于至少两个断开帧之间的相似度大于设定值,确定分割不正确。
在一些实施例中,镜头分割单元在响应于分割正确,确定视频片段作为镜头,获得镜头序列时,配置为响应于断开帧对应至少两个分割间距,以大小较小的分割间距获得的视频片段作为镜头,获得镜头序列。
在一个或多个可选的实施例中,特征提取单元91,配置为对镜头中的至少一帧视频图像进行特征提取,获得至少一个图像特征;获取所有图像特征的均值特征,并将均值 特征作为镜头的图像特征。
在一些实施例中,通过特征提取网络分别对镜头中的每帧视频图像进行特征提取,当一个镜头仅包括一帧视频图像时,以该图像特征作为图像特征,当包括多帧视频图像时,对多个图像特征计算均值,以均值特征作为该镜头的图像特征。
在一个或多个可选的实施例中,摘要生成单元,配置为获取视频摘要的限定时长;根据镜头的权重和视频摘要的限定时长,获得待处理视频流的视频摘要。
视频摘要又称视频浓缩,是对视频内容的一个简要概括,可实现在相对较短的时间内将视频表达的主要内容进行体现,需要在实现将视频主要内容表达的同时,还要对视频摘要的时长进行限制,否则将达不到简要的功能,与看完整视频无异,本申请实施例通过限定时长来限制视频摘要的时长,即,要求获得的视频摘要的时长小于或等于限定时长,限定时长的具体取值可根据实际情况进行设定。
在一个或多个实施例中,本申请实施例装置还包括:
联合训练单元,配置为基于样本视频流对特征提取网络和记忆神经网络进行联合训练,样本视频流包括至少两个样本镜头,每个样本镜头包括标注权重。
为了实现获得较准确的权重,在获得权重之前需要对特征提取网络和记忆神经网络进行训练,单独训练特征提取网络和记忆神经网络也可以实现本申请实施例的目的,但将特征提取网络和记忆神经网络联合训练得到的参数更适合本申请实施例,能提供更准确的预测权重;该训练过程假设样本视频流已经分割为至少两个样本镜头,该分割过程可以基于训练好的分割神经网络或其他手段,本申请实施例不限制。
本申请实施例的另一个方面,还提供了一种电子设备,包括处理器,该处理器包括上述任意一项实施例提供的视频摘要生成装置。
本申请实施例的又一个方面,还提供了一种电子设备,包括:存储器,配置为存储可执行指令;
以及处理器,配置为与该存储器通信以执行所述可执行指令从而完成上述任意一项实施例提供的视频摘要生成方法的操作。
本申请实施例的还一个方面,还提供了一种计算机存储介质,配置为存储计算机可读取的指令,该指令被执行时执行上述任意一项实施例提供的视频摘要生成方法的操作。
本申请实施例的再一个方面,还提供了一种计算机程序产品,包括计算机可读代码, 当所述计算机可读代码在设备上运行时,该设备中的处理器执行用于实现上述任意一项实施例提供的视频摘要生成方法的指令。
本申请实施例还提供了一种电子设备,例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等。下面参考图10,其示出了适于用来实现本申请实施例的终端设备或服务器的电子设备1000的结构示意图:如图10所示,电子设备1000包括一个或多个处理器、通信部等,所述一个或多个处理器例如:一个或多个中央处理单元(CPU)1001,和/或一个或多个专用处理器,专用处理器可作为加速单元1013,可包括但不限于图像处理器(GPU)、FPGA、DSP以及其它的ASIC芯片之类专用处理器等,处理器可以根据存储在只读存储器(ROM)1002中的可执行指令或者从存储部分1008加载到随机访问存储器(RAM)1003中的可执行指令而执行各种适当的动作和处理。通信部1012可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡。
处理器可与只读存储器1002和/或随机访问存储器1003中通信以执行可执行指令,通过总线1004与通信部1012相连、并经通信部1012与其他目标设备通信,从而完成本申请实施例提供的任一项方法对应的操作,例如,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征,每个镜头包括至少一帧视频图像;根据所有镜头的图像特征,获取镜头的全局特征;根据镜头的图像特征和全局特征确定镜头的权重;基于镜头的权重获得待处理视频流的视频摘要。
此外,在RAM 1003中,还可存储有装置操作所需的各种程序和数据。CPU1001、ROM1002以及RAM1003通过总线1004彼此相连。在有RAM1003的情况下,ROM1002为可选模块。RAM1003存储可执行指令,或在运行时向ROM1002中写入可执行指令,可执行指令使中央处理单元1001执行上述通信方法对应的操作。输入/输出(I/O)接口1005也连接至总线1004。通信部1012可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。
以下部件连接至I/O接口1005:包括键盘、鼠标等的输入部分1006;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1007;包括硬盘等的存储部分1008;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1009。通信部分1009经由诸如因特网的网络执行通信处理。驱动器1010也根据需要连接至I/O接口1005。可拆卸介质1011,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1010上,以便于从其上读出的计算机程序根据需要被安装入存储部分 1008。
需要说明的,如图10所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图10的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如加速单元1013和CPU1001可分离设置或者可将加速单元1013集成在CPU1001上,通信部可分离设置,也可集成设置在CPU1001或加速单元1013上,等等。这些可替换的实施方式均落入本申请公开的保护范围。
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,对待处理视频流的镜头序列中的镜头进行特征提取,获得每个镜头的图像特征,每个镜头包括至少一帧视频图像;根据所有镜头的图像特征,获取镜头的全局特征;根据镜头的图像特征和全局特征确定镜头的权重;基于镜头的权重获得待处理视频流的视频摘要。在这样的实施例中,该计算机程序可以通过通信部分1009从网络上被下载和安装,和/或从可拆卸介质1011被安装。在该计算机程序被中央处理单元(CPU)1001执行时,执行本申请的方法中限定的上述功能的操作。
可能以许多方式来实现本申请的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本申请的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本申请的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本申请实施为记录在记录介质中的程序,这些程序包括用于实现根据本申请的方法的机器可读指令。因而,本申请还覆盖存储用于执行根据本申请的方法的程序的记录介质。
本申请的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本申请限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本申请的原理和实际应用,并且使本领域的普通技术人员能够理解本申请从而设计适于特定用途的带有各种修改的各种实施例。

Claims (38)

  1. 一种视频摘要生成方法,其中,包括:
    对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,每个所述镜头包括至少一帧视频图像;
    根据所有所述镜头的图像特征,获取所述镜头的全局特征;
    根据所述镜头的图像特征和所述全局特征确定所述镜头的权重;
    基于所述镜头的权重获得所述待处理视频流的视频摘要。
  2. 根据权利要求1所述的方法,其中,所述根据所有所述镜头的图像特征,获取所述镜头的全局特征,包括:
    基于记忆神经网络对所有所述镜头的图像特征进行处理,获取所述镜头的全局特征。
  3. 根据权利要求2所述的方法,其中,所述记忆神经网络对所述所有镜头的图像特征进行处理,获取所述镜头的全局特征,包括:
    将所述所有镜头的图像特征分别映射到第一嵌入矩阵和第二嵌入矩阵,获得输入记忆和输出记忆;
    根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征。
  4. 根据权利要求3所述的方法,其中,所述根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征,包括:
    将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;
    将所述特征向量与所述输入记忆进行内积运算,得到所述镜头的权值向量;
    将所述权值向量与所述输出记忆进行加权叠加运算,得到所述全局向量,将所述全局向量作为所述全局特征。
  5. 根据权利要求1-4任一项所述的方法,其中,所述根据所述镜头的图像特征和所述全局特征确定所述镜头的权重,包括:
    将所述镜头的图像特征和所述镜头的全局特征进行内积运算,得到权重特征;
    将所述权重特征通过全连接神经网络,得到所述镜头的权重。
  6. 根据权利要求2-5任一所述的方法,其中,所述基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的全局特征,包括:
    基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的至少两个全局 特征。
  7. 根据权利要求6所述的方法,其中,所述基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的至少两个全局特征,包括:
    将所述镜头的图像特征分别映射到至少两组嵌入矩阵组,获得至少两组记忆组,每组所述嵌入矩阵组包括两个嵌入矩阵,每组所述记忆组包括输入记忆和输出记忆;
    根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征。
  8. 根据权利要求7所述的方法,其中,所述根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征,包括:
    将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;
    将所述特征向量与至少两个所述输入记忆进行内积运算,得到所述镜头的至少两个权值向量;
    将所述权值向量与至少两个所述输出记忆进行加权叠加运算,得到至少两个全局向量,将所述至少两个全局向量作为所述至少两个全局特征。
  9. 根据权利要求6-8任一项所述的方法,其中,所述根据所述镜头的图像特征和所述全局特征确定所述镜头的权重,包括:
    将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;
    将所述第一权重特征作为所述图像特征,所述镜头的至少两个全局特征中的第二全局特征作为第一全局特征,所述第二全局特征为所述至少两个全局特征中除了第一全局特征之外的全局特征;
    将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;
    直到所述镜头的至少两个全局特征中不包括第二全局特征,将所述第一权重特征作为所述镜头的权重特征;
    将所述权重特征通过全连接神经网络,得到所述镜头的权重。
  10. 根据权利要求1-9任一所述的方法,其中,所述对待处理视频流的镜头序列中的镜头进行特征提取,获得所述镜头的图像特征之前,还包括:
    对所述待处理视频流进行镜头分割获得所述镜头序列。
  11. 根据权利要求10所述的方法,其中,所述对所述待处理视频流进行镜头分割获得所述镜头序列,包括:
    基于所述待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得所述镜头序列。
  12. 根据权利要求11所述的方法,其中,所述基于所述待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得所述镜头序列,包括:
    基于至少两个大小不同的分割间距对所述视频流中的视频图像进行分割,获得至少两组视频片段组,每组所述视频片段组包括至少两个视频片段,所述分割间距大于等于1帧;
    基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确,所述断开帧为所述视频片段中的第一帧;
    响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列。
  13. 根据权利要求12所述的方法,其中,所述基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确,包括:
    响应于所述至少两个断开帧之间的相似度小于或等于设定值,确定所述分割正确;
    响应于所述至少两个断开帧之间的相似度大于设定值,确定所述分割不正确。
  14. 根据权利要求12或13所述的方法,其中,所述响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列,包括:
    响应于所述断开帧对应至少两个所述分割间距,以大小较小的分割间距获得的视频片段作为所述镜头,获得所述镜头序列。
  15. 根据权利要求1-14任一所述的方法,其中,所述对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,包括:
    对所述镜头中的至少一帧视频图像进行特征提取,获得至少一个图像特征;
    获取所有所述图像特征的均值特征,并将所述均值特征作为所述镜头的图像特征。
  16. 根据权利要求1-15任一所述的方法,其中,所述基于所述镜头的权重获得所述待处理视频流的视频摘要,包括:
    获取所述视频摘要的限定时长;
    根据所述镜头的权重和所述视频摘要的限定时长,获得所述待处理视频流的视频摘要。
  17. 根据权利要求1-16任一所述的方法,其中,所述方法基于特征提取网络和记忆神经网络实现;
    所述对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征之前,还包括:
    基于样本视频流对所述特征提取网络和记忆神经网络进行联合训练,所述样本视频流包括至少两个样本镜头,每个所述样本镜头包括标注权重。
  18. 一种视频摘要生成装置,,包括:
    特征提取单元,配置为对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,每个所述镜头包括至少一帧视频图像;
    全局特征单元,配置为根据所有所述镜头的图像特征,获取所述镜头的全局特征;
    权重获取单元,配置为根据所述镜头的图像特征和所述全局特征确定所述镜头的权重;
    摘要生成单元,配置为基于所述镜头的权重获得所述待处理视频流的视频摘要。
  19. 根据权利要求18所述的装置,其中,所述全局特征单元,配置为基于记忆神经网络对所有所述镜头的图像特征进行处理,获取所述镜头的全局特征。
  20. 根据权利要求19所述的装置,其中,所述全局特征单元,配置为将所述所有镜头的图像特征分别映射到第一嵌入矩阵和第二嵌入矩阵,获得输入记忆和输出记忆;根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征。
  21. 根据权利要求20所述的装置,其中,所述全局特征单元在根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征时,配置为将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;将所述特征向量与所述输入记忆进行内积运算,得到所述镜头的权值向量;将所述权值向量与所述输出记忆进行加权叠加运算,得到所述全局向量,将所述全局向量作为所述全局特征。
  22. 根据权利要求18-21任一项所述的装置,其中,所述权重获取单元,配置为将所述镜头的图像特征和所述镜头的全局特征进行内积运算,得到权重特征;将所述权重特征通过全连接神经网络,得到所述镜头的权重。
  23. 根据权利要求19-22任一所述的装置,其中,所述全局特征单元,配置为基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的至少两个全局特征。
  24. 根据权利要求23所述的装置,其中,所述全局特征单元,配置为将所述镜头的图像特征分别映射到至少两组嵌入矩阵组,获得至少两组记忆组,每组所述嵌入矩阵组包括两个嵌入矩阵,每组所述记忆组包括输入记忆和输出记忆;根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征。
  25. 根据权利要求24所述的装置,其中,所述全局特征单元在根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征时,配置为将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;将所述特征向量与至少 两个所述输入记忆进行内积运算,得到所述镜头的至少两个权值向量;将所述权值向量与至少两个所述输出记忆进行加权叠加运算,得到至少两个全局向量,将所述至少两个全局向量作为所述至少两个全局特征。
  26. 根据权利要求23-25任一项所述的装置,其中,所述权重获取单元,配置为将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;将所述第一权重特征作为所述图像特征,所述镜头的至少两个全局特征中的第二全局特征作为第一全局特征,所述第二全局特征为所述至少两个全局特征中除了第一全局特征之外的全局特征;将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;直到所述镜头的至少两个全局特征中不包括第二全局特征,将所述第一权重特征作为所述镜头的权重特征;将所述权重特征通过全连接神经网络,得到所述镜头的权重。
  27. 根据权利要求18-26任一所述的装置,其中,所述装置还包括:
    镜头分割单元,配置为对所述待处理视频流进行镜头分割获得所述镜头序列。
  28. 根据权利要求27所述的装置,其中,所述镜头分割单元,配置为基于所述待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得所述镜头序列。
  29. 根据权利要求28所述的装置,其中,所述镜头分割单元,配置为基于至少两个大小不同的分割间距对所述视频流中的视频图像进行分割,获得至少两组视频片段组,每组所述视频片段组包括至少两个视频片段,所述分割间距大于等于1帧;基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确,所述断开帧为所述视频片段中的第一帧;响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列。
  30. 根据权利要求29所述的装置,其中,所述镜头分割单元在基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确时,配置为响应于所述至少两个断开帧之间的相似度小于或等于设定值,确定所述分割正确;响应于所述至少两个断开帧之间的相似度大于设定值,确定所述分割不正确。
  31. 根据权利要求29或30所述的装置,其中,所述镜头分割单元在响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列时,配置为响应于所述断开帧对应至少两个所述分割间距,以大小较小的分割间距获得的视频片段作为所述镜头,获得所述镜头序列。
  32. 根据权利要求18-31任一所述的装置,其中,所述特征提取单元,配置为对所述镜头中的至少一帧视频图像进行特征提取,获得至少一个图像特征;获取所有所述图 像特征的均值特征,并将所述均值特征作为所述镜头的图像特征。
  33. 根据权利要求18-32任一所述的装置,其中,所述摘要生成单元,配置为获取所述视频摘要的限定时长;根据所述镜头的权重和所述视频摘要的限定时长,获得所述待处理视频流的视频摘要。
  34. 根据权利要求18-33任一所述的装置,其中,所述装置还包括:
    联合训练单元,配置为基于样本视频流对所述特征提取网络和记忆神经网络进行联合训练,所述样本视频流包括至少两个样本镜头,每个所述样本镜头包括标注权重。
  35. 一种电子设备,包括处理器,所述处理器包括权利要求18至34任意一项所述的视频摘要生成装置。
  36. 一种电子设备,包括:存储器,配置为存储可执行指令;
    以及处理器,配置为与所述存储器通信以执行所述可执行指令从而完成权利要求1至17任意一项所述视频摘要生成方法的操作。
  37. 一种计算机存储介质,配置为存储计算机可读取的指令,其中,所述指令被执行时执行权利要求1至17任意一项所述视频摘要生成方法的操作。
  38. 一种计算机程序产品,包括计算机可读代码,其中,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行配置为实现权利要求1至17任意一项所述视频摘要生成方法的指令。
PCT/CN2019/088020 2018-10-19 2019-05-22 视频摘要生成方法和装置、电子设备、计算机存储介质 WO2020077999A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
SG11202003999QA SG11202003999QA (en) 2018-10-19 2019-05-22 Video summary generation method and apparatus, electronic device, and computer storage medium
JP2020524009A JP7150840B2 (ja) 2018-10-19 2019-05-22 ビデオ要約生成方法及び装置、電子機器並びにコンピュータ記憶媒体
US16/884,177 US20200285859A1 (en) 2018-10-19 2020-05-27 Video summary generation method and apparatus, electronic device, and computer storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811224169.XA CN109413510B (zh) 2018-10-19 2018-10-19 视频摘要生成方法和装置、电子设备、计算机存储介质
CN201811224169.X 2018-10-19

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/884,177 Continuation US20200285859A1 (en) 2018-10-19 2020-05-27 Video summary generation method and apparatus, electronic device, and computer storage medium

Publications (1)

Publication Number Publication Date
WO2020077999A1 true WO2020077999A1 (zh) 2020-04-23

Family

ID=65468671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088020 WO2020077999A1 (zh) 2018-10-19 2019-05-22 视频摘要生成方法和装置、电子设备、计算机存储介质

Country Status (6)

Country Link
US (1) US20200285859A1 (zh)
JP (1) JP7150840B2 (zh)
CN (1) CN109413510B (zh)
SG (1) SG11202003999QA (zh)
TW (1) TWI711305B (zh)
WO (1) WO2020077999A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113556577A (zh) * 2021-07-21 2021-10-26 北京字节跳动网络技术有限公司 一种视频生成方法及装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413510B (zh) * 2018-10-19 2021-05-18 深圳市商汤科技有限公司 视频摘要生成方法和装置、电子设备、计算机存储介质
CN110381392B (zh) * 2019-06-06 2021-08-10 五邑大学 一种视频摘要提取方法及其系统、装置、存储介质
CN110933519A (zh) * 2019-11-05 2020-03-27 合肥工业大学 一种基于多路特征的记忆网络视频摘要方法
CN111641868A (zh) * 2020-05-27 2020-09-08 维沃移动通信有限公司 预览视频生成方法、装置及电子设备
CN112532897B (zh) * 2020-11-25 2022-07-01 腾讯科技(深圳)有限公司 视频剪辑方法、装置、设备及计算机可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120293686A1 (en) * 2011-05-18 2012-11-22 Keith Stoll Karn Video summary including a feature of interest
CN102906745A (zh) * 2010-05-25 2013-01-30 伊斯曼柯达公司 使用选择准则确定关键视频片段以形成视频概要
CN106612468A (zh) * 2015-10-21 2017-05-03 上海文广互动电视有限公司 视频摘要自动生成系统及方法
CN107222795A (zh) * 2017-06-23 2017-09-29 南京理工大学 一种多特征融合的视频摘要生成方法
CN107590442A (zh) * 2017-08-22 2018-01-16 华中科技大学 一种基于卷积神经网络的视频语义场景分割方法
CN108073902A (zh) * 2017-12-19 2018-05-25 深圳先进技术研究院 基于深度学习的视频总结方法、装置及终端设备
CN109413510A (zh) * 2018-10-19 2019-03-01 深圳市商汤科技有限公司 视频摘要生成方法和装置、电子设备、计算机存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5758257A (en) * 1994-11-29 1998-05-26 Herz; Frederick System and method for scheduling broadcast of and access to video programs and other data using customer profiles
CN101778257B (zh) * 2010-03-05 2011-10-26 北京邮电大学 用于数字视频点播中的视频摘要片断的生成方法
US10387729B2 (en) * 2013-07-09 2019-08-20 Outward, Inc. Tagging virtualized content
US10386440B2 (en) * 2014-07-03 2019-08-20 Koninklijke Philips N.V. Multi-shot magnetic-resonance (MR) imaging system and method of operation thereof
US9436876B1 (en) * 2014-12-19 2016-09-06 Amazon Technologies, Inc. Video segmentation techniques
CN105228033B (zh) * 2015-08-27 2018-11-09 联想(北京)有限公司 一种视频处理方法及电子设备
US9807473B2 (en) * 2015-11-20 2017-10-31 Microsoft Technology Licensing, Llc Jointly modeling embedding and translation to bridge video and language
CN106851437A (zh) * 2017-01-17 2017-06-13 南通同洲电子有限责任公司 一种提取视频摘要的方法
US10592751B2 (en) * 2017-02-03 2020-03-17 Fuji Xerox Co., Ltd. Method and system to generate targeted captions and summarize long, continuous media files
CN106888407B (zh) * 2017-03-28 2019-04-02 腾讯科技(深圳)有限公司 一种视频摘要生成方法及装置
CN107484017B (zh) * 2017-07-25 2020-05-26 天津大学 基于注意力模型的有监督视频摘要生成方法
CN108024158A (zh) * 2017-11-30 2018-05-11 天津大学 利用视觉注意力机制的有监督视频摘要提取方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102906745A (zh) * 2010-05-25 2013-01-30 伊斯曼柯达公司 使用选择准则确定关键视频片段以形成视频概要
US20120293686A1 (en) * 2011-05-18 2012-11-22 Keith Stoll Karn Video summary including a feature of interest
CN106612468A (zh) * 2015-10-21 2017-05-03 上海文广互动电视有限公司 视频摘要自动生成系统及方法
CN107222795A (zh) * 2017-06-23 2017-09-29 南京理工大学 一种多特征融合的视频摘要生成方法
CN107590442A (zh) * 2017-08-22 2018-01-16 华中科技大学 一种基于卷积神经网络的视频语义场景分割方法
CN108073902A (zh) * 2017-12-19 2018-05-25 深圳先进技术研究院 基于深度学习的视频总结方法、装置及终端设备
CN109413510A (zh) * 2018-10-19 2019-03-01 深圳市商汤科技有限公司 视频摘要生成方法和装置、电子设备、计算机存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113556577A (zh) * 2021-07-21 2021-10-26 北京字节跳动网络技术有限公司 一种视频生成方法及装置
CN113556577B (zh) * 2021-07-21 2022-09-09 北京字节跳动网络技术有限公司 一种视频生成方法及装置

Also Published As

Publication number Publication date
SG11202003999QA (en) 2020-05-28
CN109413510A (zh) 2019-03-01
US20200285859A1 (en) 2020-09-10
TWI711305B (zh) 2020-11-21
JP2021503123A (ja) 2021-02-04
TW202032999A (zh) 2020-09-01
JP7150840B2 (ja) 2022-10-11
CN109413510B (zh) 2021-05-18

Similar Documents

Publication Publication Date Title
WO2020077999A1 (zh) 视频摘要生成方法和装置、电子设备、计算机存储介质
Zhong et al. Ghostvlad for set-based face recognition
WO2022111506A1 (zh) 视频动作识别方法、装置、电子设备和存储介质
Weinzaepfel et al. Mimetics: Towards understanding human actions out of context
WO2020228525A1 (zh) 地点识别及其模型训练的方法和装置以及电子设备
US8750602B2 (en) Method and system for personalized advertisement push based on user interest learning
WO2020177673A1 (zh) 一种视频序列选择的方法、计算机设备及存储介质
US11270124B1 (en) Temporal bottleneck attention architecture for video action recognition
Zhang et al. Feature aggregation with reinforcement learning for video-based person re-identification
Kucer et al. Leveraging expert feature knowledge for predicting image aesthetics
Dhall et al. Finding happiest moments in a social context
WO2018196718A1 (zh) 图像消歧方法、装置、存储介质和电子设备
Zhang et al. Deep metric learning with improved triplet loss for face clustering in videos
Huang et al. Benchmarking still-to-video face recognition via partial and local linear discriminant analysis on COX-S2V dataset
CN111209897A (zh) 视频处理的方法、装置和存储介质
CN111553838A (zh) 模型参数的更新方法、装置、设备及存储介质
Zhang et al. Contrastive positive mining for unsupervised 3d action representation learning
WO2023109361A1 (zh) 用于视频处理的方法、系统、设备、介质和产品
CN107220597B (zh) 一种基于局部特征和词袋模型人体动作识别过程的关键帧选取方法
Hou et al. Deep generative image priors for semantic face manipulation
Dong et al. A supervised dictionary learning and discriminative weighting model for action recognition
CN117315752A (zh) 人脸情绪识别网络模型的训练方法、装置、设备和介质
Zhou et al. Test-time domain generalization for face anti-spoofing
Rao et al. Non-local attentive temporal network for video-based person re-identification
Lee et al. Sequence feature generation with temporal unrolling network for zero-shot action recognition

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020524009

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19873613

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/08/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19873613

Country of ref document: EP

Kind code of ref document: A1