WO2020077999A1 - 视频摘要生成方法和装置、电子设备、计算机存储介质 - Google Patents
视频摘要生成方法和装置、电子设备、计算机存储介质 Download PDFInfo
- Publication number
- WO2020077999A1 WO2020077999A1 PCT/CN2019/088020 CN2019088020W WO2020077999A1 WO 2020077999 A1 WO2020077999 A1 WO 2020077999A1 CN 2019088020 W CN2019088020 W CN 2019088020W WO 2020077999 A1 WO2020077999 A1 WO 2020077999A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- lens
- feature
- global
- video
- shot
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 97
- 238000000605 extraction Methods 0.000 claims abstract description 56
- 230000015654 memory Effects 0.000 claims description 134
- 239000013598 vector Substances 0.000 claims description 76
- 230000011218 segmentation Effects 0.000 claims description 48
- 238000013528 artificial neural network Methods 0.000 claims description 40
- 239000011159 matrix material Substances 0.000 claims description 38
- 230000008569 process Effects 0.000 claims description 19
- 230000004044 response Effects 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 6
- 239000011295 pitch Substances 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000001133 acceleration Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
- G06V20/47—Detecting features for summarising video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/48—Matching video sequences
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present application relates to computer vision technology but is not limited to computer vision technology, in particular to a method and apparatus for generating video summaries, electronic equipment, and computer storage media.
- Video summary is an emerging video understanding technology. Video summary is to extract some shots from a long video to synthesize a short new video that contains the story line or exciting shots in the original video.
- the embodiments of the present application provide a method and an apparatus for generating a video summary, an electronic device, and a computer storage medium.
- a method for generating a video summary includes:
- a video summary of the to-be-processed video stream is obtained based on the weight of the shot.
- a feature extraction unit configured to perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image characteristics of each of the shots, and each of the shots includes at least one frame of video image;
- a global feature unit configured to acquire global features of the lens based on image features of all the lenses
- a weight acquisition unit configured to determine the weight of the lens according to the image characteristics and the global characteristics of the lens
- the summary generating unit is configured to obtain a video summary of the to-be-processed video stream based on the weight of the shot.
- an electronic device including a processor, where the processor includes the video digest generating device according to any one of the above.
- an electronic device includes: a memory for storing executable instructions;
- a processor configured to communicate with the memory to execute the executable instructions to complete the operation of any one of the video digest generation methods described above.
- a computer program product including computer readable code, wherein, when the computer readable code runs on a device, a processor in the device executes for implementation The instruction of the video digest generation method as described in any one of the above.
- feature extraction is performed on shots in a shot sequence of a video stream to be processed to obtain image characteristics of each shot.
- Each shot includes at least one frame of video image; obtain the global characteristics of the shot according to the image characteristics of all shots; determine the weight of the shot according to the image characteristics and global characteristics of the shot; obtain the video summary of the video stream to be processed based on the weight of the shot, combined Image features and global features determine the weight of each shot, which realizes the understanding of the video from the perspective of the overall video.
- the video summary determined based on the weight of the lens in this embodiment can be integrated in the overall
- the video content is expressed in the above, reducing the problem of one-sided video summary.
- FIG. 1 is a schematic flowchart of an embodiment of a video digest generation method provided by an embodiment of the present application.
- FIG. 2 is a schematic flowchart of another embodiment of a video digest generation method provided by an embodiment of this application.
- FIG. 3 is a partial flowchart of an optional example of a video digest generation method provided by an embodiment of the present application.
- FIG. 4 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application.
- FIG. 5 is a schematic flowchart of still another embodiment of a video digest generation method provided by an embodiment of this application.
- FIG. 6 is a schematic diagram of some optional examples of the video digest generation method provided by the embodiment of the present application.
- FIG. 7 is a schematic flowchart of another embodiment of a video digest generation method provided by an embodiment of this application.
- FIG. 8 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application.
- FIG. 9 is a schematic structural diagram of an embodiment of an apparatus for generating a video summary provided by an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server according to an embodiment of the present application.
- FIG. 1 is a schematic flowchart of an embodiment of a video digest generation method provided by an embodiment of the present application. This method can be performed by any video digest extraction device, such as a terminal device, server, mobile device, etc. As shown in FIG. 1, the method in this embodiment includes:
- Step 110 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
- the video summary is: extracting key information or subject information from the original video stream, generating a video summary, the video summary is smaller than the original video stream data stream, and at the same time covering the original video stream's subject content or The key content can be used for subsequent retrieval of the original video stream.
- a video summary representing the movement trajectory of the same target in the video stream is generated.
- this is only an example, and the specific implementation is not limited to the above example.
- the to-be-processed video stream is a video stream to obtain a video summary, and the video stream includes at least one frame of video images.
- the embodiments of the present application use shots as a constituent unit of the video abstract, and each shot includes at least one frame of video images.
- the feature extraction in the embodiments of the present application may be implemented based on any feature extraction network, and feature extraction is performed for each shot separately based on the feature extraction network to obtain at least two image features.
- the application does not limit specific The process of feature extraction.
- Step 120 Obtain the global characteristics of the lens according to the image characteristics of all the lenses.
- all image features corresponding to the video stream are processed (such as mapping or embedding, etc.) to obtain a conversion feature sequence corresponding to the overall video stream, and the conversion feature sequence is then calculated with each image feature to obtain a correspondence with each shot Global feature (global attention), the global feature can reflect the association between each shot and other shots in the video stream.
- the global features include but are not limited to: image features that characterize the correspondence or positional relationship between the same image element in multiple video images in one shot. It should be noted that the above-mentioned association relationship is not limited to the corresponding relationship and / or position relationship.
- Step 130 Determine the weight of the lens according to the image characteristics and global characteristics of the lens.
- the weight of the lens is determined by the image characteristics of the lens and its global characteristics.
- the weight obtained is not only based on the lens itself, but also based on the correlation between the lens and other shots in the entire video stream, from the perspective of the overall video Evaluate the importance of the lens.
- Step 140 Obtain a video summary of the video stream to be processed based on the weight of the shot.
- the importance of the shots in the shot sequence is determined by the weight of the shots, but the determination of the video summary is not only based on the importance of the shots, but also the length of the video summary needs to be controlled, that is, the weight of the shots and the length of the shots need to be combined (Number of frames) Determine the video summary.
- the weight is positively related to the importance of the shot and / or the length of the video summary.
- the knapsack algorithm may be used to determine the video summary, and other algorithms may also be used to determine, which are not listed here one by one.
- the video summary generation method performs feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image characteristics of each shot, and each shot includes at least one frame of video image; based on the image characteristics of all shots, The global characteristics of the lens; determine the weight of the lens according to the image characteristics and global characteristics of the lens; obtain the video summary of the video stream to be processed based on the weight of the lens, and combine the image characteristics and the global characteristics to determine the weight of each shot
- the global association relationship between each shot and the entire video stream is used. Based on the video summary determined in this embodiment, the video content can be expressed as a whole, reducing the problem of one-sided video summary.
- FIG. 2 is a schematic flowchart of another embodiment of a video digest generation method provided by an embodiment of this application. As shown in FIG. 2, the method in this embodiment includes:
- Step 210 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
- Step 210 in the embodiment of the present application is similar to step 110 in the above-mentioned embodiment, and the step can be understood by referring to the above-mentioned embodiment, which will not be repeated here.
- the memory neural network may include at least two embedding matrices, by inputting image features of all shots of the video stream into at least two embedding matrices, and obtaining the global features of each shot through the output of the embedding matrix,
- the global characteristics of a shot can express the association between the shot and other shots in the video stream. From the weight of the shot, the larger the weight, the greater the correlation between the shot and other shots, the more likely it is included in the video summary .
- Step 230 in the embodiment of the present application is similar to step 130 in the foregoing embodiment, and this step can be understood with reference to the foregoing embodiment, and details are not described herein again.
- Step 240 Obtain a video summary of the video stream to be processed based on the weight of the shot.
- Step 240 in the embodiment of the present application is similar to step 140 in the foregoing embodiment, and this step can be understood by referring to the foregoing embodiment, and details are not described herein again.
- the embodiments of the present application imitate the practice of humans in creating video summaries through the memory neural network, that is, understand the video from the perspective of the whole video, use the memory neural network to store the information of the entire video stream, and use the global relationship between each shot and the video Decide on its importance, and then choose the shot as the video summary.
- FIG. 3 is a partial flowchart of an optional example of a video digest generation method provided by an embodiment of the present application. As shown in FIG. 3, step 220 in the above embodiment includes:
- Step 310 Map the image features of all lenses to the first embedding matrix and the second embedding matrix, respectively, to obtain input memory and output memory.
- the input memory and the output memory in this embodiment correspond to all the shots of the video stream, and each embedding matrix corresponds to a memory (input memory or output memory).
- each embedding matrix corresponds to a memory (input memory or output memory).
- Step 320 Obtain the global characteristics of the lens according to the image characteristics, input memory, and output memory of the lens.
- the global characteristics of the lens can be obtained.
- the global characteristics reflect the association between the lens and all the shots in the video stream, so that the weight of the lens obtained based on the global characteristics is
- the video streams are overall related, which in turn leads to a more comprehensive video summary.
- each shot may correspond to at least two global features, and the acquisition of at least two global features may be obtained by at least two sets of embedded matrix groups.
- the structure of each set of embedded matrix groups is the same as that in the above embodiments
- the first embedding matrix and the second embedding matrix are similar;
- Each set of embedded matrix groups includes two embedded matrices, and each set of memory groups includes input memory and output memory;
- At least two memory groups and the image characteristics of the lens are obtained.
- At least two global features are obtained through at least two memory groups, and the weight of the lens is obtained by combining multiple global features, wherein the embedding matrix included in each group Different or the same, when the embedding matrix groups are different, the obtained global characteristics can better reflect the association between the lens and the video as a whole.
- FIG. 4 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application. As shown in FIG. 4, step 320 in the above embodiment includes:
- the third embedding matrix can transpose the image features, that is, transpose the image features of the lens to obtain the feature vector of the lens, for example: the image corresponding to the ith lens in the lens sequence
- the feature u i is transposed to obtain the feature vector
- Step 404 Perform an inner product operation on the feature vector and the input memory to obtain the weight vector of the lens.
- the input memory corresponds to the sequence of shots. Therefore, the input memory includes at least two vectors (the number corresponds to the number of shots).
- the feature vector and the input When performing the inner product operation of the feature vector and the input memory, the feature vector and the input The results of calculating the inner product of multiple vectors in memory are mapped to the (0,1) interval, and the values expressed in multiple probability forms are obtained.
- the values expressed in multiple probability forms are used as the weight vector of the lens, for example:
- the weight vector is obtained by formula (1):
- u i represents the image feature of the i-th lens, that is, the image feature corresponding to the lens whose weight is currently required to be calculated;
- a represents the input memory;
- p i represents the weight vector of the correlation between the i-th image feature and the input memory ;
- the Softmax activation function is used in the multi-classification process to map the output of multiple neurons into the (0,1) interval, which can be understood as a probability; where the value of i is the number of shots in the shot sequence; through the formula (1)
- a weight vector expressing the correlation between the i-th image feature and the shot sequence can be obtained.
- Step 406 Perform weighted superposition operation on the weight vector and the output memory to obtain a global vector, and use the global vector as a global feature.
- the global vector is obtained by the following formula (2):
- b represents the output memory obtained based on the second embedding matrix
- o i represents the global vector obtained by calculating the i-th image feature and the output memory.
- the inner product operation is performed by the image feature and the input memory to obtain the correlation between the image feature and each shot.
- the image feature may be transposed to Ensure that the image features and the vectors in the input memory can perform inner product operations.
- the weight vector obtained at this time includes multiple probability values. Each probability value indicates the correlation between the shot and each shot in the shot sequence. The greater the probability value , The stronger the correlation, the inner product operation of each probability value and multiple vectors in the output memory, respectively, to obtain the global vector of the lens as a global feature.
- obtaining at least two global features of the lens according to at least two memory groups includes:
- u i represents the image feature of the ith lens, that is, the image feature corresponding to the lens whose weight is currently required to be calculated, Represents the feature vector of the ith lens; ak represents the input memory in the kth memory group; Weight vector representing the correlation between the i-th image feature and the input memory in the k-th memory group; Softmax activation function is used in the multi-classification process to map the output of multiple neurons to (0,1 ) In the interval, it can be understood as a probability; where k takes values from 1 to N; at least two weight vectors expressing the correlation between the i-th image feature and the shot sequence can be obtained by formula (5).
- At least two global vectors in this embodiment are obtained by deforming the above formula (2) to obtain formula (6):
- b k represents the output memory based on the kth memory group; Representing the i-th image feature and the global vector obtained by calculating the output memory in the k-th memory group, at least two global vectors of the lens can be obtained based on formula (6).
- FIG. 5 is a schematic flowchart of still another embodiment of a video digest generation method provided by an embodiment of this application. As shown in Figure 5,
- Step 510 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
- step 510 is similar to step 110 in the foregoing embodiment, and the step can be understood by referring to the foregoing embodiment, and details are not described herein again.
- Step 520 Obtain the global characteristics of the lens according to the image characteristics of all the lenses.
- step 520 is similar to step 120 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
- Step 530 Perform an inner product operation on the image features of the lens and the global features of the lens to obtain weighted features.
- the obtained weight feature not only reflects the importance of the lens in the overall video, but also depends on the information of the lens itself.
- the weight characteristics can be obtained by the following formula (3):
- Step 540 Pass the weight feature through a fully connected neural network to obtain the weight of the lens.
- the weight is used to reflect the importance of the lens. Therefore, it needs to be embodied in a numerical form.
- the dimension of the weight feature is transformed through a fully connected neural network to obtain the weight of the lens expressed in a one-dimensional vector.
- the weight of the lens can be obtained based on the following formula (4):
- s i represents the weight of the i-th lens
- W D and b D represent the weight and offset in the fully connected network through which the target image feature passes.
- Step 550 Obtain a video summary of the video stream to be processed based on the weight of the shot.
- This embodiment combines the image characteristics of the lens and the global characteristics of the lens to determine the weight of the lens. While embodying the information of the lens, the association of the lens and the entire video is combined to realize the understanding of the video from the perspective of the local video and the overall video Make the obtained video summary more in line with human habits.
- determining the weight of the lens according to the image characteristics and global characteristics of the lens includes:
- the first weight feature is used as the image feature
- the second global feature of the at least two global features of the lens is used as the first global feature
- the second global feature is a global feature other than the first global feature among the at least two global features
- the first weight feature is used as the weight feature of the lens
- the weight feature is passed through a fully connected neural network to obtain the weight of the lens.
- FIG. 6 is a schematic diagram of some optional examples of the video digest generation method provided by the embodiment of the present application.
- this example includes multiple memory groups, where the number of memory groups is n, multiple matrices are obtained by segmenting the video stream, and the above formulas (5), (6), (7 ), (4) calculation, the weight s i of the i-th lens can be obtained.
- the weight s i of the i-th lens can be obtained.
- Step 710 Perform shot segmentation on the video stream to be processed to obtain a shot sequence.
- shot segmentation is performed based on the similarity between at least two frames of video images in the video stream to be processed to obtain a shot sequence.
- the similarity between the two frames of video images may be determined by the distance between the features corresponding to the two frames of video images (such as Euclidean distance, cosine distance, etc.). High, indicating that the possibility of two video images belonging to the same shot is greater.
- the similarity between the video images can be used to divide the video images with obvious differences into different shots to achieve accurate shot division.
- Step 720 Perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image features of each shot.
- step 720 is similar to step 110 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
- Step 730 Obtain the global characteristics of the lens according to the image characteristics of all the lenses.
- Step 730 in the embodiment of the present application is similar to step 120 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
- Step 740 Determine the weight of the lens according to the image characteristics and global characteristics of the lens.
- step 740 is similar to step 130 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
- Step 750 Obtain a video summary of the video stream to be processed based on the weight of the shot.
- step 750 is similar to step 140 in the foregoing embodiment, and this step can be understood by referring to any of the foregoing embodiments, and details are not described herein again.
- the lens is used as the unit for extracting the abstract.
- the method of segmenting the lens can be segmented by a neural network or by a known photography lens or human judgment; The embodiment does not limit the specific means of splitting the lens.
- FIG. 8 is a partial flowchart of another optional example of the video digest generation method provided by the embodiment of the present application. As shown in FIG. 8, step 710 in the above embodiment includes:
- Step 802 Segment the video images in the video stream based on at least two segmentation intervals of different sizes to obtain at least two video segment groups.
- each video clip group includes at least two video clips, and the division interval is greater than or equal to 1 frame.
- the video stream is divided by multiple division intervals of different sizes, for example: the division interval is respectively: 1 frame, 4 frames, 6 frames, 8 frames, etc., and the video stream can be divided into one division interval Multiple video clips of fixed size (eg 6 frames).
- the disconnected frame is the first frame in the video segment; optionally, in response to the similarity between at least two disconnected frames being less than or equal to the set value, it is determined that the segmentation is correct;
- the association between the two frames of video images may be determined based on the similarity between the features.
- the greater the similarity the greater the likelihood of the same shot.
- the embodiments of the present application mainly use the scene change as the basis of the lens segmentation, that is, even For video clips shot in the same long shot, when the correlation between the image of a certain frame and the first frame of the long shot is less than or equal to the set value, the shots are also segmented.
- step 806 in response to the correct segmentation, the video segment is determined as a shot, and a shot sequence is obtained.
- the video stream is divided by a plurality of division intervals with different sizes, and then the similarity between the broken frames of two consecutive video clips is judged to determine whether the division at the position is correct.
- the similarity between the disconnected frames exceeds a certain value, it indicates that the segmentation at this position is incorrect, that is, the two video clips belong to one shot, and the shot sequence can be obtained by correct segmentation.
- a video segment obtained by obtaining the division interval with a smaller size is used as the shot to obtain a shot sequence.
- a disconnected frame at a disconnected position is simultaneously at least two divided interval divided ports, for example: for a video stream including 8 frame images, 2 frames and 4 frames are used as the first divided interval and the second divided interval respectively, the first Divide the interval to obtain 4 video clips, of which the first frame, the third frame, the fifth frame and the seventh frame are disconnected frames, and obtain the second video clip of the second divided interval, of which the first frame and the fifth frame are disconnected Frame; at this time, if it is determined that the split frame corresponding to the broken frame of the 5th frame and the 7th frame is correct, that is, the 5th frame is the broken frame of the first split pitch and the second frame of the second split pitch, , Subject to the first division interval, that is, the video stream is divided into three shots: the first frame to the fourth frame is a shot, the fifth frame and the sixth frame are a shot, the seventh frame and the eighth frame It is a shot; instead of taking frames 5 to 8 as a shot according to the second split pitch.
- feature extraction is performed for each frame of video image in the shot separately through a feature extraction network.
- the image feature is used as the image feature.
- the average value is calculated for multiple image features, and the average feature is used as the image feature of the lens.
- step 140 includes:
- Video summary also known as video enrichment, is a brief summary of the video content. It can realize the main content of the video expression in a relatively short time. It is necessary to summarize the video content while expressing the main content of the video. The length of time is limited, otherwise the brief function will not be achieved, just like watching the full video.
- the embodiment of the present application limits the duration of the video summary by limiting the duration, that is, the duration of the video summary required to be obtained is less than or equal to the limited duration, and the specific value of the limited duration may be set according to actual conditions.
- the embodiment of the present application uses the 01 knapsack algorithm to extract the video summary.
- the problem solved by the 01 knapsack problem applied to this embodiment can be described as:
- the shot sequence includes multiple shots, and each shot has a corresponding (usually Different) length, each shot has a corresponding (usually different) weight, you need to obtain a video summary of a limited duration, how to ensure that the sum of the weights of the video summary within the limited duration is the largest. Therefore, the embodiment of the present application can obtain the video summary of the best content through the backpack algorithm.
- step 110 Before performing step 110, it also includes:
- the feature extraction network and the memory neural network are jointly trained based on the sample video stream.
- the sample video stream includes at least two sample shots, and each sample shot includes a label weight.
- the feature extraction network and the memory neural network In order to achieve more accurate weights, it is necessary to train the feature extraction network and the memory neural network before obtaining the weights. Training the feature extraction network and the memory neural network separately can also achieve the purpose of the embodiments of the present application, but the feature extraction network and the memory
- the parameters obtained by the joint training of the neural network are more suitable for the embodiments of the present application, and can provide more accurate prediction weights; the training process assumes that the sample video stream has been divided into at least two sample shots, and the segmentation process can be based on the trained segmented neural network or
- the embodiments of the present application are not limited.
- the process of joint training may include:
- the loss is determined based on the prediction weight and the labeling weight, and the parameters of the feature extraction network and the memory neural network are adjusted based on the loss.
- FIG. 9 is a schematic structural diagram of an embodiment of an apparatus for generating a video summary provided by an embodiment of the present application.
- the apparatus of this embodiment may be used to implement the above method embodiments of the present application.
- the device of this embodiment includes:
- the feature extraction unit 91 is configured to perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain the image features of each shot.
- the to-be-processed video stream is a video stream to obtain a video summary, and the video stream includes at least one frame of video images.
- the embodiments of the present application use shots as a constituent unit of the video abstract, and each shot includes at least one frame of video images.
- the feature extraction in the embodiments of the present application may be implemented based on any feature extraction network, and feature extraction is performed for each shot separately based on the feature extraction network to obtain at least two image features, and the application does not limit specific features The extraction process.
- the global feature unit 92 is configured to acquire global features of the lens according to image features of all lenses.
- all image features corresponding to the video stream are processed (such as mapping or embedding, etc.) to obtain a conversion feature sequence corresponding to the overall video stream, and the conversion feature sequence is then calculated with each image feature to obtain a correspondence with each shot Global feature (global attention), the global feature can reflect the association between each shot and other shots in the video stream.
- the weight of the lens is determined by the image characteristics of the lens and its global characteristics.
- the weight obtained is not only based on the lens itself, but also based on the correlation between the lens and other shots in the entire video stream, from the perspective of the overall video Evaluate the importance of the lens.
- the summary generating unit 94 is configured to obtain a video summary of the to-be-processed video stream based on the weight of the shot.
- the embodiments of the present application reflect the importance of each shot through the weight of the shot, and can determine some of the more important shots in the shot sequence, but determine that the video summary is not only based on the importance of the shot, but also needs to control the video
- the length of the summary that is, the video summary needs to be determined in combination with the weight and duration (number of frames) of the shot.
- a backpack algorithm can be used to obtain the video summary.
- the video summary generating device provided in the above embodiment combines the image features and global features to determine the weight of each shot, and realizes the understanding of the video from the perspective of the entire video.
- the global association relationship between each shot and the entire video stream is used.
- the video summary determined in the embodiment can express the video content as a whole, avoiding the problem of one-sidedness of the video summary.
- the global feature unit 92 is configured to process image features of all lenses based on a memory neural network to obtain global features of the lens.
- the memory neural network may include at least two embedding matrices, by inputting image features of all shots of the video stream into at least two embedding matrices, and obtaining the global features of each shot through the output of the embedding matrix,
- the global characteristics of a shot can express the association between the shot and other shots in the video stream. From the weight of the shot, the larger the weight, the greater the correlation between the shot and other shots, the more likely it is included in the video summary .
- the global feature unit 92 is configured to map the image features of all lenses to the first embedding matrix and the second embedding matrix, respectively, to obtain input memory and output memory; according to the lens image features, input memory, and output memory To obtain the global characteristics of the lens.
- the global feature unit 92 is configured to map the image feature of the lens to the third embedding matrix to obtain the feature vector of the lens when acquiring the global feature of the lens according to the image feature, input memory, and output memory of the lens; Perform the inner product operation of the feature vector and the input memory to obtain the weight vector of the lens; perform the weighted superposition operation of the weight vector and the output memory to obtain the global vector, and use the global vector as the global feature.
- the weight acquisition unit 93 is configured to perform an inner product operation on the image features of the lens and the global features of the lens to obtain weight features; the weight features are obtained through a fully connected neural network to obtain the lens Weights.
- This embodiment combines the image characteristics of the lens and the global characteristics of the lens to determine the weight of the lens. While embodying the information of the lens, the association of the lens and the entire video is combined to realize the understanding of the video from the perspective of the local video and the overall video Make the obtained video summary more in line with human habits.
- the global feature unit 92 is configured to process the image features of the lens based on the memory neural network to obtain at least two global features of the lens.
- At least two global features are obtained through at least two memory groups, and the weight of the lens is obtained by combining multiple global features, wherein the embedding matrix included in each group Different or the same, when the embedding matrix groups are different, the obtained global characteristics can better reflect the association between the lens and the video as a whole.
- the global feature unit 92 is configured to map the image features of the lens to at least two sets of embedded matrix groups to obtain at least two sets of memory groups, each of the embedded matrix groups includes two embedded matrices, Each of the memory groups includes an input memory and an output memory; according to at least two groups of the memory groups and image characteristics of the lens, at least two global characteristics of the lens are acquired.
- the global feature unit 92 is configured to map the image features of the lens to the third embedding matrix when acquiring at least two global features of the lens based on at least two memory groups and the image features of the lens, to obtain the lens ’s Feature vector; perform inner product operation on the feature vector and at least two input memories to obtain at least two weight vectors of the lens; perform weighted superposition operation on the weight vector and at least two output memories to obtain at least two global vectors, At least two global vectors serve as at least two global features.
- the weight acquisition unit 93 is configured to perform an inner product operation on the image feature of the lens and the first global feature among the at least two global features of the lens to obtain the first weight feature; use the first weight feature as the image Feature, the second global feature of the at least two global features of the lens as the first global feature, the second global feature is the global feature of the at least two global features other than the first global feature; the image feature of the lens and the lens
- the first global feature of the at least two global features is subjected to an inner product operation to obtain the first weight feature; until the second global feature is not included in the at least two global features of the lens, the first weight feature is used as the weight feature of the lens;
- the weight feature is passed through a fully connected neural network to obtain the weight of the lens.
- the device further includes:
- shot segmentation is performed based on the similarity between at least two frames of video images in the video stream to be processed to obtain a shot sequence.
- the similarity between the two frames of video images may be determined by the distance between the features corresponding to the two frames of video images (such as Euclidean distance, cosine distance, etc.). High, indicating that the possibility of two video images belonging to the same shot is greater.
- the similarity between the video images can be used to divide the video images with obvious differences into different shots to achieve accurate shot division.
- the shot segmentation unit is configured to perform shot segmentation based on the similarity between at least two frames of video images in the video stream to be processed to obtain a shot sequence.
- the lens segmentation unit is configured to segment the video images in the video stream based on at least two segmentation pitches of different sizes to obtain at least two groups of video segments, each group of video segments includes at least two video segments ,
- the split interval is greater than or equal to 1 frame; based on the similarity between at least two broken frames in each video clip group, determine whether the split is correct, and the split frame is the first frame in the video clip; in response to the correct split, determine The video clip is used as a shot to obtain a shot sequence.
- the shot segmentation unit is configured to respond to the similarity between at least two broken frames when determining whether the segmentation is correct based on the similarity between at least two broken frames in each video clip group Less than or equal to the set value, it is determined that the segmentation is correct; in response to the similarity between at least two disconnected frames being greater than the set value, it is determined that the segmentation is incorrect.
- the shot segmentation unit when the shot segmentation unit determines that the video segment is a shot in response to the correct segmentation, and obtains the shot sequence, it is configured to respond to the broken frame corresponding to at least two segmentation pitches, and obtain the video at a smaller segmentation pitch
- the clip is used as a shot to obtain a shot sequence.
- the feature extraction unit 91 is configured to perform feature extraction on at least one frame of video image in the shot to obtain at least one image feature; obtain the average feature of all image features, and convert the average feature As the image feature of the lens.
- feature extraction is performed for each frame of video image in the shot separately through a feature extraction network.
- the image feature is used as the image feature.
- the average value is calculated for multiple image features, and the average feature is used as the image feature of the lens.
- the summary generating unit is configured to obtain a limited duration of the video summary; according to the weight of the shot and the limited duration of the video summary, the video summary of the video stream to be processed is obtained.
- Video summary also known as video enrichment, is a brief summary of the video content. It can realize the main content of the video expression in a relatively short time. It is necessary to summarize the video content while expressing the main content of the video. The duration of the video is limited, otherwise the brief function will not be achieved, which is no different from watching the full video.
- the embodiment of the present application limits the duration of the video summary by limiting the duration, that is, the duration of the video summary required to be obtained is less than or equal to the limited duration
- the specific value of the limited duration can be set according to the actual situation.
- the device of the embodiments of the present application further includes:
- the joint training unit is configured to perform joint training on the feature extraction network and the memory neural network based on the sample video stream.
- the sample video stream includes at least two sample shots, and each sample shot includes a labeling weight.
- the feature extraction network and the memory neural network In order to achieve more accurate weights, it is necessary to train the feature extraction network and the memory neural network before obtaining the weights. Training the feature extraction network and the memory neural network separately can also achieve the purpose of the embodiments of the present application, but the feature extraction network and the memory
- the parameters obtained by the joint training of the neural network are more suitable for the embodiments of the present application, and can provide more accurate prediction weights; the training process assumes that the sample video stream has been divided into at least two sample shots, and the segmentation process can be based on the trained segmented neural network or
- the embodiments of the present application are not limited.
- an electronic device which includes a processor, and the processor includes the video digest generating apparatus provided in any one of the foregoing embodiments.
- an electronic device including: a memory configured to store executable instructions;
- a processor configured to communicate with the memory to execute the executable instructions to complete the operation of the video digest generation method provided by any one of the foregoing embodiments.
- a computer storage medium configured to store computer-readable instructions, and when the instructions are executed, the operations of the video digest generation method provided in any of the foregoing embodiments are performed.
- a computer program product including computer readable code, and when the computer readable code runs on a device, a processor in the device executes to implement any of the above The instruction of the video digest generation method provided in the embodiment.
- the special-purpose processors may serve as the acceleration unit 1013, which may include but not limited to images The processor (GPU), FPGA, DSP, and other dedicated processors such as ASIC chips, etc.
- the processor can be loaded into the random access memory according to the executable instructions stored in the read only memory (ROM) 1002 or from the storage section 1008 RAM) 1003 executable instructions to perform various appropriate actions and processes.
- the communication unit 1012 may include but is not limited to a network card, and the network card may include but not limited to an IB (Infiniband) network card.
- the processor may communicate with the read-only memory 1002 and / or the random access memory 1003 to execute executable instructions, connect to the communication unit 1012 through the bus 1004, and communicate with other target devices via the communication unit 1012, thereby completing the embodiment of the present application.
- the operation corresponding to any of the methods, for example, feature extraction of the shots in the shot sequence of the video stream to be processed to obtain the image characteristics of each shot, each shot includes at least one frame of video images; according to the image characteristics of all shots, Obtain the global characteristics of the shot; determine the weight of the shot according to the image characteristics and global characteristics of the shot; obtain the video summary of the video stream to be processed based on the weight of the shot.
- RAM 1003 various programs and data necessary for device operation can also be stored.
- the CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004.
- ROM1002 is an optional module.
- the RAM 1003 stores executable instructions, or writes executable instructions to the ROM 1002 at runtime, and the executable instructions cause the central processing unit 1001 to perform operations corresponding to the above communication method.
- An input / output (I / O) interface 1005 is also connected to the bus 1004.
- the communication unit 1012 may be integratedly provided, or may be provided with multiple sub-modules (for example, multiple IB network cards), and are on the bus link.
- the following components are connected to the I / O interface 1005: an input section 1006 including a keyboard, a mouse, etc .; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 1008 including a hard disk, etc. ; And a communication section 1009 including a network interface card such as a LAN card, a modem, etc. The communication section 1009 performs communication processing via a network such as the Internet.
- the driver 1010 is also connected to the I / O interface 1005 as needed.
- a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1010 as necessary, so that the computer program read out therefrom is installed into the storage section 1008 as necessary.
- FIG. 10 is only an optional implementation method.
- the number and types of the components in FIG. 10 can be selected, deleted, added, or replaced according to actual needs; Separate settings or integrated settings can also be adopted for the setting of different functional components.
- the acceleration unit 1013 and the CPU 1001 can be separated or the acceleration unit 1013 can be integrated on the CPU 1001. Or on the acceleration unit 1013, etc.
- embodiments of the present application include a computer program product including a computer program tangibly contained on a machine-readable medium, the computer program containing program code for performing the method shown in the flowchart, the program code may include a corresponding Execute the instructions corresponding to the method steps provided in the embodiments of the present application, for example, perform feature extraction on the shots in the shot sequence of the video stream to be processed to obtain image characteristics of each shot, and each shot includes at least one frame of video image; according to all shots To obtain the global characteristics of the lens; determine the weight of the lens according to the image characteristics and the global characteristics of the lens; obtain the video summary of the video stream to be processed based on the weight of the lens.
- the computer program may be downloaded and installed from the network through the communication section 1009, and / or installed from the removable medium 1011.
- the computer program is executed by the central processing unit (CPU) 1001, the operation of the above-mentioned functions defined in the method of the present application is performed.
- the method and apparatus of the present application may be implemented in many ways.
- the method and apparatus of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware.
- the above sequence of steps for the method is for illustration only, and the steps of the method of the present application are not limited to the sequence specifically described above unless specifically stated otherwise.
- the present application may also be implemented as programs recorded in a recording medium, and these programs include machine-readable instructions for implementing the method according to the present application.
- the present application also covers a recording medium storing a program for executing the method according to the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computer Security & Cryptography (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Studio Devices (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
Description
Claims (38)
- 一种视频摘要生成方法,其中,包括:对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,每个所述镜头包括至少一帧视频图像;根据所有所述镜头的图像特征,获取所述镜头的全局特征;根据所述镜头的图像特征和所述全局特征确定所述镜头的权重;基于所述镜头的权重获得所述待处理视频流的视频摘要。
- 根据权利要求1所述的方法,其中,所述根据所有所述镜头的图像特征,获取所述镜头的全局特征,包括:基于记忆神经网络对所有所述镜头的图像特征进行处理,获取所述镜头的全局特征。
- 根据权利要求2所述的方法,其中,所述记忆神经网络对所述所有镜头的图像特征进行处理,获取所述镜头的全局特征,包括:将所述所有镜头的图像特征分别映射到第一嵌入矩阵和第二嵌入矩阵,获得输入记忆和输出记忆;根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征。
- 根据权利要求3所述的方法,其中,所述根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征,包括:将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;将所述特征向量与所述输入记忆进行内积运算,得到所述镜头的权值向量;将所述权值向量与所述输出记忆进行加权叠加运算,得到所述全局向量,将所述全局向量作为所述全局特征。
- 根据权利要求1-4任一项所述的方法,其中,所述根据所述镜头的图像特征和所述全局特征确定所述镜头的权重,包括:将所述镜头的图像特征和所述镜头的全局特征进行内积运算,得到权重特征;将所述权重特征通过全连接神经网络,得到所述镜头的权重。
- 根据权利要求2-5任一所述的方法,其中,所述基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的全局特征,包括:基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的至少两个全局 特征。
- 根据权利要求6所述的方法,其中,所述基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的至少两个全局特征,包括:将所述镜头的图像特征分别映射到至少两组嵌入矩阵组,获得至少两组记忆组,每组所述嵌入矩阵组包括两个嵌入矩阵,每组所述记忆组包括输入记忆和输出记忆;根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征。
- 根据权利要求7所述的方法,其中,所述根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征,包括:将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;将所述特征向量与至少两个所述输入记忆进行内积运算,得到所述镜头的至少两个权值向量;将所述权值向量与至少两个所述输出记忆进行加权叠加运算,得到至少两个全局向量,将所述至少两个全局向量作为所述至少两个全局特征。
- 根据权利要求6-8任一项所述的方法,其中,所述根据所述镜头的图像特征和所述全局特征确定所述镜头的权重,包括:将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;将所述第一权重特征作为所述图像特征,所述镜头的至少两个全局特征中的第二全局特征作为第一全局特征,所述第二全局特征为所述至少两个全局特征中除了第一全局特征之外的全局特征;将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;直到所述镜头的至少两个全局特征中不包括第二全局特征,将所述第一权重特征作为所述镜头的权重特征;将所述权重特征通过全连接神经网络,得到所述镜头的权重。
- 根据权利要求1-9任一所述的方法,其中,所述对待处理视频流的镜头序列中的镜头进行特征提取,获得所述镜头的图像特征之前,还包括:对所述待处理视频流进行镜头分割获得所述镜头序列。
- 根据权利要求10所述的方法,其中,所述对所述待处理视频流进行镜头分割获得所述镜头序列,包括:基于所述待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得所述镜头序列。
- 根据权利要求11所述的方法,其中,所述基于所述待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得所述镜头序列,包括:基于至少两个大小不同的分割间距对所述视频流中的视频图像进行分割,获得至少两组视频片段组,每组所述视频片段组包括至少两个视频片段,所述分割间距大于等于1帧;基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确,所述断开帧为所述视频片段中的第一帧;响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列。
- 根据权利要求12所述的方法,其中,所述基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确,包括:响应于所述至少两个断开帧之间的相似度小于或等于设定值,确定所述分割正确;响应于所述至少两个断开帧之间的相似度大于设定值,确定所述分割不正确。
- 根据权利要求12或13所述的方法,其中,所述响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列,包括:响应于所述断开帧对应至少两个所述分割间距,以大小较小的分割间距获得的视频片段作为所述镜头,获得所述镜头序列。
- 根据权利要求1-14任一所述的方法,其中,所述对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,包括:对所述镜头中的至少一帧视频图像进行特征提取,获得至少一个图像特征;获取所有所述图像特征的均值特征,并将所述均值特征作为所述镜头的图像特征。
- 根据权利要求1-15任一所述的方法,其中,所述基于所述镜头的权重获得所述待处理视频流的视频摘要,包括:获取所述视频摘要的限定时长;根据所述镜头的权重和所述视频摘要的限定时长,获得所述待处理视频流的视频摘要。
- 根据权利要求1-16任一所述的方法,其中,所述方法基于特征提取网络和记忆神经网络实现;所述对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征之前,还包括:基于样本视频流对所述特征提取网络和记忆神经网络进行联合训练,所述样本视频流包括至少两个样本镜头,每个所述样本镜头包括标注权重。
- 一种视频摘要生成装置,,包括:特征提取单元,配置为对待处理视频流的镜头序列中的镜头进行特征提取,获得每个所述镜头的图像特征,每个所述镜头包括至少一帧视频图像;全局特征单元,配置为根据所有所述镜头的图像特征,获取所述镜头的全局特征;权重获取单元,配置为根据所述镜头的图像特征和所述全局特征确定所述镜头的权重;摘要生成单元,配置为基于所述镜头的权重获得所述待处理视频流的视频摘要。
- 根据权利要求18所述的装置,其中,所述全局特征单元,配置为基于记忆神经网络对所有所述镜头的图像特征进行处理,获取所述镜头的全局特征。
- 根据权利要求19所述的装置,其中,所述全局特征单元,配置为将所述所有镜头的图像特征分别映射到第一嵌入矩阵和第二嵌入矩阵,获得输入记忆和输出记忆;根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征。
- 根据权利要求20所述的装置,其中,所述全局特征单元在根据所述镜头的图像特征、所述输入记忆和所述输出记忆,获取所述镜头的全局特征时,配置为将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;将所述特征向量与所述输入记忆进行内积运算,得到所述镜头的权值向量;将所述权值向量与所述输出记忆进行加权叠加运算,得到所述全局向量,将所述全局向量作为所述全局特征。
- 根据权利要求18-21任一项所述的装置,其中,所述权重获取单元,配置为将所述镜头的图像特征和所述镜头的全局特征进行内积运算,得到权重特征;将所述权重特征通过全连接神经网络,得到所述镜头的权重。
- 根据权利要求19-22任一所述的装置,其中,所述全局特征单元,配置为基于记忆神经网络对所述镜头的图像特征进行处理,获取所述镜头的至少两个全局特征。
- 根据权利要求23所述的装置,其中,所述全局特征单元,配置为将所述镜头的图像特征分别映射到至少两组嵌入矩阵组,获得至少两组记忆组,每组所述嵌入矩阵组包括两个嵌入矩阵,每组所述记忆组包括输入记忆和输出记忆;根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征。
- 根据权利要求24所述的装置,其中,所述全局特征单元在根据至少两组所述记忆组和所述镜头的图像特征,获取所述镜头的至少两个全局特征时,配置为将所述镜头的图像特征映射到第三嵌入矩阵,得到所述镜头的特征向量;将所述特征向量与至少 两个所述输入记忆进行内积运算,得到所述镜头的至少两个权值向量;将所述权值向量与至少两个所述输出记忆进行加权叠加运算,得到至少两个全局向量,将所述至少两个全局向量作为所述至少两个全局特征。
- 根据权利要求23-25任一项所述的装置,其中,所述权重获取单元,配置为将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;将所述第一权重特征作为所述图像特征,所述镜头的至少两个全局特征中的第二全局特征作为第一全局特征,所述第二全局特征为所述至少两个全局特征中除了第一全局特征之外的全局特征;将所述镜头的图像特征和所述镜头的至少两个全局特征中的第一全局特征进行内积运算,得到第一权重特征;直到所述镜头的至少两个全局特征中不包括第二全局特征,将所述第一权重特征作为所述镜头的权重特征;将所述权重特征通过全连接神经网络,得到所述镜头的权重。
- 根据权利要求18-26任一所述的装置,其中,所述装置还包括:镜头分割单元,配置为对所述待处理视频流进行镜头分割获得所述镜头序列。
- 根据权利要求27所述的装置,其中,所述镜头分割单元,配置为基于所述待处理视频流中至少两帧视频图像之间的相似度进行镜头分割,获得所述镜头序列。
- 根据权利要求28所述的装置,其中,所述镜头分割单元,配置为基于至少两个大小不同的分割间距对所述视频流中的视频图像进行分割,获得至少两组视频片段组,每组所述视频片段组包括至少两个视频片段,所述分割间距大于等于1帧;基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确,所述断开帧为所述视频片段中的第一帧;响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列。
- 根据权利要求29所述的装置,其中,所述镜头分割单元在基于所述每组视频片段组中至少两个断开帧之间的相似度,确定所述分割是否正确时,配置为响应于所述至少两个断开帧之间的相似度小于或等于设定值,确定所述分割正确;响应于所述至少两个断开帧之间的相似度大于设定值,确定所述分割不正确。
- 根据权利要求29或30所述的装置,其中,所述镜头分割单元在响应于所述分割正确,确定所述视频片段作为所述镜头,获得所述镜头序列时,配置为响应于所述断开帧对应至少两个所述分割间距,以大小较小的分割间距获得的视频片段作为所述镜头,获得所述镜头序列。
- 根据权利要求18-31任一所述的装置,其中,所述特征提取单元,配置为对所述镜头中的至少一帧视频图像进行特征提取,获得至少一个图像特征;获取所有所述图 像特征的均值特征,并将所述均值特征作为所述镜头的图像特征。
- 根据权利要求18-32任一所述的装置,其中,所述摘要生成单元,配置为获取所述视频摘要的限定时长;根据所述镜头的权重和所述视频摘要的限定时长,获得所述待处理视频流的视频摘要。
- 根据权利要求18-33任一所述的装置,其中,所述装置还包括:联合训练单元,配置为基于样本视频流对所述特征提取网络和记忆神经网络进行联合训练,所述样本视频流包括至少两个样本镜头,每个所述样本镜头包括标注权重。
- 一种电子设备,包括处理器,所述处理器包括权利要求18至34任意一项所述的视频摘要生成装置。
- 一种电子设备,包括:存储器,配置为存储可执行指令;以及处理器,配置为与所述存储器通信以执行所述可执行指令从而完成权利要求1至17任意一项所述视频摘要生成方法的操作。
- 一种计算机存储介质,配置为存储计算机可读取的指令,其中,所述指令被执行时执行权利要求1至17任意一项所述视频摘要生成方法的操作。
- 一种计算机程序产品,包括计算机可读代码,其中,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行配置为实现权利要求1至17任意一项所述视频摘要生成方法的指令。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG11202003999QA SG11202003999QA (en) | 2018-10-19 | 2019-05-22 | Video summary generation method and apparatus, electronic device, and computer storage medium |
JP2020524009A JP7150840B2 (ja) | 2018-10-19 | 2019-05-22 | ビデオ要約生成方法及び装置、電子機器並びにコンピュータ記憶媒体 |
US16/884,177 US20200285859A1 (en) | 2018-10-19 | 2020-05-27 | Video summary generation method and apparatus, electronic device, and computer storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811224169.XA CN109413510B (zh) | 2018-10-19 | 2018-10-19 | 视频摘要生成方法和装置、电子设备、计算机存储介质 |
CN201811224169.X | 2018-10-19 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/884,177 Continuation US20200285859A1 (en) | 2018-10-19 | 2020-05-27 | Video summary generation method and apparatus, electronic device, and computer storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020077999A1 true WO2020077999A1 (zh) | 2020-04-23 |
Family
ID=65468671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/088020 WO2020077999A1 (zh) | 2018-10-19 | 2019-05-22 | 视频摘要生成方法和装置、电子设备、计算机存储介质 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20200285859A1 (zh) |
JP (1) | JP7150840B2 (zh) |
CN (1) | CN109413510B (zh) |
SG (1) | SG11202003999QA (zh) |
TW (1) | TWI711305B (zh) |
WO (1) | WO2020077999A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113556577A (zh) * | 2021-07-21 | 2021-10-26 | 北京字节跳动网络技术有限公司 | 一种视频生成方法及装置 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109413510B (zh) * | 2018-10-19 | 2021-05-18 | 深圳市商汤科技有限公司 | 视频摘要生成方法和装置、电子设备、计算机存储介质 |
CN110381392B (zh) * | 2019-06-06 | 2021-08-10 | 五邑大学 | 一种视频摘要提取方法及其系统、装置、存储介质 |
CN110933519A (zh) * | 2019-11-05 | 2020-03-27 | 合肥工业大学 | 一种基于多路特征的记忆网络视频摘要方法 |
CN111641868A (zh) * | 2020-05-27 | 2020-09-08 | 维沃移动通信有限公司 | 预览视频生成方法、装置及电子设备 |
CN112532897B (zh) * | 2020-11-25 | 2022-07-01 | 腾讯科技(深圳)有限公司 | 视频剪辑方法、装置、设备及计算机可读存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120293686A1 (en) * | 2011-05-18 | 2012-11-22 | Keith Stoll Karn | Video summary including a feature of interest |
CN102906745A (zh) * | 2010-05-25 | 2013-01-30 | 伊斯曼柯达公司 | 使用选择准则确定关键视频片段以形成视频概要 |
CN106612468A (zh) * | 2015-10-21 | 2017-05-03 | 上海文广互动电视有限公司 | 视频摘要自动生成系统及方法 |
CN107222795A (zh) * | 2017-06-23 | 2017-09-29 | 南京理工大学 | 一种多特征融合的视频摘要生成方法 |
CN107590442A (zh) * | 2017-08-22 | 2018-01-16 | 华中科技大学 | 一种基于卷积神经网络的视频语义场景分割方法 |
CN108073902A (zh) * | 2017-12-19 | 2018-05-25 | 深圳先进技术研究院 | 基于深度学习的视频总结方法、装置及终端设备 |
CN109413510A (zh) * | 2018-10-19 | 2019-03-01 | 深圳市商汤科技有限公司 | 视频摘要生成方法和装置、电子设备、计算机存储介质 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5758257A (en) * | 1994-11-29 | 1998-05-26 | Herz; Frederick | System and method for scheduling broadcast of and access to video programs and other data using customer profiles |
CN101778257B (zh) * | 2010-03-05 | 2011-10-26 | 北京邮电大学 | 用于数字视频点播中的视频摘要片断的生成方法 |
US10387729B2 (en) * | 2013-07-09 | 2019-08-20 | Outward, Inc. | Tagging virtualized content |
US10386440B2 (en) * | 2014-07-03 | 2019-08-20 | Koninklijke Philips N.V. | Multi-shot magnetic-resonance (MR) imaging system and method of operation thereof |
US9436876B1 (en) * | 2014-12-19 | 2016-09-06 | Amazon Technologies, Inc. | Video segmentation techniques |
CN105228033B (zh) * | 2015-08-27 | 2018-11-09 | 联想(北京)有限公司 | 一种视频处理方法及电子设备 |
US9807473B2 (en) * | 2015-11-20 | 2017-10-31 | Microsoft Technology Licensing, Llc | Jointly modeling embedding and translation to bridge video and language |
CN106851437A (zh) * | 2017-01-17 | 2017-06-13 | 南通同洲电子有限责任公司 | 一种提取视频摘要的方法 |
US10592751B2 (en) * | 2017-02-03 | 2020-03-17 | Fuji Xerox Co., Ltd. | Method and system to generate targeted captions and summarize long, continuous media files |
CN106888407B (zh) * | 2017-03-28 | 2019-04-02 | 腾讯科技(深圳)有限公司 | 一种视频摘要生成方法及装置 |
CN107484017B (zh) * | 2017-07-25 | 2020-05-26 | 天津大学 | 基于注意力模型的有监督视频摘要生成方法 |
CN108024158A (zh) * | 2017-11-30 | 2018-05-11 | 天津大学 | 利用视觉注意力机制的有监督视频摘要提取方法 |
-
2018
- 2018-10-19 CN CN201811224169.XA patent/CN109413510B/zh active Active
-
2019
- 2019-05-22 SG SG11202003999QA patent/SG11202003999QA/en unknown
- 2019-05-22 WO PCT/CN2019/088020 patent/WO2020077999A1/zh active Application Filing
- 2019-05-22 JP JP2020524009A patent/JP7150840B2/ja active Active
- 2019-08-27 TW TW108130688A patent/TWI711305B/zh active
-
2020
- 2020-05-27 US US16/884,177 patent/US20200285859A1/en not_active Abandoned
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102906745A (zh) * | 2010-05-25 | 2013-01-30 | 伊斯曼柯达公司 | 使用选择准则确定关键视频片段以形成视频概要 |
US20120293686A1 (en) * | 2011-05-18 | 2012-11-22 | Keith Stoll Karn | Video summary including a feature of interest |
CN106612468A (zh) * | 2015-10-21 | 2017-05-03 | 上海文广互动电视有限公司 | 视频摘要自动生成系统及方法 |
CN107222795A (zh) * | 2017-06-23 | 2017-09-29 | 南京理工大学 | 一种多特征融合的视频摘要生成方法 |
CN107590442A (zh) * | 2017-08-22 | 2018-01-16 | 华中科技大学 | 一种基于卷积神经网络的视频语义场景分割方法 |
CN108073902A (zh) * | 2017-12-19 | 2018-05-25 | 深圳先进技术研究院 | 基于深度学习的视频总结方法、装置及终端设备 |
CN109413510A (zh) * | 2018-10-19 | 2019-03-01 | 深圳市商汤科技有限公司 | 视频摘要生成方法和装置、电子设备、计算机存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113556577A (zh) * | 2021-07-21 | 2021-10-26 | 北京字节跳动网络技术有限公司 | 一种视频生成方法及装置 |
CN113556577B (zh) * | 2021-07-21 | 2022-09-09 | 北京字节跳动网络技术有限公司 | 一种视频生成方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
SG11202003999QA (en) | 2020-05-28 |
CN109413510A (zh) | 2019-03-01 |
US20200285859A1 (en) | 2020-09-10 |
TWI711305B (zh) | 2020-11-21 |
JP2021503123A (ja) | 2021-02-04 |
TW202032999A (zh) | 2020-09-01 |
JP7150840B2 (ja) | 2022-10-11 |
CN109413510B (zh) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020077999A1 (zh) | 视频摘要生成方法和装置、电子设备、计算机存储介质 | |
Zhong et al. | Ghostvlad for set-based face recognition | |
WO2022111506A1 (zh) | 视频动作识别方法、装置、电子设备和存储介质 | |
Weinzaepfel et al. | Mimetics: Towards understanding human actions out of context | |
WO2020228525A1 (zh) | 地点识别及其模型训练的方法和装置以及电子设备 | |
US8750602B2 (en) | Method and system for personalized advertisement push based on user interest learning | |
WO2020177673A1 (zh) | 一种视频序列选择的方法、计算机设备及存储介质 | |
US11270124B1 (en) | Temporal bottleneck attention architecture for video action recognition | |
Zhang et al. | Feature aggregation with reinforcement learning for video-based person re-identification | |
Kucer et al. | Leveraging expert feature knowledge for predicting image aesthetics | |
Dhall et al. | Finding happiest moments in a social context | |
WO2018196718A1 (zh) | 图像消歧方法、装置、存储介质和电子设备 | |
Zhang et al. | Deep metric learning with improved triplet loss for face clustering in videos | |
Huang et al. | Benchmarking still-to-video face recognition via partial and local linear discriminant analysis on COX-S2V dataset | |
CN111209897A (zh) | 视频处理的方法、装置和存储介质 | |
CN111553838A (zh) | 模型参数的更新方法、装置、设备及存储介质 | |
Zhang et al. | Contrastive positive mining for unsupervised 3d action representation learning | |
WO2023109361A1 (zh) | 用于视频处理的方法、系统、设备、介质和产品 | |
CN107220597B (zh) | 一种基于局部特征和词袋模型人体动作识别过程的关键帧选取方法 | |
Hou et al. | Deep generative image priors for semantic face manipulation | |
Dong et al. | A supervised dictionary learning and discriminative weighting model for action recognition | |
CN117315752A (zh) | 人脸情绪识别网络模型的训练方法、装置、设备和介质 | |
Zhou et al. | Test-time domain generalization for face anti-spoofing | |
Rao et al. | Non-local attentive temporal network for video-based person re-identification | |
Lee et al. | Sequence feature generation with temporal unrolling network for zero-shot action recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2020524009 Country of ref document: JP Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19873613 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/08/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19873613 Country of ref document: EP Kind code of ref document: A1 |