WO2020024958A1 - Method and system for generating video abstract - Google Patents

Method and system for generating video abstract Download PDF

Info

Publication number
WO2020024958A1
WO2020024958A1 PCT/CN2019/098495 CN2019098495W WO2020024958A1 WO 2020024958 A1 WO2020024958 A1 WO 2020024958A1 CN 2019098495 W CN2019098495 W CN 2019098495W WO 2020024958 A1 WO2020024958 A1 WO 2020024958A1
Authority
WO
WIPO (PCT)
Prior art keywords
shot
video
lens
shots
feature vector
Prior art date
Application number
PCT/CN2019/098495
Other languages
French (fr)
Chinese (zh)
Inventor
曾建平
吴立薪
吕晶晶
包勇军
Original Assignee
北京京东尚科信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2020024958A1 publication Critical patent/WO2020024958A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • the present disclosure relates to the field of video technology, and in particular, to a method and system for generating a video abstract.
  • the video summary is to select key frames or key segments from a longer video and stitch them into a shorter video, so that the viewer can understand the content of the original video or enjoy the wonderful clips in the original video in a short time.
  • Video summary has a wide range of application scenarios, including personal video editing, TV movie plot introduction, video assisted criminal investigation and Internet short video.
  • the generated video summary may lose some important segments or exciting content.
  • the video abstraction method of related technologies generally selects key frames and key fragments based on some general criteria, and there are few methods for generating video abstractions for specific scenes and applications. This leads to the poor performance of this method in some specific application scenarios, especially in the field of video advertising.
  • the summary-processed advertising video may lose key segments used to introduce product brands and product characteristics.
  • a technical problem solved by the embodiments of the present disclosure is to provide a method for generating a video summary, so that the video summary can include some relatively important shots or fragments.
  • a method for generating a video summary executed by a computer including: receiving a video, and dividing the video into multiple shots according to a change in a video scene of the video
  • Each shot of the plurality of shots is a continuous video scene; calculating the importance score of each shot; selecting a set of shots from the plurality of shots so as to meet the constraint of the total length of the video summary
  • the total importance score of the selected group of shots is the largest under the condition; and the selected group of shots is stitched into a video summary and the video summary is output.
  • the step of calculating the importance score of each shot includes: extracting a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and The feature vector sequence is input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.
  • the method before segmenting the video into multiple shots, the method further includes: training a shot importance score calculation network using a reinforcement learning method, wherein the method of reinforcement learning includes Key elements include: action and value reward functions, which include: diversity indicators and representative indicators.
  • the method before selecting the set of shots from the plurality of shots, the method further includes identifying at least one shot among the plurality of shots that exhibits a key feature.
  • the key feature includes at least one of a product brand trademark and a product brand text.
  • the step of identifying at least one shot that exhibits a key feature among the plurality of shots includes: using a deep learning-based object detection method to detect a trademark region in each frame of the video of the video; and The image is input to a pre-trained depth model to extract the embedded feature vector, and the embedded feature vector is compared with the feature vector of the trademark image in the database to obtain the brand type of the trademark, so as to identify at least one displaying the brand brand of the product Footage; or, using deep learning-based optical character recognition methods to identify the text in each frame of the video; and word segmentation of the text, and matching the processed text with the brand text in the database, keeping it relevant to the product brand Text to identify at least one shot showing the brand text of the product.
  • the step of selecting a group of shots from the plurality of shots includes: selecting at least one main shot from the at least one shot showing key features, and dividing the shot from the plurality of shots. At least one auxiliary lens is selected from the remaining lenses other than the at least one main lens, and the at least one main lens and the at least one auxiliary lens are taken as a selected group of lenses.
  • the step of selecting at least one main shot from the at least one shot showing the key feature includes: if the shot selected from the at least one shot showing the key feature is the first Ng shots of the video or The rearmost Ng lens, it is determined that the frontmost Ng lens or the rearmost Ng lens is the at least one main lens and Ng is a positive integer; at least one auxiliary lens is selected from the remaining lenses in the plurality of lenses.
  • the step of using the at least one main lens and the at least one auxiliary lens as the selected group of lenses includes: selecting at least one from among the remaining lenses except the at least one main lens among the plurality of lenses.
  • An auxiliary lens using the at least one main lens and the at least one auxiliary lens as a selected group of lenses, so that the selected group of lenses has a total importance score under the condition that the constraint condition of the total length of the video summary is satisfied Maximum; the step of stitching the selected set of shots into a video summary includes: combining the at least one main shot and the at least one auxiliary shot Lens spliced into video summary in chronological order.
  • the method before at least one shot showing key features is identified in the plurality of shots, the method further includes: calculating the similarity between each shot and the advertised product picture, and using the similarity correction The importance score of the corresponding shot.
  • the step of calculating the similarity between each shot and the advertised product picture and using the similarity to modify the importance score of the corresponding shot includes: calculating a feature vector of the advertised product picture; for each Multi-frame images of the lens are sampled to obtain a sample frame, and a feature vector of the sample frame of each lens is calculated; each lens and the product are calculated according to the feature vector of the product picture and the feature vector of the sample frame of each lens Similarity of pictures; and correcting importance scores of corresponding shots according to the similarity and a preset similarity threshold.
  • a system for generating a video digest including: a video segmentation unit configured to receive a video and cut the video according to a change in a video scene of the video Divided into a plurality of shots, wherein each shot of the plurality of shots is a continuous video scene; the calculation unit is configured to calculate the importance score of each shot; the selection unit is configured to extract A group of shots is selected from a plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and a stitching unit is configured to set the selected A set of shots is stitched into a video summary and the video summary is output.
  • the calculation unit is configured to extract a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots, and to convert the feature vector sequence Input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.
  • the system further includes: a training unit configured to train a lens importance score calculation network by using a reinforcement learning method, wherein the key elements included in the method of reinforcement learning include: action And value reward function, the value reward function includes: a diversity indicator and a representative indicator.
  • the system further includes a recognition unit configured to identify at least one shot among the plurality of shots that exhibits a key feature.
  • the key feature includes at least one of a product brand trademark and a product brand text.
  • the recognition unit is configured to: use a deep learning-based object detection method to detect a trademark region in each frame of video of the video; and input the image of the trademark region into a pre-trained depth model to extract embedded features Vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, thereby identifying at least one shot showing the brand brand of the product; or using optical character recognition based on deep learning
  • the method recognizes the text in each frame of the video; performs word segmentation on the text, matches the processed text with the brand text in the database, and retains the text related to the product brand, thereby identifying at least the product brand text A shot.
  • the selection unit is configured to select at least one main shot from the at least one shot that exhibits key features, and to remove from the rest of the plurality of shots except the at least one main shot
  • At least one auxiliary lens is selected from the lenses, and the at least one main lens and the at least one auxiliary lens are taken as a selected group of lenses.
  • the selecting unit is configured to determine the foremost Ng if the lens selected from the at least one lens showing the key feature is the foremost Ng lens or the last Ng lens of the video.
  • the three lenses or the rearmost Ng lenses are the at least one main lens and Ng is a positive integer; and at least one auxiliary lens is selected from the remaining lenses of the plurality of lenses except the at least one main lens, Using the at least one main shot and the at least one auxiliary shot as a selected group of shots, so that the selected group of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied;
  • the stitching unit is configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
  • the system further includes a correction unit configured to calculate the similarity between each shot and the advertised product picture, and use the similarity to correct the importance score of the corresponding shot.
  • the correction unit is configured to: calculate a feature vector of the advertised product picture; sample a plurality of frames of each shot to obtain a sampling frame, and calculate a feature vector of the sampling frame of each shot; Calculating the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and the importance of the corresponding shot to the corresponding shot according to the similarity and a preset similarity threshold Scores are corrected.
  • a system for generating a video digest including: a memory; and a processor coupled to the memory, the processor being configured to be based on the memory stored in the memory.
  • the instructions execute the method described previously.
  • a computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor, implement the steps of the method described above.
  • the importance score of each shot is calculated, and in the process of selecting a set of shots, a constraint is selected to satisfy the total length of the video summary In the case of a condition, a group of shots with the highest total importance score is stitched into a video summary and the video summary is output. Therefore, this method can make the video summary contain some more important shots or fragments.
  • FIG. 1 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure
  • FIG. 2 is a flowchart illustrating a method of calculating an importance score of each shot according to some embodiments of the present disclosure
  • FIG. 3 is a flowchart illustrating a method of calculating an importance score of each shot according to other embodiments of the present disclosure
  • FIG. 4 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure
  • FIG. 5 is a flowchart illustrating a method of correcting an importance score of a lens according to some embodiments of the present disclosure
  • FIG. 6 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure
  • FIG. 7 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure
  • FIG. 8 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure.
  • FIG. 9 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure.
  • any specific value should be construed as exemplary only and not as a limitation. Therefore, other examples of the exemplary embodiments may have different values.
  • FIG. 1 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure.
  • FIG. 2 is a flowchart illustrating a method of calculating an importance score of each shot according to some embodiments of the present disclosure.
  • FIG. 3 is a flowchart illustrating a method of calculating an importance score of each shot according to other embodiments of the present disclosure.
  • the method for generating a video summary executed by a computer according to some embodiments of the present disclosure is described in detail below with reference to FIGS. 1 to 3. As shown in FIG. 1, the method may include steps S102 to S108.
  • step S102 a video is received, and the video is divided into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a segment of continuous video scene.
  • a KTS Kernel Temporal Segmentation, Kernel Time Domain Segmentation
  • This method has good segmentation effect and fast speed.
  • the present disclosure is not limited to using the KTS method, and other lens segmentation methods may also be adopted.
  • step S104 the importance score of each shot is calculated.
  • step S104 may include: extracting a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and inputting the feature vector sequence to A pre-trained shot importance score calculation network to calculate the importance score of each shot.
  • FIG. 1 a block diagram of a model that implements the calculation of importance scores (the process of calculating importance scores can also be referred to as importance scores) is shown in Figure 2.
  • C3D Net three-dimensional convolutional network
  • Video footage is a sequence of images that can be represented by a three-dimensional matrix.
  • a three-dimensional convolutional network (C3D Net) can be used to process the shots and extract one-dimensional feature vectors. That is, a three-dimensional convolutional network is used as a feature extraction network for video shots.
  • an inflated 3D convolutional network (I3D for short) can be used to process the lens.
  • Kinetics-600 is a video classification dataset, which contains the activities of 600 categories of people, a total of more than 500,000 10-second video clips.
  • First use the Kinetics-600 dataset to pre-train the I3D network, then use the I3D network to process the video shot S t , and use the output of the last pooling layer of the network as the feature vector X t , so that the shot set S ⁇ S t
  • t 1, ..., T ⁇ .
  • the output of its last pooling layer is a type of feature embedding, which characterizes the essential characteristics of video content.
  • the lens importance score calculation network may be a time-series network, for example, it may be a recurrent neural network (Recurrent Neural Network, RNN for short).
  • t 1, ..., T ⁇ .
  • a bidirectional LSTM Long Short-Term Memory
  • the method may further include: training a shot importance score calculation network using a reinforcement learning method.
  • the key elements of this reinforcement learning method include: action and value reward functions.
  • the value reward function includes: a diversity indicator and a representative indicator. Reinforcement learning is used to train the above model without labeling the video.
  • the reinforcement learning method is an unsupervised learning method.
  • reinforcement learning has two key elements: actions and reward functions.
  • actions related to camera selection It means that the lens whose time serial number is y i is selected, so Y can represent the lens selection action, the time serial number set of the selected lens, and
  • is a parameter of the above-mentioned two-way LSTM model, so the probability of occurrence of the lens selection action Y is
  • the value reward function R (S) has two indicators: diversity R div and representative R rep , which are defined as follows:
  • R (S) R div + R rep . (4)
  • X t X t 2 represents the feature vector, obtained by squaring the square of each element of the feature vector X t, and opening;
  • 2 indicates the feature vector X t' length , Obtained by square-summing the squares of the elements of the feature vector X t ' ; Represents the transpose of the feature vector X t .
  • the diversity index measures the diversity of content between different shots, and the representative index measures how much the selected video shots represent the original video.
  • a 1: T represents the action taken, that is, which lenses are selected and which lenses are not selected
  • p ⁇ (a 1: T ) represents the probability that the action a 1: T occurs.
  • N is the number of actions sampled.
  • the aforementioned two-way LSTM network is trained using a large number of advertisement videos on Jingdong Mall, and a trained lens importance score calculation network is obtained as a video lens importance score network model.
  • step S106 a group of shots is selected from a plurality of shots, so that the total importance score of the selected group of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
  • the constraint condition that the total length of the video summary needs to be satisfied may not exceed the required total length of the video summary.
  • a set of shots is selected from the plurality of shots, and the set of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied.
  • step S108 the selected group of shots is stitched into a video summary and the video summary is output.
  • the video summary may be output to a display to display the video summary.
  • the video summary can also be output to other devices.
  • a method for generating a video summary of some embodiments is provided.
  • this method after receiving a video, after dividing the video into multiple shots, the importance score of each shot is calculated, and a shot with a larger importance score is a more important shot.
  • a group of shots with the largest total importance score is selected under the condition that the total length of the video summary is satisfied, and the set of shots is stitched into a video summary and the video summary is output. Therefore, this method can be used to make the video summary contain some more important shots or fragments.
  • the method may further include identifying at least one shot that exhibits a key feature among the plurality of shots.
  • the key feature may include at least one of a product brand trademark and a product brand text.
  • the above step S106 may include: selecting at least one main shot from at least one shot showing key features, and selecting at least one from the remaining shots of the plurality of shots except for at least one main shot The auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.
  • At least one shot showing key features is identified, and at least one main shot is selected from at least one shot showing key features, and at least one auxiliary shot is selected from other remaining shots.
  • the at least one main shot and the at least one auxiliary shot are used as a selected group of shots, so that the total importance score of the set of shots is the largest when the constraint condition of the total length of the video summary is satisfied. Stitch the set of shots into a video summary. In this way, the obtained video summary contains key shots, such as the key shots used to introduce product brands or product names in advertising videos, so as to highlight the important information of the video as much as possible.
  • the method may further include: calculating the similarity between each shot and the advertised product picture, and using the similarity correction The importance score of the corresponding shot (ie, the shot corresponding to the similarity). With this amendment, the importance of those lenses that focus on displaying products can be enhanced, thereby enhancing the ability of video summaries to display products.
  • FIG. 4 is a flowchart illustrating a method for generating a video digest according to other embodiments of the present disclosure. As shown in FIG. 4, the method may include steps S402 to S410.
  • step S402 a video is received, and the video is divided into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a segment of a continuous video scene.
  • This step S402 is the same as or similar to step S102, and details are not described herein again.
  • step S404 the importance score of each shot is calculated. This step S404 is the same as or similar to step S104, and details are not described herein again.
  • step S406 the similarity between each lens and the advertised product picture is calculated, and the importance score of the corresponding lens is corrected by using the similarity.
  • the process of step S406 will be described in detail later with reference to FIG. 5.
  • step S408 at least one shot showing key features is identified among the plurality of shots.
  • the key feature may include at least one of a product brand trademark and a product brand text.
  • advertising videos there are generally shots showing product brands at the beginning or end of the video. This is to deepen the impression of the product brand on the advertising audience and to promote the brand. Therefore, you can identify and extract the shots of the advertising brand and summarize them in the summary After the ad video.
  • the two sources of information used in the embodiments of the present disclosure to identify advertising brand lenses include: product brand trademarks and product brand text. For example, Jingdong mascots and Jingdong characters.
  • advertising brand lens recognition may include two steps of brand trademark or text recognition and brand lens determination. It is as follows: (1) using the object detection technology to identify the brand trademark, or using OCR (Optical Character Recognition, optical character recognition) technology to identify the brand text; (2) brand lens determination: for the lens S t , its length (ie the number of video frames) For sl t , if the brand trademark or text is in the center area of the image and appears in consecutive N c frame images, then this shot is determined as an advertising brand shot. For example, N c ⁇ sl t / 2.
  • step S408 may include: using a deep learning-based object detection method to detect a trademark area in each frame of the video.
  • the object detection method can use Faster-RCNN (Faster, Region, CNN, Detector, Faster Region CNN Detector), SSD (Single Shot Detector, Single Frame Detector), and YOLO (Detector by "You only look at the scene” "Can only look at the detector once” etc., but is not limited to these methods.
  • the step S408 may further include: inputting the image of the trademark region into a pre-trained depth model to extract the embedded feature vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark ( (Such as JD.com, Apple, Haier, etc.), so as to identify at least one shot showing the brand logo of the product.
  • the at least one lens may include a plurality of lenses. For example, if the database stores feature vectors of N trademark images, the extracted embedded feature vectors are compared with the feature vectors of these N trademark images to obtain the brand type of the trademark.
  • step S408 may include: identifying the text in each frame of the video using the OCR method based on deep learning; and performing word segmentation processing on the text, and processing the processed text with the brand text in the database. Match and retain the text related to the product brand to identify at least one shot showing the product brand text.
  • step S410 a set of shots is selected from a plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
  • a group of shots needs to be selected and stitched together to obtain a final summary video file.
  • t 1, ..., T ⁇ , where su t ⁇ ⁇ 0,1 ⁇ indicates whether the shot is selected. For example, su t is 1, which means that the lens is selected; su t is 0, which means that the lens is not selected.
  • sv t is the importance score of the lens
  • sl t is the length of the lens
  • su t indicates whether the lens is selected
  • ST is the maximum duration of the summary video.
  • step S410 may include: selecting at least one main shot from at least one shot showing key features, and selecting from the remaining shots of the plurality of shots except the at least one main shot At least one auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.
  • the step of selecting at least one main shot from at least one shot showing key features may include: if the shot selected from the at least one shot showing key features is the first N g shots or the last face of the video N g lenses, it is determined that the frontmost N g lenses or the rearmost N g lenses are the at least one main lens, and N g is a positive integer, for example, the value of N g is 1 to 2.
  • the lens S t is identified as the lens used to display the brand of the advertised product, and it is the first N g or the last N g of the lens set S, that is t ⁇ N g or t> KN g , K Is the total number of shots, then this shot S t is the selected advertisement brand shot.
  • the value of N g is 1 to 2. Because a basic purpose of advertising is to let the advertising audience know the brand of the product, the brand of the product can be displayed and emphasized in the summary video.
  • the step of selecting at least one auxiliary lens from the remaining lenses in the plurality of lenses, and using the at least one main lens and the at least one auxiliary lens as the selected group of lenses may include: At least one auxiliary lens is selected from the remaining lenses except the at least one main lens, and the at least one main lens and the at least one auxiliary lens are used as a selected group of lenses, so that the selected group The lens has the largest total importance score when it meets the constraint of the total length of the video summary.
  • S pre is above the selected brand advertising lens set in the lens set S ⁇ S pre (meant to exclude residual lens after S pre set) using a dynamic programming method to solve the optimization problem, choice of lens, and satisfies the remaining length constraint .
  • step S412 the selected group of shots is stitched into a video summary and the video summary is output.
  • the step of stitching the selected group of shots into a video summary may include: stitching at least one main shot and at least one auxiliary shot into a video summary in chronological order. For example, at least one main shot and at least one auxiliary shot may be sorted in time, and finally stitched into an advertisement video summary.
  • the lens may exhibit not critical features of the top N video shots or the rearmost g N g shots, but some lenses video intermediate portion.
  • one or some shots can be selected as the main shot from these shots showing key features. Then select the secondary lens from the remaining shots.
  • the main shot is placed at the front or the rear of the video summary, and the auxiliary shots are arranged in chronological order, so that the main shot and the auxiliary shot are stitched into a video Summary.
  • a method for generating a video summary after receiving a video, after dividing the video into multiple shots, the importance score of each shot is calculated. Among them, a shot with a larger importance score is a more important shot. At least one shot showing key features is identified, and at least one main shot is selected from at least one shot showing key features, and at least one auxiliary shot is selected from other remaining shots. The at least one main shot and the at least one auxiliary shot are used as a selected group of shots, so that the total importance score of the set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
  • the set of shots is stitched into a video summary and the video summary is output. In this way, the obtained video summary contains key shots, such as the key shots used to introduce product brands or product names in advertising videos, so as to highlight the important information of the video as much as possible.
  • the method of some embodiments of the present disclosure focuses on retaining key segments that introduce product brands and product characteristics in short video advertisements, and guarantees a certain degree of continuity and excitement of the video content after the summary.
  • One purpose of advertising is to show the appearance of the product to the advertising audience, to build an impression of the product in their minds, so the shots that highlight the product can be identified in the advertising video and output to the video summary.
  • the main image of the product generally contains the overall appearance of the product.
  • the similarity between the video lens and the main image of the product can identify the lens that displays the main content of the product. If the main picture of the product promoted by the advertising video can be obtained, the lens importance score can be corrected.
  • FIG. 5 is a flowchart illustrating a method of correcting an importance score of a lens according to some embodiments of the present disclosure.
  • the process shown in FIG. 5 is a specific implementation manner of step S406 in FIG. 4.
  • the specific process of step S406 in FIG. 4 is described in detail below with reference to FIG. 5.
  • the process of correcting the importance score of the lens may include steps S502 to S508.
  • step S502 a feature vector of the advertised product picture is calculated.
  • a classification model based on deep learning for example, Very Deep Convolutional Network (VGG), Google Inception Convolutional Network (Inception), ResNet (Residual Convolutional Network, residual) convolutional network)
  • VCG Very Deep Convolutional Network
  • Inception Inception
  • ResNet Residual Convolutional Network, residual
  • step S504 the multi-frame image of each lens is sampled to obtain a sample frame, and the feature vector of the sample frame of each lens is calculated.
  • one frame image is selected for every several frames (for example, every 5 frames) of the video image in each shot S t , and the classification feature model in step S502 is used to calculate the embedded feature vectors of these images to obtain a feature vector set ⁇ X ti
  • i 1, ..., N t ⁇ .
  • N t represents the number of images sampled by the lens S t .
  • step S506 the similarity between each shot and the product picture is calculated according to the feature vector of the product picture and the feature vector of the sampling frame of each shot.
  • step S508 the importance score of the corresponding lens is modified according to the similarity and a preset similarity threshold. For example, the importance scores of some important shots can be modified, or the importance scores of each shot can be modified.
  • the following formula may be used to modify the lens importance score sv t , where tsm is a similarity threshold, for example, the similarity threshold may take a value of 0.5 to 0.6.
  • the formula for correcting the lens importance score sv t is:
  • a method of correcting the importance score of a lens is provided.
  • the importance of the lens that focuses on the product can be enhanced, thereby enhancing the ability of the video summary to display the product.
  • FIG. 6 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure. As shown in FIG. 6, the system may include a video segmentation unit 602, a calculation unit 604, a selection unit 606, and a stitching unit 608.
  • the video segmentation unit 602 is configured to receive a video, and segment the video into multiple shots according to a change in the video scene of the video. Each shot of the plurality of shots is a continuous video scene.
  • the calculation unit 604 is configured to calculate an importance score of each shot.
  • the selecting unit 606 is configured to select a group of shots from the plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
  • the stitching unit 608 is configured to stitch the selected group of shots into a video summary and output the video summary.
  • the video segmentation unit receives the video and divides the video into multiple shots according to changes in the video scene; the calculation unit calculates the importance score of each shot; and the selection unit selects from among the multiple shots Selecting a group of shots so that the total importance score of the selected group of shots is the largest under the constraint condition of the total length of the video summary; and the stitching unit stitches the selected group of shots into a video summary and outputs the Video summary.
  • This system can make the video summary contain some more important shots or clips.
  • the calculation unit 604 may be configured to extract a feature vector from each shot using a three-dimensional convolution network, obtain a feature vector sequence of a shot set composed of the plurality of shots, and input the feature vector sequence. To a pre-trained shot importance score calculation network to calculate the importance score of each shot.
  • FIG. 7 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure. As shown in FIG. 7, the system may include a video segmentation unit 602, a calculation unit 604, a selection unit 606, and a stitching unit 608.
  • the system may further include a training unit 714.
  • the training unit 714 is configured to train a lens importance score calculation network using a reinforcement learning method.
  • the key elements of this reinforcement learning method include: action and value reward functions.
  • the value reward function includes: a diversity indicator and a representative indicator.
  • the system may further include an identification unit 710.
  • the identification unit 710 is configured to identify at least one shot that exhibits a key feature among the plurality of shots.
  • the key feature may include at least one of a product brand trademark and a product brand text.
  • the recognition unit 710 may be configured to: use a deep learning-based object detection method to detect a trademark region in each frame of video of the video; and input the image of the trademark region into a pre-trained depth model to extract an embedding Feature vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, thereby identifying at least one shot showing the brand brand of the product.
  • the recognition unit 710 may be configured to: use a deep learning-based optical character recognition method to recognize text in each frame of the video; and perform word segmentation processing on the text, and compare the processed text with a database The brand text in the match is matched, and the text related to the product brand is retained, thereby identifying at least one shot showing the product brand text.
  • the selecting unit 606 may be configured to select at least one main shot from at least one shot showing key features, and to select remaining shots from the plurality of shots other than the at least one main shot. Select at least one auxiliary lens, and use the at least one main lens and the at least one auxiliary lens as the selected group of lenses.
  • the selection unit 606 may be configured to determine the forefront if the shot selected from the at least one shot showing the key feature is the first N g shots or the last N g shots of the video.
  • N g shots or the rearmost N g shots are the at least one main shot, and N g is a positive integer; and at least one selected from the remaining shots of the plurality of shots except the at least one main shot
  • An auxiliary lens using the at least one main lens and the at least one auxiliary lens as a selected group of lenses, so that the selected group of lenses has the largest total importance score when the constraint condition of the total length of the video summary is satisfied .
  • the stitching unit 608 may be configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
  • the system may further include a correction unit 712.
  • the correction unit 712 is configured to calculate the similarity between each shot and the advertised product picture, and use the similarity to correct the importance score of the corresponding shot.
  • the correction unit 712 may be configured to: calculate a feature vector of the advertised product picture; sample multiple frames of each shot to obtain a sampling frame, and calculate a feature vector of the sampling frame of each shot ; Calculate the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and the importance score of the corresponding shot according to the similarity and a preset similarity threshold Make corrections.
  • FIG. 8 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure.
  • the system includes a memory 810 and a processor 820. among them:
  • the memory 810 may be a magnetic disk, a flash memory, or any other non-volatile storage medium.
  • the memory is configured to store instructions in the embodiment corresponding to at least one of FIG. 1 to FIG. 5.
  • the processor 820 is coupled to the memory 810 and may be implemented as one or more integrated circuits, such as a microprocessor or a microcontroller.
  • the processor 820 is configured to execute instructions stored in the memory, so that the video summary includes some more important shots or clips, or contains some key shots or clips.
  • the system 900 includes a memory 910 and a processor 920.
  • the processor 920 is coupled to the memory 910 through a BUS bus 930.
  • the system 900 may also be connected to an external storage device 950 through a storage interface 940 to call external data, and may also be connected to a network or another computer system (not shown) through a network interface 960, which is not described in detail here.
  • a data instruction is stored in a memory, and the above instruction is processed by a processor, so that the video summary includes some more important shots or fragments, or some key shots or fragments.
  • the present disclosure also provides a computer-readable storage medium having computer program instructions stored thereon, which are executed by a processor to implement at least one of the embodiments corresponding to FIG. 1 to FIG. 5 Steps of the method.
  • a processor to implement at least one of the embodiments corresponding to FIG. 1 to FIG. 5 Steps of the method.
  • the embodiments of the present disclosure may be provided as a method, an apparatus, or a computer program product. Therefore, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code therein. .
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions
  • the device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
  • the methods and systems of the present disclosure may be implemented in many ways.
  • the methods and systems of the present disclosure may be implemented by software, hardware, firmware or any combination of software, hardware, firmware.
  • the above order of the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated.
  • the present disclosure may also be implemented as programs recorded in a recording medium, which programs include machine-readable instructions for implementing the method according to the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing a method according to the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The present invention relates to the technical field of videos, and provides a method and system for generating a video abstract. The method may comprise: receiving a video and splitting the video into multiple shots according to the video scene change of the video, wherein each of the multiple shots is a video scene having continuous content; calculating the importance score of each shot; selecting a group of shots from the multiple shots, thereby making the total importance score of the selected group of shots be maximum in the case that the constraint condition of the total time length of a video abstract is satisfied; and splicing the selected group of shots into the video abstract and outputting the video abstract. The present invention may make the video abstract comprise some important shots or segments.

Description

用于生成视频摘要的方法和系统Method and system for generating video summary
相关申请的交叉引用Cross-reference to related applications
本申请是以CN申请号为201810874321.2,申请日为2018年8月3日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。This application is based on an application with a CN application number of 201810874321.2 and an application date of August 3, 2018, and claims its priority. The disclosure of this CN application is incorporated herein as a whole.
技术领域Technical field
本公开涉及视频技术领域,特别涉及一种用于生成视频摘要的方法和系统。The present disclosure relates to the field of video technology, and in particular, to a method and system for generating a video abstract.
背景技术Background technique
视频摘要是从一段较长的视频中选取关键帧或者关键片段并拼接成一段较短的视频,使观看者能够在较短的时间内了解原始视频的内容或者欣赏原始视频中的精彩片段。视频摘要有广泛的应用场景,包括个人视频剪辑、电视电影剧情介绍、视频辅助刑侦和互联网短视频等。在相关技术的生成视频摘要的方法中,由于对视频评价的主观性较强,因此所生成的视频摘要可能会丢失一些比较重要的片段或精彩内容。The video summary is to select key frames or key segments from a longer video and stitch them into a shorter video, so that the viewer can understand the content of the original video or enjoy the wonderful clips in the original video in a short time. Video summary has a wide range of application scenarios, including personal video editing, TV movie plot introduction, video assisted criminal investigation and Internet short video. In the related art method for generating a video summary, due to the strong subjectivity of the video evaluation, the generated video summary may lose some important segments or exciting content.
例如,相关技术的视频摘要方法一般是基于一些通用性准则来选取关键帧和关键片段,较少有针对特定场景和应用的生成视频摘要的方法。这导致这样的方法在一些具体应用场景特别是视频广告领域的效果不太好,经过摘要处理的广告视频可能会丢失用于介绍商品品牌和商品特点的关键片段。For example, the video abstraction method of related technologies generally selects key frames and key fragments based on some general criteria, and there are few methods for generating video abstractions for specific scenes and applications. This leads to the poor performance of this method in some specific application scenarios, especially in the field of video advertising. The summary-processed advertising video may lose key segments used to introduce product brands and product characteristics.
发明内容Summary of the invention
本公开实施例解决的一个技术问题是:提供一种用于生成视频摘要的方法,使得该视频摘要能包含一些比较重要的镜头或片段。A technical problem solved by the embodiments of the present disclosure is to provide a method for generating a video summary, so that the video summary can include some relatively important shots or fragments.
根据本公开实施例的一个方面,提供了一种由计算机执行的用于生成视频摘要的方法,包括:接收视频,并根据所述视频的视频场景的变化将所述视频切分为多个镜头,,所述多个镜头的每个镜头为一段内容连续的视频场景;计算每个镜头的重要性分值;从所述多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大;以及将所选取的所述一组镜头拼接成视频摘要并输出所述视频摘要。According to an aspect of the embodiments of the present disclosure, there is provided a method for generating a video summary executed by a computer, including: receiving a video, and dividing the video into multiple shots according to a change in a video scene of the video Each shot of the plurality of shots is a continuous video scene; calculating the importance score of each shot; selecting a set of shots from the plurality of shots so as to meet the constraint of the total length of the video summary The total importance score of the selected group of shots is the largest under the condition; and the selected group of shots is stitched into a video summary and the video summary is output.
在一些实施例中,计算每个镜头的重要性分值的步骤包括:利用三维卷积网络从 每个镜头提取特征向量,以获得由所述多个镜头组成的镜头集合的特征向量序列;以及将所述特征向量序列输入到预先训练的镜头重要性分值计算网络以计算每个镜头的重要性分值。In some embodiments, the step of calculating the importance score of each shot includes: extracting a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and The feature vector sequence is input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.
在一些实施例中,在将视频切分为多个镜头之前,所述方法还包括:采用强化学习的方法对镜头重要性分值计算网络进行训练,其中,所述强化学习的方法所包含的关键元素包括:行动和价值奖励函数,所述价值奖励函数包含:多样性指标和代表性指标。In some embodiments, before segmenting the video into multiple shots, the method further includes: training a shot importance score calculation network using a reinforcement learning method, wherein the method of reinforcement learning includes Key elements include: action and value reward functions, which include: diversity indicators and representative indicators.
在一些实施例中,在从所述多个镜头中选取所述一组镜头之前,所述方法还包括:在所述多个镜头中识别出展现关键特征的至少一个镜头。In some embodiments, before selecting the set of shots from the plurality of shots, the method further includes identifying at least one shot among the plurality of shots that exhibits a key feature.
在一些实施例中,所述关键特征包括商品品牌商标和商品品牌文字中的至少一个。In some embodiments, the key feature includes at least one of a product brand trademark and a product brand text.
在一些实施例中,在所述多个镜头中识别出展现关键特征的至少一个镜头的步骤包括:使用基于深度学习的对象检测方法检测视频的每帧图像中的商标区域;以及将商标区域的图像输入到预先训练的深度模型以提取嵌入特征向量,并将所述嵌入特征向量与数据库中的商标图像的特征向量进行比对以获取商标的品牌类型,从而识别出展现商品品牌商标的至少一个镜头;或者,使用基于深度学习的光学字符识别方法识别视频的每帧图像中的文字;以及对文字进行分词处理,并将处理后的文字与数据库中的品牌文字进行匹配,保留与商品品牌相关的文字,从而识别出展现商品品牌文字的至少一个镜头。In some embodiments, the step of identifying at least one shot that exhibits a key feature among the plurality of shots includes: using a deep learning-based object detection method to detect a trademark region in each frame of the video of the video; and The image is input to a pre-trained depth model to extract the embedded feature vector, and the embedded feature vector is compared with the feature vector of the trademark image in the database to obtain the brand type of the trademark, so as to identify at least one displaying the brand brand of the product Footage; or, using deep learning-based optical character recognition methods to identify the text in each frame of the video; and word segmentation of the text, and matching the processed text with the brand text in the database, keeping it relevant to the product brand Text to identify at least one shot showing the brand text of the product.
在一些实施例中,从所述多个镜头中选取一组镜头的步骤包括:从所述展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头。In some embodiments, the step of selecting a group of shots from the plurality of shots includes: selecting at least one main shot from the at least one shot showing key features, and dividing the shot from the plurality of shots. At least one auxiliary lens is selected from the remaining lenses other than the at least one main lens, and the at least one main lens and the at least one auxiliary lens are taken as a selected group of lenses.
在一些实施例中,从所述展现关键特征的至少一个镜头中选取至少一个主镜头的步骤包括:如果从所述展现关键特征的至少一个镜头中选取的镜头为视频的最前面Ng个镜头或最后面Ng个镜头,则确定该最前面Ng个镜头或该最后面Ng个镜头为所述至少一个主镜头,Ng为正整数;从所述多个镜头中的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头的步骤包括:从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头,使得所选取的该组镜头在满足视频摘要总时长的约束条件的情况下总的重要性分值最大; 将所选取的所述一组镜头拼接成视频摘要的步骤包括:将所述至少一个主镜头和所述至少一个辅助镜头按照时间顺序拼接成视频摘要。In some embodiments, the step of selecting at least one main shot from the at least one shot showing the key feature includes: if the shot selected from the at least one shot showing the key feature is the first Ng shots of the video or The rearmost Ng lens, it is determined that the frontmost Ng lens or the rearmost Ng lens is the at least one main lens and Ng is a positive integer; at least one auxiliary lens is selected from the remaining lenses in the plurality of lenses The step of using the at least one main lens and the at least one auxiliary lens as the selected group of lenses includes: selecting at least one from among the remaining lenses except the at least one main lens among the plurality of lenses. An auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses, so that the selected group of lenses has a total importance score under the condition that the constraint condition of the total length of the video summary is satisfied Maximum; the step of stitching the selected set of shots into a video summary includes: combining the at least one main shot and the at least one auxiliary shot Lens spliced into video summary in chronological order.
在一些实施例中,在所述多个镜头中识别出展现关键特征的至少一个镜头之前,所述方法还包括:计算每个镜头与所宣传商品图片的相似度,并利用所述相似度修正相应镜头的重要性分值。In some embodiments, before at least one shot showing key features is identified in the plurality of shots, the method further includes: calculating the similarity between each shot and the advertised product picture, and using the similarity correction The importance score of the corresponding shot.
在一些实施例中,计算每个镜头与所宣传的商品图片的相似度,并利用所述相似度修正相应镜头的重要性分值的步骤包括:计算所宣传商品图片的特征向量;对每个镜头的多帧图像进行采样以获得采样帧,并计算每个镜头的采样帧的特征向量;根据所述商品图片的特征向量和每个镜头的采样帧的特征向量计算每个镜头与所述商品图片的相似度;以及根据所述相似度和预设的相似度阈值对相应镜头的重要性分值进行修正。In some embodiments, the step of calculating the similarity between each shot and the advertised product picture and using the similarity to modify the importance score of the corresponding shot includes: calculating a feature vector of the advertised product picture; for each Multi-frame images of the lens are sampled to obtain a sample frame, and a feature vector of the sample frame of each lens is calculated; each lens and the product are calculated according to the feature vector of the product picture and the feature vector of the sample frame of each lens Similarity of pictures; and correcting importance scores of corresponding shots according to the similarity and a preset similarity threshold.
根据本公开实施例的另一个方面,提供了一种用于生成视频摘要的系统,包括:视频切分单元,被配置为接收视频,并根据所述视频的视频场景的变化将所述视频切分为多个镜头,其中,所述多个镜头的每个镜头为一段内容连续的视频场景;计算单元,被配置为计算每个镜头的重要性分值;选取单元,被配置为从所述多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大;以及拼接单元,被配置为将所选取的所述一组镜头拼接成视频摘要并输出所述视频摘要。According to another aspect of the embodiments of the present disclosure, there is provided a system for generating a video digest, including: a video segmentation unit configured to receive a video and cut the video according to a change in a video scene of the video Divided into a plurality of shots, wherein each shot of the plurality of shots is a continuous video scene; the calculation unit is configured to calculate the importance score of each shot; the selection unit is configured to extract A group of shots is selected from a plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and a stitching unit is configured to set the selected A set of shots is stitched into a video summary and the video summary is output.
在一些实施例中,所述计算单元被配置为利用三维卷积网络从每个镜头提取特征向量,以获得由所述多个镜头组成的镜头集合的特征向量序列,以及将所述特征向量序列输入到预先训练的镜头重要性分值计算网络以计算每个镜头的重要性分值。In some embodiments, the calculation unit is configured to extract a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots, and to convert the feature vector sequence Input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.
在一些实施例中,所述系统还包括:训练单元,被配置为采用强化学习的方法对镜头重要性分值计算网络进行训练,其中,所述强化学习的方法所包含的关键元素包括:行动和价值奖励函数,所述价值奖励函数包含:多样性指标和代表性指标。In some embodiments, the system further includes: a training unit configured to train a lens importance score calculation network by using a reinforcement learning method, wherein the key elements included in the method of reinforcement learning include: action And value reward function, the value reward function includes: a diversity indicator and a representative indicator.
在一些实施例中,所述系统还包括:识别单元,被配置为在所述多个镜头中识别出展现关键特征的至少一个镜头。In some embodiments, the system further includes a recognition unit configured to identify at least one shot among the plurality of shots that exhibits a key feature.
在一些实施例中,所述关键特征包括商品品牌商标和商品品牌文字中的至少一个。In some embodiments, the key feature includes at least one of a product brand trademark and a product brand text.
在一些实施例中,所述识别单元被配置为:使用基于深度学习的对象检测方法检测视频的每帧图像中的商标区域;以及将商标区域的图像输入到预先训练的深度模型以提取嵌入特征向量,并将所述嵌入特征向量与数据库中的商标图像的特征向量进行 比对以获取商标的品牌类型,从而识别出展现商品品牌商标的至少一个镜头;或者,使用基于深度学习的光学字符识别方法识别视频的每帧图像中的文字;以及对文字进行分词处理,并将处理后的文字与数据库中的品牌文字进行匹配,保留与商品品牌相关的文字,从而识别出展现商品品牌文字的至少一个镜头。In some embodiments, the recognition unit is configured to: use a deep learning-based object detection method to detect a trademark region in each frame of video of the video; and input the image of the trademark region into a pre-trained depth model to extract embedded features Vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, thereby identifying at least one shot showing the brand brand of the product; or using optical character recognition based on deep learning The method recognizes the text in each frame of the video; performs word segmentation on the text, matches the processed text with the brand text in the database, and retains the text related to the product brand, thereby identifying at least the product brand text A shot.
在一些实施例中,所述选取单元被配置为从所述展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头。In some embodiments, the selection unit is configured to select at least one main shot from the at least one shot that exhibits key features, and to remove from the rest of the plurality of shots except the at least one main shot At least one auxiliary lens is selected from the lenses, and the at least one main lens and the at least one auxiliary lens are taken as a selected group of lenses.
在一些实施例中,所述选取单元被配置为:如果从所述展现关键特征的至少一个镜头中选取的镜头为视频的最前面Ng个镜头或最后面Ng个镜头,则确定该最前面Ng个镜头或该最后面Ng个镜头为所述至少一个主镜头,Ng为正整数;以及从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头,使得所选取的该组镜头在满足视频摘要总时长的约束条件的情况下总的重要性分值最大;所述拼接单元被配置为将所述至少一个主镜头和所述至少一个辅助镜头按照时间顺序拼接成视频摘要。In some embodiments, the selecting unit is configured to determine the foremost Ng if the lens selected from the at least one lens showing the key feature is the foremost Ng lens or the last Ng lens of the video. The three lenses or the rearmost Ng lenses are the at least one main lens and Ng is a positive integer; and at least one auxiliary lens is selected from the remaining lenses of the plurality of lenses except the at least one main lens, Using the at least one main shot and the at least one auxiliary shot as a selected group of shots, so that the selected group of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied; The stitching unit is configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
在一些实施例中,所述系统还包括:修正单元,被配置为计算每个镜头与所宣传商品图片的相似度,并利用所述相似度修正相应镜头的重要性分值。In some embodiments, the system further includes a correction unit configured to calculate the similarity between each shot and the advertised product picture, and use the similarity to correct the importance score of the corresponding shot.
在一些实施例中,所述修正单元被配置为:计算所宣传商品图片的特征向量;对每个镜头的多帧图像进行采样以获得采样帧,并计算每个镜头的采样帧的特征向量;根据所述商品图片的特征向量和每个镜头的采样帧的特征向量计算每个镜头与所述商品图片的相似度;以及根据所述相似度和预设的相似度阈值对相应镜头的重要性分值进行修正。In some embodiments, the correction unit is configured to: calculate a feature vector of the advertised product picture; sample a plurality of frames of each shot to obtain a sampling frame, and calculate a feature vector of the sampling frame of each shot; Calculating the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and the importance of the corresponding shot to the corresponding shot according to the similarity and a preset similarity threshold Scores are corrected.
根据本公开实施例的另一个方面,提供了一种用于生成视频摘要的系统,包括:存储器;以及耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行如前所述的方法。According to another aspect of the embodiments of the present disclosure, there is provided a system for generating a video digest, including: a memory; and a processor coupled to the memory, the processor being configured to be based on the memory stored in the memory. The instructions execute the method described previously.
根据本公开实施例的另一个方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现如前所述的方法的步骤。According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor, implement the steps of the method described above.
在上述方法中,接收视频,在将视频切分为多个镜头后,计算了每个镜头的重要性分值,而且在选取一组镜头的过程中,选取了在满足视频摘要总时长的约束条件的 情况下总的重要性分值最大的一组镜头,将该组镜头拼接成视频摘要并输出该视频摘要。因此通过该方法可以使得视频摘要中包含一些比较重要的镜头或片段。In the above method, after receiving the video, after dividing the video into multiple shots, the importance score of each shot is calculated, and in the process of selecting a set of shots, a constraint is selected to satisfy the total length of the video summary In the case of a condition, a group of shots with the highest total importance score is stitched into a video summary and the video summary is output. Therefore, this method can make the video summary contain some more important shots or fragments.
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。Other features and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。The accompanying drawings, which form a part of the specification, describe embodiments of the present disclosure and, together with the description, serve to explain principles of the present disclosure.
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:The disclosure can be understood more clearly with reference to the accompanying drawings, based on the following detailed description, in which:
图1是示出根据本公开一些实施例的用于生成视频摘要的方法的流程图;1 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure;
图2是示出根据本公开一些实施例的计算每个镜头的重要性分值的方法的流程图;2 is a flowchart illustrating a method of calculating an importance score of each shot according to some embodiments of the present disclosure;
图3是示出根据本公开另一些实施例的计算每个镜头的重要性分值的方法的流程图;3 is a flowchart illustrating a method of calculating an importance score of each shot according to other embodiments of the present disclosure;
图4是示出根据本公开另一些实施例的用于生成视频摘要的方法的流程图;4 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure;
图5是示出根据本公开一些实施例的修正镜头的重要性分值的方法的流程图;5 is a flowchart illustrating a method of correcting an importance score of a lens according to some embodiments of the present disclosure;
图6是示出根据本公开一些实施例的用于生成视频摘要的系统的结构图;6 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure;
图7是示出根据本公开另一些实施例的用于生成视频摘要的系统的结构图;7 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure;
图8是示出根据本公开另一些实施例的用于生成视频摘要的系统的结构图;8 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure;
图9是示出根据本公开另一些实施例的用于生成视频摘要的系统的结构图。FIG. 9 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure.
具体实施方式detailed description
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the drawings. It should be noted that, unless specifically stated otherwise, the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。At the same time, it should be understood that, for the convenience of description, the dimensions of the various parts shown in the drawings are not drawn according to the actual proportional relationship.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is actually merely illustrative and in no way serves as any limitation on the present disclosure and its application or use.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and equipment known to those of ordinary skill in the relevant field may not be discussed in detail, but where appropriate, the techniques, methods, and equipment should be considered as part of the description.
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。In all examples shown and discussed herein, any specific value should be construed as exemplary only and not as a limitation. Therefore, other examples of the exemplary embodiments may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters indicate similar items in the following drawings, so once an item is defined in one drawing, it need not be discussed further in subsequent drawings.
图1是示出根据本公开一些实施例的用于生成视频摘要的方法的流程图。图2是示出根据本公开一些实施例的计算每个镜头的重要性分值的方法的流程图。图3是示出根据本公开另一些实施例的计算每个镜头的重要性分值的方法的流程图。下面结合图1至图3详细描述根据本公开一些实施例的由计算机执行的用于生成视频摘要的方法。如图1所示,该方法可以包括步骤S102~S108。FIG. 1 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure. FIG. 2 is a flowchart illustrating a method of calculating an importance score of each shot according to some embodiments of the present disclosure. FIG. 3 is a flowchart illustrating a method of calculating an importance score of each shot according to other embodiments of the present disclosure. The method for generating a video summary executed by a computer according to some embodiments of the present disclosure is described in detail below with reference to FIGS. 1 to 3. As shown in FIG. 1, the method may include steps S102 to S108.
如图1所示,在步骤S102,接收视频,并根据该视频的视频场景的变化将该视频切分为多个镜头,其中,该多个镜头的每个镜头为一段内容连续的视频场景。As shown in FIG. 1, in step S102, a video is received, and the video is divided into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a segment of continuous video scene.
例如,对于一段视频序列V={I i|i=1,…,N},其中I i是一帧视频图像。根据视频场景的变化将其切分为长度不一的多个镜头S t,这些多个镜头组成镜头集合S={S t|t=1,…,T},T>1且T为正整数。每个镜头是一段内容连续的视频场景。假设每个镜头的长度(即每个镜头包含的视频帧的数目)为sl t,则所有镜头长度的集合表示为SL={sl t|t=1,…,T}。 For example, for a video sequence V = {I i | i = 1, ..., N}, where I i is a frame of video image. According to the change of the video scene, it is divided into multiple shots S t of different lengths, which constitute a lens set S = {S t | t = 1, ..., T}, T> 1 and T is a positive integer . Each shot is a continuous video scene. Assuming that the length of each shot (that is, the number of video frames contained in each shot) is sl t , the set of all shot lengths is expressed as SL = {sl t | t = 1, ..., T}.
在一些实施例中,可以采用KTS(Kernel Temporal Segmentation,核函数时域分割)方法将视频切分为多个镜头。此方法切分效果好而且速度较快。但本公开并仅不限于使用KTS方法,还可以采用其它镜头切分方法。In some embodiments, a KTS (Kernel Temporal Segmentation, Kernel Time Domain Segmentation) method may be used to segment a video into multiple shots. This method has good segmentation effect and fast speed. However, the present disclosure is not limited to using the KTS method, and other lens segmentation methods may also be adopted.
在步骤S104,计算每个镜头的重要性分值。In step S104, the importance score of each shot is calculated.
在一些实施例中,该步骤S104可以包括:利用三维卷积网络从每个镜头提取特征向量,以获得由所述多个镜头组成的镜头集合的特征向量序列;以及将该特征向量序列输入到预先训练的镜头重要性分值计算网络以计算每个镜头的重要性分值。In some embodiments, step S104 may include: extracting a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and inputting the feature vector sequence to A pre-trained shot importance score calculation network to calculate the importance score of each shot.
例如,实现计算重要性分值(该计算重要性分值的过程也可以称为重要性打分)的模型框图如图2所示。使用三维卷积网络(C3D Net)从视频镜头提取特征向量,获得镜头集合S(S={S t|t=1,…,T})的特征向量序列X={X t|t=1,…,T},其中X t∈R d1,R为实数集,d1表示维度。然后将特征向量序列X输入到已经训练得到的镜头重要性分值计算网络来计算每个镜头的重要性分值(或者称为重要性概率值)sv t∈[0,1],获得镜头重要性序列SV={sv t|t=1,…,T}。以下对计算重要性分值的两个子网络进行说明。 For example, a block diagram of a model that implements the calculation of importance scores (the process of calculating importance scores can also be referred to as importance scores) is shown in Figure 2. Use a three-dimensional convolutional network (C3D Net) to extract feature vectors from the video shots, and obtain a feature vector sequence X = {X t | t = 1, for the shot set S (S = {S t | t = 1, ..., T}) …, T}, where X t ∈ R d1 , R is the set of real numbers, and d1 represents the dimension. Then input the feature vector sequence X to the lens importance score calculation network that has been trained to calculate the importance score (or importance probability value) of each lens sv t ∈ [0,1] to obtain the lens importance The sexual sequence SV = {sv t | t = 1, ..., T}. The two sub-networks for calculating the importance score are described below.
(1)视频镜头特征提取网络(1) Video lens feature extraction network
视频镜头是图像序列,可以用三维矩阵表示。可以使用三维卷积网络(C3D Net)对镜头进行处理并提取一维特征向量。即将三维卷积网络作为视频镜头特征提取网络。例如,可以采用膨胀的三维卷积网络(Inflated 3D convolutional network,简称为I3D)对镜头进行处理。Video footage is a sequence of images that can be represented by a three-dimensional matrix. A three-dimensional convolutional network (C3D Net) can be used to process the shots and extract one-dimensional feature vectors. That is, a three-dimensional convolutional network is used as a feature extraction network for video shots. For example, an inflated 3D convolutional network (I3D for short) can be used to process the lens.
例如,Kinetics-600是一个视频分类数据集,它包含600个类别的人的活动,共50多万个10秒时长的视频片段。首先使用Kinetics-600数据集对I3D网络进行预训练,然后使用I3D网络处理视频镜头S t,将网络的最后一个池化层的输出作为特征向量X t,从而将镜头集合S={S t|t=1,…,T}转化为特征向量序列X={X t|t=1,…,T}。由于经过预训练的I3D网络具有很强的视频分类能力,其最后一个池化层的输出就是一种特征嵌入(feature embedding),表征了视频内容的本质特征。 For example, Kinetics-600 is a video classification dataset, which contains the activities of 600 categories of people, a total of more than 500,000 10-second video clips. First use the Kinetics-600 dataset to pre-train the I3D network, then use the I3D network to process the video shot S t , and use the output of the last pooling layer of the network as the feature vector X t , so that the shot set S = {S t | t = 1, ..., T} is transformed into a feature vector sequence X = {X t | t = 1, ..., T}. Because the pre-trained I3D network has strong video classification capabilities, the output of its last pooling layer is a type of feature embedding, which characterizes the essential characteristics of video content.
在本公开的实施例中,并不仅限于采用I3D网络,还可以采用其它类型的三维卷积网络以进行视频镜头的特征提取。In the embodiment of the present disclosure, it is not limited to adopting an I3D network, and other types of three-dimensional convolutional networks may also be used for feature extraction of a video lens.
(2)镜头重要性分值计算网络(2) Lens importance score calculation network
镜头重要性分值计算网络可以是时序网络,例如可以是递归神经网络(Recurrent Neural Network,简称为RNN)。镜头重要性分值计算网络可以被输入具有时间先后顺序的特征向量序列X={X t|t=1,…,T},输出镜头重要性分值序列SV={sv t|t=1,…,T}。例如,可以使用双向LSTM(Long Short-Term Memory,长短时记忆网络)实现此网络,如图3所示。 The lens importance score calculation network may be a time-series network, for example, it may be a recurrent neural network (Recurrent Neural Network, RNN for short). The lens importance score calculation network can be input with a chronological sequence of feature vectors X = {X t | t = 1, ..., T}, and output a lens importance score sequence SV = {sv t | t = 1, …, T}. For example, a bidirectional LSTM (Long Short-Term Memory) can be used to implement this network, as shown in Figure 3.
在一些实施例中,在将视频切分为多个镜头之前,所述方法还可以包括:采用强化学习的方法对镜头重要性分值计算网络进行训练。该强化学习的方法所包含的关键元素包括:行动和价值奖励函数。该价值奖励函数包含:多样性指标和代表性指标。使用强化学习的方法训练上述模型,不需要对视频进行标注。该强化学习的方法是一种无监督学习方法。In some embodiments, before segmenting the video into multiple shots, the method may further include: training a shot importance score calculation network using a reinforcement learning method. The key elements of this reinforcement learning method include: action and value reward functions. The value reward function includes: a diversity indicator and a representative indicator. Reinforcement learning is used to train the above model without labeling the video. The reinforcement learning method is an unsupervised learning method.
强化学习的基本思想是在系统的某个状态下随机采取多个行动,计算每个行动产生的价值,通过奖励价值较大的行动而惩罚价值较小的行动来对系统进行优化,使其趋向于选择价值较大的行动。因此,强化学习有两个关键元素:行动(actions)和价值奖励函数(reward function)。The basic idea of reinforcement learning is to take multiple actions at random in a certain state of the system, calculate the value generated by each action, and optimize the system by rewarding the action of higher value and punishing the action of lower value, so that it tends to For choosing higher value actions. Therefore, reinforcement learning has two key elements: actions and reward functions.
例如,定义与镜头选取相关的行动(actions):
Figure PCTCN2019098495-appb-000001
表示时间序号为y i的镜头被选取,所以Y可以代表镜头选取行动,表示被选取镜头的时 间序号集合,|Y|表示该集合的元素数目。网络对每个视频镜头输出其重要性概率值p t=sv t,基于伯努利分布对镜头是否被选取进行采样,即a t~Bernoulli(p t),用π θ(a t|p t)表示,其中θ为上述双向LSTM模型的参数,所以镜头选取行动Y的发生概率为
Figure PCTCN2019098495-appb-000002
For example, define actions related to camera selection:
Figure PCTCN2019098495-appb-000001
It means that the lens whose time serial number is y i is selected, so Y can represent the lens selection action, the time serial number set of the selected lens, and | Y | the number of elements in the set. The network outputs the importance probability value p t = sv t for each video shot, and samples whether the shot is selected based on the Bernoulli distribution, that is, a t ~ Bernoulli (p t ), using π θ (a t | p t ), Where θ is a parameter of the above-mentioned two-way LSTM model, so the probability of occurrence of the lens selection action Y is
Figure PCTCN2019098495-appb-000002
价值奖励函数R(S)有两个指标:多样性R div和代表性R rep,分别定义如下: The value reward function R (S) has two indicators: diversity R div and representative R rep , which are defined as follows:
Figure PCTCN2019098495-appb-000003
Figure PCTCN2019098495-appb-000003
Figure PCTCN2019098495-appb-000004
Figure PCTCN2019098495-appb-000004
其中,
Figure PCTCN2019098495-appb-000005
among them,
Figure PCTCN2019098495-appb-000005
R(S)=R div+R rep。       (4) R (S) = R div + R rep . (4)
这里,||X t|| 2表示特征向量X t的长度,通过对特征向量X t的各个元素的平方和进行开平方得到;||X t′|| 2表示特征向量X t’的长度,通过对特征向量X t’的各个元素的平方和进行开平方得到;
Figure PCTCN2019098495-appb-000006
表示特征向量X t的转置。
Here, the length || || X t X t 2 represents the feature vector, obtained by squaring the square of each element of the feature vector X t, and opening; || X t '|| 2 indicates the feature vector X t' length , Obtained by square-summing the squares of the elements of the feature vector X t ' ;
Figure PCTCN2019098495-appb-000006
Represents the transpose of the feature vector X t .
多样性指标度量了不同镜头之间内容的多样性,代表性指标度量了选取的视频镜头在多大程度上代表了原始的视频。The diversity index measures the diversity of content between different shots, and the representative index measures how much the selected video shots represent the original video.
强化学习的目标是对所有可能的行动最大化奖励函数R(S)的期望,数学描述如下:The goal of reinforcement learning is to maximize the expectation of the reward function R (S) for all possible actions. The mathematical description is as follows:
Figure PCTCN2019098495-appb-000007
Figure PCTCN2019098495-appb-000007
其中,a 1:T表示采取的行动,即表示哪些镜头被选取,哪些镜头不被选取,pθ(a 1:T)表示行动a 1:T发生的概率。 Among them, a 1: T represents the action taken, that is, which lenses are selected and which lenses are not selected, and pθ (a 1: T ) represents the probability that the action a 1: T occurs.
因为镜头选取行动Y的发生概率为
Figure PCTCN2019098495-appb-000008
所有目标函数的梯度可以表示为:
Because the probability of the camera selecting action Y is
Figure PCTCN2019098495-appb-000008
The gradient of all objective functions can be expressed as:
Figure PCTCN2019098495-appb-000009
Figure PCTCN2019098495-appb-000009
通过对镜头选择行动进行采样,上述梯度期望可以做近似计算,即:By sampling the lens selection action, the above gradient expectations can be approximated, that is:
Figure PCTCN2019098495-appb-000010
Figure PCTCN2019098495-appb-000010
实际计算期望过程中可以采样一些行动来近似计算期望,这里,N为采样的行动的数目。During the actual calculation of expectations, some actions can be sampled to approximate the expectations. Here, N is the number of actions sampled.
基于上述强化学习方法,使用例如京东商城上的大量广告视频训练前述双向 LSTM网络,获得了训练好的镜头重要性分值计算网络作为视频镜头重要性打分网络模型。Based on the above reinforcement learning method, for example, the aforementioned two-way LSTM network is trained using a large number of advertisement videos on Jingdong Mall, and a trained lens importance score calculation network is obtained as a video lens importance score network model.
回到图1,在步骤S106,从多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大。Returning to FIG. 1, in step S106, a group of shots is selected from a plurality of shots, so that the total importance score of the selected group of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
例如,所需要满足的视频摘要总时长的约束条件可以为不超过要求的视频摘要总时长。从所述多个镜头中选取一组镜头,该组镜头在满足视频摘要总时长的约束条件的情况下的总重要性分值最大。For example, the constraint condition that the total length of the video summary needs to be satisfied may not exceed the required total length of the video summary. A set of shots is selected from the plurality of shots, and the set of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied.
在步骤S108,将所选取的一组镜头拼接成视频摘要并输出该视频摘要。例如,可以将该组镜头按照时间顺序拼接成视频摘要。该视频摘要可以被输出到显示器以显示该视频摘要。当然,该视频摘要也可以被输出到其他设备。In step S108, the selected group of shots is stitched into a video summary and the video summary is output. For example, you can stitch the set of shots into a video summary in chronological order. The video summary may be output to a display to display the video summary. Of course, the video summary can also be output to other devices.
至此,提供了一些实施例的用于生成视频摘要的方法。在该方法中,接收视频,在将视频切分为多个镜头后,计算每个镜头的重要性分值,重要性分值比较大的镜头是比较重要的镜头。在选取一组镜头的过程中,选取了在满足视频摘要总时长的约束条件的情况下总的重要性分值最大的一组镜头,将该组镜头拼接成视频摘要并输出该视频摘要。因此可以通过该方法使得视频摘要中包含一些比较重要的镜头或片段。So far, a method for generating a video summary of some embodiments is provided. In this method, after receiving a video, after dividing the video into multiple shots, the importance score of each shot is calculated, and a shot with a larger importance score is a more important shot. In the process of selecting a group of shots, a group of shots with the largest total importance score is selected under the condition that the total length of the video summary is satisfied, and the set of shots is stitched into a video summary and the video summary is output. Therefore, this method can be used to make the video summary contain some more important shots or fragments.
在一些实施例中,在步骤S106之前,所述方法还可以包括:在所述多个镜头中识别出展现关键特征的至少一个镜头。例如,该关键特征可以包括商品品牌商标和商品品牌文字中的至少一个。In some embodiments, before step S106, the method may further include identifying at least one shot that exhibits a key feature among the plurality of shots. For example, the key feature may include at least one of a product brand trademark and a product brand text.
在一些实施例中,上述步骤S106可以包括:从展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头。In some embodiments, the above step S106 may include: selecting at least one main shot from at least one shot showing key features, and selecting at least one from the remaining shots of the plurality of shots except for at least one main shot The auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.
在上述实施例的方法中,识别出展现关键特征的至少一个镜头,并从展现关键特征的至少一个镜头中选取至少一个主镜头以及从其他剩余镜头中选取至少一个辅助镜头。将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头,并使得该组镜头在满足视频摘要总时长的约束条件的情况下的总重要性分值最大。将该组镜头拼接成视频摘要。这样可以使得所得到的视频摘要包含关键镜头,例如广告视频中用于介绍商品品牌或商品名称的关键镜头,从而可以尽可能地突出视频的重要信息。In the method of the above embodiment, at least one shot showing key features is identified, and at least one main shot is selected from at least one shot showing key features, and at least one auxiliary shot is selected from other remaining shots. The at least one main shot and the at least one auxiliary shot are used as a selected group of shots, so that the total importance score of the set of shots is the largest when the constraint condition of the total length of the video summary is satisfied. Stitch the set of shots into a video summary. In this way, the obtained video summary contains key shots, such as the key shots used to introduce product brands or product names in advertising videos, so as to highlight the important information of the video as much as possible.
在一些实施例中,在所述多个镜头中识别出展现关键特征的至少一个镜头之前,所述方法还可以包括:计算每个镜头与所宣传商品图片的相似度,并利用该相似度修 正相应镜头(即与相似度对应的镜头)的重要性分值。经过该修正,可以提升那些重点展示商品的镜头的重要性,从而增强视频摘要对商品的展现能力。In some embodiments, before at least one shot showing key features is identified in the plurality of shots, the method may further include: calculating the similarity between each shot and the advertised product picture, and using the similarity correction The importance score of the corresponding shot (ie, the shot corresponding to the similarity). With this amendment, the importance of those lenses that focus on displaying products can be enhanced, thereby enhancing the ability of video summaries to display products.
图4是示出根据本公开另一些实施例的用于生成视频摘要的方法的流程图。如图4所示,该方法可以包括步骤S402~S410。FIG. 4 is a flowchart illustrating a method for generating a video digest according to other embodiments of the present disclosure. As shown in FIG. 4, the method may include steps S402 to S410.
在步骤S402,接收视频,并根据该视频的视频场景的变化将该视频切分为多个镜头,其中,该多个镜头的每个镜头为一段内容连续的视频场景。该步骤S402与步骤S102相同或相似,这里不再赘述。In step S402, a video is received, and the video is divided into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a segment of a continuous video scene. This step S402 is the same as or similar to step S102, and details are not described herein again.
在步骤S404,计算每个镜头的重要性分值。该步骤S404与步骤S104相同或相似,这里不再赘述。In step S404, the importance score of each shot is calculated. This step S404 is the same as or similar to step S104, and details are not described herein again.
在步骤S406,计算每个镜头与所宣传商品图片的相似度,并利用该相似度修正相应镜头的重要性分值。该步骤S406的过程在后面将结合图5详细描述。In step S406, the similarity between each lens and the advertised product picture is calculated, and the importance score of the corresponding lens is corrected by using the similarity. The process of step S406 will be described in detail later with reference to FIG. 5.
在步骤S408,在多个镜头中识别出展现关键特征的至少一个镜头。例如,该关键特征可以包括商品品牌商标和商品品牌文字中的至少一个。In step S408, at least one shot showing key features is identified among the plurality of shots. For example, the key feature may include at least one of a product brand trademark and a product brand text.
例如,在广告视频中,一般在片头或者片尾都会有展示商品品牌的镜头,这是为了加深商品品牌对广告受众的印象,起到宣传品牌的目的,因此可以识别和提取广告品牌镜头并在摘要后的广告视频中进行展示。本公开实施例用于识别广告品牌镜头所使用的两个信息来源包括:商品品牌商标和商品品牌文字。例如京东吉祥物和京东文字。For example, in advertising videos, there are generally shots showing product brands at the beginning or end of the video. This is to deepen the impression of the product brand on the advertising audience and to promote the brand. Therefore, you can identify and extract the shots of the advertising brand and summarize them in the summary After the ad video. The two sources of information used in the embodiments of the present disclosure to identify advertising brand lenses include: product brand trademarks and product brand text. For example, Jingdong mascots and Jingdong characters.
在一些实施例中,广告品牌镜头识别可以包含品牌商标或文字识别和品牌镜头判定两个步骤。如下:(1)使用对象检测技术识别品牌商标,或者使用OCR(Optical Character Recognition,光学字符识别)技术识别品牌文字;(2)品牌镜头判定:对于镜头S t,其长度(即视频帧数)为sl t,如果品牌商标或文字处在图像的中心区域,并且在连续N c帧图像中出现,则此镜头被确定为广告品牌镜头。例如,N c≥sl t/2。 In some embodiments, advertising brand lens recognition may include two steps of brand trademark or text recognition and brand lens determination. It is as follows: (1) using the object detection technology to identify the brand trademark, or using OCR (Optical Character Recognition, optical character recognition) technology to identify the brand text; (2) brand lens determination: for the lens S t , its length (ie the number of video frames) For sl t , if the brand trademark or text is in the center area of the image and appears in consecutive N c frame images, then this shot is determined as an advertising brand shot. For example, N c ≥sl t / 2.
在一些实施例中,该步骤S408可以包括:使用基于深度学习的对象检测方法检测视频的每帧图像中的商标区域。例如,该对象检测方法可以使用Faster-RCNN(Faster Region CNN Detector,更快速区域CNN检测器)、SSD(Single Shot Detector,单帧检测器)和YOLO(Detector by“You only look once”,“你只能看一次”的检测器)等,但不限于这些方法。该步骤S408还可以包括:将商标区域的图像输入到预先训练的深度模型以提取嵌入特征向量,并将该嵌入特征向量与数据库中的商标图像的特征向量进行比对以获取商标的品牌类型(例如京东、apple或海尔等),从而识 别出展现商品品牌商标的至少一个镜头。例如,该至少一个镜头可以包括多个镜头。例如,数据库存储有N个商标图像的特征向量,则将提取的嵌入特征向量与这N个商标图像的特征向量进行比对,获取商标的品牌类型。In some embodiments, step S408 may include: using a deep learning-based object detection method to detect a trademark area in each frame of the video. For example, the object detection method can use Faster-RCNN (Faster, Region, CNN, Detector, Faster Region CNN Detector), SSD (Single Shot Detector, Single Frame Detector), and YOLO (Detector by "You only look at the scene" "Can only look at the detector once" etc., but is not limited to these methods. The step S408 may further include: inputting the image of the trademark region into a pre-trained depth model to extract the embedded feature vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark ( (Such as JD.com, Apple, Haier, etc.), so as to identify at least one shot showing the brand logo of the product. For example, the at least one lens may include a plurality of lenses. For example, if the database stores feature vectors of N trademark images, the extracted embedded feature vectors are compared with the feature vectors of these N trademark images to obtain the brand type of the trademark.
在另一些实施例中,该步骤S408可以包括:使用基于深度学习的OCR方法识别视频的每帧图像中的文字;以及对文字进行分词处理,并将处理后的文字与数据库中的品牌文字进行匹配,保留与商品品牌相关的文字,从而识别出展现商品品牌文字的至少一个镜头。In other embodiments, step S408 may include: identifying the text in each frame of the video using the OCR method based on deep learning; and performing word segmentation processing on the text, and processing the processed text with the brand text in the database. Match and retain the text related to the product brand to identify at least one shot showing the product brand text.
在步骤S410,从多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大。In step S410, a set of shots is selected from a plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
在本公开的实施例中,在生成视频摘要的过程中,需要选择一组镜头并将它们拼接起来获得最后的摘要视频文件。哪些镜头被选取可以用集合SU={su t|t=1,…,T}来表示,其中su t∈{0,1},表示镜头是否被选取。例如,su t为1,表示镜头被选取;su t为0,表示镜头不被选取。 In the embodiment of the present disclosure, in the process of generating a video summary, a group of shots needs to be selected and stitched together to obtain a final summary video file. Which shots are selected can be represented by the set SU = {su t | t = 1, ..., T}, where su t ∈ {0,1} indicates whether the shot is selected. For example, su t is 1, which means that the lens is selected; su t is 0, which means that the lens is not selected.
对于镜头集合S={S t|t=1,…,T},在满足总时长约束条件下选择一组镜头使得总的镜头重要性分值最大,可以归结为一个最优化问题,如下: For the lens set S = {S t | t = 1, ..., T}, selecting a group of lenses under the condition of satisfying the total duration constraint to maximize the total lens importance score can be reduced to an optimization problem, as follows:
Figure PCTCN2019098495-appb-000011
受限于
Figure PCTCN2019098495-appb-000012
Figure PCTCN2019098495-appb-000011
limited by
Figure PCTCN2019098495-appb-000012
其中,sv t是镜头的重要性分值,sl t是镜头的长度,su t表示镜头是否被选取,ST是摘要视频的最大时长。此最优化问题可以使用动态规划方法求解。 Among them, sv t is the importance score of the lens, sl t is the length of the lens, su t indicates whether the lens is selected, and ST is the maximum duration of the summary video. This optimization problem can be solved using dynamic programming methods.
在一些实施例中,该步骤S410可以包括:从展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头。In some embodiments, step S410 may include: selecting at least one main shot from at least one shot showing key features, and selecting from the remaining shots of the plurality of shots except the at least one main shot At least one auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.
在一些实施例中,从展现关键特征的至少一个镜头中选取至少一个主镜头的步骤可以包括:如果从展现关键特征的至少一个镜头中选取的镜头为视频的最前面N g个镜头或最后面N g个镜头,则确定该最前面N g个镜头或该最后面N g个镜头为所述至少一个主镜头,N g为正整数,例如N g的取值为1~2。 In some embodiments, the step of selecting at least one main shot from at least one shot showing key features may include: if the shot selected from the at least one shot showing key features is the first N g shots or the last face of the video N g lenses, it is determined that the frontmost N g lenses or the rearmost N g lenses are the at least one main lens, and N g is a positive integer, for example, the value of N g is 1 to 2.
例如,如果镜头S t被识别为用于展示所宣传商品品牌的镜头,而且是镜头集合S的最前面N g个或者最后面N g个镜头,即t≤N g或者t>K-N g,K为总镜头数,则此镜头S t即为所选取的广告品牌镜头。例如,N g的取值为1~2。因为广告的一个基本目 的是让广告受众知晓商品的品牌,所以可以在摘要视频中展示和强调商品的品牌。 For example, if the lens S t is identified as the lens used to display the brand of the advertised product, and it is the first N g or the last N g of the lens set S, that is t ≦ N g or t> KN g , K Is the total number of shots, then this shot S t is the selected advertisement brand shot. For example, the value of N g is 1 to 2. Because a basic purpose of advertising is to let the advertising audience know the brand of the product, the brand of the product can be displayed and emphasized in the summary video.
在一些实施例中,从所述多个镜头中的剩余镜头中选取至少一个辅助镜头,将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头的步骤可以包括:从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头,使得所选取的该组镜头在满足视频摘要总时长的约束条件的情况下总的重要性分值最大。In some embodiments, the step of selecting at least one auxiliary lens from the remaining lenses in the plurality of lenses, and using the at least one main lens and the at least one auxiliary lens as the selected group of lenses may include: At least one auxiliary lens is selected from the remaining lenses except the at least one main lens, and the at least one main lens and the at least one auxiliary lens are used as a selected group of lenses, so that the selected group The lens has the largest total importance score when it meets the constraint of the total length of the video summary.
例如,S pre是上面所选取的广告品牌镜头集合,在镜头集合S\S pre(表示排除S pre之后的剩余镜头集合)中使用动态规划方法求解前述最优化问题,选择镜头并满足剩余时长约束。 For example, S pre is above the selected brand advertising lens set in the lens set S \ S pre (meant to exclude residual lens after S pre set) using a dynamic programming method to solve the optimization problem, choice of lens, and satisfies the remaining length constraint .
在步骤S412,将所选取的一组镜头拼接成视频摘要并输出该视频摘要。In step S412, the selected group of shots is stitched into a video summary and the video summary is output.
在一些实施例中,将所选取的一组镜头拼接成视频摘要的步骤可以包括:将至少一个主镜头和至少一个辅助镜头按照时间顺序拼接成视频摘要。例如,可以将至少一个主镜头和至少一个辅助镜头按照时间排序,最后拼接成广告视频摘要。In some embodiments, the step of stitching the selected group of shots into a video summary may include: stitching at least one main shot and at least one auxiliary shot into a video summary in chronological order. For example, at least one main shot and at least one auxiliary shot may be sorted in time, and finally stitched into an advertisement video summary.
在另一些实施例中,展现关键特征的镜头可能不是视频的最前面N g个镜头或最后面N g个镜头,而是视频中间部分的某些镜头。在这样的情况下,可以从这些展现关键特征的镜头中选取一个或一些镜头作为主镜头。然后从剩余的镜头中选取辅助镜头。在将主镜头和辅助镜头拼接成视频摘要的过程中,将该主镜头放在视频摘要的最前面或最后面,并将辅助镜头按照时间顺序排列,从而将这些主镜头和辅助镜头拼接成视频摘要。 In other embodiments, the lens may exhibit not critical features of the top N video shots or the rearmost g N g shots, but some lenses video intermediate portion. In such a case, one or some shots can be selected as the main shot from these shots showing key features. Then select the secondary lens from the remaining shots. In the process of stitching the main shot and the auxiliary shot into a video summary, the main shot is placed at the front or the rear of the video summary, and the auxiliary shots are arranged in chronological order, so that the main shot and the auxiliary shot are stitched into a video Summary.
至此,提供了根据本公开另一些实施例的用于生成视频摘要的方法。在该方法中,接收视频,在将视频切分为多个镜头后,计算每个镜头的重要性分值,其中,重要性分值比较大的镜头是比较重要的镜头。识别出展现关键特征的至少一个镜头,并从展现关键特征的至少一个镜头中选取至少一个主镜头以及从其他剩余镜头中选取至少一个辅助镜头。将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头,并使得该组镜头在满足视频摘要总时长的约束条件的情况下的总重要性分值最大。将该组镜头拼接成视频摘要并输出该视频摘要。这样可以使得所得到的视频摘要包含关键镜头,例如广告视频中用于介绍商品品牌或商品名称的关键镜头,从而可以尽可能地突出视频的重要信息。So far, a method for generating a video summary according to other embodiments of the present disclosure is provided. In this method, after receiving a video, after dividing the video into multiple shots, the importance score of each shot is calculated. Among them, a shot with a larger importance score is a more important shot. At least one shot showing key features is identified, and at least one main shot is selected from at least one shot showing key features, and at least one auxiliary shot is selected from other remaining shots. The at least one main shot and the at least one auxiliary shot are used as a selected group of shots, so that the total importance score of the set of shots is the largest when the constraint condition of the total length of the video summary is satisfied. The set of shots is stitched into a video summary and the video summary is output. In this way, the obtained video summary contains key shots, such as the key shots used to introduce product brands or product names in advertising videos, so as to highlight the important information of the video as much as possible.
本公开一些实施例的方法在短视频广告中重点保留介绍商品品牌和商品特点的关键片段,并保证摘要后视频内容有一定的连续性和精彩性。The method of some embodiments of the present disclosure focuses on retaining key segments that introduce product brands and product characteristics in short video advertisements, and guarantees a certain degree of continuity and excitement of the video content after the summary.
广告的一个目的是向广告受众展示商品的外观,在他们的脑海中建立对此商品的印象,所以可以在广告视频中识别出重点展示商品的镜头并输出到视频摘要中。商品主图一般包含了商品的外观全貌,通过视频镜头与商品主图之间的相似性可以识别出以展示商品为主要内容的镜头。如果能够获取广告视频所宣传商品的主图,则可以对镜头重要性分值进行修正。One purpose of advertising is to show the appearance of the product to the advertising audience, to build an impression of the product in their minds, so the shots that highlight the product can be identified in the advertising video and output to the video summary. The main image of the product generally contains the overall appearance of the product. The similarity between the video lens and the main image of the product can identify the lens that displays the main content of the product. If the main picture of the product promoted by the advertising video can be obtained, the lens importance score can be corrected.
图5是示出根据本公开一些实施例的修正镜头的重要性分值的方法的流程图。该图5所示的过程是图4中的步骤S406的一种具体实施方式。下面结合图5详细描述图4中的步骤S406的具体过程。如图5所示,该修正镜头的重要性分值的过程可以包括步骤S502~S508。FIG. 5 is a flowchart illustrating a method of correcting an importance score of a lens according to some embodiments of the present disclosure. The process shown in FIG. 5 is a specific implementation manner of step S406 in FIG. 4. The specific process of step S406 in FIG. 4 is described in detail below with reference to FIG. 5. As shown in FIG. 5, the process of correcting the importance score of the lens may include steps S502 to S508.
在步骤S502,计算所宣传商品图片的特征向量。In step S502, a feature vector of the advertised product picture is calculated.
例如,可以使用基于深度学习的分类模型(例如,VGG(Very Deep Convolutional Network,非常深的卷积网络)、Inception(Google Inception Convolutional Network,谷歌初期卷积网络)、ResNet(Residual Convolutional Network,残差卷积网络)等)计算商品图片(或称为商品主图)I M的嵌入特征向量X M∈R d2,X M是d2维的特征向量。 For example, a classification model based on deep learning (for example, Very Deep Convolutional Network (VGG), Google Inception Convolutional Network (Inception), ResNet (Residual Convolutional Network, residual) convolutional network)) calculations for the image (or referred to the goods master FIG.) I M damascene feature vectors X M ∈R d2, X M is the dimension of the feature vector d2.
在步骤S504,对每个镜头的多帧图像进行采样以获得采样帧,并计算每个镜头的采样帧的特征向量。In step S504, the multi-frame image of each lens is sampled to obtain a sample frame, and the feature vector of the sample frame of each lens is calculated.
例如,对每个镜头S t中的视频图像每若干帧(例如每5帧)图像选取1帧图像,使用步骤S502中的分类模型计算这些图像的嵌入特征向量以得到特征向量集合{X ti|i=1,…,N t}。这里N t表示对镜头S t抽样的图像的数目。 For example, one frame image is selected for every several frames (for example, every 5 frames) of the video image in each shot S t , and the classification feature model in step S502 is used to calculate the embedded feature vectors of these images to obtain a feature vector set {X ti | i = 1, ..., N t }. Here N t represents the number of images sampled by the lens S t .
在步骤S506,根据商品图片的特征向量和每个镜头的采样帧的特征向量计算每个镜头与商品图片的相似度。In step S506, the similarity between each shot and the product picture is calculated according to the feature vector of the product picture and the feature vector of the sampling frame of each shot.
例如,对每个镜头S t,将其特征向量集合{X ti|i=1,…,N t}与商品图片的特征向量X M计算余弦相似度以得到相似度集合{sm ti|i=1,…,N t},并取相似度集合的中值sm t=median{sm ti|i=1,…,N t}作为镜头与商品图片的相似度。 For example, for each shot S t , calculate the cosine similarity between the feature vector set {X ti | i = 1, ..., N t } and the feature vector X M of the product picture to obtain the similarity set {sm ti | i = 1, ..., N t }, and take the median value of the similarity set sm t = median {sm ti | i = 1, ..., N t } as the similarity between the lens and the product picture.
在步骤S508,根据相似度和预设的相似度阈值对相应镜头的重要性分值进行修正。例如,可以对某些重要的镜头的重要性分值修正,或者对每个镜头的重要性分值修正。In step S508, the importance score of the corresponding lens is modified according to the similarity and a preset similarity threshold. For example, the importance scores of some important shots can be modified, or the importance scores of each shot can be modified.
例如,可以使用以下公式对镜头重要性分值sv t进行修正,其中tsm为相似度阈值,例如,该相似度阈值可以取值为0.5至0.6。对镜头重要性分值sv t进行修正的公式为: For example, the following formula may be used to modify the lens importance score sv t , where tsm is a similarity threshold, for example, the similarity threshold may take a value of 0.5 to 0.6. The formula for correcting the lens importance score sv t is:
Figure PCTCN2019098495-appb-000013
Figure PCTCN2019098495-appb-000013
至此,提供了根据一些实施例的修正镜头的重要性分值的方法。通过计算镜头与商品图片的相似度,并根据相似度对镜头的重要性分值修正,可以提升重点展示商品的镜头的重要性,从而增强视频摘要对商品的展现能力。So far, a method of correcting the importance score of a lens according to some embodiments is provided. By calculating the similarity between the lens and the product picture, and correcting the importance score of the lens according to the similarity, the importance of the lens that focuses on the product can be enhanced, thereby enhancing the ability of the video summary to display the product.
图6是示出根据本公开一些实施例的用于生成视频摘要的系统的结构图。如图6所示,该系统可以包括视频切分单元602、计算单元604、选取单元606和拼接单元608。FIG. 6 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure. As shown in FIG. 6, the system may include a video segmentation unit 602, a calculation unit 604, a selection unit 606, and a stitching unit 608.
该视频切分单元602被配置为接收视频,并根据该视频的视频场景的变化将该视频切分为多个镜头。该多个镜头的每个镜头为一段内容连续的视频场景。The video segmentation unit 602 is configured to receive a video, and segment the video into multiple shots according to a change in the video scene of the video. Each shot of the plurality of shots is a continuous video scene.
该计算单元604被配置为计算每个镜头的重要性分值。The calculation unit 604 is configured to calculate an importance score of each shot.
该选取单元606被配置为从所述多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大。The selecting unit 606 is configured to select a group of shots from the plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.
该拼接单元608被配置为将所选取的该组镜头拼接成视频摘要并输出该视频摘要。The stitching unit 608 is configured to stitch the selected group of shots into a video summary and output the video summary.
在该实施例的系统中,视频切分单元接收视频并根据视频场景的变化将视频切分为多个镜头;计算单元计算每个镜头的重要性分值;选取单元从所述多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大;以及拼接单元将所选取的该组镜头拼接成视频摘要并输出该视频摘要。该系统可以使得视频摘要中包含一些比较重要的镜头或片段。In the system of this embodiment, the video segmentation unit receives the video and divides the video into multiple shots according to changes in the video scene; the calculation unit calculates the importance score of each shot; and the selection unit selects from among the multiple shots Selecting a group of shots so that the total importance score of the selected group of shots is the largest under the constraint condition of the total length of the video summary; and the stitching unit stitches the selected group of shots into a video summary and outputs the Video summary. This system can make the video summary contain some more important shots or clips.
在一些实施例中,该计算单元604可以被配置为利用三维卷积网络从每个镜头提取特征向量,获得由所述多个镜头组成的镜头集合的特征向量序列,以及将该特征向量序列输入到预先训练的镜头重要性分值计算网络以计算每个镜头的重要性分值。In some embodiments, the calculation unit 604 may be configured to extract a feature vector from each shot using a three-dimensional convolution network, obtain a feature vector sequence of a shot set composed of the plurality of shots, and input the feature vector sequence. To a pre-trained shot importance score calculation network to calculate the importance score of each shot.
图7是示出根据本公开另一些实施例的用于生成视频摘要的系统的结构图。如图7所示,该系统可以包括视频切分单元602、计算单元604、选取单元606和拼接单元608。FIG. 7 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure. As shown in FIG. 7, the system may include a video segmentation unit 602, a calculation unit 604, a selection unit 606, and a stitching unit 608.
在一些实施例中,如图7所示,该系统还可以包括训练单元714。该训练单元714被配置为采用强化学习的方法对镜头重要性分值计算网络进行训练。该强化学习的方法所包含的关键元素包括:行动和价值奖励函数。该价值奖励函数包含:多样性指标和代表性指标。In some embodiments, as shown in FIG. 7, the system may further include a training unit 714. The training unit 714 is configured to train a lens importance score calculation network using a reinforcement learning method. The key elements of this reinforcement learning method include: action and value reward functions. The value reward function includes: a diversity indicator and a representative indicator.
在一些实施例中,如图7所示,该系统还可以包括识别单元710。该识别单元710 被配置为在所述多个镜头中识别出展现关键特征的至少一个镜头。例如,该关键特征可以包括商品品牌商标和商品品牌文字中的至少一个。In some embodiments, as shown in FIG. 7, the system may further include an identification unit 710. The identification unit 710 is configured to identify at least one shot that exhibits a key feature among the plurality of shots. For example, the key feature may include at least one of a product brand trademark and a product brand text.
在一些实施例中,该识别单元710可以被配置为:使用基于深度学习的对象检测方法检测视频的每帧图像中的商标区域;以及将商标区域的图像输入到预先训练的深度模型以提取嵌入特征向量,并将该嵌入特征向量与数据库中的商标图像的特征向量进行比对以获取商标的品牌类型,从而识别出展现商品品牌商标的至少一个镜头。In some embodiments, the recognition unit 710 may be configured to: use a deep learning-based object detection method to detect a trademark region in each frame of video of the video; and input the image of the trademark region into a pre-trained depth model to extract an embedding Feature vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, thereby identifying at least one shot showing the brand brand of the product.
在另一些实施例中,该识别单元710可以被配置为:使用基于深度学习的光学字符识别方法识别视频的每帧图像中的文字;以及对文字进行分词处理,并将处理后的文字与数据库中的品牌文字进行匹配,保留与商品品牌相关的文字,从而识别出展现商品品牌文字的至少一个镜头。In other embodiments, the recognition unit 710 may be configured to: use a deep learning-based optical character recognition method to recognize text in each frame of the video; and perform word segmentation processing on the text, and compare the processed text with a database The brand text in the match is matched, and the text related to the product brand is retained, thereby identifying at least one shot showing the product brand text.
在一些实施例中,该选取单元606可以被配置为从展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头。In some embodiments, the selecting unit 606 may be configured to select at least one main shot from at least one shot showing key features, and to select remaining shots from the plurality of shots other than the at least one main shot. Select at least one auxiliary lens, and use the at least one main lens and the at least one auxiliary lens as the selected group of lenses.
在一些实施例中,该选取单元606可以被配置为:如果从展现关键特征的至少一个镜头中选取的镜头为视频的最前面N g个镜头或最后面N g个镜头,则确定该最前面N g个镜头或该最后面N g个镜头为所述至少一个主镜头,N g为正整数;以及从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将该至少一个主镜头和该至少一个辅助镜头作为所选取的一组镜头,使得所选取的该组镜头在满足视频摘要总时长的约束条件的情况下总的重要性分值最大。 In some embodiments, the selection unit 606 may be configured to determine the forefront if the shot selected from the at least one shot showing the key feature is the first N g shots or the last N g shots of the video. N g shots or the rearmost N g shots are the at least one main shot, and N g is a positive integer; and at least one selected from the remaining shots of the plurality of shots except the at least one main shot An auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses, so that the selected group of lenses has the largest total importance score when the constraint condition of the total length of the video summary is satisfied .
在一些实施例中,该拼接单元608可以被配置为将所述至少一个主镜头和所述至少一个辅助镜头按照时间顺序拼接成视频摘要。In some embodiments, the stitching unit 608 may be configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
在一些实施例中,如图7所示,该系统还可以包括修正单元712。该修正单元712被配置为计算每个镜头与所宣传商品图片的相似度,并利用该相似度修正相应镜头的重要性分值。In some embodiments, as shown in FIG. 7, the system may further include a correction unit 712. The correction unit 712 is configured to calculate the similarity between each shot and the advertised product picture, and use the similarity to correct the importance score of the corresponding shot.
在一些实施例中,该修正单元712可以被配置为:计算所宣传商品图片的特征向量;对每个镜头的多帧图像进行采样以获得采样帧,并计算每个镜头的采样帧的特征向量;根据该商品图片的特征向量和每个镜头的采样帧的特征向量计算每个镜头与该商品图片的相似度;以及根据该相似度和预设的相似度阈值对相应镜头的重要性分值进行修正。In some embodiments, the correction unit 712 may be configured to: calculate a feature vector of the advertised product picture; sample multiple frames of each shot to obtain a sampling frame, and calculate a feature vector of the sampling frame of each shot ; Calculate the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and the importance score of the corresponding shot according to the similarity and a preset similarity threshold Make corrections.
图8是示出根据本公开另一些实施例的用于生成视频摘要的系统的结构图。该系统包括存储器810和处理器820。其中:FIG. 8 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure. The system includes a memory 810 and a processor 820. among them:
存储器810可以是磁盘、闪存或其它任何非易失性存储介质。存储器用于存储图1至图5中的至少一个所对应实施例中的指令。The memory 810 may be a magnetic disk, a flash memory, or any other non-volatile storage medium. The memory is configured to store instructions in the embodiment corresponding to at least one of FIG. 1 to FIG. 5.
处理器820耦接至存储器810,可以作为一个或多个集成电路来实施,例如微处理器或微控制器。该处理器820用于执行存储器中存储的指令,从而使得视频摘要中包含一些比较重要的镜头或片段,或者包含一些关键镜头或片段。The processor 820 is coupled to the memory 810 and may be implemented as one or more integrated circuits, such as a microprocessor or a microcontroller. The processor 820 is configured to execute instructions stored in the memory, so that the video summary includes some more important shots or clips, or contains some key shots or clips.
在一些实施例中,还可以如图9所示,该系统900包括存储器910和处理器920。处理器920通过BUS总线930耦合至存储器910。该系统900还可以通过存储接口940连接至外部存储装置950以便调用外部数据,还可以通过网络接口960连接至网络或者另外一台计算机系统(未标出),此处不再进行详细介绍。In some embodiments, as shown in FIG. 9, the system 900 includes a memory 910 and a processor 920. The processor 920 is coupled to the memory 910 through a BUS bus 930. The system 900 may also be connected to an external storage device 950 through a storage interface 940 to call external data, and may also be connected to a network or another computer system (not shown) through a network interface 960, which is not described in detail here.
在该实施例中,通过存储器存储数据指令,再通过处理器处理上述指令,从而使得视频摘要中包含一些比较重要的镜头或片段,或者包含一些关键镜头或片段。In this embodiment, a data instruction is stored in a memory, and the above instruction is processed by a processor, so that the video summary includes some more important shots or fragments, or some key shots or fragments.
在另一些实施例中,本公开还提供了一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现图1至图5中的至少一个所对应实施例中的方法的步骤。本领域内的技术人员应明白,本公开的实施例可提供为方法、装置、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。In other embodiments, the present disclosure also provides a computer-readable storage medium having computer program instructions stored thereon, which are executed by a processor to implement at least one of the embodiments corresponding to FIG. 1 to FIG. 5 Steps of the method. Those skilled in the art should understand that the embodiments of the present disclosure may be provided as a method, an apparatus, or a computer program product. Therefore, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code therein. .
本公开是参照根据本公开实施例的方法、设备(系统)和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowcharts and / or block diagrams of methods, devices (systems) and computer program products according to embodiments of the present disclosure. It should be understood that each process and / or block in the flowcharts and / or block diagrams and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that the instructions generated by the processor of the computer or other programmable data processing device are used to generate instructions Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一 个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
至此,已经详细描述了本公开。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。So far, the present disclosure has been described in detail. To avoid obscuring the concept of the present disclosure, some details known in the art are not described. Those skilled in the art can fully understand how to implement the technical solutions disclosed herein based on the above description.
可能以许多方式来实现本公开的方法和系统。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和系统。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。The methods and systems of the present disclosure may be implemented in many ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware or any combination of software, hardware, firmware. The above order of the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. Further, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, which programs include machine-readable instructions for implementing the method according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing a method according to the present disclosure.
虽然已经通过示例对本公开的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本公开的范围。本领域的技术人员应该理解,可在不脱离本公开的范围和精神的情况下,对以上实施例进行修改。本公开的范围由所附权利要求来限定。Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art should understand that the above examples are only for the purpose of illustration and are not intended to limit the scope of the present disclosure. Those skilled in the art should understand that the above embodiments can be modified without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (22)

  1. 一种由计算机执行的用于生成视频摘要的方法,包括:A computer-implemented method for generating a video summary includes:
    接收视频,并根据所述视频的视频场景的变化将所述视频切分为多个镜头,其中,所述多个镜头的每个镜头为一段内容连续的视频场景;Receiving a video, and cutting the video into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a continuous video scene;
    计算每个镜头的重要性分值;Calculate the importance score of each shot;
    从所述多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大;以及Selecting a set of shots from the plurality of shots so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and
    将所选取的所述一组镜头拼接成视频摘要并输出所述视频摘要。The selected group of shots is stitched into a video summary and the video summary is output.
  2. 根据权利要求1所述的方法,其中,计算每个镜头的重要性分值的步骤包括:The method of claim 1, wherein the step of calculating the importance score of each shot includes:
    利用三维卷积网络从每个镜头提取特征向量,以获得由所述多个镜头组成的镜头集合的特征向量序列;以及Extracting a feature vector from each shot using a three-dimensional convolutional network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and
    将所述特征向量序列输入到预先训练的镜头重要性分值计算网络以计算每个镜头的重要性分值。The feature vector sequence is input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.
  3. 根据权利要求2所述的方法,其中,在将视频切分为多个镜头之前,所述方法还包括:The method according to claim 2, wherein before the video is divided into a plurality of shots, the method further comprises:
    采用强化学习的方法对镜头重要性分值计算网络进行训练,其中,所述强化学习的方法所包含的关键元素包括:行动和价值奖励函数,所述价值奖励函数包含:多样性指标和代表性指标。Reinforcement learning is used to train the lens importance score calculation network. The key elements included in the reinforcement learning method include: action and value reward functions. The value reward function includes: diversity indicators and representativeness. index.
  4. 根据权利要求1所述的方法,其中,在从所述多个镜头中选取所述一组镜头之前,所述方法还包括:The method according to claim 1, wherein before selecting the group of shots from the plurality of shots, the method further comprises:
    在所述多个镜头中识别出展现关键特征的至少一个镜头。At least one shot is identified among the plurality of shots that exhibits a key feature.
  5. 根据权利要求4所述的方法,其中,所述关键特征包括商品品牌商标和商品品牌文字中的至少一个。The method of claim 4, wherein the key feature comprises at least one of a product brand trademark and a product brand text.
  6. 根据权利要求5所述的方法,其中,在所述多个镜头中识别出展现关键特征 的至少一个镜头的步骤包括:The method according to claim 5, wherein the step of identifying, among the plurality of shots, at least one shot exhibiting key features comprises:
    使用基于深度学习的对象检测方法检测视频的每帧图像中的商标区域;以及将商标区域的图像输入到预先训练的深度模型以提取嵌入特征向量,并将所述嵌入特征向量与数据库中的商标图像的特征向量进行比对以获取商标的品牌类型,从而识别出展现商品品牌商标的至少一个镜头;或者,Use a deep learning-based object detection method to detect the trademark region in each frame of the video; and input the image of the trademark region into a pre-trained depth model to extract the embedded feature vector, and compare the embedded feature vector with the trademark in the database The feature vectors of the images are compared to obtain the brand type of the trademark, so as to identify at least one shot showing the brand trademark of the product; or,
    使用基于深度学习的光学字符识别方法识别视频的每帧图像中的文字;以及对文字进行分词处理,并将处理后的文字与数据库中的品牌文字进行匹配,保留与商品品牌相关的文字,从而识别出展现商品品牌文字的至少一个镜头。Use deep learning-based optical character recognition to identify the text in each frame of the video; and perform word segmentation on the text, and match the processed text with the brand text in the database, and retain the text related to the product brand, thereby Identify at least one shot showing the brand text of the product.
  7. 根据权利要求4所述的方法,其中,从所述多个镜头中选取一组镜头的步骤包括:The method according to claim 4, wherein the step of selecting a group of lenses from the plurality of lenses comprises:
    从所述展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头。Selecting at least one main lens from the at least one lens exhibiting key features, and selecting at least one auxiliary lens from the remaining lenses of the plurality of lenses except the at least one main lens, and dividing the at least one The main lens and the at least one auxiliary lens serve as a selected group of lenses.
  8. 根据权利要求7所述的方法,其中,The method according to claim 7, wherein:
    从所述展现关键特征的至少一个镜头中选取至少一个主镜头的步骤包括:如果从所述展现关键特征的至少一个镜头中选取的镜头为视频的最前面Ng个镜头或最后面Ng个镜头,则确定该最前面Ng个镜头或该最后面Ng个镜头为所述至少一个主镜头,Ng为正整数;The step of selecting at least one main shot from the at least one shot showing the key feature includes: if the shot selected from the at least one shot showing the key feature is the first Ng shot or the last Ng shot of the video, Determining that the foremost Ng lens or the last Ng lens is the at least one main lens, and Ng is a positive integer;
    从所述多个镜头中的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头的步骤包括:从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头,使得所选取的该组镜头在满足视频摘要总时长的约束条件的情况下总的重要性分值最大;The step of selecting at least one auxiliary lens from the remaining lenses in the plurality of lenses, and using the at least one main lens and the at least one auxiliary lens as the selected group of lenses includes: At least one auxiliary lens is selected from the remaining lenses except the at least one main lens, and the at least one main lens and the at least one auxiliary lens are selected as a selected group of lenses, so that the selected group of lenses satisfies The total importance score is the largest under the constraint of the total length of the video summary;
    将所选取的所述一组镜头拼接成视频摘要的步骤包括:将所述至少一个主镜头和所述至少一个辅助镜头按照时间顺序拼接成视频摘要。The step of splicing the selected set of shots into a video summary includes: splicing the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
  9. 根据权利要求5所述的方法,其中,在所述多个镜头中识别出展现关键特征的至少一个镜头之前,所述方法还包括:The method according to claim 5, wherein before identifying at least one shot showing key features among the plurality of shots, the method further comprises:
    计算每个镜头与所宣传商品图片的相似度,并利用所述相似度修正相应镜头的重要性分值。Calculate the similarity between each shot and the advertised product picture, and use the similarity to modify the importance score of the corresponding shot.
  10. 根据权利要求9所述的方法,其中,计算每个镜头与所宣传商品图片的相似度,并利用所述相似度修正相应镜头的重要性分值的步骤包括:The method according to claim 9, wherein the step of calculating the similarity between each shot and the advertised product picture and using the similarity to correct the importance score of the corresponding shot comprises:
    计算所宣传商品图片的特征向量;Calculate the feature vector of the advertised product picture;
    对每个镜头的多帧图像进行采样以获得采样帧,并计算每个镜头的采样帧的特征向量;Sampling a multi-frame image of each lens to obtain a sampling frame, and calculating a feature vector of the sampling frame of each lens;
    根据所述商品图片的特征向量和每个镜头的采样帧的特征向量计算每个镜头与所述商品图片的相似度;以及Calculating the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and
    根据所述相似度和预设的相似度阈值对相应镜头的重要性分值进行修正。Correct the importance score of the corresponding lens according to the similarity and a preset similarity threshold.
  11. 一种用于生成视频摘要的系统,包括:A system for generating a video summary includes:
    视频切分单元,被配置为接收视频,并根据所述视频的视频场景的变化将所述视频切分为多个镜头,其中,所述多个镜头的每个镜头为一段内容连续的视频场景;The video segmentation unit is configured to receive a video and divide the video into multiple shots according to a change in the video scene of the video, where each shot of the multiple shots is a segment of a continuous video scene ;
    计算单元,被配置为计算每个镜头的重要性分值;A calculation unit configured to calculate an importance score of each lens;
    选取单元,被配置为从所述多个镜头中选取一组镜头,使得在满足视频摘要总时长的约束条件的情况下所选取的该组镜头的总的重要性分值最大;以及A selection unit configured to select a group of shots from the plurality of shots so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and
    拼接单元,被配置为将所选取的所述一组镜头拼接成视频摘要并输出所述视频摘要。The stitching unit is configured to stitch the selected group of shots into a video summary and output the video summary.
  12. 根据权利要求11所述的系统,其中,The system according to claim 11, wherein:
    所述计算单元被配置为利用三维卷积网络从每个镜头提取特征向量,以获得由所述多个镜头组成的镜头集合的特征向量序列,以及将所述特征向量序列输入到预先训练的镜头重要性分值计算网络以计算每个镜头的重要性分值。The calculation unit is configured to extract a feature vector from each shot using a three-dimensional convolutional network to obtain a feature vector sequence of a shot set composed of the plurality of shots, and input the feature vector sequence to a pre-trained shot The importance score calculation network calculates the importance score of each shot.
  13. 根据权利要求12所述的系统,还包括:The system of claim 12, further comprising:
    训练单元,被配置为采用强化学习的方法对镜头重要性分值计算网络进行训练,其中,所述强化学习的方法所包含的关键元素包括:行动和价值奖励函数,所述价值奖励函数包含:多样性指标和代表性指标。A training unit configured to train a lens importance score calculation network by using a reinforcement learning method, wherein the key elements included in the method of reinforcement learning include: an action and a value reward function, the value reward function includes: Diversity and representativeness indicators.
  14. 根据权利要求11所述的系统,还包括:The system of claim 11, further comprising:
    识别单元,被配置为在所述多个镜头中识别出展现关键特征的至少一个镜头。The identification unit is configured to identify at least one shot that exhibits a key feature among the plurality of shots.
  15. 根据权利要求14所述的系统,其中,所述关键特征包括商品品牌商标和商品品牌文字中的至少一个。The system of claim 14, wherein the key feature includes at least one of a product brand trademark and a product brand text.
  16. 根据权利要求15所述的系统,其中,The system of claim 15, wherein:
    所述识别单元被配置为:使用基于深度学习的对象检测方法检测视频的每帧图像中的商标区域;以及将商标区域的图像输入到预先训练的深度模型以提取嵌入特征向量,并将所述嵌入特征向量与数据库中的商标图像的特征向量进行比对以获取商标的品牌类型,从而识别出展现商品品牌商标的至少一个镜头;或者,使用基于深度学习的光学字符识别方法识别视频的每帧图像中的文字;以及对文字进行分词处理,并将处理后的文字与数据库中的品牌文字进行匹配,保留与商品品牌相关的文字,从而识别出展现商品品牌文字的至少一个镜头。The recognition unit is configured to: use a deep learning-based object detection method to detect a trademark region in each frame of the video; and input the image of the trademark region into a pre-trained depth model to extract an embedded feature vector, and Compare the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, so as to identify at least one shot showing the brand trademark of the product; or use deep learning-based optical character recognition to identify each frame of the video Text in the image; word segmentation of the text, and matching the processed text with the brand text in the database, retaining the text related to the product brand, thereby identifying at least one shot showing the brand text of the product.
  17. 根据权利要求14所述的系统,其中,The system according to claim 14, wherein:
    所述选取单元被配置为从所述展现关键特征的至少一个镜头中选取至少一个主镜头,并从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头。The selecting unit is configured to select at least one main shot from the at least one shot showing key features, and select at least one auxiliary shot from the remaining shots of the plurality of shots except the at least one main shot. A lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.
  18. 根据权利要求17所述的系统,其中,The system of claim 17, wherein:
    所述选取单元被配置为:如果从所述展现关键特征的至少一个镜头中选取的镜头为视频的最前面Ng个镜头或最后面Ng个镜头,则确定该最前面Ng个镜头或该最后面Ng个镜头为所述至少一个主镜头,Ng为正整数;以及从所述多个镜头中的除所述至少一个主镜头之外的剩余镜头中选取至少一个辅助镜头,将所述至少一个主镜头和所述至少一个辅助镜头作为所选取的一组镜头,使得所选取的该组镜头在满足视频摘要总时长的约束条件的情况下总的重要性分值最大;The selecting unit is configured to determine the foremost Ng lens or the last Ng lens if the lens selected from the at least one lens showing the key feature is the foremost Ng lens or the last Ng lens of the video. Ng lenses are the at least one main lens and Ng is a positive integer; and at least one auxiliary lens is selected from the remaining lenses of the plurality of lenses except the at least one main lens, and the at least one main lens is The shot and the at least one auxiliary shot are used as a selected set of shots, so that the selected set of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied;
    所述拼接单元被配置为将所述至少一个主镜头和所述至少一个辅助镜头按照时间顺序拼接成视频摘要。The stitching unit is configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
  19. 根据权利要求15所述的系统,还包括:The system of claim 15, further comprising:
    修正单元,被配置为计算每个镜头与所宣传商品图片的相似度,并利用所述相似度修正相应镜头的重要性分值。The correction unit is configured to calculate a similarity between each lens and the advertised product picture, and correct the importance score of the corresponding lens by using the similarity.
  20. 根据权利要求19所述的系统,其中,The system of claim 19, wherein:
    所述修正单元被配置为:计算所宣传商品图片的特征向量;对每个镜头的多帧图像进行采样以获得采样帧,并计算每个镜头的采样帧的特征向量;根据所述商品图片的特征向量和每个镜头的采样帧的特征向量计算每个镜头与所述商品图片的相似度;以及根据所述相似度和预设的相似度阈值对相应镜头的重要性分值进行修正。The correction unit is configured to: calculate a feature vector of the advertised product picture; sample multiple frames of each lens to obtain a sample frame, and calculate a feature vector of the sample frame of each lens; The feature vector and the feature vector of the sampling frame of each shot calculate the similarity between each shot and the product picture; and correct the importance score of the corresponding shot according to the similarity and a preset similarity threshold.
  21. 一种用于生成视频摘要的系统,包括:A system for generating a video summary includes:
    存储器;以及Memory; and
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行如权利要求1至10任意一项所述的方法。A processor coupled to the memory, the processor being configured to perform the method of any one of claims 1 to 10 based on instructions stored in the memory.
  22. 一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现如权利要求1至10任意一项所述的方法的步骤。A computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor, implement the steps of the method according to any one of claims 1 to 10.
PCT/CN2019/098495 2018-08-03 2019-07-31 Method and system for generating video abstract WO2020024958A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810874321.2A CN110798752B (en) 2018-08-03 2018-08-03 Method and system for generating video summary
CN201810874321.2 2018-08-03

Publications (1)

Publication Number Publication Date
WO2020024958A1 true WO2020024958A1 (en) 2020-02-06

Family

ID=69230586

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/098495 WO2020024958A1 (en) 2018-08-03 2019-07-31 Method and system for generating video abstract

Country Status (2)

Country Link
CN (1) CN110798752B (en)
WO (1) WO2020024958A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694984A (en) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 Video searching method and device, electronic equipment and readable storage medium
CN112423112A (en) * 2020-11-16 2021-02-26 北京意匠文枢科技有限公司 Method and equipment for releasing video information
CN113810782A (en) * 2020-06-12 2021-12-17 阿里巴巴集团控股有限公司 Video processing method and device, server and electronic device
CN115442660A (en) * 2022-08-31 2022-12-06 杭州影象官科技有限公司 Method and device for extracting self-supervision confrontation video abstract
CN115731498A (en) * 2022-12-01 2023-03-03 石家庄铁道大学 Video abstract generation method combining reinforcement learning and contrast learning
US11836181B2 (en) 2019-05-22 2023-12-05 SalesTing, Inc. Content summarization leveraging systems and processes for key moment identification and extraction
US11900682B2 (en) 2020-08-25 2024-02-13 Beijing Xiaomi Pinecone Electronics Co., Ltd. Method and apparatus for video clip extraction, and storage medium

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021184153A1 (en) * 2020-03-16 2021-09-23 阿里巴巴集团控股有限公司 Summary video generation method and device, and server
CN113453040B (en) * 2020-03-26 2023-03-10 华为技术有限公司 Short video generation method and device, related equipment and medium
CN111488807B (en) * 2020-03-29 2023-10-10 复旦大学 Video description generation system based on graph rolling network
WO2021240651A1 (en) * 2020-05-26 2021-12-02 日本電気株式会社 Information processing device, control method, and storage medium
CN112004111B (en) * 2020-09-01 2023-02-24 南京烽火星空通信发展有限公司 News video information extraction method for global deep learning
CN112052841B (en) * 2020-10-12 2021-06-29 腾讯科技(深圳)有限公司 Video abstract generation method and related device
CN112261472A (en) * 2020-10-19 2021-01-22 上海博泰悦臻电子设备制造有限公司 Short video generation method and related equipment
CN112291589B (en) * 2020-10-29 2023-09-22 腾讯科技(深圳)有限公司 Method and device for detecting structure of video file
CN112445935B (en) * 2020-11-25 2023-07-04 开望(杭州)科技有限公司 Automatic generation method of video selection collection based on content analysis
CN112532897B (en) * 2020-11-25 2022-07-01 腾讯科技(深圳)有限公司 Video clipping method, device, equipment and computer readable storage medium
CN112464814A (en) 2020-11-27 2021-03-09 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN113242464A (en) * 2021-01-28 2021-08-10 维沃移动通信有限公司 Video editing method and device
CN113438509A (en) * 2021-06-23 2021-09-24 腾讯音乐娱乐科技(深圳)有限公司 Video abstract generation method, device and storage medium
CN115022711B (en) * 2022-04-28 2024-05-31 之江实验室 System and method for ordering shot videos in movie scene
US20240054782A1 (en) * 2022-08-12 2024-02-15 Nec Laboratories America, Inc. Few-shot video classification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101438284A (en) * 2006-05-05 2009-05-20 皇家飞利浦电子股份有限公司 Method of updating a video summary by user relevance feedback
US20120076357A1 (en) * 2010-09-24 2012-03-29 Kabushiki Kaisha Toshiba Video processing apparatus, method and system
US20130156321A1 (en) * 2011-12-16 2013-06-20 Shigeru Motoi Video processing apparatus and method
CN104980772A (en) * 2014-04-14 2015-10-14 北京酷云互动科技有限公司 Product placement monitoring method and monitoring device
WO2016014724A1 (en) * 2014-07-23 2016-01-28 Gopro, Inc. Scene and activity identification in video summary generation
CN107967482A (en) * 2017-10-24 2018-04-27 广东中科南海岸车联网技术有限公司 Icon-based programming method and device
CN108073902A (en) * 2017-12-19 2018-05-25 深圳先进技术研究院 Video summary method, apparatus and terminal device based on deep learning

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807198A (en) * 2010-01-08 2010-08-18 中国科学院软件研究所 Video abstraction generating method based on sketch
WO2012068154A1 (en) * 2010-11-15 2012-05-24 Huawei Technologies Co., Ltd. Method and system for video summarization
US8665345B2 (en) * 2011-05-18 2014-03-04 Intellectual Ventures Fund 83 Llc Video summary including a feature of interest
US8643746B2 (en) * 2011-05-18 2014-02-04 Intellectual Ventures Fund 83 Llc Video summary including a particular person
CN106034264B (en) * 2015-03-11 2020-04-03 中国科学院西安光学精密机械研究所 Method for acquiring video abstract based on collaborative model
CN106612468A (en) * 2015-10-21 2017-05-03 上海文广互动电视有限公司 A video abstract automatic generation system and method
CN107203636B (en) * 2017-06-08 2020-06-16 天津大学 Multi-video abstract acquisition method based on hypergraph master set clustering

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101438284A (en) * 2006-05-05 2009-05-20 皇家飞利浦电子股份有限公司 Method of updating a video summary by user relevance feedback
US20120076357A1 (en) * 2010-09-24 2012-03-29 Kabushiki Kaisha Toshiba Video processing apparatus, method and system
US20130156321A1 (en) * 2011-12-16 2013-06-20 Shigeru Motoi Video processing apparatus and method
CN104980772A (en) * 2014-04-14 2015-10-14 北京酷云互动科技有限公司 Product placement monitoring method and monitoring device
WO2016014724A1 (en) * 2014-07-23 2016-01-28 Gopro, Inc. Scene and activity identification in video summary generation
CN107967482A (en) * 2017-10-24 2018-04-27 广东中科南海岸车联网技术有限公司 Icon-based programming method and device
CN108073902A (en) * 2017-12-19 2018-05-25 深圳先进技术研究院 Video summary method, apparatus and terminal device based on deep learning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11836181B2 (en) 2019-05-22 2023-12-05 SalesTing, Inc. Content summarization leveraging systems and processes for key moment identification and extraction
CN111694984A (en) * 2020-06-12 2020-09-22 百度在线网络技术(北京)有限公司 Video searching method and device, electronic equipment and readable storage medium
CN113810782A (en) * 2020-06-12 2021-12-17 阿里巴巴集团控股有限公司 Video processing method and device, server and electronic device
CN111694984B (en) * 2020-06-12 2023-06-20 百度在线网络技术(北京)有限公司 Video searching method, device, electronic equipment and readable storage medium
US11900682B2 (en) 2020-08-25 2024-02-13 Beijing Xiaomi Pinecone Electronics Co., Ltd. Method and apparatus for video clip extraction, and storage medium
CN112423112A (en) * 2020-11-16 2021-02-26 北京意匠文枢科技有限公司 Method and equipment for releasing video information
CN115442660A (en) * 2022-08-31 2022-12-06 杭州影象官科技有限公司 Method and device for extracting self-supervision confrontation video abstract
CN115442660B (en) * 2022-08-31 2023-05-19 杭州影象官科技有限公司 Self-supervision countermeasure video abstract extraction method, device, equipment and storage medium
CN115731498A (en) * 2022-12-01 2023-03-03 石家庄铁道大学 Video abstract generation method combining reinforcement learning and contrast learning

Also Published As

Publication number Publication date
CN110798752A (en) 2020-02-14
CN110798752B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
WO2020024958A1 (en) Method and system for generating video abstract
CN109587554B (en) Video data processing method and device and readable storage medium
CN101281540B (en) Apparatus, method and computer program for processing information
Li et al. Video object segmentation with re-identification
JP5510167B2 (en) Video search system and computer program therefor
CN113542777B (en) Live video editing method and device and computer equipment
CN111836118B (en) Video processing method, device, server and storage medium
US20190303499A1 (en) Systems and methods for determining video content relevance
Voigtlaender et al. Boltvos: Box-level tracking for video object segmentation
JP2016502194A5 (en)
US10699156B2 (en) Method and a device for image matching
CN104883515A (en) Video annotation processing method and video annotation processing server
CN111046904B (en) Image description method, image description device and computer storage medium
CN107358490A (en) A kind of image matching method, device and electronic equipment
JP2019003585A (en) Summary video creation device and program of the same
CN114677402A (en) Poster text layout, poster generation method and related device
Han et al. Video scene change detection using convolution neural network
CN105847964A (en) Movie and television program processing method and movie and television program processing system
CN115665508A (en) Video abstract generation method and device, electronic equipment and storage medium
CN104850600A (en) Method and device for searching images containing faces
Bhaumik et al. Towards redundancy reduction in storyboard representation for static video summarization
CN115278300A (en) Video processing method, video processing apparatus, electronic device, storage medium, and program product
CN110781345B (en) Video description generation model obtaining method, video description generation method and device
US20210124995A1 (en) Selecting training symbols for symbol recognition
JP4904316B2 (en) Specific scene extraction apparatus and specific scene extraction program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19843213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17/05/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19843213

Country of ref document: EP

Kind code of ref document: A1