WO2023065663A1 - 视频剪辑方法、装置、电子设备及存储介质 - Google Patents

视频剪辑方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023065663A1
WO2023065663A1 PCT/CN2022/094576 CN2022094576W WO2023065663A1 WO 2023065663 A1 WO2023065663 A1 WO 2023065663A1 CN 2022094576 W CN2022094576 W CN 2022094576W WO 2023065663 A1 WO2023065663 A1 WO 2023065663A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
content feature
segment
video segment
prediction model
Prior art date
Application number
PCT/CN2022/094576
Other languages
English (en)
French (fr)
Inventor
梅立军
付瑞吉
李月雷
张德兵
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2023065663A1 publication Critical patent/WO2023065663A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a video editing method, device, electronic equipment, storage medium, computer program product and computer program.
  • the usual method is to insert multiple clips from different short videos into one video, or directly integrate a group of short video clips into one video, but the above methods require manual marking to collect video clips , relying on the manual completion of the mixing and cutting operation, lack of automated short video mixing and cutting, and the few automated mixing and cutting that exist are only integrated video clips obtained by simple attribute aggregation, which cannot reflect intelligence in video connection.
  • the present disclosure provides a video clipping method, device, electronic equipment, storage medium, computer program product and computer program, so as to at least solve the problem of lack of automatic intelligent mixed clipping.
  • the disclosed technical scheme is as follows:
  • a video clipping method applied to an electronic device comprising:
  • the target video segment is a preset duration in the original video before the editing point or after the editing point video clips of
  • the video content feature corresponding to the target video segment is input to the content feature prediction model to obtain the predicted video content feature;
  • the video segment to be inserted is determined from the set of video material segments; the degree of matching between the feature of the video content corresponding to the video segment to be inserted and the feature of the predicted video content satisfies a preset condition;
  • the video segment to be inserted is fed back to the user, so as to insert the video segment to be inserted into the editing point of the original video.
  • the determination of the video segment to be inserted from the set of video material segments according to the predicted video content features includes:
  • the matching degree is greater than a preset threshold, it is determined that the matching degree between the video content feature and the predicted video content feature satisfies a preset condition
  • the video material segment corresponding to the video content feature is used as the video segment to be inserted.
  • the video clips to be inserted include multiple, and the feedback to the user of the video clips to be inserted includes:
  • the method further includes:
  • the target insertion video segment from the plurality of video segments to be inserted
  • a method for obtaining a content feature prediction model which is applied to an electronic device, and the method includes:
  • the training sample data includes a plurality of video clip pairs; each of the video clip pairs includes a first video clip and a second video clip belonging to the same sample video; the first video clip is in the A video segment of a preset duration before the video key point in the sample video; the second video segment is a video segment of a preset duration after the video key point in the sample video;
  • the content feature prediction model to be trained is trained to obtain the content feature prediction model.
  • the content feature prediction model to be trained is performed using the training sample data. Training to get the content feature prediction model, including:
  • the video content feature corresponding to the first video segment is input to the content feature prediction model to be trained, and the predicted video content feature corresponding to the first video segment is obtained;
  • the training sample data is used to train the content feature prediction model to be trained to obtain the content feature prediction model ,include:
  • the method further includes:
  • each image frame in the first video segment and the second video segment of each video segment pair is adjusted, and the adjusted the image frame;
  • the acquisition of training sample data includes:
  • For each video highlight determine the first video segment of the preset duration before the video highlight in the sample video, and the second video segment of the preset duration after the video highlight in the sample video;
  • a video clip pair corresponding to the video highlights is obtained.
  • the acquisition of the video highlights set of the sample video includes:
  • the highlight extraction information is used to identify video highlights according to picture information, sound information, and text information in the video;
  • a plurality of video highlight points are determined from the sample video according to the highlight point extraction information, and a video highlight set of the sample video is obtained.
  • a video editing device including:
  • the acquiring unit is configured to execute an instruction to acquire a selection point of an original video clip, and extract a target video segment from the original video; the target video segment is before the clip point in the original video or at the A video segment with a preset duration after the above-mentioned editing point;
  • the prediction unit is configured to input the video content feature corresponding to the target video segment into the content feature prediction model to obtain the predicted video content feature;
  • the video segment matching unit is configured to determine the video segment to be inserted from the set of video material segments according to the predicted video content feature; the difference between the video content feature corresponding to the video segment to be inserted and the predicted video content feature The matching degree meets the preset conditions;
  • the feedback unit is configured to feed back the video segment to be inserted to the user, so as to insert the video segment to be inserted into the editing point of the original video.
  • the video clip matching unit is specifically configured to determine a plurality of video content features and the predicted Matching degree sorting results between video content features; when the matching degree is greater than a preset threshold, it is determined that the matching degree between the video content features and the predicted video content features meets a preset condition; the video content The video clip corresponding to the feature is used as the video clip to be inserted.
  • the video clips to be inserted include multiple, and the feedback unit is specifically configured to acquire preset feedback index information; sort the multiple video clips to be inserted according to the feedback index information , to obtain a feedback sorting result; based on the feedback sorting result, feed back the plurality of video segments to be inserted.
  • the device further includes:
  • the target insertion video segment determination unit is configured to determine the target insertion video segment from the plurality of video segments to be inserted according to the insertion selection information returned by the user;
  • the target insertion video segment inserting unit is configured to insert the target insertion video segment before or after the cutting point of the original video.
  • an apparatus for obtaining a content feature prediction model comprising:
  • the training sample data acquisition unit is configured to execute and obtain training sample data;
  • the training sample data includes a plurality of video segment pairs; each of the video segment pairs includes a first video segment and a second video segment belonging to the same sample video;
  • the first video clip is a video clip with a preset duration before the video key point in the sample video;
  • the second video clip is a video clip with a preset duration after the video key point in the sample video;
  • the model training unit is configured to use the training sample data to train the content feature prediction model to be trained to obtain the content feature prediction model.
  • the model training unit is specifically configured to perform the conversion of the first video
  • the video content feature corresponding to the segment is input to the content feature prediction model to be trained to obtain the predicted video content feature corresponding to the first video segment; based on the predicted video content feature corresponding to the first video segment and the second video segment According to the difference in video content features, adjust the model parameters of the content feature prediction model to be trained until the adjusted content feature prediction model meets the preset training conditions, and obtain the content feature prediction model;
  • the model training unit is specifically configured to input the video content features corresponding to the second video clip To the content feature prediction model to be trained, the predicted video content feature corresponding to the second video segment is obtained; based on the difference between the predicted video content feature corresponding to the second video segment and the video content feature corresponding to the first video segment , adjusting the model parameters of the content feature prediction model to be trained until the adjusted content feature prediction model meets the preset training conditions to obtain the content feature prediction model.
  • the device further includes:
  • the image preprocessing unit is configured to perform, for each image content feature dimension, according to the image preprocessing method corresponding to the image content feature dimension, for each of the video clip pairs in the first video clip and the second video clip Each image frame is adjusted to obtain an adjusted image frame;
  • An image feature extraction unit configured to perform image feature extraction on the adjusted image frame to obtain a plurality of image feature vectors
  • the splicing unit is configured to splice the plurality of image feature vectors to obtain respective video feature vectors corresponding to the first video segment and the second video segment; the video feature vectors are used to characterize the first video segment A video segment and the video content characteristics corresponding to the second video segment.
  • the training sample data acquisition unit is specifically configured to acquire a set of video highlight points of the sample video; for each video highlight point, determine the preset The first video segment of the duration, and the second video segment of the preset duration after the video highlights in the sample video; according to the first video segment and the second video segment, the corresponding video highlights are obtained pair of video clips.
  • the training sample data acquisition unit is specifically configured to acquire preset highlight extraction information; the highlight extraction information is used to identify the Highlights of the video: According to the extracted information of the highlights, a plurality of highlights of the video are determined from the sample video, and a set of highlights of the video of the sample videos is obtained.
  • an electronic device including a memory and a processor, the memory stores a computer program, and the processor implements the first aspect or the first aspect when executing the computer program.
  • a non-volatile computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any of the first aspect or the first aspect can be realized.
  • a computer program product includes a computer program, the computer program is stored in a readable storage medium, and at least one processor of a device reads from the readable storage medium Read and execute the computer program, so that the device executes the video editing method described in the first aspect or any embodiment of the first aspect, or obtains content as described in the second aspect or any embodiment of the second aspect Methods for Feature Prediction Models.
  • a computer program the computer program includes computer program code, and when the computer program code is run on a computer, the computer executes any of the first aspect or the first aspect.
  • the target video segment is a video segment with a preset duration before the editing point or after the editing point in the original video, and then the target video segment
  • the corresponding video content features are input to the content feature prediction model to obtain the predicted video content features, and then according to the predicted video content features, the video clips to be inserted are determined from the set of video material clips, and the video content features corresponding to the video clips to be inserted are related to the predicted video content features.
  • the matching degree between the content features satisfies the preset condition, and the video segment to be inserted is fed back to the user, so as to insert the video segment to be inserted into the editing point of the original video.
  • the predicted video content features can be obtained based on the video content features corresponding to the target video segment, and then the video segment to be inserted can be matched from the video material segment set for feedback, and the video clip is optimized to make the clipped video more natural and smooth. In this way, intelligence is reflected in the video connection, so as to avoid making the edited video appear abrupt.
  • Fig. 1 is an application environment diagram of a video editing method according to an embodiment of the present disclosure.
  • Fig. 2 is a flow chart of a video editing method according to an embodiment of the present disclosure.
  • Fig. 3 is a schematic diagram of a processing flow of an intelligent video mixed-cut editing according to an embodiment of the present disclosure.
  • Fig. 4 is a flow chart of obtaining a content feature prediction model according to an embodiment of the present disclosure.
  • Fig. 5a is a schematic diagram of model training according to an embodiment of the present disclosure.
  • Fig. 5b is a schematic diagram showing a processing flow of training data preparation and model training according to an embodiment of the present disclosure.
  • Fig. 6 is a flow chart of another video editing method according to an embodiment of the present disclosure.
  • Fig. 7 is a block diagram of a video clipping device according to an embodiment of the disclosure.
  • Fig. 8 is a block diagram of an apparatus for obtaining a content feature prediction model according to an embodiment of the present disclosure.
  • Fig. 9 is a diagram showing an internal structure of a server according to an embodiment of the present disclosure.
  • the video clipping method provided by the embodiments of the present disclosure can be applied to the application environment shown in FIG. 1 .
  • the client 110 interacts with the server 120 through the network.
  • the server 120 obtains the selection instruction for the editing point of the original video, extracts the target video segment from the original video, and inputs the video content feature corresponding to the target video segment into the content feature prediction model to obtain the predicted video content feature, and then according to Predict the feature of the video content, determine the video segment to be inserted from the video material segment set, and the server 120 feeds back the video segment to be inserted to the client 110 .
  • the client 110 can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices
  • the server 120 can be implemented as an independent server or a server cluster composed of multiple servers.
  • Fig. 2 is a flow chart of a video editing method according to an embodiment of the present disclosure. As shown in Fig. 2 , the method is used in the server 120 in Fig. 1 and includes steps S210-S240.
  • step S210 an instruction to select an editing point of the original video is obtained, and a target video segment is extracted from the original video; the target video segment is a video segment with a preset duration before or after the editing point in the original video.
  • the original video may be a video to be inserted into a clip, for example, the only base video currently being edited by the client may be used as the original video.
  • the target video segment may be a video segment of the video content to be predicted in the original video, for example, based on the target video segment extracted from the original video, the video segment connected with the video content of the target video segment may be predicted.
  • the editing point may be the time position of inserting the clip in the original video specified by the user terminal, such as specifying the insertion time position p based on the user's requirement.
  • the server in the process of video clipping, can receive the selection instruction of the editing point of the original video sent by the client, and then the server can extract the target video segment from the original video according to the obtained selection command, the target video segment You can preset the duration of the video segment before the cut point or after the cut point in the original video.
  • a video based on a certain time interval [tp-n, tp] before the insertion time position may be extracted from the original video as a target video segment.
  • the preset duration n (that is, the time interval) before the editing point can be selected within the range of 10-15s before the time position p, and the preset duration n can also be Other setting values are not limited in this disclosure.
  • step S220 the video content features corresponding to the target video segment are input into the content feature prediction model to obtain predicted video content features.
  • the video content feature corresponding to the target video segment may be a feature vector sequence obtained by performing multi-dimensional feature extraction on the target video segment, which may be used to characterize the video content feature of the video segment.
  • multi-dimensional feature extraction can be performed on the target video segment to obtain the video content feature corresponding to the target video segment, and then the video content feature corresponding to the target video segment can be input into the content feature prediction model, A predicted video content feature is obtained, and the video content represented by the predicted video content feature can be connected with the video content represented by the video content feature corresponding to the target video segment.
  • the predicted video content feature can be a set of vector sequences, for example, based on the target video segment extracted at the insertion time position p, the optional set Yp can be predicted by the pre-trained content feature prediction model, which can have multiple elements, each element y can refer to a vector sequence, each vector can correspond to a video frame, and the complete vector sequence can correspond to a video segment, that is, the optional set Yp can be the vector sequence corresponding to the predicted optional video segment gather.
  • step S230 according to the feature of the predicted video content, the video segment to be inserted is determined from the set of video material segments; the matching degree between the feature of the video content corresponding to the video segment to be inserted and the feature of the predicted video content satisfies the preset condition.
  • the video clip set may be a group of video clip sets, and each video clip may correspond to a vector sequence characterizing the video content characteristics of the video clip.
  • the video material segment set can be searched according to the predicted video content feature, and through the similarity matching process between the predicted video content feature and the vector sequence of the video content feature corresponding to each video segment in the video material segment set, it can be searched out A vector sequence whose matching degree with the feature of the predicted video content satisfies a preset condition, and then the video segment corresponding to the searched vector sequence can be used as the video segment to be inserted.
  • the searched similarity matching results may be the N video segments with the highest matching degree with the predicted video content features, that is, the video segments to be inserted.
  • step S240 the video segment to be inserted is fed back to the user, so as to insert the video segment to be inserted into the editing point of the original video.
  • the server can feed back the video segment to be inserted to the user end, and then insert the video segment to be inserted into the editing point of the original video based on the user operation to obtain a mixed-cut edited video.
  • the above video editing method extracts the target video segment from the original video by obtaining the selection instruction of the editing point of the original video, and then inputs the video content feature corresponding to the target video segment into the content feature prediction model to obtain the predicted video content feature, Furthermore, according to the predicted video content features, the video clips to be inserted are determined from the set of video material clips, and the video clips to be inserted are fed back to the user, so as to insert the video clips to be inserted into the editing points of the original video. In this way, the predicted video content features can be obtained based on the video content features corresponding to the target video segment, and then the video segment to be inserted can be matched from the video material segment set for feedback, and the video clip is optimized to make the clipped video more natural and smooth. In this way, intelligence is reflected in the video connection, so as to avoid making the edited video appear abrupt.
  • obtaining the selection instruction for the editing point of the original video, and extracting the target video segment from the original video includes: obtaining the selection instruction for the editing point of the original video, and determining whether the clip point in the original video is before or The time interval after the cut point; based on the time interval, the target video segment is extracted from the original video.
  • the server can receive the selection instruction of the editing point of the original video sent by the client, and then the server can determine the time interval before or after the editing point in the original video according to the obtained selection instruction, and obtain the time interval After that, the target video segment corresponding to the time interval can be extracted from the original video.
  • the insertion time position p can be determined according to the selection instruction, and then based on the preset duration n, the time interval [tp-n, tp] before the insertion time position can be obtained, and the time interval [tp-n , tp] corresponding to the video segment, as the target video segment.
  • the technical solution of the embodiment of the present disclosure determines the time interval before or after the editing point in the original video by obtaining the selection instruction for the editing point of the original video, and then extracts the target video from the original video based on the time interval Segments, which can accurately extract target video segments from original videos based on user needs, and provide data support for subsequent prediction of video content features.
  • determining the video clips to be inserted from the video clip set includes: determining a plurality of The matching degree sorting result between the video content feature and the predicted video content feature; when the matching degree is greater than the preset threshold, it is determined that the matching degree between the video content feature and the predicted video content feature meets the preset condition; the video content feature The corresponding video clips are used as the video clips to be inserted.
  • the video material segment set can be searched according to the predicted video content features based on the corresponding video content features of the multiple video material segments.
  • the video material segment set can be searched according to the predicted video content features based on the corresponding video content features of the multiple video material segments.
  • 10 video clips with the highest similarity and matching degree can be searched from the video material clip collection, and then based on the highest similarity matching degree of each of the 5 elements
  • the 10 video clips, that is, 50 video clips, constitute the video clips to be inserted.
  • the server After the server obtains the predicted video content features, it determines the relationship between the multiple video content features and the predicted video content features based on the video content features corresponding to each of the multiple video material segments in the video material segment set. Matching degree sorting results, and then when the matching degree is greater than the preset threshold, it is determined that the matching degree between the video content feature and the predicted video content feature meets the preset condition, and then the video material segment corresponding to the video content feature is used as the video to be inserted Fragments, so that video material fragments with high similarity can be effectively matched according to the predicted video content features, and the video content connection effect is improved.
  • multiple video clips to be inserted may be included, and feedback to the user of the video clips to be inserted includes: obtaining preset feedback index information; sorting the multiple video clips to be inserted according to the feedback index information, and obtaining feedback Sorting results; based on the feedback sorting results, multiple video clips to be inserted are fed back.
  • the feedback index information may include a plurality of designated indexes, such as relevance, excitement, and the like.
  • multiple video clips to be inserted may be included, and the recommendation ranking of multiple video clips to be inserted may be performed according to preset feedback index information to obtain feedback sorting results, and then the server may give feedback to the client based on the feedback sorting results Multiple video clips to be inserted.
  • the video clips to be inserted may include multiple, by obtaining the preset feedback index information; sorting the multiple video clips to be inserted according to the feedback index information, obtaining the feedback sorting result, and then based on the feedback sorting result , feed back multiple video clips to be inserted, so as to provide users with intelligent video mixed-cut materials, which can make the edited video more natural and smooth.
  • the step of feeding back the video segment to be inserted to the user further includes: determining the target insertion video segment from a plurality of video segments to be inserted according to the insertion selection information returned by the user; inserting the target insertion video segment into To before or after the cut point of the original video.
  • the target insertion video clip after feeding back the video clip to be inserted to the user, can be determined from multiple video clips to be inserted according to the insertion selection information returned by the user, and then the target video clip can be inserted into the original video.
  • the target insertion video segment Before or after the editing point, for example, according to the user's selection operation on the sorted video segments to be inserted, the target insertion video segment can be determined, and then the target insertion video segment can be spliced into the original video to obtain the mixed-cut edited result. video.
  • the target insertion video segment when the target video segment of the video content to be predicted is a video segment with a preset duration before the cut point in the original video, the target insertion video segment may be inserted after the cut point of the original video;
  • the target video segment of the video content is a video segment with a preset duration after the cut point in the original video, the target insertion video segment may be inserted before the cut point of the original video.
  • the target insertion video segment is determined from multiple video segments to be inserted according to the insertion selection information returned by the user; the target insertion video segment is inserted before or after the editing point of the original video, which can Based on the user's choice, the video intelligent mixing and cutting is carried out, which embodies the intelligence in the video connection, making the edited video more natural and smooth.
  • FIG. 3 exemplarily provides a schematic diagram of a processing flow of intelligent video mixed-cut editing; as shown in FIG. 3 , the processing flow of intelligent video mixed-cut editing includes steps S301-S307.
  • step S301 the user can insert the time position p (i.e. the editing point) based on the base video (i.e. the original video) specified by the client; in step S302: the server can insert the time position p from the existing video ( That is, the original video) extracts the video corresponding to the time interval [t pn , t p ] (i.e.
  • step S303 multi-dimensional feature extraction is performed on the video corresponding to the time interval [t pn , t p ] to obtain Predict video content feature;
  • step S304 can generate optional collection Y p (promptly predict video content characteristic) by generative deep learning model (promptly content feature prediction model);
  • step S305 For optional collection Y Each element y in p is searched in the editable video to be selected (i.e. the video clip collection) to obtain the video clip collection Y y to be inserted (i.e.
  • step S306 the video clip collection to be inserted Y y can be sorted according to the specified index, and feedback can be given according to the sorting result; in step S307: the user is provided to select and insert the sorted set of video segments Y y to be inserted.
  • Fig. 4 is a flowchart of a method for obtaining a content feature prediction model according to an embodiment of the present disclosure. As shown in Fig. 4 , the method is used in the server 120 in Fig. 1 , including steps S410-S420.
  • step S410 the training sample data is obtained; the training sample data includes a plurality of video clip pairs; each video clip pair includes a first video clip and a second video clip belonging to the same sample video; the first video clip is in the sample video
  • the second video clip is a video clip with a preset duration after the video key point in the sample video.
  • the server before obtaining the selection instruction of the editing point of the original video and extracting the target video segment from the original video, the server also needs to train the above-mentioned content feature prediction model, and can obtain training sample data
  • the training sample data Can include a plurality of video clip pairs, each video clip pair can include a first video clip and a second video clip belonging to the same sample video, the first video clip can be a video with a preset duration before the video key point in the sample video segment, the second video segment may be a video segment with a preset duration after the video key point in the sample video.
  • the content feature prediction model can be a generative deep learning model, and the generative deep learning model can use VAE, GAN and its variants, for example, a recurrent neural network, Bidirectional RNN (bidirectional recurrent neural network) can be used , Deep (Bidirectional) RNN (deep (two-way) recurrent neural network), LSTM, etc., and Convolutional Neural Network (CNN), etc.
  • step S420 the training sample data is used to train the content feature prediction model to be trained to obtain the content feature prediction model.
  • the server can use the training sample data to train the content feature prediction model to be trained to obtain the content feature prediction model. Specifically, based on the first video clip and the second video clip of each video clip pair, the to-be-trained The content feature prediction model is trained to obtain the content feature prediction model.
  • the content feature prediction model can be obtained, and video content prediction can be performed based on the pre-trained content feature prediction model, and optimized In addition to video clips, it embodies intelligence in the connection of clipped videos.
  • the training sample data is used to train the content feature prediction model to be trained to obtain the content feature prediction model, include:
  • the video content feature corresponding to the first video clip is input to the content feature prediction model to be trained, and the predicted video content feature corresponding to the first video clip is obtained;
  • the model parameters of the content feature prediction model to be trained are adjusted until the adjusted content feature prediction model meets the preset training conditions , get the content feature prediction model;
  • the training sample data is used to train the content feature prediction model to be trained to obtain the content feature prediction model, including:
  • the video content feature corresponding to the second video clip is input to the content feature prediction model to be trained, and the predicted video content feature corresponding to the second video clip is obtained;
  • the video content features corresponding to the first video segment can be input to the content feature prediction model to be trained , to obtain the predicted video content features corresponding to the first video segment, and based on the difference between the predicted video content features corresponding to the first video segment and the video content features corresponding to the second video segment, adjust the model parameters of the content feature prediction model to be trained , until the adjusted content feature prediction model meets the preset training conditions, and then the content feature prediction model can be obtained.
  • the video content features corresponding to the second video segment can be input into the content feature prediction model to be trained to obtain the second The predicted video content feature corresponding to the video segment, and based on the difference between the predicted video content feature corresponding to the second video segment and the video content feature corresponding to the first video segment, the model parameters of the content feature prediction model to be trained are adjusted until after adjustment The content feature prediction model conforms to the preset training conditions, and then the content feature prediction model can be obtained.
  • the target video segment is a video segment with a preset duration before the editing point in the original video
  • the content feature prediction model to be trained , obtain the predicted video content feature corresponding to the first video segment, based on the difference between the predicted video content feature corresponding to the first video segment and the video content feature corresponding to the second video segment, adjust the model parameters of the content feature prediction model to be trained, Until the adjusted content feature prediction model meets the preset training conditions, the content feature prediction model is obtained;
  • the target video segment is a video segment with a preset duration after the editing point in the original video
  • the second video segment corresponding to The video content feature is input to the content feature prediction model to be trained to obtain the predicted video content feature corresponding to the second video segment; based on the difference between the predicted video content feature corresponding to the second video segment and the video content feature corresponding to the first video segment, treat
  • the model parameters of the trained content feature prediction model are adjusted until the adjusted content feature
  • the step of obtaining the training sample data after the step of obtaining the training sample data, it also includes: for each image content feature dimension, according to the image preprocessing method corresponding to the image content feature dimension, the first video segment of each video segment pair and each image frame in the second video segment is adjusted to obtain an adjusted image frame; image feature extraction is performed on the adjusted image frame to obtain a plurality of image feature vectors; multiple image feature vectors are spliced to obtain the first video Video feature vectors corresponding to the segment and the second video segment; the video feature vectors are used to characterize the video content features corresponding to the first video segment and the second video segment.
  • each video clip in the training sample data can be analyzed from multiple dimensions Perform preprocessing on the corresponding image sequence, and for each image content feature dimension, according to the image preprocessing method corresponding to the image content feature dimension, each image in the first video segment and the second video segment of each video segment pair
  • the frame is adjusted to obtain the adjusted image frame, and then image feature extraction is performed on the adjusted image frame to obtain multiple image feature vectors, and then multiple image feature vectors can be spliced to obtain the first video clip and the second video Video feature vectors corresponding to each segment.
  • the image feature extraction process may be as follows: converting the video segment into a picture sequence, and then performing image feature extraction on each picture in the picture sequence using a convolutional neural network to obtain an image feature vector.
  • a convolutional neural network By splicing the image feature vectors corresponding to multiple pictures, video feature vectors corresponding to video clips, such as feature vector sequences, can be obtained.
  • multiple dimensions may include whether to include the background (include, not include), whether to ignore picture colors (yes, no), whether to include only people (include, not include), whether to only target moving objects (yes, no), where , including the background and not including the background can be used as two dimensions.
  • multi-dimensional feature extraction can be performed on the video clip pair to obtain the corresponding video content features of the first video clip and the second video clip in the video clip pair.
  • the content feature prediction model can be trained.
  • the first video clip can be centered based on the video clip, such as a video clip with a preset duration before the video key point in the sample video, through dimension 1-input feature data (ie the first video clip Corresponding multiple image feature vectors) to splicing input data to obtain the video content features corresponding to the first video segment, and can align the second video segment based on the video segment, such as a video with a preset duration after the video key point in the sample video Segment, through dimension 1-output feature data (that is, a plurality of image feature vectors corresponding to the second video segment) to perform output data splicing to obtain the video content characteristics corresponding to the second video segment, and then predict the video according to the first video segment
  • the content feature is the video content feature corresponding to the second video clip, and the generative deep learning model (ie, the content feature prediction model to be trained) is trained.
  • each image frame in the first video segment and the second video segment of each video segment pair is processed Adjust to obtain the adjusted image frame, and then perform image feature extraction on the adjusted image frame to obtain a plurality of image feature vectors, and then splicing the plurality of image feature vectors to obtain the first video segment and the second video segment respectively corresponding
  • the video feature vector can be used for model training based on multiple image content feature dimensions, which enhances the generalization ability of the content feature prediction model.
  • obtaining training sample data includes: obtaining a video highlight set of a sample video; for each video highlight, determining a first video segment of a preset duration before the video highlight in the sample video, and A second video clip with a preset duration after the video highlights in the sample video; and a pair of video clips corresponding to the video highlights is obtained according to the first video clip and the second video clip.
  • the second video segment with a preset duration can further obtain a pair of video segments corresponding to video highlight points according to the first video segment and the second video segment.
  • the video highlight set of the sample video by obtaining the video highlight set of the sample video, and then for each video key point, determine the first video segment with a preset duration before the video highlight point in the sample video, and in the sample video After the highlights of the video, the second video segment of the preset duration is obtained, and then according to the first video segment and the second video segment, the video segment pair corresponding to the video highlight point is obtained, and the video segment to be trained can be accurately obtained based on the video highlight point, as Model training provides data support.
  • Fig. 5b exemplarily provides a schematic diagram of the processing flow of training data preparation and model training; as shown in Fig. 5b, by extracting the key point set K (i.e. the video highlights set of the sample video), for each key point k (i.e. the video highlights) in the key point set K, the video training pair ⁇ x k ,y can be extracted from the existing video (i.e. the sample video) k >(i.e. video segment pair), where x k is the video (i.e.
  • obtaining the video highlight set of the sample video includes: acquiring preset highlight extraction information; the highlight extraction information is used to identify the video highlight according to the picture information, sound information, and text information in the video ; According to the highlight point extraction information, a plurality of video highlight points are determined from the sample video, and a video highlight set of the sample video is obtained.
  • the highlight point of the video may be a time center point of the highlight segment in the video.
  • the highlight point extraction information can be used, and multiple video highlight points can be identified from the sample video according to the picture information, sound information, and text information in the video, and then the sample can be obtained.
  • the highlight of the video can be the time point corresponding to the video screen including shooting, scoring, and red and yellow cards;
  • the acoustic recognition model Taking a football match as an example, the part where the loudness of the sound exceeds the threshold (for example, the threshold is 1.5 times the average of the overall audio loudness) can be recognized as a highlight, and the video highlights It may be the time point when the loudness of the sound exceeds the threshold;
  • the threshold for example, the threshold is 1.5 times the average of the overall audio loudness
  • ASR Automatic Speech Recognition
  • a plurality of video highlight points are determined from the sample video, and a set of video highlight points of the sample video is obtained, which can be used for video
  • the highlights determine the highlights of the video, which is helpful for users to perform video editing operations.
  • Fig. 6 is a flow chart of another video clipping method according to an embodiment of the present disclosure. As shown in Fig. 6, the method is used in the server 120 in Fig. 1 , including steps S601-S611.
  • step S601 the training sample data is obtained; the training sample data includes a plurality of video clip pairs; each of the video clip pairs includes a first video clip and a second video clip belonging to the same sample video; the first video The segment is a video segment with a preset duration before the video key point in the sample video; the second video segment is a video segment with a preset duration after the video key point in the sample video.
  • step S602 for each image content feature dimension, according to the image preprocessing method corresponding to the image content feature dimension, each image frame in the first video segment and the second video segment of each video segment pair is processed Adjust to get the adjusted image frame.
  • step S603 image feature extraction is performed on the adjusted image frame to obtain a plurality of image feature vectors.
  • step S604 the plurality of image feature vectors are spliced to obtain video feature vectors respectively corresponding to the first video segment and the second video segment; the video feature vectors are used to characterize the first video The segment and the video content feature corresponding to the second video segment.
  • step S605 the training sample data is used to train the content feature prediction model to be trained to obtain the content feature prediction model.
  • step S606 a selection instruction to the editing point of the original video is obtained, and a target video segment is extracted from the original video; the target video segment is before the editing point or before the editing point in the original video A video clip with a preset duration after clicking.
  • step S607 the video content features corresponding to the target video segment are input into the content feature prediction model to obtain predicted video content features.
  • step S608 according to the feature of the predicted video content, determine the video segment to be inserted from the set of video material segments; the matching degree between the feature of the video content corresponding to the video segment to be inserted and the feature of the predicted video content satisfies preset conditions.
  • step S609 the video segment to be inserted is fed back to the user.
  • step S610 according to the insertion selection information returned by the user, a target video segment to be inserted is determined from the plurality of video segments to be inserted.
  • step S611 the target insertion video segment is inserted before or after the clip point of the original video.
  • steps in the flow charts of FIG. 2 , FIG. 4 , and FIG. 6 are shown sequentially as indicated by the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in FIG. 2, FIG. 4, and FIG. 6 may include multiple steps or multiple stages, and these steps or stages are not necessarily performed at the same time, but may be performed at different times. These steps Or the execution sequence of the stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.
  • Fig. 7 is a block diagram of a video clipping device according to an embodiment of the disclosure. Referring to Figure 7, the device includes:
  • the obtaining unit 701 is configured to execute a selection instruction for obtaining an editing point of the original video, and extract a target video segment from the original video; the target video segment is before or before the editing point in the original video A video segment with a preset duration after the editing point;
  • the prediction unit 702 is configured to input the video content features corresponding to the target video segment into the content feature prediction model to obtain the predicted video content features;
  • the video segment matching unit 703 is configured to determine the video segment to be inserted from the set of video material segments according to the predicted video content feature; The degree of matching between meets the preset conditions;
  • the feedback unit 704 is configured to feed back the video segment to be inserted to the user, so as to insert the video segment to be inserted into the editing point of the original video.
  • the video clip matching unit 703 is specifically configured to determine a plurality of video content features and the Matching degree sorting results between predicted video content features; when the matching degree is greater than a preset threshold, it is determined that the matching degree between the video content features and the predicted video content features meets a preset condition; The video material segment corresponding to the content feature is used as the video segment to be inserted.
  • the video clips to be inserted include multiple, and the feedback unit 704 is specifically configured to execute obtaining preset feedback index information; perform a process on the multiple video clips to be inserted according to the feedback index information Sorting to obtain a feedback sorting result; based on the feedback sorting result, feeding back the plurality of video segments to be inserted.
  • the device further includes:
  • the target insertion video segment determination unit is configured to determine the target insertion video segment from the plurality of video segments to be inserted according to the insertion selection information returned by the user;
  • the target insertion video segment inserting unit is configured to insert the target insertion video segment before or after the cutting point of the original video.
  • Fig. 8 is a block diagram of an apparatus for obtaining a content feature prediction model according to an embodiment of the present disclosure.
  • the device includes:
  • the training sample data obtaining unit 901 is configured to perform obtaining training sample data;
  • the training sample data includes a plurality of video clip pairs; each of the video clip pairs includes a first video clip and a second video clip belonging to the same sample video ;
  • the first video clip is a video clip of a preset duration before the video key point in the sample video;
  • the second video clip is a video clip of a preset duration after the video key point in the sample video ;
  • the model training unit 902 is configured to use the training sample data to train the content feature prediction model to be trained to obtain the content feature prediction model.
  • the model training unit is specifically configured to execute the The video content feature corresponding to a video segment is input to the content feature prediction model to be trained to obtain the predicted video content feature corresponding to the first video segment; based on the predicted video content feature corresponding to the first video segment and the second Adjusting the model parameters of the content feature prediction model to be trained for the difference in video content features corresponding to the video clips, until the adjusted content feature prediction model meets the preset training conditions, and the content feature prediction model is obtained;
  • the model training unit is specifically configured to execute the video content corresponding to the second video segment Feature input to the content feature prediction model to be trained to obtain the predicted video content features corresponding to the second video segment; based on the predicted video content features corresponding to the second video segment and the video content features corresponding to the first video segment Adjust the model parameters of the content feature prediction model to be trained until the adjusted content feature prediction model meets the preset training conditions to obtain the content feature prediction model.
  • the device further includes:
  • the image preprocessing unit is configured to perform, for each image content feature dimension, according to the image preprocessing method corresponding to the image content feature dimension, for each of the video clip pairs in the first video clip and the second video clip Each image frame is adjusted to obtain an adjusted image frame;
  • An image feature extraction unit configured to perform image feature extraction on the adjusted image frame to obtain a plurality of image feature vectors
  • the splicing unit is configured to splice the plurality of image feature vectors to obtain respective video feature vectors corresponding to the first video segment and the second video segment; the video feature vectors are used to characterize the first video segment A video segment and the video content characteristics corresponding to the second video segment.
  • the training sample data acquisition unit is specifically configured to acquire a set of video highlight points of the sample video; for each video highlight point, determine the preset The first video segment of the duration, and the second video segment of the preset duration after the video highlights in the sample video; according to the first video segment and the second video segment, the corresponding video highlights are obtained pair of video clips.
  • the training sample data acquisition unit is specifically configured to acquire preset highlight extraction information; the highlight extraction information is used to identify the Highlights of the video: According to the extracted information of the highlights, a plurality of highlights of the video are determined from the sample video, and a set of highlights of the video of the sample videos is obtained.
  • Fig. 9 is a block diagram of an apparatus 800 for performing a video clipping method according to an embodiment of the present disclosure.
  • the electronic device 800 may be a server.
  • electronic device 800 includes processing component 820 , which further includes one or more processors, and a memory resource represented by memory 822 for storing instructions executable by processing component 820 , such as application programs.
  • the application program stored in memory 822 may include one or more modules each corresponding to a set of instructions.
  • the processing component 820 is configured to execute instructions to perform the above video clipping method.
  • the electronic device 800 may also include a power component 824 configured to perform power management of the electronic device 800 , a wired or wireless network interface 826 configured to connect the electronic device 800 to a network, and an input-output (I/O) interface 828 .
  • the electronic device 800 can operate based on an operating system stored in the memory 822, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD or the like.
  • the processor of the electronic device 800 is configured to execute instructions, so as to implement the method for obtaining a content feature prediction model as described above.
  • a computer-readable storage medium including instructions, such as a memory 822 including instructions, the instructions can be executed by the processor of the electronic device 800 to complete the above video clipping method or obtain a content feature prediction model Methods.
  • the storage medium may be a non-volatile computer-readable storage medium, for example, the non-volatile computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc. .
  • a computer program product includes instructions, and the above instructions can be executed by the processor of the electronic device 800 to complete the above video clipping method or the method for obtaining a content feature prediction model.
  • a computer program is also provided, the computer program includes computer program code, and when the computer program code is run on a computer, it causes the computer to execute the above method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Television Signal Processing For Recording (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

本公开关于一种视频剪辑方法、装置、电子设备及非易失性计算机可读存储介质,所述方法包括:获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段;目标视频片段为原始视频中在剪辑点之前或在剪辑点之后预设时长的视频片段;将目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征;根据预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;待插入视频片段对应的视频内容特征与预测视频内容特征之间的匹配程度满足预设条件;向用户反馈待插入视频片段,以用于将待插入视频片段插入至原始视频的剪辑点。

Description

视频剪辑方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请基于申请号为202111211990.X、申请日为2021年10月18日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本公开涉及计算机技术领域,具体涉及一种视频剪辑方法、装置、电子设备、存储介质、计算机程序产品和计算机程序。
背景技术
目前,针对短视频剪辑,通常采用的方法是将多个不同短视频中的片段插入一个视频中,或者直接将一组短视频的片段集成为一个视频,但上述方法需要人工标记以采集视频片段,依赖于由人工完成混剪操作,缺乏自动化短视频混剪,较少存在的自动化混剪也仅仅是简单属性聚合得到的集成视频片段,无法在视频衔接上体现智能化。
因此,自动化智能混剪技术仍有待提高。
发明内容
本公开提供一种视频剪辑方法、装置、电子设备、存储介质、计算机程序产品和计算机程序,以至少解决缺乏自动化智能混剪的问题。本公开的技术方案如下:
根据本公开实施例的第一方面,提供一种视频剪辑方法,应用于电子设备,所述方法包括:
获取对原始视频的剪辑点的选择指令,从所述原始视频中提取出目标视频片段;所述目标视频片段为所述原始视频中在所述剪辑点之前或在所述剪辑点之后预设时长的视频片段;
将所述目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征;
根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;所述待插入视频片段对应的视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
向用户反馈所述待插入视频片段,以用于将所述待插入视频片段插入至所述原始视频的剪辑点。
在本公开实施例中,所述根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段,包括:
基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个所述视频内容特征与所述预测视频内容特征之间的匹配程度排序结果;
当匹配程度大于预设阈值的情况下,判定所述视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
将所述视频内容特征对应的视频素材片段作为待插入视频片段。
在本公开实施例中,所述待插入视频片段包括多个,所述向用户反馈所述待插入视频片段,包括:
获取预设的反馈指标信息;
按照所述反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果;
基于所述反馈排序结果,反馈所述多个待插入视频片段。
在本公开实施例中,所述方法还包括:
根据用户返回的插入选择信息,从所述多个待插入视频片段中确定目标插入视频片段;
将所述目标插入视频片段插入至所述原始视频的剪辑点之前或剪辑点之后。
根据本公开实施例的第二方面,提供一种获得内容特征预测模型的方法,应用于电子设备,所述方法包括:
获取训练样本数据;所述训练样本数据包括多个视频片段对;每个所述视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;所述第一视频片段为在所述样本视频中的视频关键点之前预设时长的视频片段;所述第二视频片段为在所述样本视频中的视频关键点之后预设时长的视频片段;
采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。
在本公开实施例中,当目标视频片段为所述原始视频中在所述剪辑点之前预设时长的视频片段的情况下,所述采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,包括:
将所述第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第一视频片段对应的预测视频内容特征;
基于所述第一视频片段对应的预测视频内容特征与所述第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型;
当目标视频片段为所述原始视频中在所述剪辑点之后预设时长的视频片段的情况下,所述采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,包括:
将所述第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第二视频片段对应的预测视频内容特征;
基于所述第二视频片段对应的预测视频内容特征与所述第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型。
在本公开实施例中,所述方法还包括:
针对每一图像内容特征维度,按照所述图像内容特征维度对应的图像预处理方式,对每个所述视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧;
对所述调整后的图像帧进行图像特征提取,得到多个图像特征向量;
将所述多个图像特征向量进行拼接,得到所述第一视频片段和所述第二视频片段各自对应的视频特征向量;所述视频特征向量用于表征所述第一视频片段和所述第二视频片段各自对应的视频内容特征。
在本公开实施例中,所述获取训练样本数据,包括:
获取样本视频的视频精彩点集合;
针对每一视频精彩点,确定在所述样本视频中的视频精彩点之前预设时长的第一视频片段,以及 在所述样本视频中的视频精彩点之后预设时长的第二视频片段;
根据所述第一视频片段和所述第二视频片段,得到所述视频精彩点对应的视频片段对。
在本公开实施例中,所述获取样本视频的视频精彩点集合,包括:
获取预设的精彩点提取信息;所述精彩点提取信息用于根据视频中的画面信息、声音信息、文本信息识别出视频精彩点;
根据所述精彩点提取信息,从所述样本视频中确定出多个视频精彩点,得到所述样本视频的视频精彩点集合。
根据本公开实施例的第三方面,提供一种视频剪辑装置,包括:
获取单元,被配置为执行获取对原始视频的剪辑点的选择指令,从所述原始视频中提取出目标视频片段;所述目标视频片段为所述原始视频中在所述剪辑点之前或在所述剪辑点之后预设时长的视频片段;
预测单元,被配置为执行将所述目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征;
视频片段匹配单元,被配置为执行根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;所述待插入视频片段对应的视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
反馈单元,被配置为执行向用户反馈所述待插入视频片段,以用于将所述待插入视频片段插入至所述原始视频的剪辑点。
在本公开实施例中,所述视频片段匹配单元,具体被配置为执行基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个所述视频内容特征与所述预测视频内容特征之间的匹配程度排序结果;当匹配程度大于预设阈值的情况下,判定所述视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;将所述视频内容特征对应的视频素材片段作为待插入视频片段。
在本公开实施例中,所述待插入视频片段包括多个,所述反馈单元,具体被配置为执行获取预设的反馈指标信息;按照所述反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果;基于所述反馈排序结果,反馈所述多个待插入视频片段。
在本公开实施例中,所述装置还包括:
目标插入视频片段确定单元,被配置为执行根据用户返回的插入选择信息,从所述多个待插入视频片段中确定目标插入视频片段;
目标插入视频片段插入单元,被配置为执行将所述目标插入视频片段插入至所述原始视频的剪辑点之前或剪辑点之后。
根据本公开实施例的第四方面,提供一种内容特征预测模型获得装置,所述装置包括:
训练样本数据获取单元,被配置为执行获取训练样本数据;所述训练样本数据包括多个视频片段对;每个所述视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;所述第一视频片段为在所述样本视频中的视频关键点之前预设时长的视频片段;所述第二视频片段为在所述样本视频中的视频关键点之后预设时长的视频片段;
模型训练单元,被配置为执行采用所述训练样本数据,对待训练的内容特征预测模型进行训练, 得到内容特征预测模型。
在本公开实施例中,当目标视频片段为所述原始视频中在所述剪辑点之前预设时长的视频片段的情况下,所述模型训练单元,具体被配置为执行将所述第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第一视频片段对应的预测视频内容特征;基于所述第一视频片段对应的预测视频内容特征与所述第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型;
当目标视频片段为所述原始视频中在所述剪辑点之后预设时长的视频片段的情况下,所述模型训练单元,具体被配置为执行将所述第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第二视频片段对应的预测视频内容特征;基于所述第二视频片段对应的预测视频内容特征与所述第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型。
在本公开实施例中,所述装置还包括:
图像预处理单元,被配置为执行针对每一图像内容特征维度,按照所述图像内容特征维度对应的图像预处理方式,对每个所述视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧;
图像特征提取单元,被配置为执行对所述调整后的图像帧进行图像特征提取,得到多个图像特征向量;
拼接单元,被配置为执行将所述多个图像特征向量进行拼接,得到所述第一视频片段和所述第二视频片段各自对应的视频特征向量;所述视频特征向量用于表征所述第一视频片段和所述第二视频片段各自对应的视频内容特征。
在本公开实施例中,所述训练样本数据获取单元,具体被配置为执行获取样本视频的视频精彩点集合;针对每一视频精彩点,确定在所述样本视频中的视频精彩点之前预设时长的第一视频片段,以及在所述样本视频中的视频精彩点之后预设时长的第二视频片段;根据所述第一视频片段和所述第二视频片段,得到所述视频精彩点对应的视频片段对。
在本公开实施例中,所述训练样本数据获取单元,具体被配置为获取预设的精彩点提取信息;所述精彩点提取信息用于根据视频中的画面信息、声音信息、文本信息识别出视频精彩点;根据所述精彩点提取信息,从所述样本视频中确定出多个视频精彩点,得到所述样本视频的视频精彩点集合。
根据本公开实施例的第五方面,提供一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如第一方面或第一方面的任一实施例所述的视频剪辑方法、或者如第二方面或第二方面的任一实施例所述的获得内容特征预测模型的方法。
根据本公开实施例的第六方面,提供一种非易失性计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面或第一方面的任一实施例所述的视频剪辑方法、或者如第二方面或第二方面的任一实施例所述的获得内容特征预测模型的方法。
根据本公开实施例的第七方面,提供一种计算机程序产品,所述程序产品包括计算机程序,所述计算机程序存储在可读存储介质中,设备的至少一个处理器从所述可读存储介质读取并执行所述计算 机程序,使得设备执行第一方面或第一方面的任一实施例所述的视频剪辑方法、或者如第二方面或第二方面的任一实施例所述的获得内容特征预测模型的方法。
根据本公开实施例的第八方面,提供一种计算机程序,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如第一方面或第一方面的任一实施例所述的视频剪辑方法、或者如第二方面或第二方面的任一实施例所述的获得内容特征预测模型的方法。
通过获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段,目标视频片段为原始视频中在剪辑点之前或在剪辑点之后预设时长的视频片段,然后将目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征,进而根据预测视频内容特征,从视频素材片段集合中确定出待插入视频片段,待插入视频片段对应的视频内容特征与预测视频内容特征之间的匹配程度满足预设条件,向用户反馈待插入视频片段,以用于将待插入视频片段插入至原始视频的剪辑点。如此,可以基于目标视频片段对应的视频内容特征得到预测视频内容特征,进而可以从视频素材片段集合中匹配出待插入视频片段进行反馈,优化了视频剪辑,使得剪辑后的视频更加自然和流畅,从而在视频衔接上体现了智能化,避免让剪辑后的视频显得突兀。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理,并不构成对本公开的不当限定。
图1是根据本公开实施例示出的一种视频剪辑方法的应用环境图。
图2是根据本公开实施例示出的一种视频剪辑方法的流程图。
图3是根据本公开实施例示出的一种智能视频混剪编辑的处理流程示意图。
图4是根据本公开实施例示出的一种获得内容特征预测模型的流程图。
图5a是根据本公开实施例示出的一种模型训练的示意图。
图5b是根据本公开实施例示出的一种训练数据准备及模型训练的处理流程示意图。
图6是根据本公开实施例示出的另一种视频剪辑方法的流程图。
图7是根据本公开实施例示出的一种视频剪辑装置的框图。
图8是根据本公开实施例示出的一种内容特征预测模型获得装置的框图。
图9是根据本公开实施例示出的一种服务器的内部结构图。
具体实施方式
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。
本公开实施例所提供的视频剪辑方法,可以应用于如图1所示的应用环境中。其中,用户端110 通过网络与服务器120进行交互。其中,服务器120获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段,并将目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征,再根据预测视频内容特征,从视频素材片段集合中确定出待插入视频片段,服务器120向用户端110反馈待插入视频片段。实际应用中,用户端110可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
图2是根据本公开实施例示出的一种视频剪辑方法的流程图,如图2所示,该方法用于图1的服务器120中,包括步骤S210-S240。
步骤S210中,获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段;目标视频片段为原始视频中在剪辑点之前或在剪辑点之后预设时长的视频片段。
其中,原始视频可以为待插入剪辑片段的视频,如可以将当前用户端正在编辑的唯一的基视频作为原始视频。
其中,目标视频片段可以为原始视频中待预测视频内容的视频片段,如可以基于原始视频中提取出的目标视频片段,预测与该目标视频片段的视频内容相衔接的视频片段。
在本公开实施例中,剪辑点可以为用户端指定的原始视频中插入片段的时间位置,如基于用户需求指定插入时间位置p。
具体实现中,在视频剪辑的过程中,服务器可以接收用户端发送的原始视频的剪辑点的选择指令,进而服务器可以根据获取的选择指令,从原始视频中提取出目标视频片段,该目标视频片段可以为原始视频中在剪辑点之前或在剪辑点之后预设时长的视频片段。
例如,在确定插入时间位置p后,可以从原始视频中提取出基于该插入时间位置之前一定时间区间[tp-n,tp]的视频,作为目标视频片段。
在本公开实施例中,由于短视频的总时长较短,在剪辑点之前的预设时长n(即时间区间)可以选取时间位置p之前的10-15s范围内,预设时长n还可以为其它设定值,本公开在此不做限定。
步骤S220中,将目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征。
其中,目标视频片段对应的视频内容特征可以为对目标视频片段进行多维度特征提取得到的特征向量序列,其可以用于表征视频片段的视频内容特征。
具体实现中,在获取目标视频片段后,可以对目标视频片段进行多维度特征提取,得到目标视频片段对应的视频内容特征,进而可以将目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征,该预测视频内容特征所表征的视频内容可以与目标视频片段对应的视频内容特征所表征的视频内容相衔接。
在本公开实施例中,预测视频内容特征可以为向量序列集合,例如,基于插入时间位置p提取的目标视频片段,通过预训练的内容特征预测模型可以预测出可选集合Yp,其可以具有多个元素,每一个元素y可以指一个向量序列,每一向量可以对应一视频帧,完整的向量序列可以对应一视频片段,即可选集合Yp可以为预测出的可选视频片段对应的向量序列集合。
步骤S230中,根据预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;待插入视 频片段对应的视频内容特征与预测视频内容特征之间的匹配程度满足预设条件。
其中,视频素材片段集合可以为一组视频片段集合,每个视频片段可以对应有一表征该视频片段的视频内容特征的向量序列。
具体实现中,可以根据预测视频内容特征在视频素材片段集合中进行搜索,通过预测视频内容特征与视频素材片段集合中各视频片段对应的视频内容特征的向量序列的相似度匹配过程,可以搜索出与预测视频内容特征的匹配程度满足预设条件的向量序列,进而可以将搜索出的向量序列对应的视频片段作为待插入视频片段。
在本公开实施例中,搜索出的相似度匹配结果可以为与预测视频内容特征的匹配程度最高的N个视频片段,即待插入视频片段。
步骤S240中,向用户反馈待插入视频片段,以用于将待插入视频片段插入至原始视频的剪辑点。
在得到待插入视频片段后,服务器可以向用户端反馈该待插入视频片段,进而可以基于用户操作将待插入视频片段插入至原始视频的剪辑点,得到混剪编辑的视频。
上述视频剪辑方法,通过获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段,然后将目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征,进而根据预测视频内容特征,从视频素材片段集合中确定出待插入视频片段,向用户反馈待插入视频片段,以用于将待插入视频片段插入至原始视频的剪辑点。如此,可以基于目标视频片段对应的视频内容特征得到预测视频内容特征,进而可以从视频素材片段集合中匹配出待插入视频片段进行反馈,优化了视频剪辑,使得剪辑后的视频更加自然和流畅,从而在视频衔接上体现了智能化,避免让剪辑后的视频显得突兀。
在本公开实施例中,获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段,包括:获取对原始视频的剪辑点的选择指令,确定原始视频中在剪辑点之前或在剪辑点之后的时间区间;基于时间区间,从原始视频中提取出目标视频片段。
具体实现中,服务器可以接收用户端发送的原始视频的剪辑点的选择指令,进而服务器可以根据获取的选择指令,确定原始视频中在剪辑点之前或在剪辑点之后的时间区间,在得到时间区间后,可以从原始视频中提取出该时间区间对应的目标视频片段。
例如,根据选择指令可以确定插入时间位置p,进而可以基于预设时长n,得到插入时间位置之前的时间区间[tp-n,tp],并可以从原始视频中提取出时间区间[tp-n,tp]对应的视频片段,作为目标视频片段。
本公开实施例的技术方案,通过获取对原始视频的剪辑点的选择指令,确定原始视频中在剪辑点之前或在剪辑点之后的时间区间,进而基于时间区间,从原始视频中提取出目标视频片段,能够基于用户需求准确地从原始视频中提取出目标视频片段,为后续预测视频内容特征提供了数据支持。
在本公开实施例中,根据预测视频内容特征,从视频素材片段集合中确定出待插入视频片段,包括:基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个视频内容特征与预测视频内容特征之间的匹配程度排序结果;当匹配程度大于预设阈值的情况下,判定视频内容特征与预测视频内容特征之间的匹配程度满足预设条件;将视频内容特征对应的视频素材片段作为待插入视频片段。
具体实现中,视频素材片段集合中具有多个视频素材片段,可以基于多个视频素材片段各自对应的视频内容特征,根据预测视频内容特征在视频素材片段集合中进行搜索,通过预测视频内容特征与多个视频素材片段各自对应的视频内容特征的相似度匹配过程,可以搜索出与预测视频内容特征的匹配程度满足预设条件的视频内容特征,进而可以将搜索出的视频内容特征对应的视频素材片段作为待插入视频片段。
例如,当预测视频内容特征中具有5个元素,针对每一个元素,可以从视频素材片段集合中搜索出相似匹配度最高的10个视频片段,进而可以基于5个元素各自对应的相似匹配度最高的10个视频片段,即50个视频片段,构成待插入视频片段。
本公开实施例的技术方案,服务器得到预测视频内容特征后,通过基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个视频内容特征与预测视频内容特征之间的匹配程度排序结果,然后当匹配程度大于预设阈值的情况下,判定视频内容特征与预测视频内容特征之间的匹配程度满足预设条件,进而将视频内容特征对应的视频素材片段作为待插入视频片段,从而能够有效针对预测视频内容特征匹配出相似度较高的视频素材片段,提升了视频内容衔接效果。
在本公开实施例中,待插入视频片段可以包括多个,向用户反馈待插入视频片段,包括:获取预设的反馈指标信息;按照反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果;基于反馈排序结果,反馈多个待插入视频片段。
其中,反馈指标信息可以包括多个指定指标,如相关性、精彩度等。
具体实现中,待插入视频片段可以包括多个,可以按照预设的反馈指标信息对多个待插入视频片段进行推荐度排序,得到反馈排序结果,进而服务器可以基于反馈排序结果,向用户端反馈多个待插入视频片段。
本公开实施例的技术方案,待插入视频片段可以包括多个,通过获取预设的反馈指标信息;按照反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果,进而基于反馈排序结果,反馈多个待插入视频片段,从而实现了为用户提供智能化视频混剪素材,能够让剪辑后的视频更加自然和流畅。
在本公开实施例中,在向用户反馈待插入视频片段的步骤之后,还包括:根据用户返回的插入选择信息,从多个待插入视频片段中确定目标插入视频片段;将目标插入视频片段插入至原始视频的剪辑点之前或剪辑点之后。
在实际应用中,在向用户反馈待插入视频片段后,可以根据用户返回的插入选择信息,从多个待插入视频片段中确定目标插入视频片段,进而可以将目标插入视频片段插入至原始视频的剪辑点之前或剪辑点之后,例如,根据用户对排序后的待插入视频片段的选择操作,可以确定目标插入视频片段,进而可以将该目标插入视频片段拼接至原始视频中,得到混剪编辑的视频。
在本公开实施例中,当待预测视频内容的目标视频片段为原始视频中在剪辑点之前预设时长的视频片段时,可以将目标插入视频片段插入至原始视频的剪辑点之后;当待预测视频内容的目标视频片段为原始视频中在剪辑点之后预设时长的视频片段时,可以将目标插入视频片段插入至原始视频的剪辑点之前。
本公开实施例的技术方案,通过根据用户返回的插入选择信息,从多个待插入视频片段中确定目 标插入视频片段;将目标插入视频片段插入至原始视频的剪辑点之前或剪辑点之后,可以基于用户选择进行视频智能混剪,在视频衔接上体现了智能化,使得剪辑后的视频更加自然和流畅。
为了便于本领域技术人员的理解,图3实例性地提供了一种智能视频混剪编辑的处理流程示意图;如图3所示,该智能视频混剪编辑的处理流程包括步骤S301-S307。
具体地,在步骤S301中:用户可以基于用户端指定基视频(即原始视频)插入时间位置p(即剪辑点);在步骤S302中:服务器可以根据指定的插入时间位置p从已有视频(即原始视频)中提取时间区间[t p-n,t p]对应的视频(即目标视频片段);在步骤S303中:对时间区间[t p-n,t p]对应的视频进行多维度特征提取,得到预测视频内容特征;然后在步骤S304中:可以通过生成式深度学习模型(即内容特征预测模型)生成可选集合Y p(即预测视频内容特征);进而在步骤S305中:对于可选集合Y p中每一个元素y,在待选可剪辑视频(即视频素材片段集合)中进行搜索,得到待插入视频片段集合Y y(即待插入视频片段);在步骤S306中:对待插入视频片段集合Y y可以按照指定指标进行排序,并可以根据排序结果进行反馈;在步骤S307中:提供给用户对排序后的待插入视频片段集合Y y进行选择和插入操作。
图4是根据本公开实施例示出的一种获得内容特征预测模型的方法的流程图,如图4所示,该方法用于图1中的服务器120中,包括步骤S410-S420。
在步骤S410中,获取训练样本数据;训练样本数据包括多个视频片段对;每个视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;第一视频片段为在样本视频中的视频关键点之前预设时长的视频片段;第二视频片段为在样本视频中的视频关键点之后预设时长的视频片段。
具体实现中,在获取对原始视频的剪辑点的选择指令,从原始视频中提取出目标视频片段之前,服务器还需要对上述的内容特征预测模型进行训练,可以获取训练样本数据,该训练样本数据可以包括多个视频片段对,每个视频片段对可以包括属于同一样本视频的第一视频片段和第二视频片段,第一视频片段可以为在样本视频中的视频关键点之前预设时长的视频片段,第二视频片段可以为在样本视频中的视频关键点之后预设时长的视频片段。
在本公开实施例中,内容特征预测模型可以为生成式深度学习模型,该生成式深度学习模型可以采用VAE,GAN及其变种,例如,可以采用循环神经网络,Bidirectional RNN(双向循环神经网络)、Deep(Bidirectional)RNN(深度(双向)循环神经网络)、LSTM等,以及卷积神经网络(Convolutional Neural Network,CNN)等。
在步骤S420中,采用训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。
实际应用中,服务器可以采用训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,具体地,可以基于每个视频片段对的第一视频片段和第二视频片段,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。
本公开实施例的技术方案,通过获取训练样本数据,采用训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,可以基于预训练的内容特征预测模型进行视频内容预测,优化了视频剪辑,在剪辑视频衔接上体现了智能化。
在本公开实施例中,当目标视频片段为原始视频中在剪辑点之前预设时长的视频片段的情况下, 采用训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,包括:
将第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到第一视频片段对应的预测视频内容特征;
基于第一视频片段对应的预测视频内容特征与第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的内容特征预测模型符合预设训练条件,得到内容特征预测模型;
当目标视频片段为原始视频中在剪辑点之后预设时长的视频片段的情况下,采用训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,包括:
将第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到第二视频片段对应的预测视频内容特征;
基于第二视频片段对应的预测视频内容特征与第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的内容特征预测模型符合预设训练条件,得到内容特征预测模型。
具体实现中,若目标视频片段为原始视频中在剪辑点之前预设时长的视频片段,在模型训练的过程中,可以将第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到第一视频片段对应的预测视频内容特征,并基于第一视频片段对应的预测视频内容特征与第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的内容特征预测模型符合预设训练条件,进而可以得到内容特征预测模型。
若目标视频片段为原始视频中在剪辑点之后预设时长的视频片段,在模型训练的过程中,可以将第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到第二视频片段对应的预测视频内容特征,并基于第二视频片段对应的预测视频内容特征与第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的内容特征预测模型符合预设训练条件,进而可以得到内容特征预测模型。
本公开实施例的技术方案,当目标视频片段为原始视频中在剪辑点之前预设时长的视频片段的情况下,通过将第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到第一视频片段对应的预测视频内容特征,基于第一视频片段对应的预测视频内容特征与第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的内容特征预测模型符合预设训练条件,得到内容特征预测模型;当目标视频片段为原始视频中在剪辑点之后预设时长的视频片段的情况下,通过将第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到第二视频片段对应的预测视频内容特征;基于第二视频片段对应的预测视频内容特征与第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的内容特征预测模型符合预设训练条件,得到内容特征预测模型,可以有效针对剪辑点之前或之后的视频片段进行视频内容预测,提升了视频剪辑效果。
在本公开实施例中,在获取训练样本数据的步骤之后,还包括:针对每一图像内容特征维度,按照图像内容特征维度对应的图像预处理方式,对每个视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧;对调整后的图像帧进行图像特征提取,得到多个图像特征向 量;将多个图像特征向量进行拼接,得到第一视频片段和第二视频片段各自对应的视频特征向量;视频特征向量用于表征第一视频片段和第二视频片段各自对应的视频内容特征。
具体实现中,由于混剪的视频之间视频内容和视频质量不一致,为了增强内容特征预测模型的泛化能力,可以在获取训练样本数据后,从多个维度对训练样本数据中每个视频片段对所对应的图片序列进行预处理,通过针对每一图像内容特征维度,按照图像内容特征维度对应的图像预处理方式,对每个视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧,然后对调整后的图像帧进行图像特征提取,得到多个图像特征向量,进而可以将多个图像特征向量进行拼接,得到第一视频片段和第二视频片段各自对应的视频特征向量。
在本公开实施例中,图像特征提取的过程可以为:将视频片段转换为图片序列,然后对图片序列中每一个图片利用卷积神经网络进行图像特征提取,得到图像特征向量。将多个图片对应的图像特征向量进行拼接,可以得到视频片段对应的视频特征向量,如特征向量序列。
例如,多个维度可以包括是否包含背景(包含、不包含)、是否忽略图片颜色(是、否)、是否仅包含人物(包含、不包含)、是否仅针对移动物体(是、否),其中,包含背景和不包含背景可以作为两个维度,针对每一维度,可以对视频片段对进行多维度特征提取,得到视频片段对中第一视频片段和第二视频片段各自对应的视频内容特征,如特征向量序列,进而可以对内容特征预测模型进行训练。
如图5a所示,针对维度1,可以基于视频片段对中第一视频片段,如样本视频中的视频关键点之前预设时长的视频片段,通过维度1-输入特征数据(即第一视频片段对应的多个图像特征向量)进行输入数据拼接,得到第一视频片段对应的视频内容特征,并可以基于视频片段对中第二视频片段,如样本视频中的视频关键点之后预设时长的视频片段,通过维度1-输出特征数据(即第二视频片段对应的多个图像特征向量)进行输出数据拼接,得到第二视频片段对应的视频内容特征,进而可以根据第一视频片段对应的预测视频内容特征与第二视频片段对应的视频内容特征,对生成式深度学习模型(即待训练的内容特征预测模型)进行训练。
本公开实施例的技术方案,通过针对每一图像内容特征维度,按照图像内容特征维度对应的图像预处理方式,对每个视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧,然后对调整后的图像帧进行图像特征提取,得到多个图像特征向量,进而将多个图像特征向量进行拼接,得到第一视频片段和第二视频片段各自对应的视频特征向量,可以基于多个图像内容特征维度进行模型训练,增强了内容特征预测模型的泛化能力。
在本公开实施例中,获取训练样本数据,包括:获取样本视频的视频精彩点集合;针对每一视频精彩点,确定在样本视频中的视频精彩点之前预设时长的第一视频片段,以及在样本视频中的视频精彩点之后预设时长的第二视频片段;根据第一视频片段和第二视频片段,得到视频精彩点对应的视频片段对。
具体实现中,通过获取样本视频的视频精彩点集合,然后针对每一视频精彩点,确定在样本视频中的视频精彩点之前预设时长的第一视频片段,以及在样本视频中的视频精彩点之后预设时长的第二视频片段,进而可以根据第一视频片段和第二视频片段,得到视频精彩点对应的视频片段对。
本公开实施例的技术方案,通过获取样本视频的视频精彩点集合,然后针对每一视频关键点,确 定在样本视频中的视频精彩点之前预设时长的第一视频片段,以及在样本视频中的视频精彩点之后预设时长的第二视频片段,进而根据第一视频片段和第二视频片段,得到视频精彩点对应的视频片段对,可以基于视频精彩点准确得到待训练的视频片段,为模型训练提供了数据支持。
为了便于本领域技术人员的理解,图5b实例性地提供了一种训练数据准备及模型训练的处理流程示意图;如图5b所示,通过从已有视频(即样本视频)提取关键点集合K(即样本视频的视频精彩点集合),针对关键点集合K中每一关键点k(即视频精彩点),可以从已有视频(即样本视频)中提取出视频训练对<x k,y k>(即视频片段对),其中,x k为[t k-n,t k]时间区间的视频(即第一视频片段),y k为[t k,t k+n]时间区间的视频(即第二视频片段),进而可以对训练对<x k,y k>进行多维度特征提取,得到训练对<x k,y k>对应的视频特征向量,以训练生成式深度学习模型(即待训练的内容特征预测模型)。
在本公开实施例中,获取样本视频的视频精彩点集合,包括:获取预设的精彩点提取信息;精彩点提取信息用于根据视频中的画面信息、声音信息、文本信息识别出视频精彩点;根据精彩点提取信息,从样本视频中确定出多个视频精彩点,得到样本视频的视频精彩点集合。
其中,视频精彩点可以为视频中精彩片段的时间中心点。
具体实现中,通过获取预设的精彩点提取信息,可以采用该精彩点提取信息,根据视频中的画面信息、声音信息、文本信息从样本视频中识别出多个视频精彩点,进而可以得到样本视频的视频精彩点集合。
举例来说,由于短视频的时长较短,为了吸引用户,需要找出视频中最精彩的部分,可以采用如下方法提取视频精彩点:
1、通过训练视觉识别模型识别出视频精彩点:以足球比赛为例,视频精彩点可以为包括射门、进球、红黄牌时的视频画面对应的时间点;
2、通过训练声学识别模型来识别出视频精彩点:以足球比赛为例,可以将声音的响度超出阈值(如阈值为整体音频响度均值的1.5倍)的部分确认为精彩片段,则视频精彩点可以为声音的响度超出阈值时的时间点;
3、通过ASR(Automatic Speech Recognition)技术,可以将音频中的语音部分转为文本,进而通过识别文本中的关键字,如“球进了”、“红牌”、“黄牌”,可以识别出视频精彩点。
本公开实施例的技术方案,通过获取预设的精彩点提取信息,进而根据精彩点提取信息,从样本视频中确定出多个视频精彩点,得到样本视频的视频精彩点集合,可以针对视频中精彩片段确定视频精彩点,有助于用户进行视频剪辑操作。
图6是根据本公开实施例示出的另一种视频剪辑方法的流程图,如图6所示,该方法用于图1中的服务器120中,包括步骤S601-S611。
在步骤S601中,获取训练样本数据;所述训练样本数据包括多个视频片段对;每个所述视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;所述第一视频片段为在所述样本视频中的视频关键点之前预设时长的视频片段;所述第二视频片段为在所述样本视频中的视频关键点之后预设时长的视频片段。在步骤S602中,针对每一图像内容特征维度,按照所述图像内容特征维度对应的图像预处理方式,对每个所述视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧。在步骤S603中,对所述调整后的图像帧进行图像特征提取,得到多个图像特征向量。 在步骤S604中,将所述多个图像特征向量进行拼接,得到所述第一视频片段和所述第二视频片段各自对应的视频特征向量;所述视频特征向量用于表征所述第一视频片段和所述第二视频片段各自对应的视频内容特征。在步骤S605中,采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。在步骤S606中,获取对原始视频的剪辑点的选择指令,从所述原始视频中提取出目标视频片段;所述目标视频片段为所述原始视频中在所述剪辑点之前或在所述剪辑点之后预设时长的视频片段。在步骤S607中,将所述目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征。在步骤S608中,根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;所述待插入视频片段对应的视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件。在步骤S609中,向用户反馈所述待插入视频片段。在步骤S610中,根据用户返回的插入选择信息,从所述多个待插入视频片段中确定目标插入视频片段。在步骤S611中,将所述目标插入视频片段插入至所述原始视频的剪辑点之前或剪辑点之后。需要说明的是,上述步骤的具体限定可以参见上文对本公开实施例的视频剪辑方法的具体限定,在此不再赘述。
应该理解的是,虽然图2、图4、图6的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2、图4、图6中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
图7是根据本公开实施例示出的一种视频剪辑装置框图。参照图7,该装置包括:
获取单元701,被配置为执行获取对原始视频的剪辑点的选择指令,从所述原始视频中提取出目标视频片段;所述目标视频片段为所述原始视频中在所述剪辑点之前或在所述剪辑点之后预设时长的视频片段;
预测单元702,被配置为执行将所述目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征;
视频片段匹配单元703,被配置为执行根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;所述待插入视频片段对应的视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
反馈单元704,被配置为执行向用户反馈所述待插入视频片段,以用于将所述待插入视频片段插入至所述原始视频的剪辑点。
在本公开实施例中,所述视频片段匹配单元703,具体被配置为执行基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个所述视频内容特征与所述预测视频内容特征之间的匹配程度排序结果;当匹配程度大于预设阈值的情况下,判定所述视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;将所述视频内容特征对应的视频素材片段作为待插入视频片段。
在本公开实施例中,所述待插入视频片段包括多个,所述反馈单元704,具体被配置为执行获取预设的反馈指标信息;按照所述反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果;基于所述反馈排序结果,反馈所述多个待插入视频片段。
在本公开实施例中,该装置还包括:
目标插入视频片段确定单元,被配置为执行根据用户返回的插入选择信息,从所述多个待插入视频片段中确定目标插入视频片段;
目标插入视频片段插入单元,被配置为执行将所述目标插入视频片段插入至所述原始视频的剪辑点之前或剪辑点之后。
图8是根据本公开实施例示出的一种内容特征预测模型获得装置的框图。参照图8,该装置包括:
训练样本数据获取单元901,被配置为执行获取训练样本数据;所述训练样本数据包括多个视频片段对;每个所述视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;所述第一视频片段为在所述样本视频中的视频关键点之前预设时长的视频片段;所述第二视频片段为在所述样本视频中的视频关键点之后预设时长的视频片段;
模型训练单元902,被配置为执行采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。
在本公开实施例中,当所述目标视频片段为所述原始视频中在所述剪辑点之前预设时长的视频片段的情况下,所述模型训练单元,具体被配置为执行将所述第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第一视频片段对应的预测视频内容特征;基于所述第一视频片段对应的预测视频内容特征与所述第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型;
当所述目标视频片段为所述原始视频中在所述剪辑点之后预设时长的视频片段的情况下,所述模型训练单元,具体被配置为执行将所述第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第二视频片段对应的预测视频内容特征;基于所述第二视频片段对应的预测视频内容特征与所述第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型。
在本公开实施例中,该装置还包括:
图像预处理单元,被配置为执行针对每一图像内容特征维度,按照所述图像内容特征维度对应的图像预处理方式,对每个所述视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧;
图像特征提取单元,被配置为执行对所述调整后的图像帧进行图像特征提取,得到多个图像特征向量;
拼接单元,被配置为执行将所述多个图像特征向量进行拼接,得到所述第一视频片段和所述第二视频片段各自对应的视频特征向量;所述视频特征向量用于表征所述第一视频片段和所述第二视频片段各自对应的视频内容特征。
在本公开实施例中,所述训练样本数据获取单元,具体被配置为执行获取样本视频的视频精彩点集合;针对每一视频精彩点,确定在所述样本视频中的视频精彩点之前预设时长的第一视频片段,以及在所述样本视频中的视频精彩点之后预设时长的第二视频片段;根据所述第一视频片段和所述第二视频片段,得到所述视频精彩点对应的视频片段对。
在本公开实施例中,所述训练样本数据获取单元,具体被配置为获取预设的精彩点提取信息;所述精彩点提取信息用于根据视频中的画面信息、声音信息、文本信息识别出视频精彩点;根据所述精彩点提取信息,从所述样本视频中确定出多个视频精彩点,得到所述样本视频的视频精彩点集合。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
图9是根据本公开实施例示出的一种用于执行视频剪辑方法的设备800的框图。例如,电子设备800可以为一服务器。参照图9,电子设备800包括处理组件820,其进一步包括一个或多个处理器,以及由存储器822所代表的存储器资源,用于存储可由处理组件820的执行的指令,例如应用程序。存储器822中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件820被配置为执行指令,以执行上述视频剪辑方法。
电子设备800还可以包括:电源组件824被配置为执行电子设备800的电源管理,有线或无线网络接口826被配置为将电子设备800连接到网络,和输入输出(I/O)接口828。电子设备800可以操作基于存储在存储器822的操作系统,例如Windows Server,Mac OS X,Unix,Linux,FreeBSD或类似。
在本公开实施例中,电子设备800的处理器被配置为执行指令,以实现如上所述的获得内容特征预测模型的方法。
在本公开实施例中,还提供了一种包括指令的计算机可读存储介质,例如包括指令的存储器822,上述指令可由电子设备800的处理器执行以完成上述视频剪辑方法或获得内容特征预测模型的方法。存储介质可以是非易失性计算机可读存储介质,例如,所述非易失性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
在本公开实施例中,还提供一种计算机程序产品,所述计算机程序产品中包括指令,上述指令可由电子设备800的处理器执行以完成上述视频剪辑方法或获得内容特征预测模型的方法。
在本公开实施例中,还提供一种计算机程序,该计算机程序包括计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述方法。
需要说明的,上述的装置、电子设备、非易失性计算机可读存储介质、计算机程序产品和计算机程序等根据方法实施例的描述还可以包括其他的实施方式,具体的实现方式可以参照相关方法实施例的描述,在此不作一一赘述。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (22)

  1. 一种视频剪辑方法,应用于电子设备,其特征在于,所述方法包括:
    获取对原始视频的剪辑点的选择指令,从所述原始视频中提取出目标视频片段;所述目标视频片段为所述原始视频中在所述剪辑点之前或在所述剪辑点之后预设时长的视频片段;
    将所述目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征;
    根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段;所述待插入视频片段对应的视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
    向用户反馈所述待插入视频片段,以用于将所述待插入视频片段插入至所述原始视频的剪辑点。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述预测视频内容特征,从视频素材片段集合中确定出待插入视频片段,包括:
    基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个所述视频内容特征与所述预测视频内容特征之间的匹配程度排序结果;
    当匹配程度大于预设阈值的情况下,判定所述视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
    将所述视频内容特征对应的视频素材片段作为待插入视频片段。
  3. 根据权利要求1或2所述的方法,其特征在于,所述待插入视频片段包括多个,所述向用户反馈所述待插入视频片段,包括:
    获取预设的反馈指标信息;
    按照所述反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果;
    基于所述反馈排序结果,反馈所述多个待插入视频片段。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    根据用户返回的插入选择信息,从所述多个待插入视频片段中确定目标插入视频片段;
    将所述目标插入视频片段插入至所述原始视频的剪辑点之前或剪辑点之后。
  5. 一种获得内容特征预测模型的方法,应用于电子设备,其特征在于,包括:
    获取训练样本数据;所述训练样本数据包括多个视频片段对;每个所述视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;所述第一视频片段为在所述样本视频中的视频关键点之前预设时长的视频片段;所述第二视频片段为在所述样本视频中的视频关键点之后预设时长的视频片段;
    采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。
  6. 根据权利要求5所述的方法,其特征在于,当目标视频片段为所述原始视频中在所述剪辑点之前预设时长的视频片段的情况下,所述采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,包括:
    将所述第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第一视频片段对应的预测视频内容特征;
    基于所述第一视频片段对应的预测视频内容特征与所述第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型;
    当目标视频片段为所述原始视频中在所述剪辑点之后预设时长的视频片段的情况下,所述采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型,包括:
    将所述第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第二视频片段对应的预测视频内容特征;
    基于所述第二视频片段对应的预测视频内容特征与所述第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型。
  7. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    针对每一图像内容特征维度,按照所述图像内容特征维度对应的图像预处理方式,对每个所述视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧;
    对所述调整后的图像帧进行图像特征提取,得到多个图像特征向量;
    将所述多个图像特征向量进行拼接,得到所述第一视频片段和所述第二视频片段各自对应的视频特征向量;所述视频特征向量用于表征所述第一视频片段和所述第二视频片段各自对应的视频内容特征。
  8. 根据权利要求5所述的方法,其特征在于,所述获取训练样本数据,包括:
    获取样本视频的视频精彩点集合;
    针对每一视频精彩点,确定在所述样本视频中的视频精彩点之前预设时长的第一视频片段,以及在所述样本视频中的视频精彩点之后预设时长的第二视频片段;
    根据所述第一视频片段和所述第二视频片段,得到所述视频精彩点对应的视频片段对。
  9. 根据权利要求8所述的方法,其特征在于,所述获取样本视频的视频精彩点集合,包括:
    获取预设的精彩点提取信息;所述精彩点提取信息用于根据视频中的画面信息、声音信息、文本信息识别出视频精彩点;
    根据所述精彩点提取信息,从所述样本视频中确定出多个视频精彩点,得到所述样本视频的视频精彩点集合。
  10. 一种视频剪辑装置,其特征在于,包括:
    获取单元,被配置为执行获取对原始视频的剪辑点的选择指令,从所述原始视频中提取出目标视频片段;所述目标视频片段为所述原始视频中在所述剪辑点之前或在所述剪辑点之后预设时长的视频片段;
    预测单元,被配置为执行将所述目标视频片段对应的视频内容特征输入至内容特征预测模型,得到预测视频内容特征;
    视频片段匹配单元,被配置为执行根据所述预测视频内容特征,从视频素材片段集合中确定出待 插入视频片段;所述待插入视频片段对应的视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;
    反馈单元,被配置为执行向用户反馈所述待插入视频片段,以用于将所述待插入视频片段插入至所述原始视频的剪辑点。
  11. 根据权利要求10所述的装置,其特征在于,所述视频片段匹配单元,具体被配置为执行基于视频素材片段集合中的多个视频素材片段各自对应的视频内容特征,确定多个所述视频内容特征与所述预测视频内容特征之间的匹配程度排序结果;当匹配程度大于预设阈值的情况下,判定所述视频内容特征与所述预测视频内容特征之间的匹配程度满足预设条件;将所述视频内容特征对应的视频素材片段作为待插入视频片段。
  12. 根据权利要求10或11所述的装置,其特征在于,所述待插入视频片段包括多个,所述反馈单元,具体被配置为执行获取预设的反馈指标信息;按照所述反馈指标信息对多个待插入视频片段进行排序,得到反馈排序结果;基于所述反馈排序结果,反馈所述多个待插入视频片段。
  13. 根据权利要求12所述的装置,其特征在于,所述装置还包括:
    目标插入视频片段确定单元,被配置为执行根据用户返回的插入选择信息,从所述多个待插入视频片段中确定目标插入视频片段;
    目标插入视频片段插入单元,被配置为执行将所述目标插入视频片段插入至所述原始视频的剪辑点之前或剪辑点之后。
  14. 一种内容特征预测模型获得装置,其特征在于,所述装置包括:
    训练样本数据获取单元,被配置为执行获取训练样本数据;所述训练样本数据包括多个视频片段对;每个所述视频片段对包括属于同一样本视频的第一视频片段和第二视频片段;所述第一视频片段为在所述样本视频中的视频关键点之前预设时长的视频片段;所述第二视频片段为在所述样本视频中的视频关键点之后预设时长的视频片段;
    模型训练单元,被配置为执行采用所述训练样本数据,对待训练的内容特征预测模型进行训练,得到内容特征预测模型。
  15. 根据权利要求14所述的装置,其特征在于,当目标视频片段为所述原始视频中在所述剪辑点之前预设时长的视频片段的情况下,所述模型训练单元,具体被配置为执行将所述第一视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第一视频片段对应的预测视频内容特征;基于所述第一视频片段对应的预测视频内容特征与所述第二视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型;
    当目标视频片段为所述原始视频中在所述剪辑点之后预设时长的视频片段的情况下,所述模型训练单元,具体被配置为执行将所述第二视频片段对应的视频内容特征输入至待训练的内容特征预测模型,得到所述第二视频片段对应的预测视频内容特征;基于所述第二视频片段对应的预测视频内容特征与所述第一视频片段对应的视频内容特征的差异,对待训练的内容特征预测模型的模型参数进行调 整,直至调整后的所述内容特征预测模型符合预设训练条件,得到所述内容特征预测模型。
  16. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    图像预处理单元,被配置为执行针对每一图像内容特征维度,按照所述图像内容特征维度对应的图像预处理方式,对每个所述视频片段对的第一视频片段和第二视频片段中各图像帧进行调整,得到调整后的图像帧;
    图像特征提取单元,被配置为执行对所述调整后的图像帧进行图像特征提取,得到多个图像特征向量;
    拼接单元,被配置为执行将所述多个图像特征向量进行拼接,得到所述第一视频片段和所述第二视频片段各自对应的视频特征向量;所述视频特征向量用于表征所述第一视频片段和所述第二视频片段各自对应的视频内容特征。
  17. 根据权利要求14所述的装置,其特征在于,所述训练样本数据获取单元,具体被配置为执行获取样本视频的视频精彩点集合;针对每一视频精彩点,确定在所述样本视频中的视频精彩点之前预设时长的第一视频片段,以及在所述样本视频中的视频精彩点之后预设时长的第二视频片段;根据所述第一视频片段和所述第二视频片段,得到所述视频精彩点对应的视频片段对。
  18. 根据权利要求17所述的装置,其特征在于,所述训练样本数据获取单元,具体被配置为获取预设的精彩点提取信息;所述精彩点提取信息用于根据视频中的画面信息、声音信息、文本信息识别出视频精彩点;根据所述精彩点提取信息,从所述样本视频中确定出多个视频精彩点,得到所述样本视频的视频精彩点集合。
  19. 一种电子设备,其特征在于,包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令,以实现如权利要求1至4中任一项所述的视频剪辑方法或如权利要求5至9中任一项所述的获得内容特征预测模型的方法。
  20. 一种非易失性计算机可读存储介质,当所述非易失性计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如权利要求1至4中任一项所述的视频剪辑方法或如权利要求5至9中任一项所述的获得内容特征预测模型的方法。
  21. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至4中任一项所述的视频剪辑方法或如权利要求5至9中任一项所述的获得内容特征预测模型的方法。
  22. 一种计算机程序,其特征在于,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求1至4中任一项所述的视频剪辑方法或如权利要求5至9中任一项所述的获得内容特征预测模型的方法。
PCT/CN2022/094576 2021-10-18 2022-05-23 视频剪辑方法、装置、电子设备及存储介质 WO2023065663A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111211990.XA CN113949828B (zh) 2021-10-18 2021-10-18 视频剪辑方法、装置、电子设备及存储介质
CN202111211990.X 2021-10-18

Publications (1)

Publication Number Publication Date
WO2023065663A1 true WO2023065663A1 (zh) 2023-04-27

Family

ID=79331391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094576 WO2023065663A1 (zh) 2021-10-18 2022-05-23 视频剪辑方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN113949828B (zh)
WO (1) WO2023065663A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113949828B (zh) * 2021-10-18 2024-04-30 北京达佳互联信息技术有限公司 视频剪辑方法、装置、电子设备及存储介质
CN117278801B (zh) * 2023-10-11 2024-03-22 广州智威智能科技有限公司 一种基于ai算法的学生活动精彩瞬间拍摄与分析方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030234803A1 (en) * 2002-06-19 2003-12-25 Kentaro Toyama System and method for automatically generating video cliplets from digital video
CN101714155A (zh) * 2008-10-07 2010-05-26 汤姆森特许公司 用于将广告剪辑插入视频序列的方法以及对应设备
CN102543136A (zh) * 2012-02-17 2012-07-04 广州盈可视电子科技有限公司 一种视频剪辑的方法及装置
CN111708915A (zh) * 2020-06-12 2020-09-25 腾讯科技(深圳)有限公司 内容推荐方法、装置、计算机设备和存储介质
US20210289266A1 (en) * 2018-11-28 2021-09-16 Huawei Technologies Co.,Ltd. Video playing method and apparatus
CN113949828A (zh) * 2021-10-18 2022-01-18 北京达佳互联信息技术有限公司 视频剪辑方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9554093B2 (en) * 2006-02-27 2017-01-24 Microsoft Technology Licensing, Llc Automatically inserting advertisements into source video content playback streams
CN110855904B (zh) * 2019-11-26 2021-10-01 Oppo广东移动通信有限公司 视频处理方法、电子装置和存储介质
CN111726685A (zh) * 2020-06-28 2020-09-29 百度在线网络技术(北京)有限公司 视频处理方法、装置、电子设备和介质
CN111988638B (zh) * 2020-08-19 2022-02-18 北京字节跳动网络技术有限公司 一种拼接视频的获取方法、装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030234803A1 (en) * 2002-06-19 2003-12-25 Kentaro Toyama System and method for automatically generating video cliplets from digital video
CN101714155A (zh) * 2008-10-07 2010-05-26 汤姆森特许公司 用于将广告剪辑插入视频序列的方法以及对应设备
CN102543136A (zh) * 2012-02-17 2012-07-04 广州盈可视电子科技有限公司 一种视频剪辑的方法及装置
US20210289266A1 (en) * 2018-11-28 2021-09-16 Huawei Technologies Co.,Ltd. Video playing method and apparatus
CN111708915A (zh) * 2020-06-12 2020-09-25 腾讯科技(深圳)有限公司 内容推荐方法、装置、计算机设备和存储介质
CN113949828A (zh) * 2021-10-18 2022-01-18 北京达佳互联信息技术有限公司 视频剪辑方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN113949828A (zh) 2022-01-18
CN113949828B (zh) 2024-04-30

Similar Documents

Publication Publication Date Title
US10868827B2 (en) Browser extension for contemporaneous in-browser tagging and harvesting of internet content
WO2023065663A1 (zh) 视频剪辑方法、装置、电子设备及存储介质
US11899681B2 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
US8805812B1 (en) Learning semantic image similarity
JP5353148B2 (ja) 画像情報検索装置、画像情報検索方法およびそのコンピュータプログラム
US8718383B2 (en) Image and website filter using image comparison
WO2016107126A1 (zh) 图片搜索方法和装置
KR20180122926A (ko) 학습 서비스 제공 방법 및 그 장치
WO2020253127A1 (zh) 脸部特征提取模型训练方法、脸部特征提取方法、装置、设备及存储介质
CN111274442B (zh) 确定视频标签的方法、服务器及存储介质
JP2010073114A6 (ja) 画像情報検索装置、画像情報検索方法およびそのコンピュータプログラム
US9606975B2 (en) Apparatus and method for automatically generating visual annotation based on visual language
KR102488914B1 (ko) 콘텐츠에서 키워드를 추출하고, 추출된 키워드를 이용하여 콘텐츠를 추천하는 방법, 장치 및 프로그램
CN113590850A (zh) 多媒体数据的搜索方法、装置、设备及存储介质
CN111008321A (zh) 基于逻辑回归推荐方法、装置、计算设备、可读存储介质
WO2018227930A1 (zh) 智能提示答案的方法及装置
CN109902670A (zh) 数据录入方法及系统
CN107408125B (zh) 用于查询答案的图像
CN113806588B (zh) 搜索视频的方法和装置
CN107590150A (zh) 基于关键帧的视频分析实现方法及装置
US20210151038A1 (en) Methods and systems for automatic generation and convergence of keywords and/or keyphrases from a media
CN109063200B (zh) 资源搜索方法及其装置、电子设备、计算机可读介质
CN110121033A (zh) 视频编目方法及装置
KR100896336B1 (ko) 영상 정보 기반의 동영상 연관 검색 시스템 및 방법
KR20010002386A (ko) 이미지 데이터베이스 구축 및 검색 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE