WO2022134698A1 - 视频处理方法及装置 - Google Patents

视频处理方法及装置 Download PDF

Info

Publication number
WO2022134698A1
WO2022134698A1 PCT/CN2021/120383 CN2021120383W WO2022134698A1 WO 2022134698 A1 WO2022134698 A1 WO 2022134698A1 CN 2021120383 W CN2021120383 W CN 2021120383W WO 2022134698 A1 WO2022134698 A1 WO 2022134698A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature
feature extraction
extraction model
image
Prior art date
Application number
PCT/CN2021/120383
Other languages
English (en)
French (fr)
Inventor
徐宝函
李佩易
Original Assignee
上海幻电信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海幻电信息科技有限公司 filed Critical 上海幻电信息科技有限公司
Priority to EP21908699.8A priority Critical patent/EP4207770A4/en
Publication of WO2022134698A1 publication Critical patent/WO2022134698A1/zh
Priority to US18/300,310 priority patent/US20230252785A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present application relates to the field of computer technology, and in particular, to a video processing method.
  • the present application also relates to a video processing apparatus, a computing device, a computer-readable storage medium and a computer program product.
  • embodiments of the present application provide a video processing method.
  • the present application also relates to a video processing apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defect in the prior art that the highlight video segment in the video is not accurately extracted.
  • a video processing method including:
  • a video processing apparatus including:
  • a video segmenting module configured to segment the received initial video into at least one video segment
  • a feature extraction module configured to obtain a first modal feature, a second modal feature, and a third modal feature corresponding to each video segment in the at least one video segment based on the feature extraction model;
  • a target determination module configured to input the first modal feature, the second modal feature and the third modal feature corresponding to each video clip into the recognition model, to obtain a video score corresponding to each video clip, and determining a target video segment in the initial video based on the video score.
  • a computing device including a memory, a processor, and computer instructions stored in the memory and executable on the processor, the processor implementing the instructions when the processor executes the instructions The steps of a video processing method.
  • a computer-readable storage medium which stores computer instructions, and when the instructions are executed by a processor, implements the steps of the video processing method.
  • a computer program product is provided, when the computer program product is executed in a computer, the computer is made to execute the steps of the aforementioned video processing method.
  • the present application provides the video processing method and device, wherein the video processing method includes dividing the received initial video into at least one video segment; obtaining the corresponding correspondence of each video segment in the at least one video segment based on a feature extraction model the first modal feature, the second modal feature and the third modal feature; input the first modal feature, the second modal feature and the third modal feature corresponding to each video clip into the recognition model, A video score corresponding to each video segment is obtained, and a target video segment in the initial video is determined based on the video score.
  • the video processing method fuses the acquired first modal feature, second modal feature, and third modal feature of the video, and performs processing on highlight video segments of the video based on the multimodal features obtained after fusion.
  • the highlight video clips of the video are accurately obtained to enhance the user experience.
  • FIG. 1 is a schematic diagram of a specific application structure of a video processing method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a video processing method provided by an embodiment of the present application.
  • FIG. 3 is a processing flowchart of a video processing method applied to a live broadcast scene provided by an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application.
  • FIG. 5 is a structural block diagram of a computing device provided by an embodiment of the present application.
  • MFCC Mel Frequency Cepstrum Coefficient, Mel Frequency Cepstrum Coefficient.
  • VGGish A tensorflow-based VGG model for extracting audio data features.
  • CNN Convolutional Neural Networks, Convolutional Neural Networks.
  • MobileNet Focuses on lightweight CNN neural networks on mobile and embedded devices.
  • ResNet Residual network, deep CNN neural network.
  • word2vec is a group of related models used to generate word vectors.
  • GBDT Gradient Boosting Decision Tree, gradient boosting decision tree.
  • Attention Attention mechanism model.
  • Highlight clips There may be different standards for different types of videos. For example, in game videos, highlight clips may be clips that kill opponents; in game videos, highlight clips may be score clips; In the video, the highlight clip may contain some clips of the anchor interacting with the audience; therefore, pictures, audio, interactive information, etc. are closely related to the highlight clip moment.
  • a video processing method is provided.
  • the present application also relates to a video processing apparatus, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.
  • FIG. 1 shows a schematic diagram of a specific application structure of a video processing method provided according to an embodiment of the present application.
  • the video processing method provided by the embodiment of the present application is applied to a computer, a server, or a cloud service.
  • the application scenario of FIG. 1 includes a CPU (Central Processing Unit, central processing unit)/GPU (Graphics Processing Unit, graphics processing unit) 101, a data storage module 103, a preprocessing module 105, a feature extraction module 107, a feature fusion module 109 and Highlight recognition model 111; specifically, the CPU/GPU 101 starts to work, acquires the video to be processed stored in the data storage module 103, and then controls the preprocessing module 105 to cut the to-be-processed video into multiple video clips, and then divides the video into multiple video clips.
  • a CPU Central Processing Unit, central processing unit
  • GPU Graphics Processing Unit, graphics processing unit
  • Each subsequent video clip is input to the feature extraction module 107 to extract the multimodal features of each video clip, such as voice features, text features, and image features, etc.; then the multimodal features of each video clip extracted by the feature extraction module 107 The features are fused to obtain the overall features of each video clip, and finally the overall features of each video clip are input into the highlight identification module 109, and the pre-trained model in the highlight identification module 109 scores each video clip.
  • the scoring result of each video clip obtains video clips with higher scores, and these video clips with higher scores are used as highlight clips to be displayed and recommended to users, or used to assist subsequent video editing.
  • the highlights of the video to be processed are identified by the acquired multi-modal features of the video to be processed, and based on such multi-modal comprehensive video information, the identification of the highlights of the video to be processed is more comprehensive and accurate.
  • FIG. 2 shows a flowchart of a video processing method provided according to an embodiment of the present application, which specifically includes the following steps:
  • Step 202 Divide the received initial video into at least one video segment.
  • the initial video may be any type of video with any duration, for example, the initial video may include entertainment video, news video, or TV series video, and so on.
  • the server can divide the received initial video into at least one video segment according to a preset segmentation method. In practical applications, it can be divided according to the number of video frames, the preset segmentation duration, etc. Split the initial video; for example, split the initial video into multiple video clips consisting of 30 video frames, or if the preset duration is 6 seconds, then split the initial video into multiple videos with a duration of 6 seconds
  • the splitting methods of the initial video include but are not limited to the above two methods, which can be set according to specific applications, which are not limited in this application.
  • Step 204 Obtain a first modal feature, a second modal feature, and a third modal feature corresponding to each video segment in the at least one video segment based on the feature extraction model.
  • the first modal feature, the second modal feature, and the third modal feature are features of three different modalities.
  • each video segment is input into the feature extraction model to obtain the first modal feature, the second modal feature and the third modal feature corresponding to each video segment. Based on these three modal features, the following can be used to more accurately identify the wonderful video clips in the initial video.
  • the specific implementation methods are as follows:
  • the obtaining the first modal feature, the second modal feature and the third modal feature corresponding to each video segment in the at least one video segment based on the feature extraction model includes:
  • a structured feature corresponding to each video segment in the at least one video segment is obtained based on the third feature extraction model.
  • the used feature extraction models are also different.
  • the first modal feature, the second modal feature and the third modal feature are features of three different modalities
  • the first The feature extraction model, the second feature extraction model and the third feature extraction model are also three different feature extraction models.
  • the first feature extraction model can be understood as an audio feature extraction model
  • the second feature extraction model can be understood as an image feature extraction model
  • the third feature extraction model can be understood as a structured feature extraction model.
  • each video clip is input into the audio feature extraction model, the image feature extraction model and the structured feature extraction model respectively, and the audio features, image features and structural features of each video clip can be obtained.
  • Multi-modal features such as audio features, image features, and structural features can more accurately identify the wonderful video clips in the original video.
  • the obtaining the audio feature corresponding to each video clip in the at least one video clip based on the first feature extraction model includes:
  • the first modal feature is an audio feature
  • first extract the audio information in each video segment and then input the audio information into the first feature extraction model to obtain the corresponding audio information in each video segment.
  • audio features first extract the audio information in each video segment, and then input the audio information into the first feature extraction model to obtain the corresponding audio information in each video segment.
  • the first feature extraction model may be an audio feature extraction model, such as a pre-trained MFCC model or a VGGish model.
  • the audio information in each video clip is extracted, and then the audio features corresponding to the audio information in each video clip are accurately obtained based on the pre-trained audio feature extraction model.
  • the features of the modalities are fused to obtain an accurate score for each video segment.
  • the audio information includes audio information corresponding to a video picture and audio information corresponding to a non-video picture;
  • inputting the audio information in each video clip into the first feature extraction model to obtain the audio features corresponding to each video clip includes:
  • the audio features corresponding to the video pictures in each video clip and the audio features corresponding to the non-video pictures in each video clip are fused to obtain the audio features corresponding to each video clip.
  • the audio information corresponding to the video screen can be understood as the audio information corresponding to the main video screen, such as the voice of the game character in the game screen;
  • the audio information corresponding to the non-video screen can be understood as the audio information corresponding to the non-video main screen, such as video The voice of the commentator in the live broadcast or the audio of other small window videos in the video screen, etc.
  • each segmented video may contain There are two types of audio information, that is, the audio information corresponding to the video screen and the audio information corresponding to the non-video screen.
  • the audio information corresponding to the video screen can be understood as the voice of the game character in the game screen, and the non-video screen
  • the corresponding audio information can be understood as the sound of the game explanation by the live broadcast instructor.
  • the two types of audio information in each acquired video clip will be extracted through the first feature extraction model to extract the audio features.
  • the first feature extraction model may be an audio feature extraction model, such as a pre-trained MFCC model or a VGGish model.
  • the initial video may only include video dubbing, that is, the audio information corresponding to the video picture.
  • video dubbing that is, the audio information corresponding to the video picture.
  • video dubbing and other additional dubbing in order to ensure the complete and accurate extraction of audio features, it is necessary to extract and fuse these two types of audio information in each video clip separately to avoid simultaneous audio feature extraction. features are confusing.
  • the obtaining the image feature corresponding to each video segment in the at least one video segment based on the second feature extraction model includes:
  • the second modal feature is an image feature
  • extract image information in each video clip and then input the image information into the second feature extraction model to obtain image features corresponding to the image information in each video clip .
  • the second feature extraction model may be an image feature extraction model, such as a pre-trained MobileNet model or a ResNet model.
  • the image information in each video clip is extracted, and then the image features corresponding to the image information in each video clip are accurately obtained based on the pre-trained image feature extraction model, and the accurate image features can be combined with other The features of the modalities are fused to obtain an accurate score for each video segment.
  • the image information includes a video picture and a key area picture
  • the second feature extraction model includes a first image feature extraction model and a second image feature extraction model
  • inputting the image information into the second feature extraction model to obtain image features corresponding to each video clip includes:
  • the first image feature and the second image feature corresponding to each video clip are fused to obtain the image feature corresponding to each video clip.
  • the video picture of the initial video may include, but is not limited to, key area information (such as the score area, kill area, etc. in the game video) and the overall picture information of each video frame. Both types of image information are also included in each video frame of the clip.
  • the two types of image information in each video clip obtained will be extracted through different image feature extraction models. Obtain image features corresponding to each type of image information.
  • the image feature extraction is performed by the CNN model, and for the key area information (ie, the key area picture) of each video frame in each video clip, it will be According to different types of videos, image feature extraction is performed based on different image feature extraction models.
  • the image feature extraction model can be a score image feature extraction model. The score feature of the score region of the video frame.
  • the initial video may only include video pictures. In this case, it is only necessary to extract the image features of the video pictures in each video segment. If the initial video includes video picture information and the initial video In the case of the image of the key area corresponding to the domain of the video, in order to ensure the complete and accurate extraction of image features, it is necessary to extract the image features of these two types of image information in each video clip separately, so as to ensure the image of each video clip. Completeness and comprehensiveness of features.
  • the image information includes a video image, a key area image, and an anchor image
  • the second feature extraction model includes a first image feature extraction model, a second image feature extraction model, and a third image feature extraction model
  • inputting the image information into the second feature extraction model to obtain image features corresponding to each video clip includes:
  • the image features corresponding to each video clip are obtained by fusing the video picture features, key area picture features and host image features corresponding to each video clip.
  • the initial video is a video in a live broadcast scenario
  • the initial video will not only contain video images and key area images related to the field of the initial video, but may also include the facial images of the live broadcast instructor. Therefore, These three types of image information will also exist in each segmented video segment, namely, the video picture, the key area picture, and the anchor image.
  • the three types of image information obtained in each video clip will be extracted through different image feature extraction models. Obtain image features corresponding to each type of image information.
  • image feature extraction is performed by the CNN model; for the key area information (ie key area picture) of each video frame in each video clip, the According to different types of videos, image feature extraction is performed based on different image feature extraction models.
  • the image feature extraction model can be a score image feature extraction model.
  • the score feature of the score area of the video frame; and for the face information of the live broadcast commentator (that is, the anchor image) of each video frame in each video segment, the image features can be performed according to the convolutional neural network pre-trained by the facial emotion. extract.
  • the video picture of each video clip is input into the first image feature extraction model to obtain the video picture feature corresponding to the video clip, and then the key area picture of the video clip is input into the second image feature extraction model to obtain the The key area image features corresponding to the video clip, and then the anchor image is input into the third image feature extraction model to obtain the anchor image features corresponding to the video clip, and finally the video image features, key area image features and anchor image features are fused to obtain The final image features of the video clip.
  • the initial video may only include video picture information. In this case, it is only necessary to extract the image features of the video pictures in each video clip. If the initial video includes video pictures and the initial video In the case of the image of the key area corresponding to the domain of the video, in order to ensure the complete and accurate extraction of image features, it is necessary to extract the image features of these two types of image information in each video clip separately, so as to ensure the image of each video clip.
  • Integrity and comprehensiveness of features then if the initial video includes video pictures, key area pictures corresponding to the field of the initial video, and images of live broadcast instructors, in order to ensure the complete and accurate extraction of image features, then It is necessary to extract the image features of the three types of image information in each video clip separately to ensure the integrity and comprehensiveness of the image features of each video clip.
  • the obtaining, based on the third feature extraction model, the structural feature corresponding to each video segment in the at least one video segment includes:
  • the structured information includes but is not limited to text information, numerical information, etc., such as the video title, comment information, and bullet screen information in the initial video. If the initial video is a live video, the structured information may also include gift information, recharge information, etc. information and the amount of recharge, etc.
  • the third modal feature can be understood as a structured feature, then if the third modal feature is a structured feature, extract the structured information in each video segment, and then input the structured information into the third modal feature.
  • the feature extraction model obtains the structured features corresponding to the structured information in each video segment.
  • the third feature extraction model may be a structured feature extraction model, such as a pre-trained Word2vec model or a Bert model.
  • the structured information in each video segment is extracted, and then the structured features corresponding to the structured information in each video segment are accurately obtained based on the pre-trained structured feature extraction model, and the accurate structured features can be obtained subsequently.
  • the structured features are fused with the above audio features and image features to obtain accurate scores for each video segment.
  • Step 206 Input the first modal feature, the second modal feature and the third modal feature corresponding to each video clip into the recognition model, obtain the video score corresponding to each video clip, and based on the The video score determines the target video segment in the initial video.
  • the first modal feature (that is, the audio feature), the second modal feature (that is, the image feature), and the third modal feature (that is, the structural feature) corresponding to each video clip may be spliced, Obtain the target video feature corresponding to each video clip, and then input the target video feature corresponding to each video clip into the recognition model to obtain the video score corresponding to each video clip;
  • the first modal feature, the second modal feature, and the third modal feature corresponding to the video clip are input to the recognition model, and each video clip is directly output after dimensionality reduction, normalization, and weighting are processed in the recognition model. Corresponding video score.
  • the recognition model includes but is not limited to the GBDT model and the Attention-based deep neural network model.
  • the GBDT model provides a more commonly used feature fusion algorithm, which can identify the importance of the input features, and then use the labeled training data to return the scores of the corresponding video clips; while the Attention-based deep neural network model, The distribution of the importance of different modal features and the regression of the video clip scores will be simultaneously trained through the training data, and the trained recognition model will be stored on the corresponding device.
  • the modal features are input into the recognition model, and the video scores corresponding to the video clips can be directly output.
  • the determining a target video segment in the initial video based on the video score includes:
  • a video segment whose video score is greater than or equal to a preset score threshold is determined as a target video segment in the initial video.
  • the preset score threshold may be set according to actual requirements, which is not limited in this application, for example, the preset score threshold may be 80 points.
  • a video segment with a video score greater than or equal to 80 points is determined as the target video segment in the initial video.
  • the target video clips may also be determined in other ways, such as sorting the video clips in descending order according to the video score, and then selecting the top three, four or six video clips as the target video clips.
  • the target video clip in the initial video can be accurately obtained based on the video score of each video clip, and subsequent video recommendation or Video collection generation, etc.
  • the method further includes:
  • a target video is generated based on the target video segment, and the target video is sent to the user.
  • the target video clips can be spliced to generate the target video. Since the video scores of the target video clips are all high and contain more content that attracts users' attention, the target video clips are The segment generates a target video and sends the target video to the user, which can increase the user's click-through rate and viewing rate of the target video. If there is an advertisement in the target video, it can also greatly increase the exposure rate of the advertisement.
  • the video processing method provided in this application is applied in different scenarios, and the adopted models may also be different.
  • the video processing method is applied in a high real-time scene such as a live broadcast
  • the feature extraction model, recognition model, etc. in the method can adopt a lightweight model to improve the overall processing speed of the video processing method; and if the video processing method is applied in background processing, the feature extraction in the video processing method Models, recognition models, etc. can use more complex deep learning models to ensure the accuracy of video processing methods; the specific implementation methods are as follows:
  • the method also includes:
  • the type information of the feature extraction model and/or the recognition model is determined according to the scene to which the video processing method is applied and/or the resource requirements of the video processing method.
  • a type of feature extraction model and/or recognition model may be used (such as a lightweight initial feature extraction model and/or recognition model); wherein, the preset resource threshold can be set according to actual needs, which is not limited in this application.
  • a background processing scene such as a scene after a live broadcast
  • another type of feature extraction model and / or identify the model when the video processing method is applied in a background processing scene (such as a scene after a live broadcast), or the resource requirement of the video processing method is greater than a preset resource threshold.
  • one type of model is used for audio features, image features, structured features, and feature fusion; in background scenarios, audio features, image features, structured features, and feature fusion are used.
  • Another type of model see Table 1 for details.
  • the video processing method is applied to a real-time processing scenario (such as a live broadcast scenario), or when the resource requirement of the video processing method is less than or equal to a preset resource threshold, the audio feature extraction model can be the MFCC model.
  • the video processing method is applied to the background processing scene (such as the scene after the end of the live broadcast), or when the resource requirement of the video processing method is greater than the preset resource threshold, the audio feature extraction model may be the VGGish model; the video The processing method is applied to a real-time processing scene (such as a live broadcast scene), or when the resource requirement of the video processing method is less than or equal to a preset resource threshold, the image feature extraction model can be a MobileNet model, and the video processing method is applied to background processing.
  • the scene (such as the scene after the live broadcast ends), or when the resource requirement of the video processing method is greater than the preset resource threshold, the image feature extraction model can be the ResNet model; the video processing method is applied to real-time processing scenes (such as live broadcasts).
  • the structured feature extraction model can be the Word2vec model, and the video processing method is applied to the background processing scene (such as the scene after the live broadcast ends) , or when the resource requirement of the video processing method is greater than the preset resource threshold, the structured feature extraction model may be a Bert model; the video processing method is applied to real-time processing scenarios (such as live broadcast scenarios), or the video processing
  • the feature fusion model can be the GBDT model, and the video processing method is applied to the background processing scene (such as the scene after the live broadcast), or the resource requirement of the video processing method.
  • the feature fusion model can be the Attention model.
  • the video processing method fuses the acquired audio features, image features and structural features of the video, and recognizes the high-light video segments of the video based on the multimodal features obtained after the fusion.
  • the comprehensive feature information of multi-modality can accurately obtain the highlight video clip of the video, and enhance the user experience.
  • the video processing method performs feature fusion by extracting different features of multi-channel audio, multi-channel video and structured information, so as to score the brilliance of each video segment of the current video through the fusion features, and finally can Accurately and quickly identify the highlight video clips of the video; the video processing method can obtain the global information of the video through multi-modal feature fusion, so as to accurately obtain the highlight video clips of the video; in addition, the method can be aimed at different applications Different algorithm configurations and deployments are performed according to the scene and/or resource requirements, so as to meet the processing speed and accuracy requirements of the video processing method.
  • FIG. 3 shows a processing flowchart of a video processing method applied to a live broadcast scenario provided by an embodiment of the present application, which specifically includes the following steps.
  • Step 302 Extract audio in the live video to obtain audio features of the live video.
  • the live video contains the audio of the video content and the audio of the video uploader during the live broadcast. If the live video does not contain the audio of the video uploader, only the audio of the video content is extracted.
  • the audio content has a strong correlation with the video as a whole.
  • the high-light video clips of the video are often accompanied by the higher pitch of the video itself, or the video uploader raises the volume, makes laughter, etc.
  • the corresponding audio features can be extracted through traditional audio features such as MFCC, volume, and pitch, or a deep neural network based on VGGish.
  • VGGish is a model that provides audio feature extraction after classification and pre-training on a large amount of audio data.
  • Step 304 Extract video images in the live video to obtain image features of the live video.
  • the video image includes the overall picture of the video, the picture of the video uploader, and key area information.
  • the overall picture of the video contains the characteristics of the entire video, and the highlight moments are often accompanied by rich colors and content.
  • the video uploader's screen mostly contains the facial information of the video uploader, in which the expression and fluctuation of emotions are closely related to the highlight moment.
  • the key area information refers to the areas that users often pay attention to, and usually the highlight moments are also related to these key areas. For example, in game videos, users often pay attention to the score area and the kill prompt area, while in videos such as dancing, users may often pay attention to the area containing characters in the middle of the screen.
  • the overall picture features of the video are extracted through the CNN convolutional neural network pre-trained by ImageNet; the picture information of the video uploader can be extracted using the pre-trained convolutional neural network based on facial emotions; and the key area information will be According to different types of videos, different detectors are trained to extract corresponding features, such as score information features, kill information features, human action information features, etc.
  • Step 306 Extract the structured information in the live video to obtain the structured features of the live video.
  • the video usually also contains a lot of structured information, such as titles, comments, bullet screens, etc.
  • structured information such as titles, comments, bullet screens, etc.
  • gifts, recharges, etc. are also included.
  • These structured information will also be related to highlight moments, such as comment content, bullet chat content, bullet chat number, gift number, etc.
  • the present application proposes word2vec or Bert-based structured feature extraction for texts such as titles, comments, and bullet screens.
  • numerical information such as the number of bullet screens, the number of gifts, the value of gifts, and the number of recharges can be normalized to [0,1], which is also used as the feature extraction of structured information.
  • Step 308 Perform feature fusion on the audio features, image features and structural features of the live video.
  • the present application adopts a feature-level fusion strategy.
  • feature fusion the features of each video clip are summarized, and after dimensionality reduction and normalization operations, traditional GBDT (Gradient Boosting Decision Tree) or Attention-based deep neural network fusion methods can be used for feature fusion.
  • GBDT Gradient Boosting Decision Tree
  • Attention-based deep neural network fusion methods can be used for feature fusion.
  • the GBDT algorithm based on multiple decision trees is a more commonly used feature fusion algorithm, which can identify the importance of the input features and use the labeled training data (that is, extract the prepared video clips and score the corresponding degree of excitement) Regresses the score for the corresponding segment.
  • the Attention-based neural network will simultaneously train the distribution of the importance of different modalities and the regression of the segment scores through the training data.
  • extract the features of the corresponding video clips, and the trained regression model will automatically output the scores of the clips to identify the wonderful video clips of the video.
  • Step 310 Input the highlight video clip of the live video.
  • the present application can also select different algorithm configurations according to different scenarios and resources. For example, when real-time and limited computing resources are required, traditional features such as MFCC can be used for audio feature extraction; lightweight networks such as Mobilenet can be used for video feature extraction, and faster extraction methods such as Word2vec can be used for structured feature extraction.
  • traditional features such as MFCC can be used for audio feature extraction
  • lightweight networks such as Mobilenet can be used for video feature extraction
  • faster extraction methods such as Word2vec can be used for structured feature extraction.
  • the extraction of audio features, image features and structural features can be based on deep neural network feature extraction methods in order to output more accurate results.
  • the video processing method performs feature fusion by extracting different features of multi-channel audio, multi-channel video and structured information, so as to score the brilliance of each video clip of the current video through the fusion features, and finally
  • the high-light video clips of the video can be identified accurately and quickly;
  • the video processing method can obtain the global information of the video through multi-modal feature fusion, so as to accurately obtain the high-light video clips of the video;
  • the environment is configured and deployed with different algorithms to meet the needs of processing speed and accuracy.
  • FIG. 4 shows a schematic structural diagram of a video processing apparatus provided by an embodiment of the present application.
  • the device includes:
  • a video segmentation module 402 configured to segment the received initial video into at least one video segment
  • a feature extraction module 404 configured to obtain, based on the feature extraction model, a first modal feature, a second modal feature, and a third modal feature corresponding to each video segment in the at least one video segment;
  • the target determination module 406 is configured to input the first modal feature, the second modal feature and the third modal feature corresponding to each video clip into the recognition model, and obtain the video score corresponding to each video clip , and determine a target video segment in the initial video based on the video score.
  • the feature extraction module 404 is further configured to:
  • a structured feature corresponding to each video segment in the at least one video segment is obtained based on the third feature extraction model.
  • the feature extraction module 404 is further configured to:
  • the feature extraction module 404 is further configured to:
  • the feature extraction module 404 is further configured to:
  • the image information includes a video picture and a key area picture
  • the second feature extraction model includes a first image feature extraction model and a second image feature extraction model
  • the feature extraction module 404 is further configured to:
  • the image features corresponding to each video segment are obtained by fusing the video picture feature corresponding to each video clip with the key area picture feature.
  • the image information includes a video image, a key area image, and an anchor image
  • the second feature extraction model includes a first image feature extraction model, a second image feature extraction model, and a third image feature extraction model
  • the feature extraction module 404 is further configured to:
  • the image features corresponding to each video clip are obtained by fusing the video picture features, key area picture features and host image features corresponding to each video clip.
  • the audio information includes audio information corresponding to a video picture and audio information corresponding to a non-video picture;
  • the feature extraction module 404 is further configured to:
  • the audio features corresponding to the video pictures in each video clip and the audio features corresponding to the non-video pictures in each video clip are fused to obtain the audio features corresponding to each video clip.
  • the target determination module 406 is further configured to:
  • a video segment whose video score is greater than or equal to a preset score threshold is determined as a target video segment in the initial video.
  • the device further includes:
  • a target video generation module configured to generate a target video based on the target video segment, and send the target video to a user.
  • the device further includes:
  • the model determination module is configured to determine the type information of the feature extraction model and/or the recognition model according to the scene to which the video processing method is applied and/or the resource requirements of the video processing method.
  • the video processing apparatus fuses the acquired first modal feature, second modal feature and third modal feature of the video, and based on the multi-modal feature obtained after fusion, fuses the highlight video segment of the video.
  • the high-light video clips of the video can be accurately obtained through the multi-modal comprehensive feature information of the video, thereby enhancing the user experience.
  • the above is a schematic solution of a video processing apparatus according to this embodiment. It should be noted that the technical solution of the video processing device and the technical solution of the above-mentioned video processing method belong to the same concept, and the details that are not described in detail in the technical solution of the video processing device can be referred to the description of the technical solution of the above-mentioned video processing method. .
  • FIG. 5 shows a structural block diagram of a computing device 500 provided according to an embodiment of the present specification.
  • Components of the computing device 500 include, but are not limited to, memory 510 and processor 520 .
  • the processor 520 is connected with the memory 510 through the bus 530, and the database 550 is used for saving data.
  • Computing device 500 also includes access device 540 that enables computing device 500 to communicate via one or more networks 560 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 540 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 500 may also be connected to each other, such as through a bus.
  • bus may also be connected to each other, such as through a bus.
  • FIG. 5 the structural block diagram of the computing device shown in FIG. 5 is only for the purpose of example, rather than limiting the scope of this specification. Those skilled in the art can add or replace other components as required.
  • Computing device 500 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptops, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • Computing device 500 may also be a mobile or stationary server.
  • the processor 520 is configured to execute the following computer-executable instructions, and when the processor 520 executes the instructions, the steps of the video processing method are implemented.
  • the above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned video processing method belong to the same concept, and the details not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned video processing method.
  • An embodiment of the present application further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, implements the steps of the aforementioned video processing method.
  • the above is a schematic solution of a computer-readable storage medium of this embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above-mentioned video processing method belong to the same concept. For details not described in detail in the technical solution of the storage medium, refer to the description of the technical solution of the above-mentioned video processing method.
  • An embodiment of the present application further provides a computer program product, which, when the computer program product is executed in a computer, causes the computer to execute the steps of the aforementioned video processing method.
  • the computer instructions include computer program product code, which may be in source code form, object code form, an executable file, some intermediate form, or the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program product code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media, etc.
  • the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本申请提供视频处理方法及装置,其中,所述视频处理方法包括将接收的初始视频切分为至少一个视频片段;基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。所述视频处理方法通过将获取的视频的第一模态特征、第二模态特征以及第三模态特征进行融合,基于融合后获得的多模态特征对视频的高光视频片段进行识别时,通过该视频的多模态的全面的特征信息,精确的获得该视频的高光视频片段,增强用户体验。

Description

视频处理方法及装置
本申请要求于2020年12月22日提交中国专利局、申请号为202011531808.4、发明名称为“视频处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种视频处理方法。本申请同时涉及一种视频处理装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序产品。
背景技术
随着互联网的发展,人们通过网络平台观看视频、社交等需求日益增加。其中,视频网站的受众也越来越广泛,用户可以观看各类视频,并能实时与主播或者他人互动。在这个过程中,用户希望可以了解视频中的高光视频片段(例如精彩片段),在观看实时直播或其他视频时,可以挑选自己更关注的视频片段。许多主播或视频上传者,也会希望在很长时间直播完成后,对直播视频中的高光视频片段进行后期的剪辑编辑;此外视频网站也希望可以通过高光视频片段的识别,为更多用户做更精准的视频推荐。
而现有技术中对于视频中的高光视频片段的提取精确度较差,很容易提取到与视频内容差别较大的视频片段,造成的用户体验不好。
发明内容
有鉴于此,本申请实施例提供了一种视频处理方法。本申请同时涉及一种 视频处理装置,一种计算设备,以及一种计算机可读存储介质,以解决现有技术中存在的对视频中的高光视频片段提取不精确的技术缺陷。
根据本申请实施例的第一方面,提供了一种视频处理方法,包括:
将接收的初始视频切分为至少一个视频片段;
基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;
将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。
根据本申请实施例的第二方面,提供了一种视频处理装置,包括:
视频切分模块,被配置为将接收的初始视频切分为至少一个视频片段;
特征提取模块,被配置为基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;
目标确定模块,被配置为将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。
根据本申请实施例的第三方面,提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述指令时实现所述视频处理方法的步骤。
根据本申请实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现所述视频处理方法的步骤。
根据本申请实施例的第五方面,提供了一种计算机程序产品,当所述计算机程序产品在计算机中执行时,令计算机执行如前所述视频处理方法的步骤。
本申请提供了所述视频处理方法及装置,其中,所述视频处理方法包括将接收的初始视频切分为至少一个视频片段;基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。具体的,所述视频处理方法通过将获取的视频的第一模态特征、第二模态特征以及第三模态特征进行融合,基于融合后获得的多模态特征对视频的高光视频片段进行识别时,通过该视频的多模态的全面的特征信息,精确的获得该视频的高光视频片段,增强用户体验。
附图说明
图1是本申请一实施例提供的一种视频处理方法的具体应用结构示意图;
图2是本申请一实施例提供的一种视频处理方法的流程图;
图3是本申请一实施例提供的一种应用于直播场景的视频处理方法的处理流程图;
图4是本申请一实施例提供的一种视频处理装置的结构示意图;
图5是本申请一实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背 本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本申请一个或多个实施例涉及的名词术语进行解释。
MFCC:Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数。
VGGish:基于tensorflow的VGG模型,用于提取音频数据特征。
CNN:Convolutional Neural Networks,卷积神经网络。
MobileNet:专注于在移动设备和嵌入式设备上的轻量级CNN神经网络。
ResNet:残差网络,深度CNN神经网络。
word2vec:是一群用来产生词向量的相关模型。
bert:Bidirectional Encoder Representation from Transformers,双向 Transformer的Encoder(编码器)。
GBDT:Gradient Boosting Decision Tree,梯度提升决策树。
Attention:注意力机制模型。
高光片段:针对不同类型的视频可能存在不同的标准,如游戏视频中,高光片段可能是包含了击杀对手的片段;在比赛视频中,高光片段可能会是得分的片段;而在直播互动类视频中,高光片段可能包含了主播与观众互动的一些片段;因此,图片、音频、互动信息等都与高光片段时刻密切相关。
在本申请中,提供了一种视频处理方法。本申请同时涉及一种视频处理装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
参见图1,图1示出了根据本申请一实施例提供的一种视频处理方法的具体应用结构示意图。
具体的,本申请实施例提供的视频处理方法应用在电脑、服务器或者云端服务上。图1的应用场景中包括CPU(Central Processing Unit,中央处理器)/GPU(Graphics Processing Unit,图形处理器)101、数据存储模块103、预处理模块105、特征提取模块107、特征融合模块109以及精彩片段识别模型111;具体的,CPU/GPU101开始工作,获取数据存储模块103中存储的待处理视频,然后控制预处理模块105将该待处理视频切分为多个视频片段,然后将切分后的每个视频片段输入到特征提取模块107提取每个视频片段的多模态特征,例如语音特征、文本特征以及图像特征等;再将特征提取模块107提取的每个视频片段的多模态特征进行融合, 以获得每个视频片段的总体特征,最后将每个视频片段的总体特征输入精彩片段识别模块109,精彩片段识别模块109中预先训练的模型对每个视频片段进行打分,根据该每个视频片段的打分结果获得得分较高的视频片段,将这些得分较高的视频片段作为精彩片段,展示、推荐给用户,或者用于辅助后续的视频编辑。
本申请中,通过获取的待处理视频的多模态特征对待处理视频的精彩片段进行识别,基于此种多模态的全面的视频信息,使得对待处理视频的精彩片段的识别更加全面、准确。
参见图2,图2示出了根据本申请一实施例提供的一种视频处理方法的流程图,具体包括以下步骤:
步骤202:将接收的初始视频切分为至少一个视频片段。
其中,初始视频可以是任意类型、任意时长的视频,例如初始视频可以包括娱乐视频、新闻视频或者是电视剧集视频等等。
具体的,服务器在接收到一个初始视频后,可以按照预设的切分方式将接收的初始视频切分为至少一个视频片段,实际应用中,可以按照视频帧的数量、预设切分时长等对初始视频进行切分;例如将初始视频切分为多个由30个视频帧组成的视频片段,或者是预设时长为6秒,那么则将初始视频切分为多个6秒时长的视频片段;但是对初始视频的切分方式包括但不限于以上两种方式,可以根据具体应用进行设置,本申请对此不做任何限定。
步骤204:基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征。
其中,第一模态特征、第二模态特征以及第三模态特征为三种不同模态的特征。
具体的,在将初始视频切分为多个视频片段之后,将每个视频片段输入到特征提取模型中,获得每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征,后续可以基于这三种模态特征更加精确的对初始视频中的精彩视频片段进行识别,具体实现方式如下所述:
所述基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征,包括:
基于第一特征提取模型获得所述至少一个视频片段中每个视频片段对应的音频特征;
基于第二特征提取模型获得所述至少一个视频片段中每个视频片段对应的图像特征;
基于所述第三特征提取模型获得所述至少一个视频片段中每个视频片段对应的结构化特征。
其中,基于模态的不同,所使用的特征提取模型也不相同,在第一模态特征、第二模态特征以及第三模态特征为三种不同模态的特征的情况下,第一特征提取模型、第二特征提取模型和第三特征提取模型也为三种不同的特征提取模型。
例如第一模态特征为音频特征的情况下,第一特征提取模型可以理解为音频特征提取模型;第二模态特征为图像特征的情况下,第二特征提取模型可以理解为图像特征提取模型;第三模态特征为结构化特征的情况下,第三特征提 取模型可以理解为结构化特征提取模型。
具体实施时,将每个视频片段分别输入到音频特征提取模型、图像特征提取模型和结构化特征提取模型中,可以获得每个视频片段的音频特征、图像特征以及结构化特征,后续可以基于该音频特征、图像特征以及结构化特征等多模态特征,更加精确的对初始视频中的精彩视频片段进行识别。
具体的,所述基于第一特征提取模型获得所述至少一个视频片段中每个视频片段对应的音频特征,包括:
提取所述至少一个视频片段中每个视频片段中的音频信息;
将所述每个视频片段中的音频信息输入第一特征提取模型,获得所述每个视频片段对应的音频特征。
具体的,在第一模态特征为音频特征的情况下,首先提取每个视频片段中的音频信息,然后将该音频信息输入第一特征提取模型,可以获得每个视频片段中音频信息对应的音频特征。
其中,在第一模态特征为音频特征的情况下,第一特征提取模型可以为音频特征提取模型,例如预先训练的MFCC模型或VGGish模型。
本申请中,对每个视频片段中的音频信息进行提取,然后基于预先训练的音频特征提取模型准确的获得每个视频片段中音频信息对应的音频特征,后续可以将该准确的音频特征与其他模态的特征进行融合,以获得每个视频片段精确的得分。
可选的,所述音频信息包括视频画面对应的音频信息和非视频画面对应的音频信息;
相应的,所述将所述每个视频片段中的音频信息输入第一特征提取模型,获得所述每个视频片段对应的音频特征,包括:
将所述每个视频片段中的视频画面对应的音频信息输入第一特征提取模型,获得所述每个视频片段中的视频画面对应的音频特征;
将所述每个视频片段中的非视频画面对应的音频信息输入所述第一特征提取模型,获得所述每个视频片段中的非视频画面对应的音频特征;
将所述每个视频片段中的视频画面对应的音频特征和每个视频片段中的非视频画面对应的音频特征进行融合,获得所述每个视频片段对应的音频特征。
其中,视频画面对应的音频信息可以理解为视频主画面对应的音频信息,例如游戏画面中游戏角色的声音;而非视频画面对应的音频信息可以理解为非视频主画面对应的音频信息,例如视频中直播讲解人员的声音或者是视频画面中其他小窗视频的音频等。
实际应用中,若初始视频为直播场景下的视频,那么该初始视频中不仅会有视频画面的配音,还可能会有直播讲解人员的配音等,因此切分后的每个视频片段中可能会存在两种类型的音频信息,即视频画面对应的音频信息和非视频画面对应的音频信息,例如游戏直播场景下,视频画面对应的音频信息可以理解为游戏画面中游戏角色的声音,非视频画面对应的音频信息可以理解为游戏直播讲解人员的游戏讲解的声音。
那么为了保证每个视频片段中音频信息的音频特征的准确性,实际应用中,会将获取的每个视频片段中的这两种类型的音频信息分别经过第一特征提取模型进行音频特征的提取。
具体的,将每个视频片段中的视频画面对应的音频信息输入到第一特征提取模型,获得该视频片段对应的第一音频特征;然后将该视频片段中的非视频画面对应的音频信息输入到该第一特征提取模型,获得该视频片段对应的每个视频片段中的非视频画面对应的音频特征,最后将第一音频特征和每个视频片段中的非视频画面对应的音频特征进行融合,以获得该视频片段最终的音频特征;其中,第一特征提取模型可以为音频特征提取模型,例如预先训练的MFCC模型或VGGish模型。
实际应用中,初始视频中可以只包括视频配音,即视频画面对应的音频信息,此种情况下只需要对每段视频片段中的视频配音的音频特征进行提取即可,而若初始视频中包括视频配音和其他的额外配音的情况下,为了保证音频特征的完整且准确的提取,需要将每个视频片段中这两种类型的音频信息分别进行音频特征提取融合,以避免同时音频特征提取造成的特征混乱。
本说明书另一实施例中,所述基于第二特征提取模型获得所述至少一个视频片段中每个视频片段对应的图像特征,包括:
提取所述至少一个视频片段中每个视频片段的图像信息;
将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征。
具体的,在第二模态特征为图像特征的情况下,提取每个视频片段中的图像信息,然后将该图像信息输入第二特征提取模型,获得每个视频片段中图像信息对应的图像特征。
其中,在第二模态特征为图像特征的情况下,第二特征提取模型可以为图 像特征提取模型,例如预先训练的MobileNet模型或ResNet模型。
本申请中,对每个视频片段中的图像信息进行提取,然后基于预先训练的图像特征提取模型准确的获得每个视频片段中图像信息对应的图像特征,后续可以将该准确的图像特征与其他模态的特征进行融合,以获得每个视频片段精确的得分。
可选的,所述图像信息包括视频画面和关键区域画面,所述第二特征提取模型包括第一图像特征提取模型和第二图像特征提取模型;
相应的,所述将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征,包括:
将所述视频画面输入第一图像特征提取模型,获得所述每个视频片段对应的第一图像特征;
将所述关键区域画面输入第二图像特征提取模型,获得所述每个视频片段对应的第二图像特征;
将所述每个视频片段对应的第一图像特征和第二图像特征融合,获得所述每个视频片段对应的图像特征。
实际应用中,初始视频的视频画面中可能包括但不限于关键区域信息(例如游戏视频中的比分区域、击杀区域等)以及每个视频帧的整体画面信息,因此切分后的每个视频片段的每个视频帧中也会包含这两种图像信息。
那么为了保证每个视频片段中图像信息的图像特征的全面性、准确性,实际应用中,会将获取的每个视频片段中的这两种类型的图像信息,分别通过不同的图像特征提取模型获得每种类型的图像信息对应的图像特征。
例如对于每个视频片段中每个视频帧的整体画面信息(即视频画面)通过CNN模型进行图像特征提取,对于每个视频片段中每个视频帧的关键区域信息(即关键区域画面)则会根据不同类型的视频,基于不同的图像特征提取模型进行图像特征提取,例如在游戏比分场景中,图像特征提取模型则可以是比分图像特征提取模型,通过该模型获取出每个视频片段中每个视频帧的比分区域的比分特征。
具体的,将每个视频片段的视频画面输入到第一图像特征提取模型,获得该视频片段对应的第一图像特征,然后将该视频片段的关键区域画面输入到第二图像特征提取模型,获得该视频片段对应的第二图像特征,最后将第一图像特征和第二图像特征进行融合,以获得该视频片段最终的图像特征。
实际应用中,初始视频中可以只包括视频画面,此种情况下只需要对每段视频片段中的视频画面的图像特征进行提取即可,而若初始视频中包括视频画面信息和与该初始视频的领域对应的关键区域画面的情况下,为了保证图像特征的完整且准确的提取,需要将每个视频片段中这两种类型的图像信息分别进行图像特征提取,以保证每个视频片段的图像特征的完整性和全面性。
可选的,所述图像信息包括视频画面、关键区域画面和主播图像,所述第二特征提取模型包括第一图像特征提取模型、第二图像特征提取模型和第三图像特征提取模型;
相应的,所述将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征,包括:
将所述视频画面输入第一图像特征提取模型,获得所述每个视频片段对应 的视频画面特征;
将所述关键区域画面输入第二图像特征提取模型,获得所述每个视频片段对应的关键区域画面特征;
将所述第三图像信息输入第三图像特征提取模型,获得所述每个视频片段对应的主播图像特征;
将所述每个视频片段对应的视频画面特征、关键区域画面特征和主播图像特征进行融合,获得所述每个视频片段对应的图像特征。
实际应用中,若初始视频为直播场景下的视频,那么该初始视频中不仅会有视频画面、与该初始视频的领域相关的关键区域画面,还可能会包括直播讲解人员的面部图像等,因此切分后的每个视频片段中也会存在这三种类型的图像信息,即视频画面、关键区域画面和主播图像。
那么为了保证每个视频片段中图像信息的图像特征的全面性、准确性,实际应用中,会将获取的每个视频片段中的这三种类型的图像信息,分别通过不同的图像特征提取模型获得每种类型的图像信息对应的图像特征。
例如对于每个视频片段中每个视频帧的整体画面信息(即视频画面)通过CNN模型进行图像特征提取;对于每个视频片段中每个视频帧的关键区域信息(即关键区域画面)则会根据不同类型的视频,基于不同的图像特征提取模型进行图像特征提取,例如在游戏比分场景中,图像特征提取模型则可以是比分图像特征提取模型,通过该模型获取出每个视频片段中每个视频帧的比分区域的比分特征;而对于每个视频片段中每个视频帧的直播讲解人员的面部信息(即主播图像),则可以根据预先通过人脸情绪训练的卷积神经网络进行图像 特征提取。
具体的,将每个视频片段的视频画面输入到第一图像特征提取模型,获得该视频片段对应的视频画面特征,然后将该视频片段的关键区域画面输入到第二图像特征提取模型,获得该视频片段对应的关键区域画面特征,再将主播图像输入第三图像特征提取模型,获得该视频片段对应的主播图像特征,最后将视频画面特征、关键区域画面特征和主播图像特征进行融合,以获得该视频片段最终的图像特征。
实际应用中,初始视频中可以只包括视频画面信息,此种情况下只需要对每段视频片段中的视频画面的图像特征进行提取即可,而若初始视频中包括视频画面和与该初始视频的领域对应的关键区域画面的情况下,为了保证图像特征的完整且准确的提取,需要将每个视频片段中这两种类型的图像信息分别进行图像特征提取,以保证每个视频片段的图像特征的完整性和全面性;那么在初始视频中包括视频画面、与该初始视频的领域对应的关键区域画面以及直播讲解人员的图像的情况下,为了保证图像特征的完整且准确的提取,则需要将每个视频片段中这三种类型的图像信息分别进行图像特征提取,以保证每个视频片段的图像特征的完整性和全面性。
本说明书另一实施例中,所述基于所述第三特征提取模型获得所述至少一个视频片段中每个视频片段对应的结构化特征,包括:
提取所述至少一个视频片段中每个视频片段的结构化信息;
将所述结构化信息输入第三特征提取模型,获得所述每个视频片段对应的结构化特征。
其中,结构化信息包括但不限于文本信息、数值信息等,例如初始视频中的视频标题、评论信息、弹幕信息,若该初始视频为直播视频,则结构化信息还可能包括礼物信息、充值信息以及充值数量等等。
具体的,第三模态特征可以理解为结构化特征,那么在第三模态特征为结构化特征的情况下,提取每个视频片段中的结构化信息,然后将该结构化信息输入第三特征提取模型,获得每个视频片段中结构化信息对应的结构化特征。
其中,在第三模态特征为结构化特征的情况下,第三特征提取模型可以为结构化特征提取模型,例如预先训练的Word2vec模型或Bert模型。
本申请中,对每个视频片段中的结构化信息进行提取,然后基于预先训练的结构化特征提取模型准确的获得每个视频片段中结构化信息对应的结构化特征,后续可以将该准确的结构化特征与上述音频特征、图像特征进行融合,以获得每个视频片段精确的得分。
步骤206:将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。
具体实施时,可以将所述每个视频片段对应的第一模态特征(即音频特征)、第二模态特征(即图像特征)以及第三模态特征(即结构化特征)进行拼接,获得所述每个视频片段对应的目标视频特征,然后将该每个视频片段对应的目标视频特征输入到识别模型中,获得该每个视频片段对应的视频得分;还可以直接将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,在识别模型中进行降维、归一化、加权等处理后,直接输 出该每个视频片段对应的视频得分。
其中,识别模型包括但不限于GBDT模型和基于Attention的深度神经网络模型等。
实际应用中,GBDT模型中提供了较为常用的特征融合算法,对输入的特征可以识别其重要程度,然后利用有标注的训练数据回归对应的视频片段的得分;而基于Attention的深度神经网络模型,则会通过训练数据同时训练不同模态特征的重要性的分布以及视频片段得分的回归,将训练好的识别模型存储在相应的设备上,在具体使用时,将提取到的视频片段对应的多模态特征输入到识别模型中,可以直接输出视频片段对应的视频得分。
本说明书另一实施例中,所述基于所述视频得分确定所述初始视频中的目标视频片段,包括:
将视频得分大于等于预设得分阈值的视频片段确定为所述初始视频中的目标视频片段。
其中,预设得分阈值可以根据实际需求进行设置,本申请对此不做任何限定,例如预设得分阈值可以为80分。
那么在预设得分阈值为80分的情况下,将视频得分大于等于80分的视频片段确定为初始视频中的目标视频片段。
实际应用中,对于目标视频片段还可以通过其他方式确定,例如根据视频得分对视频片段进行降序排序,然后选择排序靠前的三个、四个或者六个的视频片段作为目标视频片段。
本申请中,在获得初始视频中每个视频片段的视频得分后,基于每个视频 片段的视频得分可以准确的获得初始视频中的目标视频片段,后续可以基于该目标视频片段进行视频推荐或者是视频集锦生成等。
具体实施时,所述将视频得分大于等于预设得分阈值的视频片段确定为所述初始视频中的目标视频片段之后,还包括:
基于所述目标视频片段生成目标视频,且将所述目标视频发送至用户。
具体的,在获得目标视频片段后,可以将该目标视频片段进行拼接,以生成目标视频,由于目标视频片段的视频得分均较高,里面包含吸引用户关注的内容较多,因此将该目标视频片段生成目标视频,将该目标视频发送给用户,可以增加用户对该目标视频的点击率和观看率,若该目标视频中存在广告的情况下,还可以极大的增加广告的曝光率等。
此外,本申请提供的所述视频处理方法应用在不同场景中,采用的模型也可以不同,例如若所述视频处理方法应用在直播场景的这种实时性较高的场景中,所述视频处理方法中的特征提取模型、识别模型等可以采用轻量级模型,以提高视频处理方法的整体处理速度;而若所述视频处理方法应用在后台处理时,则所述视频处理方法中的特征提取模型、识别模型等可以采用较为复杂的深度学习模型,以保证视频处理方法的精确度;具体实现方式如下所述:
所述方法,还包括:
根据所述视频处理方法所应用的场景,和/或所述视频处理方法的资源需求,确定所述特征提取模型和/或所述识别模型的类型信息。
例如所述视频处理方法应用在实时处理场景(如直播场景),或者所述视频处理方法的资源需求小于等于预设资源阈值的情况下,可以采用一种类型的 特征提取模型和/或识别模型(如轻量级的初始特征提取模型和/或识别模型);其中,预设资源阈值可以根据实际需要进行设置,本申请不做任何限定。而在所述视频处理方法应用在后台处理场景(如直播结束后的场景),或者所述视频处理方法的资源需求大于预设资源阈值的情况下,可以采用另一种类型的特征提取模型和/或识别模型。
实际应用中,在实时场景下,对于音频特征、图像特征、结构化特征以及特征融合等采用一种类型的模型;在后台场景下,对于音频特征、图像特征、结构化特征以及特征融合等采用另一种类型的模型,具体参见表1。
表1
Figure PCTCN2021120383-appb-000001
由表1可以看出,所述视频处理方法应用于实时处理场景(如直播场景),或者所述视频处理方法的资源需求小于等于预设资源阈值的情况下,音频特征提取模型可以为MFCC模型,所述视频处理方法应用于后台处理场景(如直播结束后的场景),或者所述视频处理方法的资源需求大于预设资源阈值的情况下,音频特征提取模型可以为VGGish模型;所述视频处理方法应用于实时处理场景(如直播场景),或者所述视频处理方法的资源需求小于等于预设资源阈值的情况下,图像特征提取模型可以为MobileNet模型,所述视频处理方法应用于后台处理场景(如直播结束后的场景),或者所述视频处理方法的资 源需求大于预设资源阈值的情况下,图像特征提取模型可以为ResNet模型;所述视频处理方法应用于实时处理场景(如直播场景),或者所述视频处理方法的资源需求小于等于预设资源阈值的情况下,结构化特征提取模型可以为Word2vec模型,所述视频处理方法应用于后台处理场景(如直播结束后的场景),或者所述视频处理方法的资源需求大于预设资源阈值的情况下,结构化特征提取模型可以为Bert模型;所述视频处理方法应用于实时处理场景(如直播场景),或者所述视频处理方法的资源需求小于等于预设资源阈值的情况下,特征融合模型可以为GBDT模型,所述视频处理方法应用于后台处理场景(如直播结束后的场景),或者所述视频处理方法的资源需求大于预设资源阈值的情况下,特征融合模型可以为Attention模型。
本申请中,所述视频处理方法通过将获取的视频的音频特征、图像特征以及结构化特征进行融合,基于融合后获得的多模态特征对视频的高光视频片段进行识别时,通过该视频的多模态的全面的特征信息,精确的获得该视频的高光视频片段,增强用户体验。
具体的,所述视频处理方法通过提取多路音频、多路视频以及结构化的信息的不同特征,进行特征融合,从而通过融合特征为当前视频的每个视频片段的精彩程度进行打分,最终可以准确、快速的识别出视频的高光视频片段;所述视频处理方法通过多模态特征融合,可以获得视频的全局信息,从而精确的获得视频的高光视频片段;此外,该方法可以针对不同的应用场景和/或资源需求进行不同的算法配置和部署,进而满足所述视频处理方法的处理速度和精度的需求。
参见图3,图3示出了本申请一实施例提供的一种应用于直播场景的视频处理方法的处理流程图,具体包括以下步骤。
步骤302:提取直播视频中的音频,获得该直播视频的音频特征。
具体的,对该直播视频中包含视频内容的音频和直播时视频上传者的音频进行提取,若该直播视频中不包含视频上传者的音频,则只提取视频内容的音频。音频内容与视频整体关联性很强,视频的高光视频片段往往伴随着视频本身较高的音调或者是视频上传者提高音量、发出笑声等。音频提取后,则可以通过MFCC、音量、音调等传统音频特征或者基于VGGish的深度神经网络提取对应音频特征。其中,VGGish是通过在大量音频数据进行分类预训练后,提供音频特征提取的模型。
步骤304:提取直播视频中的视频图像,获得该直播视频的图像特征。
具体的,视频图像包括视频整体画面、视频上传者画面以及关键区域信息。其中,视频整体画面包含了整个视频的特征,高光时刻往往伴随着丰富的色彩、内容等。而视频上传者画面大多是包含视频上传者的面部信息,其中情感的表达和波动则与高光时刻关联很大。而关键区域信息则指用户常关注的区域,通常高光时刻也与这些关键区域较为相关。如在游戏视频中,用户常关注比分区域、击杀提示区域,而在舞蹈等视频中,用户可能常常关注画面中部包含人物的区域。
在本申请中,视频整体画面特征通过ImageNet预训练的CNN卷积神经网络进行提取;视频上传者画面信息可以使用基于人脸情绪的预训练卷积神经网络进行特征提取;而关键区域信息则会根据不同类型的视频,训练不同的检测 器,提取对应的特征,如比分信息特征、击杀信息特征、人体动作信息特征等。
步骤306:提取直播视频中的结构化信息,获得该直播视频的结构化特征。
具体的,视频中通常还包含了许多结构化的信息,如标题、评论、弹幕等信息,在直播视频中,还会包含礼物、充值等。这些结构化的信息也会与高光时刻相关,如评论内容、弹幕内容、弹幕数量、礼物数量等。针对这些结构化信息,本申请提出了针对标题、评论、弹幕等文字基于Word2vec或Bert的结构化特征提取。同时,针对弹幕数量、礼物数量、礼物价值、充值数量等数值信息可以归一化到[0,1]之间,同样作为结构化信息的特征提取。
步骤308:将该直播视频的音频特征、图像特征以及结构化特征进行特征融合。
具体的,在该直播视频的多模态特征提取完成后,本申请采用了特征级别的融合策略。特征融合时,将每个视频片段的特征汇总,通过降维和归一化操作后,可以使用传统的GBDT(Gradient Boosting Decision Tree)或基于Attention的深度神经网络的融合方式进行特征融合。具体来说,基于多棵决策树的GBDT算法是较为常用的特征融合算法,对输入特征可以识别其重要程度,利用有标注的训练数据(即提取准备好的视频片段与对应的精彩程度打分)回归对应片段的分数。而基于Attention的神经网络,则会通过训练数据同时训练不同模态重要性的分布以及片段分数的回归。将训练好的模型存储在相应设备上,当测试和使用时,提取对应视频片段特征,训练好的回归模型则会自动输出片段的打分,从而识别视频的精彩视频片段。
步骤310:输入该直播视频的精彩视频片段。
此外,本申请还可以根据不同场景和资源选择不同的算法配置。例如,在需要实时和计算力资源有限时,音频特征提取可以采用MFCC等传统特征;视频特征提取可以采用Mobilenet等轻量级网络,结构化特征提取可以采用Word2vec等较快的提取方式。
而在对实时性没有很高要求的后台环境下,音频特征、图像特征和结构化特征的提取可以采用基于深度神经网络的特征提取方式,以便输出更精确的结果。
本申请中,所述视频处理方法通过提取多路音频、多路视频以及结构化的信息的不同特征,进行特征融合,从而通过融合特征为当前视频的每个视频片段的精彩程度进行打分,最终可以准确、快速的识别出视频的高光视频片段;所述视频处理方法通过多模态特征融合,可以获得视频的全局信息,从而精确的获得视频的高光视频片段;此外,该方法可以针对不同的环境进行不同的算法配置和部署,进而满足处理速度和精度的需求。
与上述方法实施例相对应,本申请还提供了视频处理装置实施例,图4示出了本申请一实施例提供的一种视频处理装置的结构示意图。如图4所示,该装置包括:
视频切分模块402,被配置为将接收的初始视频切分为至少一个视频片段;
特征提取模块404,被配置为基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;
目标确定模块406,被配置为将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应 的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。
可选的,所述特征提取模块404,进一步被配置为:
基于第一特征提取模型获得所述至少一个视频片段中每个视频片段对应的音频特征;
基于第二特征提取模型获得所述至少一个视频片段中每个视频片段对应的图像特征;
基于所述第三特征提取模型获得所述至少一个视频片段中每个视频片段对应的结构化特征。
可选的,所述特征提取模块404,进一步被配置为:
提取所述至少一个视频片段中每个视频片段中的音频信息;
将所述每个视频片段中的音频信息输入第一特征提取模型,获得所述每个视频片段对应的音频特征。
可选的,所述特征提取模块404,进一步被配置为:
提取所述至少一个视频片段中每个视频片段的图像信息;
将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征。
可选的,所述特征提取模块404,进一步被配置为:
提取所述至少一个视频片段中每个视频片段的结构化信息;
将所述结构化信息输入第三特征提取模型,获得所述每个视频片段对应的结构化特征。
可选的,所述图像信息包括视频画面和关键区域画面,所述第二特征提取模型包括第一图像特征提取模型和第二图像特征提取模型;
相应的,所述特征提取模块404,进一步被配置为:
将所述视频画面输入第一图像特征提取模型,获得所述每个视频片段对应的视频画面特征;
将所述关键区域画面输入第二图像特征提取模型,获得所述每个视频片段对应的关键区域画面特征;
将所述每个视频片段对应的视频画面特征和关键区域画面特征融合,获得所述每个视频片段对应的图像特征。
可选的,所述图像信息包括视频画面、关键区域画面和主播图像,所述第二特征提取模型包括第一图像特征提取模型、第二图像特征提取模型和第三图像特征提取模型;
相应的,所述特征提取模块404,进一步被配置为:
将所述视频画面输入第一图像特征提取模型,获得所述每个视频片段对应的视频画面特征;
将所述关键区域画面输入第二图像特征提取模型,获得所述每个视频片段对应的关键区域画面特征;
将所述主播图像输入第三图像特征提取模型,获得所述每个视频片段对应的主播图像特征;
将所述每个视频片段对应的视频画面特征、关键区域画面特征和主播图像特征进行融合,获得所述每个视频片段对应的图像特征。
可选的,所述音频信息包括视频画面对应的音频信息和非视频画面对应的音频信息;
相应的,所述特征提取模块404,进一步被配置为:
将所述每个视频片段中的视频画面对应的音频信息输入第一特征提取模型,获得所述每个视频片段中的视频画面对应的音频特征;
将所述每个视频片段中的非视频画面对应的音频信息输入所述第一特征提取模型,获得所述每个视频片段中的非视频画面对应的音频特征;
将所述每个视频片段中的视频画面对应的音频特征和每个视频片段中的非视频画面对应的音频特征进行融合,获得所述每个视频片段对应的音频特征。
可选的,所述目标确定模块406,进一步被配置为:
将视频得分大于等于预设得分阈值的视频片段确定为所述初始视频中的目标视频片段。
可选的,所述装置,还包括:
目标视频生成模块,被配置为基于所述目标视频片段生成目标视频,且将所述目标视频发送至用户。
可选的,所述装置,还包括:
模型确定模块,被配置为根据所述视频处理方法所应用的场景,和/或所述视频处理方法的资源需求,确定所述特征提取模型和/或所述识别模型的类型信息。
本申请中,所述视频处理装置通过将获取的视频的第一模态特征、第二模态特征以及第三模态特征进行融合,基于融合后获得的多模态特征对视频的高 光视频片段进行识别时,通过该视频的多模态的全面的特征信息,精确的获得该视频的高光视频片段,增强用户体验。
上述为本实施例的一种视频处理装置的示意性方案。需要说明的是,该视频处理装置的技术方案与上述的视频处理方法的技术方案属于同一构思,视频处理装置的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
图5示出了根据本说明书一个实施例提供的一种计算设备500的结构框图。该计算设备500的部件包括但不限于存储器510和处理器520。处理器520与存储器510通过总线530相连接,数据库550用于保存数据。
计算设备500还包括接入设备540,接入设备540使得计算设备500能够经由一个或多个网络560通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备540可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本说明书的一个实施例中,计算设备500的上述部件以及图5中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图5所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备500可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计 算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备500还可以是移动式或静止式的服务器。
其中,处理器520用于执行如下计算机可执行指令,所述处理器520执行所述指令时实现所述的视频处理方法的步骤。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的视频处理方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现如前所述视频处理方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的视频处理方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
本申请一实施例还提供一种计算机程序产品,当所述计算机程序产品在计算机中执行时,令计算机执行如前所述视频处理方法的步骤。
上述为本实施例的一种计算机程序产品的示意性方案。需要说明的是,该计算机程序产品的技术方案与上述的视频处理方法的技术方案属于同一构思,计算机程序产品的技术方案未详细描述的细节内容,均可以参见上述视频处理方法的技术方案的描述。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序产品代码,所述计算机程序产品代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序产品代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本申请的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。

Claims (15)

  1. 一种视频处理方法,包括:
    将接收的初始视频切分为至少一个视频片段;
    基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;
    将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。
  2. 根据权利要求1所述的视频处理方法,所述基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征,包括:
    基于第一特征提取模型获得所述至少一个视频片段中每个视频片段对应的音频特征;
    基于第二特征提取模型获得所述至少一个视频片段中每个视频片段对应的图像特征;
    基于第三特征提取模型获得所述至少一个视频片段中每个视频片段对应的结构化特征。
  3. 根据权利要求1或2所述的视频处理方法,所述基于第一特征提取模型获得所述至少一个视频片段中每个视频片段对应的音频特征,包括:
    提取所述至少一个视频片段中每个视频片段中的音频信息;
    将所述每个视频片段中的音频信息输入第一特征提取模型,获得所述每个 视频片段对应的音频特征。
  4. 根据权利要求2所述的视频处理方法,所述基于第二特征提取模型获得所述至少一个视频片段中每个视频片段对应的图像特征,包括:
    提取所述至少一个视频片段中每个视频片段的图像信息;
    将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征。
  5. 根据权利要求2或4所述的视频处理方法,所述基于所述第三特征提取模型获得所述至少一个视频片段中每个视频片段对应的结构化特征,包括:
    提取所述至少一个视频片段中每个视频片段的结构化信息;
    将所述结构化信息输入第三特征提取模型,获得所述每个视频片段对应的结构化特征。
  6. 根据权利要求4所述的视频处理方法,所述图像信息包括视频画面和关键区域画面,所述第二特征提取模型包括第一图像特征提取模型和第二图像特征提取模型;
    相应的,所述将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征,包括:
    将所述视频画面输入第一图像特征提取模型,获得所述每个视频片段对应的视频画面特征;
    将所述关键区域画面输入第二图像特征提取模型,获得所述每个视频片段对应的关键区域画面特征;
    将所述每个视频片段对应的视频画面特征和关键区域画面特征融合,获得 所述每个视频片段对应的图像特征。
  7. 根据权利要求4或6所述的视频处理方法,所述图像信息包括视频画面、关键区域画面和主播图像,所述第二特征提取模型包括第一图像特征提取模型、第二图像特征提取模型和第三图像特征提取模型;
    相应的,所述将所述图像信息输入第二特征提取模型,获得所述每个视频片段对应的图像特征,包括:
    将所述视频画面输入第一图像特征提取模型,获得所述每个视频片段对应的视频画面特征;
    将所述关键区域画面输入第二图像特征提取模型,获得所述每个视频片段对应的关键区域画面特征;
    将所述主播图像输入第三图像特征提取模型,获得所述每个视频片段对应的主播图像特征;
    将所述每个视频片段对应的视频画面特征、关键区域画面特征和主播图像特征进行融合,获得所述每个视频片段对应的图像特征。
  8. 根据权利要求3所述的视频处理方法,所述音频信息包括视频画面对应的音频信息和非视频画面对应的音频信息;
    相应的,所述将所述每个视频片段中的音频信息输入第一特征提取模型,获得所述每个视频片段对应的音频特征,包括:
    将所述每个视频片段中的视频画面对应的音频信息输入第一特征提取模型,获得所述每个视频片段中的视频画面对应的音频特征;
    将所述每个视频片段中的非视频画面对应的音频信息输入所述第一特征 提取模型,获得所述每个视频片段中的非视频画面对应的音频特征;
    将所述每个视频片段中的视频画面对应的音频特征和非视频画面对应的音频特征进行融合,获得所述每个视频片段对应的音频特征。
  9. 根据权利要求1至8任意一项所述的视频处理方法,所述基于所述视频得分确定所述初始视频中的目标视频片段,包括:
    将视频得分大于等于预设得分阈值的视频片段确定为所述初始视频中的目标视频片段。
  10. 根据权利要求9所述的视频处理方法,所述将视频得分大于等于预设得分阈值的视频片段确定为所述初始视频中的目标视频片段之后,还包括:
    基于所述目标视频片段生成目标视频,且将所述目标视频发送至用户。
  11. 根据权利要求1至10任意一项所述的视频处理方法,还包括:
    根据所述视频处理方法所应用的场景,和/或所述视频处理方法的资源需求,确定所述特征提取模型和/或所述识别模型的类型信息。
  12. 一种视频处理装置,包括:
    视频切分模块,被配置为将接收的初始视频切分为至少一个视频片段;
    特征提取模块,被配置为基于特征提取模型获得所述至少一个视频片段中每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征;
    目标确定模块,被配置为将所述每个视频片段对应的第一模态特征、第二模态特征以及第三模态特征输入到识别模型,获得所述每个视频片段对应的视频得分,并基于所述视频得分确定所述初始视频中的目标视频片段。
  13. 一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述指令时实现权利要求1至11任意一项所述方法的步骤。
  14. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求1至11任意一项所述方法的步骤。
  15. 一种计算机程序产品,当所述计算机程序产品在计算机中执行时,令计算机执行权利要求1至11任意一项所述方法的步骤。
PCT/CN2021/120383 2020-12-22 2021-09-24 视频处理方法及装置 WO2022134698A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21908699.8A EP4207770A4 (en) 2020-12-22 2021-09-24 VIDEO PROCESSING METHOD AND APPARATUS
US18/300,310 US20230252785A1 (en) 2020-12-22 2023-04-13 Video processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011531808.4A CN112738557A (zh) 2020-12-22 2020-12-22 视频处理方法及装置
CN202011531808.4 2020-12-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/300,310 Continuation US20230252785A1 (en) 2020-12-22 2023-04-13 Video processing

Publications (1)

Publication Number Publication Date
WO2022134698A1 true WO2022134698A1 (zh) 2022-06-30

Family

ID=75604094

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120383 WO2022134698A1 (zh) 2020-12-22 2021-09-24 视频处理方法及装置

Country Status (4)

Country Link
US (1) US20230252785A1 (zh)
EP (1) EP4207770A4 (zh)
CN (1) CN112738557A (zh)
WO (1) WO2022134698A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905584A (zh) * 2023-01-09 2023-04-04 共道网络科技有限公司 一种视频拆分方法及装置
US11699463B1 (en) * 2022-04-07 2023-07-11 Lemon Inc. Video processing method, electronic device, and non-transitory computer-readable storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738557A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置
CN114501132B (zh) * 2021-12-24 2024-03-12 北京达佳互联信息技术有限公司 一种资源处理方法、装置、电子设备及存储介质
CN114581821A (zh) * 2022-02-23 2022-06-03 腾讯科技(深圳)有限公司 一种视频检测方法、系统及存储介质和服务器
CN115529378A (zh) * 2022-02-28 2022-12-27 荣耀终端有限公司 一种视频处理方法及相关装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080193016A1 (en) * 2004-02-06 2008-08-14 Agency For Science, Technology And Research Automatic Video Event Detection and Indexing
CN109089133A (zh) * 2018-08-07 2018-12-25 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质
CN109691124A (zh) * 2016-06-20 2019-04-26 皮克索洛特公司 用于自动生成视频亮点的方法和系统
CN110191357A (zh) * 2019-06-28 2019-08-30 北京奇艺世纪科技有限公司 视频片段精彩度评估、动态封面生成方法及装置
CN110267119A (zh) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 视频精彩度的评价方法及相关设备
CN111460219A (zh) * 2020-04-01 2020-07-28 百度在线网络技术(北京)有限公司 视频处理方法及装置、短视频平台
CN111787354A (zh) * 2019-04-03 2020-10-16 浙江大学 一种视频生成方法及其装置
CN112738557A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100118033A1 (en) * 2008-11-10 2010-05-13 Vistaprint Technologies Limited Synchronizing animation to a repetitive beat source
CN103631932B (zh) * 2013-12-06 2017-03-01 中国科学院自动化研究所 一种对重复视频进行检测的方法
CN108229302A (zh) * 2017-11-10 2018-06-29 深圳市商汤科技有限公司 特征提取方法、装置、计算机程序、存储介质和电子设备
CN108833973B (zh) * 2018-06-28 2021-01-19 腾讯科技(深圳)有限公司 视频特征的提取方法、装置和计算机设备
CN110149541B (zh) * 2019-04-23 2021-08-03 腾讯科技(深圳)有限公司 视频推荐方法、装置、计算机设备及存储介质
CN110263220A (zh) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 一种视频精彩片段识别方法及装置
CN111400601B (zh) * 2019-09-16 2023-03-10 腾讯科技(深圳)有限公司 一种视频推荐的方法及相关设备
CN111581437A (zh) * 2020-05-07 2020-08-25 腾讯科技(深圳)有限公司 一种视频检索方法及装置
CN111787356B (zh) * 2020-07-09 2022-09-30 易视腾科技股份有限公司 目标视频片段提取方法和装置
CN112069951A (zh) * 2020-08-25 2020-12-11 北京小米松果电子有限公司 视频片段提取方法、视频片段提取装置及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080193016A1 (en) * 2004-02-06 2008-08-14 Agency For Science, Technology And Research Automatic Video Event Detection and Indexing
CN109691124A (zh) * 2016-06-20 2019-04-26 皮克索洛特公司 用于自动生成视频亮点的方法和系统
CN109089133A (zh) * 2018-08-07 2018-12-25 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质
CN111787354A (zh) * 2019-04-03 2020-10-16 浙江大学 一种视频生成方法及其装置
CN110191357A (zh) * 2019-06-28 2019-08-30 北京奇艺世纪科技有限公司 视频片段精彩度评估、动态封面生成方法及装置
CN110267119A (zh) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 视频精彩度的评价方法及相关设备
CN111460219A (zh) * 2020-04-01 2020-07-28 百度在线网络技术(北京)有限公司 视频处理方法及装置、短视频平台
CN112738557A (zh) * 2020-12-22 2021-04-30 上海哔哩哔哩科技有限公司 视频处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4207770A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11699463B1 (en) * 2022-04-07 2023-07-11 Lemon Inc. Video processing method, electronic device, and non-transitory computer-readable storage medium
CN115905584A (zh) * 2023-01-09 2023-04-04 共道网络科技有限公司 一种视频拆分方法及装置
CN115905584B (zh) * 2023-01-09 2023-08-11 共道网络科技有限公司 一种视频拆分方法及装置

Also Published As

Publication number Publication date
US20230252785A1 (en) 2023-08-10
EP4207770A4 (en) 2024-03-06
EP4207770A1 (en) 2023-07-05
CN112738557A (zh) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022134698A1 (zh) 视频处理方法及装置
US11625920B2 (en) Method for labeling performance segment, video playing method, apparatus and system
US10679063B2 (en) Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics
US20170065889A1 (en) Identifying And Extracting Video Game Highlights Based On Audio Analysis
CN109218629B (zh) 视频生成方法、存储介质和装置
Hong et al. Video accessibility enhancement for hearing-impaired users
CN108833973A (zh) 视频特征的提取方法、装置和计算机设备
CN110602516A (zh) 基于视频直播的信息交互方法、装置及电子设备
WO2023197979A1 (zh) 一种数据处理方法、装置、计算机设备及存储介质
US11973993B2 (en) Machine learning based media content annotation
CN111050201A (zh) 数据处理方法、装置、电子设备及存储介质
WO2024046189A1 (zh) 文本生成方法以及装置
WO2022228235A1 (zh) 生成视频语料的方法、装置及相关设备
CN113392273A (zh) 视频播放方法、装置、计算机设备及存储介质
US11581020B1 (en) Facial synchronization utilizing deferred neural rendering
US11582519B1 (en) Person replacement utilizing deferred neural rendering
CN110750996A (zh) 多媒体信息的生成方法、装置及可读存储介质
US20220415360A1 (en) Method and apparatus for generating synopsis video and server
CN113992973B (zh) 视频摘要生成方法、装置、电子设备和存储介质
CN116828246B (zh) 一种数字人直播交互方法、系统、设备及存储介质
CN116229311B (zh) 视频处理方法、装置及存储介质
WO2023142590A1 (zh) 手语视频的生成方法、装置、计算机设备及存储介质
US20220375223A1 (en) Information generation method and apparatus
CN116389849A (zh) 视频生成方法、装置、设备及存储介质
CN114449297A (zh) 一种多媒体信息的处理方法、计算设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908699

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021908699

Country of ref document: EP

Effective date: 20230331

NENP Non-entry into the national phase

Ref country code: DE