WO2023160515A1 - 视频处理方法、装置、设备及介质 - Google Patents

视频处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023160515A1
WO2023160515A1 PCT/CN2023/077309 CN2023077309W WO2023160515A1 WO 2023160515 A1 WO2023160515 A1 WO 2023160515A1 CN 2023077309 W CN2023077309 W CN 2023077309W WO 2023160515 A1 WO2023160515 A1 WO 2023160515A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
recommended
original
audio
Prior art date
Application number
PCT/CN2023/077309
Other languages
English (en)
French (fr)
Inventor
谭伟林
张兴华
曾立峰
黄继豪
池颉
陈辜少雄
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Priority to US18/574,263 priority Critical patent/US20240244290A1/en
Publication of WO2023160515A1 publication Critical patent/WO2023160515A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Definitions

  • the present disclosure relates to the technical field of computer applications, and in particular to a video processing method, device, device and medium.
  • the user selects the appropriate material to add to the video for decoration through the video editing software.
  • the process of selecting and adding one by one increases the time cost and reduces the processing efficiency.
  • relevant video editing software has launched video templates or one-click video decoration solutions, which can insert the captured video or picture into the selected video template, and automatically edit a beautified video with template effects.
  • a video processing method comprising: extracting video content features based on original video analysis; acquiring at least one recommended material matching the video content features; The material performs video processing on the original video to generate a target video, wherein the target video is a video generated by adding the recommended material to the original video.
  • a video processing device comprising: an extraction module, configured to extract features of video content based on an analysis of the original video; an acquisition module, configured to acquire features related to the video content At least one recommended material with matching features; a processing module, configured to perform video processing on the original video according to the recommended material to generate a target video, wherein the target video is generated after adding the recommended material to the original video video.
  • an electronic device includes: a processor; a memory for storing instructions executable by the processor; Read the executable instruction from the computer, and execute the instruction to implement the video processing method provided by any embodiment of the present disclosure.
  • a computer-readable storage medium stores a computer program, and the computer program is used to execute the video processing method provided in any embodiment of the present disclosure.
  • a computer program is also provided, the computer program includes instructions, and when the instructions are executed by a processor, the processor is enabled to implement the video processing method provided in any embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart of a video processing method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of another video processing method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a video processing scene provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 10 is a schematic flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 12 is a schematic flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 14 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 15 is a schematic flowchart of another video processing method provided by an embodiment of the present disclosure.
  • FIG. 16 is a schematic diagram of another video processing scenario provided by an embodiment of the present disclosure.
  • FIG. 17 is a schematic structural diagram of a video processing device provided by an embodiment of the present disclosure.
  • Fig. 18 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.”
  • Relevant definitions of other terms will be given in the description below. It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.
  • the embodiment of the present disclosure provides a video processing method, in this method, Based on the content of the video, the material related to effect processing is recommended, so that the video processed according to the recommended material has a high degree of matching between the processing effect and the video content, and the processing effect of each video with different content is significantly different, with "Thousands The processing effect of frequency and face can meet the individual needs of video processing effect.
  • FIG. 1 is a schematic flow chart of a video processing method provided by an embodiment of the present disclosure.
  • the method can be executed by a video processing device, where the device can be implemented by software and/or hardware, and generally can be integrated into an electronic device. As shown in FIG. 1, the method includes: steps S102-S106.
  • step 101 video content features are extracted based on the analysis of the original video.
  • video effect processing is performed in order to adapt to the personalized characteristics of video content, based on the analysis of the original video, video content features are extracted.
  • the original video is an uploaded video to be effected.
  • the video content features include but not It is limited to one or more of the audio features of the video, the text features of the video, the image features of the video, the filter features of the video, and the features of the subjects included in the video.
  • step 102 at least one recommended material matching the feature of the video content is acquired.
  • At least one recommended material matching the feature of the video content is acquired, and the recommended material includes but not limited to one or more of audio material, texture material, animation material, filter material, and the like.
  • the manner of acquiring at least one recommended material matching the feature of the video content may vary according to different scenarios. The specific acquisition manner may be illustrated in subsequent embodiments and will not be repeated here.
  • step 103 video processing is performed on the original video according to the recommended material to generate a target video, wherein the target video is a video generated by adding the recommended material to the original video.
  • each material has a corresponding adding track, therefore, the corresponding material can be added based on the track of the corresponding material.
  • the track of each material is defined by its corresponding field name, type, and description information.
  • Table 1 is an example of a track of a material.
  • each recommended material also includes corresponding parameters, so as to further facilitate some personalized adjustments to the display effect when the material is added, for example, in subsequent embodiments, after determining the area of the material, the size of the material is adjusted.
  • the parameters of the text_template material shown in Table 2 below may include scaling factors, rotation angles, and the like.
  • the video processing method of the embodiment of the present disclosure after extracting the video content features of the original video, at least one recommended material matching the video content features is obtained, and further, the target video is obtained after adding the recommended material to the original video.
  • the video content of the adapted video is added to the video material, which improves the matching degree between the video content and the video material, and realizes the personalized effect processing of the video.
  • extraction is performed based on the text content of the original video.
  • extracting video content features includes steps S201 - S202 .
  • step 201 speech recognition processing is performed on the target audio data of the original video to obtain corresponding text content.
  • the pre-set video editing application can also identify each audio track contained in the original video, wherein each audio A track corresponds to a sound source. For example, for the original video A, which contains the speaking voices of users a and b, in this embodiment, the audio track corresponding to the voice of a and the audio track corresponding to the voice of b can be identified.
  • all audio tracks displayed in the video editing application of the video file of the original video are obtained. It is easy to understand that the audio source corresponding to each audio track has an occurrence time, therefore, in some embodiments, the corresponding audio track is also displayed based on the time axis.
  • the audio file of the original video is split into a video track video and two audio tracks audio1 and audio2, the corresponding audio tracks can be displayed in the video editing application.
  • all audio tracks will be merged based on the time axis to generate total audio data.
  • audio1 and audio2 are merged based on the time axis to generate total audio data complex-audio, which includes all audio data in the original video.
  • the total audio data is also time-related. Therefore, if the first duration of the total audio data is longer than the second duration of the second video of the original video, in order to ensure the consistency of the project length, some If the audio data has no corresponding video content, the first duration of the total audio data is cut to obtain the target audio data, wherein the duration of the target audio data is consistent with the second duration.
  • the audio file corresponding to the original video may include background sound in addition to the audio data of the interaction between the shooting objects.
  • background sound includes the sound of music played in the environment, or the sound of vehicles passing by on the road in the environment.
  • This background sound is usually irrelevant to the video content. Therefore, in order to facilitate the extraction of subsequent video content features, avoid the interference of background sounds on the extraction of video content features (for example, when extracting video text features, the background sound may be recognized. Text content in the video, etc.), in some embodiments, the background sound in the original video can also be removed.
  • the audio identification of each audio track is detected, that is, according to the identification of sound features such as the sound spectrum of the audio corresponding to each audio track, the sound features of the audio corresponding to each audio track are compared with the pre-set The sound features corresponding to each set audio ID are matched, and the audio ID of each audio track is determined based on the matching result. If the target audio track representing the background music ID is detected, all audio tracks other than the target audio track are selected based on the time axis A combining process is performed to generate total audio data.
  • the target audio data can also be obtained by merging all audio tracks corresponding to the original video, or it can be obtained by only merging audio tracks that meet a certain type of preset sound characteristics, etc., according to the needs of the scene. settings without limitation.
  • speech recognition processing is performed on the target audio data of the original video, and then the corresponding text content is obtained, and the acquisition of the text content can be recognized through speech recognition technology.
  • step 202 semantic analysis is performed on the text content to obtain the first keyword.
  • the first keyword can match the recommended material for the video in the dimension of content.
  • the first keyword can be an emotional keyword such as "haha, so funny", so that based on the first keyword, materials that render emotions can be recommended for the video, such as some laughing texture materials, or, Some animation materials of fireworks, etc.
  • the first keyword can be a vocabulary in a professional field such as "basin”, and based on the first keyword, professional texture materials in the corresponding field can be recommended for the video, so that the vocabulary in the corresponding professional field is easier to understand wait.
  • semantic analysis is performed on the text content, and the semantic result of the analysis is matched with preset keyword semantics to determine the first keyword that matches successfully.
  • the text content of the target audio data can be recognized to obtain the sentence by using automatic speech recognition (Automatic Speech Recognition, ASR) technology, Furthermore, the semantics of the corresponding text sentence is understood through natural language processing technology (Natural Language Processing, NLP) to obtain the corresponding first keyword.
  • ASR Automatic Speech Recognition
  • NLP Natural Language Processing
  • the relevance between the recommended material and the video content can be ensured in the content dimension, so as to better render the corresponding video content.
  • the corresponding first keyword is shown in the form of subtitles
  • the sticker material of "applause” can be recommended, so that in the processed video, the "haha”
  • the audio shows “applause” stickers, which further exaggerates the happy atmosphere, and the added recommendation materials are more consistent with the video content, and the addition of recommendation materials does not appear abrupt.
  • extracting video content features includes: steps 701 - 702 .
  • step 701 a sound detection process is performed on the target audio data of the original video to obtain corresponding frequency spectrum data.
  • the audio data may still reflect the content characteristics of the video. For example, if the audio data contains "applause”, “explosion” and so on, the recommended material can be added based on this audio data, and the atmosphere of the video can also be further enhanced with the corresponding audio.
  • step 702 the frequency spectrum data is analyzed and processed to obtain a second keyword.
  • the spectral data is analyzed and processed to obtain second keywords, wherein recommended materials corresponding to corresponding spectral data can be obtained based on the second keywords.
  • the frequency spectrum data may be input into a deep learning model trained in advance based on a large amount of sample data, and the second keyword output by the deep learning model may be obtained.
  • the acquired spectrum data may be matched with the preset spectrum data of each keyword, and the second keyword corresponding to the spectrum data is determined based on the matching degree. For example, if the matching degree between the obtained spectrum data and the spectrum data corresponding to the keyword "explosion" is greater than a preset threshold, then it is determined that the second keyword corresponding to the target audio data is "explosion".
  • the first keyword and the second keyword can also be jointly recommended, wherein the second keyword can be identified based on the audio event detection (Audio event detection, AED) technology.
  • AED audio event detection
  • the sound detection process is performed on the target audio data of the original video, after obtaining the corresponding spectrum data, the corresponding second keyword obtained according to the spectrum data is "explosion", then the matching The recommended material for is an "explosion” sticker, so that the corresponding "explosion” sticker is displayed on the corresponding video frame to further render the video content including the explosion audio.
  • the video processing method of the embodiment of the present disclosure can be based on reflecting any feature of the video content as a visual Video content features, the extracted video content features and video content have a strong correlation, which ensures the relevance of recommended materials based on video content features and video content, and provides technical support for personalized video processing effects.
  • the recommended material matching the feature of the video content is further recommended, and the decision of the recommended material has a processing effect on the video.
  • the determination of the recommended material will be described below with reference to specific examples.
  • acquiring at least one recommended material matching the features of the video content includes: steps 1001 - 1002 .
  • step 1001 video style features are determined according to the video image of the original video.
  • the corresponding video styles are different. Therefore, if the same recommended material is added, it will also affect the matching degree with the video content.
  • the target audio of the original video S1 According to the target audio data of the original video S2, the first keyword obtained by semantic analysis is also "haha", but the vocal object of "haha” in S1 is For anime characters, the voice of "haha” in S2 is a real person. Therefore, if the recommended material is suitable for these two styles, it will obviously affect the processing effect of the video.
  • the video style feature is determined according to the video image of the original video.
  • the video style feature includes the image feature of the video content, the theme style feature of the video content, the feature of the shooting object contained in the video, etc., which is not limited here.
  • the convolutional network model is trained in advance based on a large amount of sample data, video images are input into the corresponding convolutional network model, and the video style features output by the convolutional network model are obtained.
  • determining the video style feature according to the video image of the original video includes: Steps 1201 - 1202 .
  • step 1201 an image recognition process is performed on the video image of the original video, and at least one shooting object is determined according to the recognition result.
  • the shooting object may be a subject contained in a video image, including but not limited to: people, animals, furniture, tableware, and the like.
  • step 1202 weighting calculation is performed on at least one shooting object according to the preset object weight, and the calculation result is matched with the preset style classification to determine the video style feature corresponding to the original video.
  • the object type of each shooting object can be identified, and the pre-selected A database is set up to obtain the object weight of each shooting object, wherein the database includes training based on a large number of sample data, each shooting type and the corresponding object weight, and then at least one shooting object is weighted according to the preset object weight Computing, matching the calculation result with the preset style classification, and determining the video style feature corresponding to the successfully matched style classification.
  • multiple video frames extracted from the original video may be used as video images of the original video based on the multiple video frames, so as to further improve style recognition efficiency.
  • multiple frames of video frames can be extracted from the original video according to a preset time interval (such as 1 second), or a corresponding video segment can be extracted according to a preset time length at intervals, and according to the video segment A video image containing multiframe video frames as raw video.
  • the video image can be input into a pre-trained image intelligent recognition model, and at least one shooting object is determined according to the recognition result.
  • the shooting objects in the figure include human faces, objects, environments, etc. , and further, identify the classification features t1, t2, t3 corresponding to each shooting object, and the corresponding object weights are z1, z2, and z3 respectively, then calculate the value of t1z1+t2z2+t3z3 as the calculation result, according to the calculation result and the preset Match the style classification of the original video to determine the video style features corresponding to the original video.
  • step 1002 at least one recommended material matching the video style features and video content features is obtained.
  • At least one recommended material matching the video style feature and video content feature is acquired, so that the recommended material matches the video content on the video style feature and video content feature, further improving the video quality. processing effect.
  • a material library that matches the video style characteristics can be obtained first, and at least one recommended material that matches the video content characteristics is obtained in the material library, thereby ensuring that the obtained recommended materials are not only compatible with The content of the video matches and is consistent with the style of the video.
  • the video style feature is "girly anime”
  • a material library composed of various girly-style materials matching "girly anime” is obtained, and then, based on the video content characteristics, the material library composed of various girly-style materials is obtained. Match the video materials in the library to ensure that the recommended materials obtained are all girlish.
  • obtaining at least one recommended material matching the feature of the video content includes: steps 1501 - 1504 .
  • step 1501 the playing time of the video frame corresponding to the video content feature is determined in the original video, wherein the video content feature is generated according to the video content of the video frame.
  • the video content feature is generated according to the video content of the video frame, therefore, in the original video, determine the video content feature The playing time of the video frame corresponding to the feature, so that according to the playing time, the corresponding material is recommended and added only for the video frame containing the corresponding video content feature.
  • step 1502 a time stamp is marked for the video content feature according to the playing time of the video frame.
  • the video content feature is marked with a time mark according to the playing time of the video frame, so as to facilitate matching of recommended materials in terms of time.
  • step 1503 for the same time stamp, if it is determined that there are multiple corresponding video content features, the multiple video content features are combined into a video feature set, and at least one recommended material matching the video feature set is obtained.
  • the multiple video content features are combined into a video feature set, and the obtained and At least one recommended material matching the video feature set.
  • multiple video content features can be combined to generate multiple video content feature combinations (video feature sets), query preset correspondences, and determine whether there is an enhanced material corresponding to each video content feature combination, If no enhanced material is matched, the video content feature combination is split into individual content feature matching recommended materials, and if the enhanced material is matched, the enhanced material is used as the corresponding recommended material.
  • the video feature set here does not necessarily include a simple combination of recommended materials corresponding to multiple video content features, and it may be to further strengthen the video atmosphere when there is correlation between multiple video content features , to generate another recommended material with a stronger sense of atmosphere.
  • the first keyword corresponding to video content feature 1 is “haha”
  • the second keyword corresponding to video content feature 2 is “applause”
  • the recommended materials that are jointly determined by words are transition effects materials, rather than the above-mentioned sticker materials that correspond to "haha” and "applause”.
  • step 1504 for the same time identifier, if it is determined that there is a corresponding video content feature, at least one recommended material matching a video content feature is obtained.
  • At least one recommended material matching a video content feature is obtained, that is, if there is a single video content feature, then a separate matching At least one recommended material.
  • the adding time is the same as the display time of the video frame of the corresponding video content feature Sincerely.
  • the original video is clipped according to the material addition time of the recommended material to generate the target video. Therefore, only when a video frame containing the characteristics of the video content corresponding to the material is played, the corresponding recommended material is added, so as to avoid the inconsistency between the addition of the material and the video content.
  • some materials do not have size information, such as sound effect materials, transition effects materials, etc., and some materials have size information.
  • size information such as sound effect materials, transition effects materials, etc.
  • sticker material and text material, etc. In order to prevent some materials with size information from blocking important display content in the video content when adding them. For example, in order not to block the face in the video frame, etc., it is necessary to determine the added area of these materials with size information.
  • the material type of the recommended material meets the preset target type, that is, when the recommended material has the attribute of adding size information, it is considered that the corresponding material meets the preset target type, and then, from the original video Obtain the target video frame corresponding to the material addition time of the recommended material, perform image recognition on the target video frame to obtain the subject area of the subject, where the subject area can be any position information that reflects the location of the subject, for example, it can be the center
  • the coordinate point for example, may be a location range or the like.
  • the shooting object is the sounding object corresponding to the "haha" audio.
  • determine the recommended material in the target The footage area to add on the video frame.
  • the material type label of the recommended material can be determined, and the preset corresponding relationship can be queried according to the material type label to determine the regional characteristics of the material area (such as the background area on the image, etc.), and in the target video frame The region that matches the characteristics of the region is determined as the material region.
  • the object type label of the photographed object can be determined, and the preset corresponding relationship can be queried according to the object type label to determine the regional characteristics of the material region (for example, if the photographed object is a face type, the corresponding regional characteristic corresponds to the top of the head etc.), determine the region that matches the feature of the region on the target video frame as the material region.
  • the original video is edited to generate the target video, and the corresponding material will be added to the material area in the video frame corresponding to the material adding time.
  • the material area may be the coordinates of the center point of the material added in the corresponding video frame, or may be the coordinate range of the material added in the corresponding video frame.
  • the server for style feature recognition since the server for determining the subject area and the like may not be the same server as the server for style feature recognition, in order to improve recognition efficiency, the server for style feature recognition may be a local server, in order to reduce material adding time and material area
  • the server identified according to the material addition time and material area of the recommended material can be a remote server.
  • the material addition time of the recommended material matching the video content feature is set to t1 and t2 respectively
  • the original video is edited according to the material addition time of the recommended material
  • the video segment clip1 corresponding to F1 and the video segment clip2 corresponding to F2 are obtained.
  • the corresponding server optimizes the added material, performs image recognition on the target video frame to obtain the main area of the subject, and then determines the recommended material in the target area according to the main area of the subject.
  • the video processing method of the embodiment of the present disclosure after determining the video content features, determines at least one recommended material that matches the multi-dimensional video content features, and also ensures the correspondence between the material and the video frame in terms of position and time, It further ensures that the processing effect of the video satisfies the personalized characteristics of the video content.
  • FIG. 17 is a schematic structural diagram of a video processing device provided by an embodiment of the present disclosure.
  • the device can be implemented by software and/or hardware, and generally can be integrated into an electronic device. As shown in FIG. 17 , the device includes: an extraction module 1710 , an acquisition module 1720 , and a processing module 1730 .
  • the extraction module 1710 is used for extracting video content features based on the analysis of the original video.
  • the acquiring module 1720 is configured to acquire at least one recommended material matching the feature of the video content.
  • the processing module 1730 is configured to perform video processing on the original video according to the recommended material to generate a target video, wherein the target video is a video generated by adding the recommended material to the original video.
  • the video processing device provided by the embodiment of the present disclosure can execute the video processing method provided by any embodiment of the present disclosure, and has corresponding functional modules and beneficial effects for executing the method, which will not be repeated here.
  • the present disclosure further proposes a computer program product, including computer programs/instructions, when the computer program/instructions are executed by a processor, the video processing method in any of the above embodiments is implemented.
  • Fig. 18 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 18 shows a schematic structural diagram of an electronic device 1800 suitable for implementing an embodiment of the present disclosure.
  • the electronic device 1800 in the embodiment of the present disclosure may include, but not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), PMPs (Portable Multimedia Players), vehicle-mounted terminals ( Mobile terminals such as car navigation terminals) and stationary terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in Figure 18 is only an example and should not be used for this No limitations are imposed on the scope of function and use of the disclosed embodiments.
  • an electronic device 1800 may include a processor (such as a central processing unit, a graphics processing unit, etc.) (RAM) 1803 to execute various appropriate actions and processing.
  • a processor such as a central processing unit, a graphics processing unit, etc.
  • RAM random access memory
  • various programs and data necessary for the operation of the electronic device 1800 are also stored.
  • the processor 1801, ROM 1802, and RAM 1803 are connected to each other through a bus 1804.
  • An input/output (I/O) interface 1805 is also connected to the bus 1804 .
  • the following devices can be connected to the I/O interface 1805: input devices 1806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 1807 such as a computer; a memory 1808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1809.
  • the communication means 1809 may allow the electronic device 1800 to perform wireless or wired communication with other devices to exchange data. While FIG. 18 shows electronic device 1800 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 1809, or from memory 1808, or from ROM 1802.
  • the processor 1801 When the computer program is executed by the processor 1801, the above-mentioned functions defined in the video processing method of the embodiment of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, Optical signals or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: after extracting the video content features of the original video, acquires at least one Recommend material, and then add the recommended material to the original video to obtain the target video.
  • the video content of the adapted video is added to the video material, which improves the matching degree between the video content and the video material, and realizes the personalized effect processing of the video.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two Blocks shown in succession may, in fact, be executed substantially in parallel, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a unit does not constitute a limitation of the unit itself under certain circumstances.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the present disclosure provides a video processing method, including:
  • the extraction of video content features based on the analysis of the original video includes:
  • Semantic analysis is performed on the text content to obtain the first keyword.
  • the extraction of video content features based on the analysis of the original video includes:
  • the method for obtaining the target audio data includes:
  • the merging of all the audio tracks based on the time axis to generate the total audio data includes:
  • all audio tracks other than the target audio track are combined based on the time axis to generate total audio data.
  • the obtaining at least one recommended material matching the characteristics of the video content includes:
  • At least one recommended material matching the video style feature and the video content feature is acquired.
  • the determining the video style feature according to the video image of the original video includes:
  • Weighting calculation is performed on the at least one object to be photographed according to a preset object weight, and the calculation result is matched with a preset style classification to determine a video style feature corresponding to the original video.
  • the acquiring at least one recommended material matching the characteristics of the video content includes:
  • At least one recommended material matching the one video content feature is acquired.
  • performing video processing on the original video according to the recommended material to generate a target video includes:
  • the time stamp of the video content feature set the material addition time of the recommended material matching the video content feature
  • the target video is generated by clipping the original video according to the material addition time of the recommended material.
  • the clipping of the original video according to the material addition time of the recommended material to generate the target video includes:
  • the main body area of the shooting object determine the material area of the recommended material added on the target video frame
  • the original video is clipped to generate a target video.
  • the detecting the audio identifier of each audio track includes:
  • An audio identity is determined for each audio track based on the matching results.
  • combining the plurality of video content features into a video feature set and obtaining at least one recommended material matching the video feature set includes:
  • the video feature set query the preset corresponding relationship, and determine whether the enhanced material corresponds to the video feature set;
  • the video feature set is split into separate content feature matching recommended material
  • the enhanced material is matched, the enhanced material is used as the corresponding recommended material.
  • the present disclosure provides a video processing device, including:
  • the extraction module is used to extract video content features based on the analysis of the original video
  • An acquisition module configured to acquire at least one recommended material matching the characteristics of the video content
  • a processing module configured to perform video processing on the original video according to the recommended material to generate a target video, wherein the target video is a video generated by adding the recommended material to the original video.
  • the extraction module is used for:
  • Semantic analysis is performed on the text content to obtain the first keyword.
  • the extraction module is used for:
  • the extraction module is configured to: obtain all audio tracks displayed in the video clip application of the video file of the original video;
  • the extraction module is used for:
  • all audio tracks other than the target audio track are combined based on the time axis to generate total audio data.
  • the acquisition module is specifically configured to:
  • At least one recommended material matching the video style feature and the video content feature is acquired.
  • the acquisition module is specifically configured to: perform image recognition processing on the video image of the original video, and determine at least one shooting object according to the recognition result;
  • the acquisition module is used for:
  • At least one recommended material matching the one video content feature is acquired.
  • the acquisition module is used for:
  • the time stamp of the video content feature set the material addition time of the recommended material matching the video content feature
  • the target video is generated by clipping the original video according to the material addition time of the recommended material.
  • the acquisition module is used for:
  • the main body area of the shooting object determine the material area of the recommended material added on the target video frame
  • the original video is clipped to generate a target video.
  • the extraction module is used for:
  • An audio identity is determined for each audio track based on the matching results.
  • the acquisition module is used for:
  • the video feature set query the preset corresponding relationship, and determine whether the enhanced material corresponds to the video feature set;
  • the video feature set is split into separate content feature matching recommended material
  • the enhanced material is matched, the enhanced material is used as the corresponding recommended material.
  • the present disclosure provides an electronic device, including:
  • the processor is configured to read the executable instructions from the memory, and execute the instructions to implement any video processing method provided in the present disclosure.
  • the present disclosure provides a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to execute any one of the video processing methods provided in the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本公开实施例涉及一种视频处理方法、装置、设备及介质,其中该方法包括:基于原始视频的分析提取视频内容特征(101);获取与视频内容特征匹配的至少一个推荐素材(102);根据推荐素材对原始视频进行视频处理生成目标视频,其中,目标视频是对原始视频添加推荐素材后生成的视频(103)。

Description

视频处理方法、装置、设备及介质
相关申请的交叉引用
本申请是以CN申请号为202210178794.5,申请日为2022年2月25日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及计算机应用技术领域,尤其涉及一种视频处理方法、装置、设备及介质。
背景技术
为了对拍摄的视频进行美化等效果处理,用户通过视频剪辑软件选择合适的素材添加到视频中进行装饰。但是一一选择和添加的过程都增加了时间成本,降低了处理效率。
目前,相关视频剪辑软件推出视频模板或者一键成片的视频装饰方案,将拍摄的视频或图片套入到选中的视频模板中,自动剪辑出附带模板效果的美化视频。
发明内容
根据本公开的一些实施例,提供了一种视频处理方法,所述方法包括:基于原始视频的分析,提取视频内容特征;获取与所述视频内容特征匹配的至少一个推荐素材;根据所述推荐素材对所述原始视频进行视频处理,生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
根据本公开的另一些实施例,还提供了一种视频处理装置,所述装置包括:提取模块,用于基于原始视频的分析,提取视频内容特征;获取模块,用于获取与所述视频内容特征匹配的至少一个推荐素材;处理模块,用于根据所述推荐素材对所述原始视频进行视频处理,生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
根据本公开的又一些实施例,还提供了一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现如本公开任意实施例提供的视频处理方法。
根据本公开的再一些实施例,还提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行如本公开任意实施例提供的视频处理方法。
根据本公开的又一些实施例,还提供了一种计算机程序,所述计算机程序包括指令,所述指令被处理器执行时,使处理器实现如本公开任意实施例提供的视频处理方法。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1为本公开实施例提供的一种视频处理方法的流程示意图;
图2为本公开实施例提供的另一种视频处理方法的流程示意图;
图3为本公开实施例提供的一种视频处理的场景示意图;
图4为本公开实施例提供的另一种视频处理的场景示意图;
图5为本公开实施例提供的另一种视频处理的场景示意图;
图6为本公开实施例提供的另一种视频处理的场景示意图;
图7为本公开实施例提供的另一种视频处理方法的流程示意图;
图8为本公开实施例提供的另一种视频处理的场景示意图;
图9为本公开实施例提供的另一种视频处理的场景示意图;
图10为本公开实施例提供的另一种视频处理方法的流程示意图;
图11为本公开实施例提供的另一种视频处理的场景示意图;
图12为本公开实施例提供的另一种视频处理方法的流程示意图;
图13为本公开实施例提供的另一种视频处理的场景示意图;
图14为本公开实施例提供的另一种视频处理的场景示意图;
图15为本公开实施例提供的另一种视频处理方法的流程示意图;
图16为本公开实施例提供的另一种视频处理的场景示意图;
图17为本公开实施例提供的一种视频处理装置的结构示意图;
图18为本公开实施例提供的一种电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
发明人发现:由于提前制作的视频模板是固定的,有时用户耗时选中的模板无法智能适配用户导入的原始视频不能直接套用。并且视频模板的效果数量有限,经常出现多个视频套用同一个视频模板,无法根据具体的视频内容进行适配性美化等效果处理。
为了解决上述提到的,现有技术中对视频进行美化处理时,美化处理的效果和视频内容的匹配度不高的问题,本公开实施例提供了一种视频处理方法,在该方法中,基于视频的内容对效果处理有关的素材进行推荐,使得根据推荐的素材处理后的视频,处理效果和视频内容的匹配度较高,每个内容不同的视频的处理效果具有明显差别,具有“千频千面”的处理效果,满足了视频的处理效果的个性化需求。
下面结合具体的实施例对本公开的方法进行介绍。
图1为本公开实施例提供的一种视频处理方法的流程示意图,该方法可以由视频处理装置执行,其中该装置可以采用软件和/或硬件实现,一般可集成在电子设备中。 如图1所示,该方法包括:步骤S102~S106。
在步骤101中,基于原始视频的分析,提取视频内容特征。
在一些实施例中,为了适配视频内容的个性化特点进行视频的效果处理,基于原始视频的分析,提取视频内容特征,该原始视频为待进行效果处理的上传视频,视频内容特征包括但不限于视频的音频特征、视频的文字特征、视频的图像特征、视频的滤镜特征、视频包含的拍摄对象特征等中的一个或多个。
在步骤102中,获取与视频内容特征匹配的至少一个推荐素材。
在一些实施例中,获取与视频内容特征匹配的至少一个推荐素材,该推荐素材包括但不限于音频素材、贴图素材、动画素材、滤镜素材等中的一种或多种。在实际应用中,获取与视频内容特征匹配的至少一个推荐素材的方式,可以根据场景的不同而不同,具体获取方式可以在后续实施例中进行示例说明,在此不再赘述。
在步骤103中,根据推荐素材对原始视频进行视频处理生成目标视频,其中,目标视频是对原始视频添加推荐素材后生成的视频。
在本实施例中,根据推荐素材对原始视频进行视频处理以生成目标视频,即在原始视频中添加对应的推荐素材后生成目标视频。在实际执行过程中,每个素材均具有对应的添加轨道,因此,可以基于对应的素材的轨道来添加对应的素材。比如,如下表1所示,每个素材的轨道通过其对应的字段名、类型、以及说明信息来定义,表1是一种素材的轨道的示例。
表1

另外,每个推荐素材也包含对应的参数,以进一步便于素材添加时对显示效果上的一些个性化调整,比如,后续实施例中在确定素材区域后,对素材的尺寸的调整等。举例而言,下表2所示的text_template素材的参数即可包括缩放系数、旋转角度等。
表2
其中,在实际素材的添加过程中,可以根据素材类型的不同,执行不同的添加方式,不同的添加方式可以在添加的时间上、添加的位置上、添加的频次上等进行区分,以更好的呼应对应的推荐素材和视频内容,呈现推荐素材和展示的视频内容的强相关性。具体添加方式在后续实施例进行示例说明,在此不再赘述。
综上,本公开实施例的视频处理方法,提取原视频的视频内容特征后,获取与视频内容特征匹配的至少一个推荐素材,进而,对原始视频添加推荐素材后得到目标视频。由此,适配视频的视频内容进行视频素材的添加,提升了视频内容和视频素材的匹配度,实现了对视频的个性化的效果处理。
正如以上所提到的,在实际执行过程中,视频内容特征在不同的应用场景中不同,示例说明如下:
在本公开的一个实施例中,为了增强视频的氛围感,基于原始视频的文本内容进行提取。
在本实施例中,如图2所示,基于原始视频的分析,提取视频内容特征包括:步骤S201~S202。
在步骤201中,对原始视频的目标音频数据进行语音识别处理,获取对应的文本内容。
获取原始视频的目标音频数据,在一些实施例中,预先设置的视频剪辑应用除了可以识别原始视频的视频轨道之外,还可以识别出原始视频中包含的每个音频轨道,其中,每个音频轨道对应于一种音源。比如,对于原始视频A来说,其中包含了用户a和用户b的说话的声音,则本实施例中可以识别出a的声音对应的音频轨道,以及b的声音对应的音频轨道。
在一些实施例中,为了便于对每个音频轨道进行处理,获取原始视频的视频文件在视频剪辑应用中显示的所有音频轨道。容易理解的是,每个音频轨道对应的音源具有发生时间,因此,在一些实施例中,还基于时间轴显示对应的音频轨道。
举例而言,如图3所示,若是对于原视频的音频文件拆分出了视频轨道video,以及两个音频轨道audio1和audio2,则可以在视频剪辑应用中显示对应的音频轨道。
在一些实施例中,继续参照图3,将基于时间轴将所有音频轨道进行合并处理生成总音频数据。比如将audio1和audio2基于时间轴合并,生成总音频数据complex-audio,该总音频数据包含了原始视频中的素有音频数据。
当然,正如以上提到的,总音频数据也是和时间相关的,因此,若是总音频数据的第一时长大于原始视频的第二视频的第二时长,则为了保证工程长度的一致性,避免有些音频数据没有对应的视频内容,则对总音频数据的第一时长进行裁剪获取目标音频数据,其中,目标音频数据的时长与第二时长一致。
举例而言,继续参照图3,若是complex-audio的时长长于video的时长,则裁剪complex-audio在时间轴上多于video的时长的部分得到target-audio,以使得target-audio在时间轴上和video对齐,便于后续的视频处理。
当然,在实际执行过程中,原始视频对应的音频文件除了可能包括拍摄对象之间交互的音频数据之外,还有可能包括背景音。比如,包括环境中播放的音乐声音,或者是包括环境中马路上的车辆通过的声音。这种背景声音通常与视频内容无关,因此,为了便于提升后续视频内容特征的提取的精度,避免背景声音对视频内容特征的提取的干扰(比如,在提取视频文字特征时,可能会识别背景声音中的文字内容等),在一些实施例中,还可以去除原始视频中的背景声音。
在一些实施例中,检测每个音频轨道的音频标识,即可以根据每个音频轨道对应的音频的声音频谱等声音特征的识别,将每个音频轨道对应的音频的声音特征与预先 设置的每个音频标识对应的声音特征匹配,基于匹配结果确定每个音频轨道的音频标识,若是检测到表示背景音乐标识的目标音频轨道,则基于时间轴将目标音频轨道之外的所有音频轨道进行合并处理生成总音频数据。
举例而言,如图4所示,继续以图3所示的场景为例,若是识别到原始视频的音频轨道的音频标识分别为audio1、audio2和bg-audio,则由于bg-audio与背景音乐标识匹配,则在生成总音频数据时,仅仅合并audio1、audio2对应的音频轨道。
当然,在实际执行过程中,目标音频数据也可以为原始视频对应的所有音频轨道合并得到的,也可以是仅仅合并符合预设的某一类声音特征的音频轨道得到的等,根据场景需要具体设置,在此不作限制。
在一些实施例中,在获取原始视频后,对原始视频的目标音频数据进行语音识别处理,进而获取对应的文本内容,该文本内容的获取可以识别通过语音识别技术识别得到。
在步骤202中,对文本内容进行语义解析处理,获取第一关键词。
其中,第一关键词可以在内容维度上,为视频匹配到推荐素材。比如,第一关键词可以为“哈哈,好搞笑啊”等情绪关键词,从而,基于该第一关键词可以为视频推荐渲染情绪的素材,比如可以是一些大笑的贴图素材,或者是,一些烟花的动画素材等。又比如,第一关键词可以为“盆地”等专业领域的词汇,则基于该第一关键词可以为视频推荐介绍对应领域的专业性贴图素材,以便于对应的专业领域的词汇更加通俗易懂等。
在一些实施例中,对文本内容进行语义解析,将解析的语义结果与预先设置的关键词语义进行匹配,以确定匹配成功的第一关键词。
在一些实施例中,为了提高第一关键词的识别效率和准确性,如图5所示,可以通过为自动语音识别(Automatic Speech Recognition,ASR)技术来识别目标音频数据的文本内容得到语句,进而,通过自然语言处理技术(Natural Language Processing,NLP)理解对应文本语句的语义以获取对应的第一关键词。
在一些实施例中,基于第一关键词推荐的素材,可以在内容维度上保证推荐的素材和视频内容的相关性,以更好的渲染对应的视频内容。举例而言,如图6所示(图中为了便于本领域的技术人员对本方案更加直观,以字幕的形式显示了对应的第一关键词),根据原始视频的文本内容进行语义解析后,获取到的第一关键词为“哈哈”,则可以推荐素材为“鼓掌”的贴纸素材,从而,在处理后的视频中,针对“哈哈”的 音频显示“鼓掌”的贴纸,进一步渲染了开心的氛围,且添加的推荐素材和视频内容吻合度更高,推荐素材的添加不显得突兀。
在本公开的一些实施例中,如图7所示,基于原始视频的分析,提取视频内容特征包括:步骤701~702。
在步骤701中,对原始视频的目标音频数据进行声音检测处理,获取对应的频谱数据。
在一些实施例中,考虑到在一些场景中,即使音频数据不能转换为对应的文本内容,也可能体现视频的内容特征。比如,若是音频数据中包含“掌声”、“爆炸”声等,则基于这种音频数据进行推荐素材的添加,也可以进一步配合对应的音频提升视频的氛围感。
因此,对上述实施例中提到的目标音频数据进行声音检测处理,提取对应的频谱数据,针对频谱数据显然可以对一些无法转换为文本内容,但是体现了视频的内容特征的信息进行提取。
在步骤702中,对频谱数据进行分析处理获取第二关键词。
在一些实施例中,对频谱数据进行分析处理以获取第二关键词,其中,基于第二关键词可以获取与对应的频谱数据对应的推荐素材。
在一些实施例中,可以将频谱数据输入预先根据大量样本数据训练得到的深度学习模型,获取该深度学习模型输出的第二关键词。
在另一些实施例中,可以将获取的频谱数据与预先设置的每个关键词的频谱数据进行匹配,基于匹配度确定频谱数据对应的第二关键词。比如,若是获取的频谱数据与关键词“爆炸”对应的频谱数据匹配度大于预设阈值,则确定目标音频数据对应的第二关键词为“爆炸”。
其中,参照图8,在确定推荐素材时,还可以结合第一关键词和第二关键词共同推荐,其中,第二关键词可以是基于音频事件检测(Audio event detection,AED)技术识别出对应的第二关键词,进而,基于第一键词和第二关键词确定对应的推荐素材。
举例而言,如图9所示,若是对原始视频的目标音频数据进行声音检测处理,获取对应的频谱数据后,根据频谱数据获取到的对应的第二关键词为“爆炸”,则匹配到的推荐素材为“爆炸”贴纸,从而,在对应的视频帧上显示对应的“爆炸”贴纸,以进一步渲染包含爆炸音频的视频内容。
综上,本公开实施例的视频处理方法,可以基于反应视频内容的任意特征作为视 频内容特征,提取的视频内容特征和视频内容具有强相关性,保证了基于视频内容特征推荐的素材和视频内容的相关性,为视频的个性化处理效果提供了技术支撑。
基于上述实施例,在获取视频内容特征后,进一步推荐与该视频内容特征匹配的推荐素材,推荐素材的决定对视频的处理效果。下面结合具体的示例对推荐素材的确定进行说明。
在本公开的一些实施例中,如图10所示,获取与视频内容特征匹配的至少一个推荐素材包括:步骤1001~1002。
在步骤1001中,根据原始视频的视频图像确定视频风格特征。
容易理解的是,即使是同样的视频内容特征,其对应的视频风格也具有差别,因此,若是添加同样的推荐素材,也会影响和视频内容的匹配度,比如,根据原始视频S1的目标音频数据,进行语义解析得到的第一关键词为“哈哈”,根据原始视频S2的目标音频数据,进行语义解析得到的第一关键词也为“哈哈”,但是S1中“哈哈”的发声对象为动漫人物,S2中“哈哈”的发声对象为现实人物,因此,推荐的素材如果适配这两种风格,显然也会影响视频的处理效果。
在本公开的实施例中,为了保证视频的处理效果,根据原始视频的视频图像来确定视频风格特征。其中,视频风格特征包括视频内容的图像特征、视频内容的主题风格特征、视频中包含的拍摄对象特征等,在此不作限制。
需要说明的是,在不同的应用场景中,根据原始视频的视频图像来确定视频风格特征的方式不同,示例说明如下。
在一些实施例中,如图11所示,预先根据大量样本数据训练卷积网络模型,将视频图像输入对应的卷积网络模型,获取该卷积网络模型输出的视频风格特征。
在一些实施例中,如图12所示,根据原始视频的视频图像确定视频风格特征包括:步骤1201~1202。
在步骤1201中,对原始视频的视频图像进行图像识别处理,根据识别结果确定至少一个拍摄对象。
其中,拍摄对象可以为视频图像中包含的主体,包括但不限于:人物、动物、家具、餐具等。
在步骤1202中,根据预设的对象权重对至少一个拍摄对象进行加权计算,根据计算结果与预设的风格分类进行匹配,确定与原始视频对应的视频风格特征。
在本实施例中,为了确定视频风格,可以识别每个拍摄对象的对象类型,查询预 设数据库,获取每个拍摄对象的对象权重,其中,数据库中包含根据大量样本数据训练得到的、每个拍摄类型和对应的对象权重,进而,根据预设的对象权重对至少一个拍摄对象进行加权计算,根据计算结果与预设的风格分类进行匹配,确定匹配成功的风格分类对应的视频风格特征。
其中,如图13所示,可以在原始视频中抽取的多帧视频帧,基于多帧视频帧的作为原始视频的视频图像,以进一步提升风格识别效率。在一些实施例中,可以根据预设的时间间隔(比如1秒)在原始视频中抽取多帧视频帧,也可每隔一段时间,根据预设的时间长度抽取对应的视频段,根据视频段包含的多帧视频帧作为原始视频的视频图像。
在一些实施例中,在获取对应的视频图像后,可以将视频图像输入预先训练好的图像智能识别模型,根据识别结果确定至少一个拍摄对象,图中的拍摄对象包括人脸,物品、环境等,进而,识别每个拍摄对象对应的分类特征t1、t2、t3,对应的对象权重分别为z1、z2和z3,则计算t1z1+t2z2+t3z3的值作为计算结果,根据该计算结果与预设的风格分类进行匹配,确定与原始视频对应的视频风格特征。
在步骤1002中,获取与视频风格特征和视频内容特征匹配的至少一个推荐素材。
在一些实施例中,获取视频风格特征后,获取与视频风格特征和视频内容特征匹配的至少一个推荐素材,从而,推荐素材在视频风格特征和视频内容特征上与视频内容匹配,进一步提升了视频处理效果。
在一些实施例中,如图14所示,可以先获取与视频风格特征匹配的素材库,在该素材库中获取与视频内容特征匹配的至少一个推荐素材,从而,保证获取的推荐素材不但和视频内容匹配,而且和视频的风格统一。
举例而言,若是视频风格特征为“少女动漫”,则获取与“少女动漫”匹配的少女风的各种素材组成的素材库,进而,基于视频内容特征在少女风的各种素材组成的素材库中匹配视频素材,以保证获取到的推荐素材均为少女风。
在本公开的一些实施例中,如图15所示,获取与视频内容特征匹配的至少一个推荐素材包括:步骤1501~1504。
在步骤1501中,在原始视频中确定与视频内容特征所对应的视频帧的播放时间,其中,视频内容特征是根据视频帧的视频内容生成的。
在一些实施例中,由于并非是每一帧视频图像都包含了同样的视频内容特征,视频内容特征是根据视频帧的视频内容生成的,因此,在原始视频中确定与视频内容特 征所对应的视频帧的播放时间,以便于根据该播放时间仅仅针对包含了对应的视频内容特征的视频帧进行对应素材的推荐和添加等。
在步骤1502中,根据视频帧的播放时间为视频内容特征标记时间标识。
在一些实施例中,根据视频帧的播放时间为视频内容特征标记时间标识,以便于进行时间上的推荐素材的匹配。
在步骤1503中,针对同一个时间标识,如果确定存在对应的多个视频内容特征,则将多个视频内容特征组合成视频特征集合,并获取与视频特征集合匹配的至少一个推荐素材。
在一些实施例中,针对同一个时间标识,即针对同一个时间对应的同一个视频帧,如果确定存在对应的多个视频内容特征,则将多个视频内容特征组合成视频特征集合,获取与视频特征集合匹配的至少一个推荐素材。
在一些实施例中,可以基于多个视频内容特征进行组合,生成多个视频内容特征组合(视频特征集合),查询预设的对应关系,确定是否与每个视频内容特征组合对应的强化素材,若是没有匹配到强化素材,则将视频内容特征组合拆分成单独的内容特征匹配推荐素材,若是匹配到强化素材,则将强化素材作为对应的推荐素材。
应当理解的是,这里的视频特征集合中并不一定包括多个视频内容特征对应的推荐素材的简单组合,还有可能是在多个视频内容特征之间具有相关性时,为了进一步强化视频氛围,生成的另一个氛围感更强的推荐素材。
比如,若是多个视频内容特征中,视频内容特征1对应的第一关键词为“哈哈”,视频内容特征2对应的第二关键词为“掌声”,则基于第一关键词和第二关键词共同确定的推荐素材为转场特效素材,而不是上述提到的分别和“哈哈”和“掌声”对应的贴纸素材。
在步骤1504中,针对同一个时间标识,如果确定存在对应的一个视频内容特征,则获取与一个视频内容特征匹配的至少一个推荐素材。
在一些实施例中,若是针对同一个时间标识,如果确定存在对应的一个视频内容特征,则获取与一个视频内容特征匹配的至少一个推荐素材,即若具有单个视频内容特征,则获取单独匹配的至少一个推荐素材。
进一步地,在获取对应的推荐素材后,根据推荐素材对原始视频进行视频处理生成目标视频时,根据所述视频内容特征的时间标识,设置与所述视频内容特征匹配的推荐素材的素材添加时间,该添加时间与对应的视频内容特征的视频帧的显示时间一 致。
进而,根据推荐素材的素材添加时间对原始视频进行剪辑处理生成目标视频。由此,仅仅在播放到包含素材对应的视频内容特征的视频帧时,才添加对应的推荐素材,避免素材的添加和视频内容不符。
另外,在实际执行过程中,有些素材是没有尺寸信息的,比如,音效素材、转场特效素材等,有些素材是具有尺寸信息的。比如,贴纸素材和文字素材等。为了避免一些具有尺寸信息的素材在添加时,不遮挡视频内容中的重要显示内容。比如,不遮挡视频帧中的人脸等,需要对这些具有尺寸信息的素材的添加区域进行确定。
在一些实施例中,在推荐素材的素材类型满足预设目标类型的情况下,即在推荐素材具有添加尺寸信息属性的情况下,则认为对应的素材满足预设目标类型,进而,从原始视频中获取与推荐素材的素材添加时间对应的目标视频帧,对目标视频片帧进行图像识别获取拍摄对象的主体区域,其中,主体区域可以为体现拍摄对象所在位置的任意位置信息,比如可以为中心坐标点,比如,可以为位置范围等。
举例而言,若是添加的素材是根据第一关键词添加的,则拍摄对象为“哈哈”音频对应的发声对象,确定拍摄对象的主体区域后,根据拍摄对象的主体区域,确定推荐素材在目标视频帧上添加的素材区域。
其中,在一些实施例中,可以确定推荐素材的素材类型标签,根据该素材类型标签查询预设的对应关系,确定素材区域的区域特征(比如是图像上的背景区域等),在目标视频帧上确定与该区域特征匹配的区域为素材区域。
在另一些实施例中,可以确定拍摄对象的对象类型标签,根据该对象类型标签查询预设的对应关系,确定素材区域的区域特征(比如拍摄对象为人脸类型,则对应的区域特征对应于头顶等),在目标视频帧上确定与该区域特征匹配的区域为素材区域。
在确定素材区域后,根据推荐素材的素材添加时间和素材区域,对原始视频进行剪辑处理生成目标视频,将在素材添加时间对应的视频帧中的素材区域上,添加对应的素材。其中,素材区域可以为素材在对应的视频帧中的添加中心点坐标,也可以为素材在对应的是视频帧中添加的素材的坐标范围等。其中,由于对主体区域等进行确定的服务器,可以和上述进行风格特征识别的服务器不是同一个服务器,为了提高识别效率,进行风格特征识别的服务器可以为本地服务器,为了降低素材添加时间和素材区域分析的算力的压力,根据推荐素材的素材添加时间和素材区域识别的服务器可以为远端服务器。
在一些实施例中,如图16所示,推荐素材包括具有尺寸属性的F1和F2时,根据视频内容特征的时间标识,设置与视频内容特征匹配的推荐素材的素材添加时间分别为t1和t2,则根据推荐素材的素材添加时间对原始视频进行剪辑处理,得到与F1对应的视频段clip1和与F2对应的视频段clip2。将clip1和clip2发送到对应的服务器后,对应的服务器对添加的素材进行优化,对目标视频片帧进行图像识别获取拍摄对象的主体区域,进而,根据拍摄对象的主体区域,确定推荐素材在目标视频帧上添加的素材区域,根据推荐素材的素材添加时间和素材区域,得到与clip1对应的目标视频段sticker1和与clip2对应的目标视频段sticker2,根据sticker1和sticker2编辑原视频得到对应的目标视频。
综上,本公开实施例的视频处理方法,在确定视频内容特征后,确定与多维度的视频内容特征匹配的至少一个推荐素材,还保证了素材与视频帧在位置上和时间上的对应,进一步提保证了视频的处理效果满足视频内容的个性化特点。
为了实现上述实施例,本公开还提出了一种视频处理装置。图17为本公开实施例提供的一种视频处理装置的结构示意图,该装置可由软件和/或硬件实现,一般可集成在电子设备中。如图17所示,该装置包括:提取模块1710,获取模块1720,处理模块1730。
提取模块1710用于基于原始视频的分析,提取视频内容特征。
获取模块1720用于获取与所述视频内容特征匹配的至少一个推荐素材。
处理模块1730用于根据所述推荐素材对所述原始视频进行视频处理,生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
本公开实施例所提供的视频处理装置可执行本公开任意实施例所提供的视频处理方法,具备执行方法相应的功能模块和有益效果,在此不再赘述。
为了实现上述实施例,本公开还提出一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时实现上述任意实施例中的视频处理方法。
图18为本公开实施例提供的一种电子设备的结构示意图。
下面具体参考图18,其示出了适于用来实现本公开实施例中的电子设备1800的结构示意图。本公开实施例中的电子设备1800可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图18示出的电子设备仅仅是一个示例,不应对本 公开实施例的功能和使用范围带来任何限制。
如图18所示,电子设备1800可以包括处理器(例如中央处理器、图形处理器等)1801,其可以根据存储在只读存储器(ROM)1802中的程序或者从存储器1808加载到随机访问存储器(RAM)1803中的程序而执行各种适当的动作和处理。在RAM 1803中,还存储有电子设备1800操作所需的各种程序和数据。处理器1801、ROM 1802以及RAM 1803通过总线1804彼此相连。输入/输出(I/O)接口1805也连接至总线1804。
通常,以下装置可以连接至I/O接口1805:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1806;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置1807;包括例如磁带、硬盘等的存储器1808;以及通信装置1809。通信装置1809可以允许电子设备1800与其他设备进行无线或有线通信以交换数据。虽然图18示出了具有各种装置的电子设备1800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1809从网络上被下载和安装,或者从存储器1808被安装,或者从ROM 1802被安装。在该计算机程序被处理器1801执行时,执行本公开实施例的视频处理方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、 光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:提取原视频的视频内容特征后,获取与视频内容特征匹配的至少一个推荐素材,进而,对原始视频添加推荐素材后得到目标视频。由此,适配视频的视频内容进行视频素材的添加,提升了视频内容和视频素材的匹配度,实现了对视频的个性化的效果处理。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两 个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一些实施例,本公开提供了一种视频处理方法,包括:
基于原始视频的分析,提取视频内容特征;
获取与所述视频内容特征匹配的至少一个推荐素材;
根据所述推荐素材对所述原始视频进行视频处理,生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述基于原始视频的分析提取视频内容特征包括:
对所述原始视频的目标音频数据进行语音识别处理,获取对应的文本内容;
对所述文本内容进行语义解析处理获取第一关键词。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述基于原始视频的分析提取视频内容特征包括:
对所述原始视频的目标音频数据进行声音检测处理,获取对应的频谱数据;
对所述频谱数据进行分析处理获取第二关键词。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述目标音频数据的获取方法包括:
获取所述原始视频的视频文件在视频剪辑应用中显示的所有音频轨道;
基于时间轴将所述所有音频轨道进行合并处理,生成总音频数据;
将所述总音频数据的第一时长与所述原始视频的第二时长进行比较,在所述第一时长大于所述第二时长的情况下,对所述总音频数据的第一时长进行裁剪获取所述目标音频数据,其中,所述目标音频数据的时长与所述第二时长一致。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述基于时间轴将所述所有音频轨道进行合并处理生成总音频数据包括:
检测每个音频轨道的音频标识;
在检测到表示背景音乐标识的目标音频轨道的情况下,基于时间轴将所述目标音频轨道之外的所有音频轨道进行合并处理,生成总音频数据。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述获取与所述视频内容特征匹配的至少一个推荐素材包括:
根据所述原始视频的视频图像确定视频风格特征;
获取与所述视频风格特征和所述视频内容特征匹配的至少一个推荐素材。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述根据所述原始视频的视频图像确定视频风格特征包括:
对所述原始视频的视频图像进行图像识别处理,根据识别结果确定至少一个拍摄对象;
根据预设的对象权重对所述至少一个拍摄对象进行加权计算,根据计算结果与预设的风格分类进行匹配,确定与所述原始视频对应的视频风格特征。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述获取与所述视频内容特征匹配的至少一个推荐素材,包括:
在所述原始视频中确定与所述视频内容特征所对应的视频帧的播放时间,其中,所述视频内容特征是根据所述视频帧的视频内容生成的;
根据所述视频帧的播放时间为所述视频内容特征标记时间标识;
针对同一个时间标识,在确定存在对应的多个视频内容特征的情况下,将所述多个视频内容特征组合成视频特征集合,并获取与所述视频特征集合匹配的至少一个推 荐素材;
针对同一个时间标识,如果确定存在对应的一个视频内容特征,则获取与所述一个视频内容特征匹配的至少一个推荐素材。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述根据所述推荐素材对所述原始视频进行视频处理生成目标视频包括:
根据所述视频内容特征的时间标识,设置与所述视频内容特征匹配的推荐素材的素材添加时间;
根据所述推荐素材的素材添加时间对所述原始视频进行剪辑处理生成目标视频。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述根据所述推荐素材的素材添加时间对所述原始视频进行剪辑处理生成目标视频包括:
在所述推荐素材的素材类型满足预设目标类型的情况下,从所述原始视频中获取与所述推荐素材的素材添加时间对应的目标视频帧;
对所述目标视频片帧进行图像识别获取拍摄对象的主体区域;
根据所述拍摄对象的主体区域,确定所述推荐素材在所述目标视频帧上添加的素材区域;
根据所述推荐素材的素材添加时间和所述素材区域,对所述原始视频进行剪辑处理生成目标视频。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述检测每个所述音频轨道的音频标识包括:
识别每个音频轨道对应的音频的声音特征;
将每个音频轨道对应的音频的声音特征与预先设置的每个音频标识对应的声音特征匹配;
基于匹配结果确定每个音频轨道的音频标识。
根据本公开的一些实施例,本公开提供的视频处理方法中,所述将所述多个视频内容特征组合成视频特征集合,并获取与所述视频特征集合匹配的至少一个推荐素材包括:
根据所述视频特征集合,查询预设的对应关系,确定是否与所述视频特征集合对应的强化素材;
在没有匹配到强化素材的情况下,则将所述视频特征集合拆分成单独的内容特征匹配推荐素材;
在匹配到强化素材的情况下,则将强化素材作为对应的推荐素材。
根据本公开的一些实施例,本公开提供了一种视频处理装置,包括:
提取模块,用于基于原始视频的分析,提取视频内容特征;
获取模块,用于获取与所述视频内容特征匹配的至少一个推荐素材;
处理模块,用于根据所述推荐素材对所述原始视频进行视频处理,生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述提取模块用于:
对所述原始视频的目标音频数据进行语音识别处理,获取对应的文本内容;
对所述文本内容进行语义解析处理,获取第一关键词。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述提取模块用于:
对所述原始视频的目标音频数据进行声音检测处理,获取对应的频谱数据;
对所述频谱数据进行分析处理,获取第二关键词。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述提取模块用于:获取所述原始视频的视频文件在视频剪辑应用中显示的所有音频轨道;
基于时间轴将所述所有音频轨道进行合并处理生成总音频数据;
将所述总音频数据的第一时长与所述原始视频的第二时长进行比较,在所述第一时长大于所述第二时长的情况下,对所述总音频数据的第一时长进行裁剪获取所述目标音频数据,其中,所述目标音频数据的时长与所述第二时长一致。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述提取模块用于:
检测每个音频轨道的音频标识;
在检测到表示背景音乐标识的目标音频轨道的情况下,基于时间轴将所述目标音频轨道之外的所有音频轨道进行合并处理,生成总音频数据。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述获取模块具体用于:
根据所述原始视频的视频图像确定视频风格特征;
获取与所述视频风格特征和所述视频内容特征匹配的至少一个推荐素材。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述获取模块具体用于:对所述原始视频的视频图像进行图像识别处理,根据识别结果确定至少一个拍摄对象;
根据预设的对象权重对所述至少一个拍摄对象进行加权计算,根据计算结果与预 设的风格分类进行匹配,确定与所述原始视频对应的视频风格特征。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述获取模块用于:
在所述原始视频中确定与所述视频内容特征所对应的视频帧的播放时间,其中,所述视频内容特征是根据所述视频帧的视频内容生成的;
根据所述视频帧的播放时间为所述视频内容特征标记时间标识;
针对同一个时间标识,在确定存在对应的多个视频内容特征的情况下,将所述多个视频内容特征组合成视频特征集合,并获取与所述视频特征集合匹配的至少一个推荐素材;
针对同一个时间标识,在确定存在对应的一个视频内容特征的情况下,获取与所述一个视频内容特征匹配的至少一个推荐素材。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述获取模块用于:
根据所述视频内容特征的时间标识,设置与所述视频内容特征匹配的推荐素材的素材添加时间;
根据所述推荐素材的素材添加时间对所述原始视频进行剪辑处理生成目标视频。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述获取模块用于:
在所述推荐素材的素材类型满足预设目标类型的情况下,从所述原始视频中获取与所述推荐素材的素材添加时间对应的目标视频帧;
对所述目标视频片帧进行图像识别获取拍摄对象的主体区域;
根据所述拍摄对象的主体区域,确定所述推荐素材在所述目标视频帧上添加的素材区域;
根据所述推荐素材的素材添加时间和所述素材区域,对所述原始视频进行剪辑处理生成目标视频。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述提取模块用于:
识别每个音频轨道对应的音频的声音特征;
将每个音频轨道对应的音频的声音特征与预先设置的每个音频标识对应的声音特征匹配;
基于匹配结果确定每个音频轨道的音频标识。
根据本公开的一些实施例,本公开提供的视频处理装置中,所述获取模块用于:
根据所述视频特征集合,查询预设的对应关系,确定是否与所述视频特征集合对应的强化素材;
在没有匹配到强化素材的情况下,则将所述视频特征集合拆分成单独的内容特征匹配推荐素材;
在匹配到强化素材的情况下,则将强化素材作为对应的推荐素材。
根据本公开的一些实施例,本公开提供了一种电子设备,包括:
处理器;
用于存储所述处理器可执行指令的存储器;
所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现如本公开提供的任一所述的视频处理方法。
根据本公开的一些实施例,本公开提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行如本公开提供的任一所述的视频处理方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (16)

  1. 一种视频处理方法,包括:
    基于原始视频的分析,提取视频内容特征;
    获取与所述视频内容特征匹配的至少一个推荐素材;
    根据所述推荐素材对所述原始视频进行视频处理,生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
  2. 根据权利要求1所述的视频处理方法,其中,所述基于原始视频的分析,提取视频内容特征包括:
    对所述原始视频的目标音频数据进行语音识别处理,获取对应的文本内容;
    对所述文本内容进行语义解析处理,获取第一关键词。
  3. 根据权利要求1或2所述的视频处理方法,其中,所述基于原始视频的分析提取视频内容特征包括:
    对所述原始视频的目标音频数据进行声音检测处理,获取对应的频谱数据;
    对所述频谱数据进行分析处理,获取第二关键词。
  4. 根据权利要求2或3所述的视频处理方法,其中,所述目标音频数据的获取方法包括:
    获取所述原始视频的视频文件在视频剪辑应用中显示的所有音频轨道;
    基于时间轴将所述所有音频轨道进行合并处理,生成总音频数据;
    将所述总音频数据的第一时长与所述原始视频的第二时长进行比较,在所述第一时长大于所述第二时长的情况下,对所述总音频数据的第一时长进行裁剪获取所述目标音频数据,其中,所述目标音频数据的时长与所述第二时长一致。
  5. 根据权利要求4所述的视频处理方法,其中,所述基于时间轴将所述所有音频轨道进行合并处理,生成总音频数据包括:
    检测每个音频轨道的音频标识;
    在检测到表示背景音乐标识的目标音频轨道的情况下,基于时间轴将所述目标音频轨道之外的所有音频轨道进行合并处理,生成总音频数据。
  6. 根据权利要求1-5任一项所述的视频处理方法,其中,所述获取与所述视频内容特征匹配的至少一个推荐素材包括:
    根据所述原始视频的视频图像确定视频风格特征;
    获取与所述视频风格特征和所述视频内容特征匹配的至少一个推荐素材。
  7. 根据权利要求6所述的视频处理方法,其中,所述根据所述原始视频的视频图像确定视频风格特征包括:
    对所述原始视频的视频图像进行图像识别处理,根据识别结果确定至少一个拍摄对象;
    根据预设的对象权重对所述至少一个拍摄对象进行加权计算,根据计算结果与预设的风格分类进行匹配,确定与所述原始视频对应的视频风格特征。
  8. 根据权利要求1-7任一项所述的视频处理方法,其中,所述获取与所述视频内容特征匹配的至少一个推荐素材包括:
    在所述原始视频中确定与所述视频内容特征所对应的视频帧的播放时间,其中,所述视频内容特征是根据所述视频帧的视频内容生成的;
    根据所述视频帧的播放时间为所述视频内容特征标记时间标识;
    针对同一个时间标识,在确定存在对应的多个视频内容特征的情况下,将所述多个视频内容特征组合成视频特征集合,并获取与所述视频特征集合匹配的至少一个推荐素材;
    针对同一个时间标识,在确定存在对应的一个视频内容特征的情况下,获取与所述一个视频内容特征匹配的至少一个推荐素材。
  9. 根据权利要求8所述的视频处理方法,其中,所述根据所述推荐素材对所述原始视频进行视频处理生成目标视频包括:
    根据所述视频内容特征的时间标识,设置与所述视频内容特征匹配的推荐素材的素材添加时间;
    根据所述推荐素材的素材添加时间对所述原始视频进行剪辑处理生成目标视频。
  10. 根据权利要求9所述的视频处理方法,其中,所述根据所述推荐素材的素材添加时间对所述原始视频进行剪辑处理生成目标视频包括:
    在所述推荐素材的素材类型满足预设目标类型的情况下,从所述原始视频中获取与所述推荐素材的素材添加时间对应的目标视频帧;
    对所述目标视频片帧进行图像识别获取拍摄对象的主体区域;
    根据所述拍摄对象的主体区域,确定所述推荐素材在所述目标视频帧上添加的素材区域;
    根据所述推荐素材的素材添加时间和所述素材区域,对所述原始视频进行剪辑处 理生成目标视频。
  11. 根据权利要求5-10任一项所述的视频处理方法,其中,所述检测每个所述音频轨道的音频标识包括:
    识别每个音频轨道对应的音频的声音特征;
    将每个音频轨道对应的音频的声音特征与预先设置的每个音频标识对应的声音特征匹配;
    基于匹配结果确定每个音频轨道的音频标识。
  12. 根据权利要求8-10任一项所述的视频处理方法,其中,所述将所述多个视频内容特征组合成视频特征集合,并获取与所述视频特征集合匹配的至少一个推荐素材包括:
    根据所述视频特征集合,查询预设的对应关系,确定是否与所述视频特征集合对应的强化素材;
    在没有匹配到强化素材的情况下,则将所述视频特征集合拆分成单独的内容特征匹配推荐素材;
    在匹配到强化素材的情况下,则将强化素材作为对应的推荐素材。
  13. 一种视频处理装置,其中,包括:
    提取模块,用于基于原始视频的分析提取视频内容特征;
    获取模块,用于获取与所述视频内容特征匹配的至少一个推荐素材;
    处理模块,用于根据所述推荐素材对所述原始视频进行视频处理生成目标视频,其中,所述目标视频是对所述原始视频添加所述推荐素材后生成的视频。
  14. 一种电子设备,其中,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-12中任一所述的视频处理方法。
  15. 一种计算机可读存储介质,其中,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-12中任一所述的视频处理方法。
  16. 一种计算机程序,包括:指令,所述指令被处理器执行时实现如权利要求1-12中任一项所述的视频处理方法。
PCT/CN2023/077309 2022-02-25 2023-02-21 视频处理方法、装置、设备及介质 WO2023160515A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/574,263 US20240244290A1 (en) 2022-02-25 2023-02-21 Video processing method and apparatus, device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210178794.5 2022-02-25
CN202210178794.5A CN116708917A (zh) 2022-02-25 2022-02-25 视频处理方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2023160515A1 true WO2023160515A1 (zh) 2023-08-31

Family

ID=87764746

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/077309 WO2023160515A1 (zh) 2022-02-25 2023-02-21 视频处理方法、装置、设备及介质

Country Status (3)

Country Link
US (1) US20240244290A1 (zh)
CN (1) CN116708917A (zh)
WO (1) WO2023160515A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150169747A1 (en) * 2013-12-12 2015-06-18 Google Inc. Systems and methods for automatically suggesting media accompaniments based on identified media content
US9270964B1 (en) * 2013-06-24 2016-02-23 Google Inc. Extracting audio components of a portion of video to facilitate editing audio of the video
CN110381371A (zh) * 2019-07-30 2019-10-25 维沃移动通信有限公司 一种视频剪辑方法及电子设备
CN111541936A (zh) * 2020-04-02 2020-08-14 腾讯科技(深圳)有限公司 视频及图像处理方法、装置、电子设备、存储介质
CN111556335A (zh) * 2020-04-15 2020-08-18 早安科技(广州)有限公司 一种视频贴纸处理方法及装置
CN113518256A (zh) * 2021-07-23 2021-10-19 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9270964B1 (en) * 2013-06-24 2016-02-23 Google Inc. Extracting audio components of a portion of video to facilitate editing audio of the video
US20150169747A1 (en) * 2013-12-12 2015-06-18 Google Inc. Systems and methods for automatically suggesting media accompaniments based on identified media content
CN110381371A (zh) * 2019-07-30 2019-10-25 维沃移动通信有限公司 一种视频剪辑方法及电子设备
CN111541936A (zh) * 2020-04-02 2020-08-14 腾讯科技(深圳)有限公司 视频及图像处理方法、装置、电子设备、存储介质
CN111556335A (zh) * 2020-04-15 2020-08-18 早安科技(广州)有限公司 一种视频贴纸处理方法及装置
CN113518256A (zh) * 2021-07-23 2021-10-19 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
US20240244290A1 (en) 2024-07-18
CN116708917A (zh) 2023-09-05

Similar Documents

Publication Publication Date Title
CN109688463B (zh) 一种剪辑视频生成方法、装置、终端设备及存储介质
US11482242B2 (en) Audio recognition method, device and server
CN109543064B (zh) 歌词显示处理方法、装置、电子设备及计算机存储介质
CN110503961B (zh) 音频识别方法、装置、存储介质及电子设备
CN110517689B (zh) 一种语音数据处理方法、装置及存储介质
WO2020237855A1 (zh) 声音分离方法、装置及计算机可读存储介质
WO2019148586A1 (zh) 多人发言中发言人识别方法以及装置
CN107680584B (zh) 用于切分音频的方法和装置
US10277834B2 (en) Suggestion of visual effects based on detected sound patterns
CN113596579B (zh) 视频生成方法、装置、介质及电子设备
CN112153460B (zh) 一种视频的配乐方法、装置、电子设备和存储介质
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
CN109582825B (zh) 用于生成信息的方法和装置
CN112929746A (zh) 视频生成方法和装置、存储介质和电子设备
CN111883107A (zh) 语音合成、特征提取模型训练方法、装置、介质及设备
CN112182255A (zh) 用于存储媒体文件和用于检索媒体文件的方法和装置
CN105989000B (zh) 音视频拷贝检测方法及装置
CN111859970B (zh) 用于处理信息的方法、装置、设备和介质
CN112328830A (zh) 一种基于深度学习的信息定位方法及相关设备
US11410706B2 (en) Content pushing method for display device, pushing device and display device
WO2023195914A2 (zh) 处理方法、装置、终端设备及介质
WO2023160515A1 (zh) 视频处理方法、装置、设备及介质
US11640426B1 (en) Background audio identification for query disambiguation
CN111259181B (zh) 用于展示信息、提供信息的方法和设备
CN110400559B (zh) 一种音频合成的方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23759140

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18574263

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2023759140

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023759140

Country of ref document: EP

Effective date: 20240925