WO2022121447A1 - 背景音频构建方法及装置 - Google Patents

背景音频构建方法及装置 Download PDF

Info

Publication number
WO2022121447A1
WO2022121447A1 PCT/CN2021/120377 CN2021120377W WO2022121447A1 WO 2022121447 A1 WO2022121447 A1 WO 2022121447A1 CN 2021120377 W CN2021120377 W CN 2021120377W WO 2022121447 A1 WO2022121447 A1 WO 2022121447A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
semantic segmentation
feature
video data
sample
Prior art date
Application number
PCT/CN2021/120377
Other languages
English (en)
French (fr)
Inventor
张奕
Original Assignee
上海幻电信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海幻电信息科技有限公司 filed Critical 上海幻电信息科技有限公司
Priority to EP21902155.7A priority Critical patent/EP4207746A4/en
Publication of WO2022121447A1 publication Critical patent/WO2022121447A1/zh
Priority to US18/133,641 priority patent/US20230245451A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to a method for constructing background audio.
  • One or more embodiments of the present application simultaneously relate to a background audio construction apparatus, a computing device, a computer-readable storage medium, and a computer program product.
  • the technology of adding background music to a video specifically can search for the background music that matches the video from the background music library according to the content information of the video to which background music needs to be added, such as the video theme, and use it as the background music of the video.
  • the video is generally directly compared with each background music in the background music library to obtain the background music that best matches the theme of the video. In this way, the efficiency of obtaining the background music is low, and the obtained background music Weak correlation with video.
  • the embodiments of the present application provide a background audio construction method.
  • One or more embodiments of the present application also relate to a background audio construction apparatus, a computing device, a computer-readable storage medium, and a computer program product, so as to solve the problem of video directly extracted by a video classification method in the prior art.
  • the features are relatively single, which leads to the technical defect that the correlation of the background music matching results is low.
  • a background audio construction method including:
  • a background audio construction device including:
  • a first extraction module configured to perform semantic segmentation on the video data to be processed to generate a corresponding semantic segmentation map, and extract semantic segmentation features of the video data to be processed based on the semantic segmentation map;
  • the second extraction module is configured to extract the audio features of each audio file in the pre-established audio set
  • the building module is configured to perform alignment processing on the audio feature and the semantic segmentation feature, filter target audio files in the audio set according to the alignment result, and construct a video data representation of the to-be-processed video data based on the target audio file. Background audio.
  • a computing device including:
  • the memory is used for storing computer-executable instructions
  • the processor is used for executing the computer-executable instructions, wherein when the processor executes the computer-executable instructions, the steps of the background audio construction method are implemented.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the background audio construction method.
  • a computer program product is provided, when the computer program product is executed in a computer, the computer is made to execute the steps of the above background audio construction method.
  • An embodiment of the present application implements a background audio construction method and apparatus, wherein the background audio construction method includes performing semantic segmentation on to-be-processed video data to generate a corresponding semantic segmentation map, and extracting the to-be-processed segmentation map based on the semantic segmentation map.
  • the background audio construction method includes performing semantic segmentation on to-be-processed video data to generate a corresponding semantic segmentation map, and extracting the to-be-processed segmentation map based on the semantic segmentation map.
  • Process the semantic segmentation features of the video data extract the audio features of each audio file in the pre-established audio set, align the audio features with the semantic segmentation features, and filter the target audio files in the audio set according to the alignment results. , and build the background audio of the video data to be processed based on the target audio file.
  • Constructing background audio for the video data to be processed in the above manner is beneficial to improve the efficiency of acquiring the background audio of the video data to be processed, and at the same time, it is beneficial to improve the correlation between the acquired background audio of the video data to be processed and the video data to be processed The accuracy of background audio matching is higher, and the video display effect is better.
  • FIG. 1 is a flowchart of a background audio construction method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a process of generating a semantic segmentation map provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of an audio feature extraction process provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an alignment process provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a background audio construction process provided by an embodiment of the present application.
  • FIG. 6 is a process flow diagram of a background music construction method in which the background audio construction method provided by an embodiment of the present application is applied to the video field;
  • FIG. 7 is a schematic structural diagram of a background audio construction device provided by an embodiment of the present application.
  • FIG. 8 is a structural block diagram of a computing device provided by an embodiment of the present application.
  • Semantic segmentation map a grayscale image of the size corresponding to the input original image, where each pixel is the class label of the corresponding pixel on the original image.
  • a background audio construction method is provided.
  • One or more embodiments of the present application simultaneously relate to a background audio construction apparatus, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.
  • FIG. 1 shows a flowchart of a background audio construction method provided according to an embodiment of the present application, including the following steps.
  • Step 102 Perform semantic segmentation on the video data to be processed to generate a corresponding semantic segmentation map, and extract semantic segmentation features of the video data to be processed based on the semantic segmentation map.
  • the background audio construction method of the embodiment of the present application can be applied to various scenarios where background audio (background music) needs to be constructed.
  • background audio background music
  • the background audio construction method provided by the embodiment of the present application can be used.
  • the background audio construction method can obtain background music with high relevance to the video more quickly.
  • the background music with a high degree of correlation can still be quickly acquired through the background audio construction method.
  • semantic segmentation is a classification at the pixel level, where pixels belonging to the same class in an image or video frame are grouped together. Therefore, after performing semantic segmentation on the video data to be processed to generate a corresponding semantic segmentation map, the semantic segmentation features of the video data to be processed can be extracted based on the semantic segmentation map.
  • the to-be-processed video data in the embodiments of the present application includes to-be-processed video data and to-be-processed audio data
  • the to-be-processed video data can be presented on large-scale video data playback devices, game consoles, desktop computers, smart phones, and tablet computers.
  • MP3 Motion Picture Experts Group Audio Layer III, Moving Picture Experts Group Audio Layer 3
  • MP4 Motion Picture Experts Group Audio Layer IV, Moving Picture Experts Group Audio Layer 4
  • Player Laptop Computer
  • e-book readers and other clients such as display terminals.
  • the semantic segmentation features include, but are not limited to, category label distribution statistics, edge pixel proportion statistics, and difference statistics of semantic segmentation maps before and after key frames, etc., where category label distribution statistics are the ratio of the number of pixels corresponding to each category label; If the upper, lower, left, and right pixels of a pixel are defined as adjacent pixels, then if there is a situation in the adjacent pixels of a pixel that is different from its own category label, the pixel is an edge pixel, and the proportion of edge pixels is calculated.
  • the ratio of pixels to the total number of pixels in the category label is the proportion of edge pixels; the difference statistics of the semantic segmentation map of the key frame is to count the difference of the pixel category label at the same position between the semantic segmentation maps corresponding to the adjacent video segmentation key frames.
  • the class label of the frame before and after the locus is the same, and the difference of the position is 0, otherwise it is 1.
  • the background music of the video data to be processed can be constructed based on the semantic segmentation features and the audio features of each audio file in the audio set.
  • FIG. 2 a schematic diagram of a process of generating a semantic segmentation map provided by an embodiment of the present application is shown in FIG. 2 , including steps 202 to 210 .
  • Step 202 acquiring to-be-processed video data.
  • Step 204 Perform video segment segmentation on the to-be-processed video data according to a preset duration threshold.
  • Step 206 Extract the first key frame of each first video segment in the segmentation result.
  • Step 208 Input the first key frame into a semantic segmentation model for processing.
  • Step 210 Generate a first semantic segmentation map of each of the first video segments.
  • Step 212 generating semantic segmentation features.
  • extracting the semantic segmentation features of the to-be-processed video data based on the semantic segmentation map can be specifically implemented in the following ways:
  • a semantic segmentation map corresponding to the video data to be processed can be generated by using a semantic segmentation model.
  • the video data to be processed Before inputting the video data to be processed into the semantic segmentation model, the video data to be processed can be processed according to a preset duration threshold.
  • Video segment segmentation and extract the key frames of each video segment in the segmentation result, and then input the key frames into the semantic segmentation model, so as to perform semantic segmentation on the key frames by the semantic segmentation model, and generate the key frames.
  • Semantic segmentation map of of .
  • the key frame may be a random frame, a start frame, an end frame or an intermediate frame of each video clip in the n video clips of any one or more video frames.
  • the semantic segmentation features include but are not limited to category label distribution statistics, edge pixel proportion statistics, and difference statistics of semantic segmentation maps before and after key frames, etc. Therefore, when determining each video segment in the n video segments, After the semantic segmentation map of each key frame is generated through the semantic segmentation model, the category label distribution statistics, edge pixel proportion statistics, and difference statistics of the semantic segmentation map of the key frame before and after can be extracted based on each semantic segmentation map.
  • Semantic segmentation features and calculate the category label distribution statistics of each key frame, the edge pixel proportion statistics, and the mean value of the difference statistics of the semantic segmentation map of the key frame before and after, and use the mean value calculation result as the semantic segmentation feature of the video data to be processed, with
  • the background music of the video data to be processed is constructed based on the semantic segmentation features of the video data to be processed and the audio features of the audio files in the audio set.
  • semantic segmentation model is trained in the following ways:
  • the second key frame is used as sample data, and the category identifier of each pixel in the semantic segmentation map of the second key frame is used as a label, and the semantic segmentation model to be trained is input for training to obtain the semantic segmentation model,
  • the semantic segmentation model associates the second key frame with the category identifier of each pixel point.
  • the sample video file may be segmented into video segments according to a fixed duration, and the key frames (second key frames) of each video segment (second video segment) in the segmentation result are extracted, and all
  • the semantic segmentation map of the key frame is used to take the key frame as sample data, and the class identification of each pixel in the semantic segmentation map of the key frame is used as a label to train the semantic segmentation model to be trained, and the semantic segmentation model obtained by training makes
  • the key frame is associated with the category identifier of each pixel in the semantic segmentation map.
  • the key frame (video frame) is input into the semantic segmentation model, and the semantic segmentation of the key frame can be output. picture.
  • the category of each pixel in the semantic segmentation map is determined according to the objects contained in the key frame.
  • the key frame may include sky, grass, road,
  • the category of each pixel in the key frame can be sky, grass, road, building, etc.
  • different categories can be represented by different colors or different numbers, such as the pixel where the sky is located.
  • the points are represented by light blue, and the pixels where the road is located are represented by gray, or the pixels where the sky is located are represented by the number 1, and the pixels where the road is located are represented by the number 2, etc.
  • the semantic segmentation model is a multi-layer convolutional network, which is divided into two parts: downsampling and upsampling.
  • the second key frame is used as sample data
  • the second key frame is used as the sample data.
  • the category identifier of each pixel in the semantic segmentation map of the frame is used as a label.
  • the second key frame and the category identifier of each pixel in the semantic segmentation map of the second key frame are input into the semantic segmentation model to be trained, the second key frame is down-sampled by the semantic segmentation model. , to zoom the second key frame, and then perform upsampling processing on the zoomed key frame to zoom in on the zoomed key frame, and process the zoomed key frame, thereby outputting the second key frame.
  • the predicted class identifier of each pixel in the semantic segmentation map of the key frame, and the difference between the predicted class identifier of each pixel and the (true) class identifier of each pixel in the semantic segmentation map of the second key frame described in the label is calculated. error, so as to adjust the parameters of the semantic segmentation model according to the error.
  • the parameters of the semantic segmentation model are adjusted in the above manner to obtain the trained semantic segmentation model, which is beneficial to ensure the accuracy of the output result of the semantic segmentation model.
  • Step 104 Extract audio features of each audio file in the pre-established audio set.
  • the audio collection is the soundtrack library.
  • the audio files contained in the soundtrack library are used to construct background music for the video data to be processed.
  • the The audio features of each audio file in the soundtrack library are extracted to construct background music of the video data to be processed based on the semantic segmentation features and the audio features of each audio file in the audio set.
  • FIG. 3 a schematic diagram of an audio feature extraction process provided by an embodiment of the present application is shown in FIG. 3 , including steps 302 to 310 .
  • Step 302 Acquire audio files in the audio collection.
  • Step 304 Segment each audio file in the audio set according to a preset duration threshold.
  • Step 306 Perform Fourier transform on each of the first audio segments in the segmentation result to generate a first spectral signal of each of the first audio segments.
  • Step 308 Input the first spectrum signal into an audio feature extraction model for processing.
  • Step 310 Generate audio features of each audio file in the audio set.
  • the audio features of each audio file in the soundtrack library can be extracted by using an audio feature extraction model, and before each audio file is input into the audio feature extraction model, each audio file can be extracted according to a preset duration threshold. Segmentation is performed; wherein, the preset duration threshold is consistent with the preset duration threshold corresponding to the foregoing video segment segmentation of the video data to be processed.
  • the spectrum signals of the m audio segments are input into the audio feature extraction model to generate m audio features.
  • the audio feature extraction model is trained in the following ways:
  • the second spectrum signal is used as sample data, and the audio type of the sample audio file is used as a label, and the audio feature extraction model to be trained is input for training, and the audio feature extraction model is obtained.
  • the audio feature extraction model makes The second spectral signal is associated with the audio type.
  • the sample audio file may be segmented into audio segments according to a fixed duration, and the fixed duration is maintained with the fixed duration (preset duration threshold) corresponding to the video segment segmentation of the sample video file. Consistent.
  • the organizing committee label conducts model training, and during the application process of the audio feature extraction model obtained by training, the audio feature of the audio data can be output by inputting the spectral signal of the audio data into the audio feature extraction model.
  • the audio feature extraction model is a convolutional neural network.
  • the second spectrum signal of the second audio segment can be used as sample data, and the sample audio file can be used as the sample data.
  • the audio type is used as a label to train the convolutional neural network, and the second spectral signal is processed by the convolutional neural network, thereby outputting the prediction result of the audio type corresponding to the second spectral signal;
  • the model parameters of the audio feature extraction model are iteratively updated by using the back-propagation algorithm of the convolutional neural network according to the loss value, thereby The trained audio feature extraction model is obtained.
  • model training process by dividing the sample audio file, Fourier transform is performed on the audio segment in the segmentation result, and the generated spectrum signal is used as the input of the audio feature extraction model, which is beneficial to ensure the audio feature. Extract the accuracy of the model output.
  • Step 106 Align the audio feature with the semantic segmentation feature, filter target audio files in the audio set according to the alignment result, and construct background audio of the video data to be processed based on the target audio file.
  • the aligning process of the audio feature and the semantic segmentation feature described in the embodiment of the present application means forcibly aligning the audio feature and the semantic segmentation feature, that is, determining the time interval corresponding to the semantic segmentation feature in the audio.
  • Forced alignment is a technique to obtain the temporal correspondence between a given semantic segmentation feature and an audio feature, which can be achieved by forced alignment tools, such as kaldi (an open source speech recognition tool (Toolkit), which uses WFST for decoding Algorithm) or HTK (HMM Toolkit, a speech processing tool based on the hmm model), etc., can achieve the alignment of semantic segmentation features and audio features.
  • forced alignment tools such as kaldi (an open source speech recognition tool (Toolkit), which uses WFST for decoding Algorithm) or HTK (HMM Toolkit, a speech processing tool based on the hmm model), etc.
  • the background audio of the video data to be processed can be constructed according to the alignment result.
  • the audio feature and the semantic segmentation feature are aligned, which may be implemented in the following ways:
  • the target audio feature and the target semantic segmentation feature are aligned.
  • the Performing feature dimension scaling processing on the semantic segmentation feature and the audio feature specifically, unifying the feature dimensions of the semantic segmentation feature and the audio feature, scaling the semantic segmentation feature and the audio feature to the same dimension, and obtaining a scaling
  • aligning the target semantic segmentation feature and the target audio feature is performed.
  • a fully connected layer can be added before the output layer of the audio feature extraction model and the semantic segmentation model, respectively.
  • the added fully-connected layer can realize feature dimension scaling for m-dimensional features and output n-dimensional features; similarly, if the m2-dimensional semantic segmentation features need to be scaled to n-dimensional, add a new feature before the output layer of the semantic segmentation model.
  • the fully connected layer of can realize the feature dimension scaling of the input as m2-dimensional features, and output n-dimensional features.
  • performing an alignment process on the audio feature and the semantic segmentation feature, screening a target audio file in the audio set according to the alignment result, and constructing a background audio of the video data to be processed based on the target audio file which can be achieved in the following ways:
  • An audio file corresponding to an audio feature whose distance from the semantic segmentation feature is less than a preset distance threshold is taken as a target audio file, and the background audio is constructed based on the target audio file.
  • aligning the audio feature and the semantic segmentation feature specifically, by calculating the difference between the audio feature and the semantic segmentation feature
  • the audio file corresponding to the audio feature whose distance from the semantic segmentation feature is smaller than the preset distance threshold is used as the target audio file.
  • the audio feature extraction model is used to extract the audio features of the audio file
  • the semantic segmentation model is used to extract the semantic segmentation features of the key frames in the video data to be processed
  • the semantic segmentation features corresponding to these key frames are averaged in the time dimension, and then combined with
  • the audio features corresponding to the audio files in the audio collection are compared, the distance between the semantic segmentation feature and the audio feature is calculated and sorted, and the audio file whose distance is less than a preset threshold is selected as the background audio.
  • the distance between the audio feature and the semantic segmentation feature may include, but is not limited to, Euclidean distance or cosine distance.
  • the background audio is constructed based on the target audio file.
  • the target audio clip can be determined according to the distance between different audio clips in the target audio file and different video clips in the video data to be processed, and the target audio clip can be determined according to the distance between the target audio clip and the video data to be processed.
  • the background audio of the to-be-processed video data is constructed from the corresponding relationship of the video segments in the to-be-processed video data.
  • the distance is the Euclidean distance
  • the determined target audio files are the audio file Y1 and the audio file Y2
  • the video segment division results of the video data to be processed are the video segment V1, the video segment V2 and the video segment V3
  • the preset distance threshold the duration of the audio segment Y11 and the video segment V3 are equal
  • the distance between the audio segment Y15 in the audio file Y1 and the video segment V2 The Euclidean distance is greater than the preset distance threshold (the duration of the audio clip Y15 and the video clip V2 is equal)
  • the Euclidean distance between the audio clip Y23 in the audio file Y2 and the video clip V1 is greater than the preset distance threshold (The audio clip Y23 and the video clip V1 of equal duration).
  • the background audio of the to-be-processed video data constructed based on the target audio file is the audio segment Y23-audio segment Y15-audio segment Y11.
  • the audio feature and the semantic segmentation feature are aligned, that is, the audio feature and the semantic segmentation feature are input into an audio alignment model for alignment.
  • FIG. 4 A schematic diagram of an alignment processing process provided by an embodiment of the present application is shown in FIG. 4 , wherein the audio alignment model includes a video feature processing module and an audio feature processing module.
  • the video feature processing performs feature connection on the semantic segmentation features, and the connection results are input into the fully connected layer, and the audio feature processing module inputs the audio features into the fully connected layer to unify the feature dimensions of the audio features and the semantic segmentation features; finally
  • the output results of the two modules are used for loss value calculation to adjust the parameters of the audio alignment model using the loss value calculation results.
  • FIG. 5 The schematic diagram of the background audio construction process provided by the embodiment of the present application is shown in FIG. 5 .
  • semantic segmentation is performed on the video data to be processed to generate a corresponding semantic segmentation map, and the to-be-processed video data is extracted based on the semantic segmentation map.
  • Process the semantic segmentation features of the video data after acquiring the audio files, input the audio files into an audio feature extraction model to generate corresponding audio features, then calculate the Euclidean distance between the semantic segmentation features and the audio features, and calculate The audio files smaller than the preset distance threshold in the result are used as the background audio of the video data to be processed.
  • the audio alignment model is trained in the following ways, including:
  • the sample video data is randomly selected video data
  • the positive sample audio data is the audio data that successfully matches the sample video data
  • the negative sample audio data is the audio data that fails to match the sample video data.
  • 1000 pieces of sample video data are selected, 1000 3 triples can be constructed. All of these triples can be used for model training, or some triples can be randomly selected for model training.
  • selecting a piece of video data as sample video data selecting audio data that successfully matches the sample video data as positive sample audio data, selecting audio data that fails to match the sample video data as negative sample audio data,
  • the sample video data is divided into video segments, the positive sample audio data and the negative sample audio data are divided into audio segments, and the audio segment division result is subjected to Fourier transform, and then the video segment division result and the Fourier transform result are input.
  • the audio alignment model to be trained is trained.
  • the two distances are input into the metric learning loss function, and the audio alignment model is trained according to the output of the loss function until the loss function tends to be stable.
  • the triplet loss function can be used:
  • i is the number of triples
  • N is the number of triples
  • x a is the semantic segmentation feature of the sample video data
  • x p is the audio feature of the positive sample audio data
  • x n is the audio feature of the negative sample audio data
  • is the Euclidean distance between the semantic segmentation features of the sample video data and the audio features of the audio features of the negative sample audio data
  • the minimum interval of the Euclidean distance between the semantic segmentation feature of the sample video data and the audio feature of the positive sample audio data, and the specific parameter value of ⁇ can be determined according to the model performance.
  • the value of the final loss function decreases from the initial larger value until it becomes stable. Decrease again until convergence is reached, for example, it is close to zero, the training of the audio alignment model is completed, and the trained audio alignment model is obtained.
  • the feature vector output by the audio alignment model can achieve a small Euclidean distance between the semantic segmentation features of the successfully matched video data and the audio features of the audio data.
  • the Euclidean distance between the semantic segmentation feature and the audio feature of the audio data is large.
  • the cosine distance can also be calculated by calculating the cosine distance between the semantic segmentation features of the sample video data and the audio features of the audio features of the negative sample audio data, and the semantic segmentation features of the sample video data and positive samples.
  • the cosine distance between audio features of the sample audio data is used to calculate the loss value of the audio alignment model, so as to iteratively calculate and update the parameters of the audio alignment model according to the loss value.
  • a corresponding semantic segmentation map is generated by performing semantic segmentation on the video data to be processed, and based on the semantic segmentation map, the semantic segmentation features of the video data to be processed are extracted, and the audio of each audio file in the pre-established audio set is extracted.
  • feature performing alignment processing on the audio feature and the semantic segmentation feature, screening a target audio file in the audio collection according to the alignment result, and constructing the background audio of the video data to be processed based on the target audio file;
  • Constructing background audio for the video data to be processed in the above manner is beneficial to improve the efficiency of acquiring the background audio of the video data to be processed, and at the same time, it is beneficial to improve the correlation between the acquired background audio of the video data to be processed and the video data to be processed The accuracy of background audio matching is higher, and the video display effect is better.
  • the background audio construction method is further described by taking the application of the background audio construction method provided in the embodiment of the present application to the application of background music construction in the video field as an example.
  • 6 shows a process flow chart of a background audio construction method applied in the video field provided by an embodiment of the present application, which specifically includes the following steps.
  • Step 602 Perform video segment segmentation on the video data to be processed according to a preset duration threshold.
  • Step 604 extracting the first key frame of each first video segment in the segmentation result.
  • Step 606 Input the first key frame into a semantic segmentation model for processing, and generate a first semantic segmentation map of each first video segment.
  • Step 608 Extract first semantic segmentation features of each of the first video segments based on the first semantic segmentation map.
  • Step 610 Calculate the mean value of the first semantic segmentation feature of each first video segment in the segmentation result, and use the mean value as the semantic segmentation feature of the video data to be processed.
  • Step 612 Segment each music file in the music library according to a preset duration threshold.
  • Step 614 Perform Fourier transform on each music segment in the segmentation result to generate a spectrum signal of each music segment.
  • Step 616 Input the spectral signal into a sound feature extraction model for processing to generate sound features of each music file in the music library.
  • Step 618 Perform dimension scaling processing on the sound feature and the semantic segmentation feature according to the preset feature dimension, to generate the target sound feature and the target semantic segmentation feature.
  • Step 620 Perform alignment processing on the target sound feature and the target semantic segmentation feature input feature alignment model.
  • Step 622 Screen target music files in the music library according to the alignment result, and construct background music of the video data to be processed based on the target music files.
  • Constructing background music for the video data to be processed in the above manner is beneficial to improve the efficiency of acquiring the background music of the video data to be processed, and at the same time, it is beneficial to improve the correlation between the acquired background music of the video data to be processed and the video data to be processed The accuracy of background music matching is higher, and the video display effect is better.
  • FIG. 7 shows a schematic structural diagram of a background audio construction apparatus provided by an embodiment of the present application.
  • the device includes:
  • the first extraction module 702 is configured to perform semantic segmentation on the video data to be processed to generate a corresponding semantic segmentation map, and extract semantic segmentation features of the video data to be processed based on the semantic segmentation map;
  • the second extraction module 704 is configured to extract the audio features of each audio file in the pre-established audio set
  • a construction module 706, configured to perform an alignment process on the audio feature and the semantic segmentation feature, filter a target audio file in the audio set according to the alignment result, and construct the to-be-processed video data based on the target audio file background audio.
  • the first extraction module 702 includes:
  • the first all molecular module is configured to perform video segment segmentation on the to-be-processed video data according to a preset duration threshold
  • the first extraction submodule is configured to extract the first key frame of each first video segment in the segmentation result
  • the first processing submodule is configured to input the first key frame into a semantic segmentation model for processing, and generate a first semantic segmentation map of each first video segment.
  • the first extraction module 702 further includes:
  • a second extraction sub-module configured to extract the first semantic segmentation feature of each of the first video segments based on the first semantic segmentation map
  • the first calculation submodule is configured to calculate the mean value of the first semantic segmentation features of each first video segment in the segmentation result, and use the mean value as the semantic segmentation feature of the video data to be processed.
  • the semantic segmentation model is trained in the following manner:
  • the second key frame is used as sample data, and the category identifier of each pixel in the semantic segmentation map of the second key frame is used as a label, and the semantic segmentation model to be trained is input for training to obtain the semantic segmentation model,
  • the semantic segmentation model associates the second key frame with the category identifier of each pixel point.
  • the second extraction module 704 includes:
  • the first all molecular module is configured to segment each audio file in the audio set according to a preset duration threshold
  • a second processing submodule configured to perform Fourier transform on each of the first audio segments in the segmentation result, to generate a first spectral signal of each of the first audio segments
  • the third processing submodule is configured to input the first spectral signal into an audio feature extraction model for processing, and generate audio features of each audio file in the audio set.
  • the audio feature extraction model is trained in the following manner:
  • the second spectrum signal is used as sample data, and the audio type of the sample audio file is used as a label, and the audio feature extraction model to be trained is input for training, and the audio feature extraction model is obtained, and the audio feature extraction model is such that The second spectral signal is associated with the audio type.
  • the building module 706 includes:
  • a generating submodule configured to perform dimension scaling processing on the audio feature and the semantic segmentation feature according to a preset feature dimension, to generate a target audio feature and a target semantic segmentation feature;
  • the first alignment processing sub-module is configured to perform alignment processing on the target audio feature and the target semantic segmentation feature.
  • the building module 706 includes:
  • a calculation submodule configured to calculate the distance between the audio feature and the semantic segmentation feature
  • a construction sub-module is configured to use an audio file corresponding to an audio feature whose distance from the semantic segmentation feature is less than a preset distance threshold as a target audio file, and construct the background audio based on the target audio file.
  • the building module 706 includes:
  • the second alignment processing sub-module is configured to input the audio feature and the semantic segmentation feature into an audio alignment model for alignment processing.
  • the audio alignment model is trained in the following manner, including:
  • the above is a schematic solution of a background audio construction apparatus according to this embodiment. It should be noted that the technical solution of the background audio construction device and the technical solution of the above-mentioned background audio construction method belong to the same concept, and the details that are not described in detail in the technical solution of the background audio construction device can refer to the above-mentioned background audio construction method Description of the technical solution.
  • FIG. 8 shows a structural block diagram of a computing device 800 according to an embodiment of the present application.
  • Components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820 .
  • the processor 820 is connected with the memory 810 through the bus 830, and the database 850 is used for saving data.
  • Computing device 800 also includes access device 840 that enables computing device 800 to communicate via one or more networks 860 .
  • networks 860 include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 840 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • the above-described components of the computing device 800 and other components not shown in FIG. 8 may also be connected to each other, eg, through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 8 is only for the purpose of example, rather than limiting the scope of the present application. Those skilled in the art can add or replace other components as required.
  • Computing device 800 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • Computing device 800 may also be a mobile or stationary server.
  • the processor 820 is configured to execute the following computer-executable instructions, and the processor is configured to execute the computer-executable instructions, wherein when the processor executes the computer-executable instructions, the steps of the background audio construction method are implemented .
  • the above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned background audio construction method belong to the same concept, and the details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned background audio construction method. .
  • An embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the background audio construction method.
  • An embodiment of the present application further provides a computer program product, wherein, when the computer program product is executed in a computer, the computer is made to execute the steps of the above-mentioned background audio construction method.
  • the computer instructions include computer program product code, which may be in source code form, object code form, an executable file, some intermediate form, or the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program product code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media, etc.
  • the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

Abstract

本申请实施例提供了背景音频构建方法及装置,其中,所述背景音频构建方法包括:对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征,提取预先建立的音频集合中各音频文件的音频特征,将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。

Description

背景音频构建方法及装置
本申请要求于2020年12月10日提交中国专利局、申请号为202011437857.1、发明名称为“背景音频构建方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种背景音频构建方法。本申请一个或者多个实施例同时涉及一种背景音频构建装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序产品。
背景技术
随着信息技术的发展,尤其是互联网的迅速发展,网络视频占的比重越来越高,为了使视频更具有吸引力,一般会为视频添加一个与该视频主题相符的背景音乐,好的背景音乐能够提升用户对视频的兴趣,进而可以提高视频的播放量。
为视频添加背景音乐的技术具体可以针对需要添加背景音乐的视频的内容信息,如视频主题等,从背景音乐库中搜索与该视频相符的背景音乐,作为该视频的背景音乐。但是在目前的相关技术中,一般将视频与背景音乐库中的各个背景音乐直接进行对比,来获取最符合该视频主题的背景音乐,这样获取背景音乐的效率较低,且获取到的背景音乐与视频的相关度较弱。
发明内容
有鉴于此,本申请实施例提供了一种背景音频构建方法。本申请一个或者多个实施例同时涉及一种背景音频构建装置,一种计算设备,一种计算机可读 存储介质以及一种计算机程序产品,以解决现有技术中直接通过视频分类方法提取的视频特征较为单一,从而导致背景音乐匹配结果的相关度较低的技术缺陷。
根据本申请实施例的第一方面,提供了一种背景音频构建方法,包括:
对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征;
提取预先建立的音频集合中各音频文件的音频特征;
将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。
根据本申请实施例的第二方面,提供了一种背景音频构建装置,包括:
第一提取模块,被配置为对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征;
第二提取模块,被配置为提取预先建立的音频集合中各音频文件的音频特征;
构建模块,被配置为将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。
根据本申请实施例的第三方面,提供了一种计算设备,包括:
存储器和处理器;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现所述背景音频构建方法的步骤。
根据本申请实施例的第四方面,提供了一种计算机可读存储介质,其存储 有计算机可执行指令,该指令被处理器执行时实现所述背景音频构建方法的步骤。
根据本申请实施例的第五方面,提供了一种计算机程序产品,当所述计算机程序产品在计算机中执行时,令计算机执行上述背景音频构建方法的步骤。
本申请一个实施例实现了一种背景音频构建方法及装置,其中,所述背景音频构建方法包括对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征,提取预先建立的音频集合中各音频文件的音频特征,将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。
通过上述方式为待处理视频数据构建背景音频,有利于提升获取待处理视频数据的背景音频的效率,同时有利于提高所获取到的待处理视频数据的背景音频与所述待处理视频数据的相关度,使背景音频匹配的准确性更高,视频展示效果更好。
附图说明
图1是本申请一个实施例提供的一种背景音频构建方法的流程图;
图2是本申请一个实施例提供的一种语义分割图生成过程的示意图;
图3是本申请一个实施例提供的一种音频特征提取过程的示意图;
图4是本申请一个实施例提供的对齐处理过程的示意图;
图5是本申请一个实施例提供的背景音频构建过程的示意图;
图6是本申请一个实施例提供的一种所述背景音频构建方法应用于视频领域的背景音乐构建方法的处理过程流程图;
图7是本申请一个实施例提供的一种背景音频构建装置的结构示意图;
图8是本申请一个实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本申请一个或多个实施例涉及的名词术语进行解释。
语义分割图:与输入原始图像对应大小的灰度图,其每个像素为原始图上对应像素的类别标签。
在本申请中,提供了一种背景音频构建方法。本申请一个或者多个实施例同时涉及一种背景音频构建装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
参见图1,图1示出了根据本申请一个实施例提供的一种背景音频构建方法的流程图,包括以下步骤。
步骤102,对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征。
本申请实施例的背景音频构建方法可以应用于各种需要构建背景音频(背 景音乐)的场景中,例如,当用户在短视频平台发布视频时,可以通过本申请实施例提供的背景音频构建方法,来为视频添加背景音乐,通过所述背景音频构建方法能够更快地获取到与该视频相关度高的背景音乐。或者在需要为直播、录播类视频或者音频添加背景音乐的情况下,仍然可以通过所述背景音频构建方法快速获取到与其相关度高的背景音乐。
具体的,语义分割是在像素级别上的分类,图像或视频帧中属于同一类的像素归为一类。因此,在对待处理视频数据进行语义分割生成对应的语义分割图后,可基于所述语义分割图提取所述待处理视频数据的语义分割特征。
具体实施时,本申请实施例的待处理视频数据包括待处理视频数据以及待处理音频数据,所述待处理视频数据可以呈现于大型视频数据播放设备、游戏机、台式计算机、智能手机、平板电脑、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器,MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机、电子书阅读器以及其它显示终端等客户端。
实际应用中,所述语义分割特征包括但不限于类别标签分布统计、边缘像素占比统计以及前后关键帧语义分割图差分统计等,其中,类别标签分布统计即各类别标签对应像素数量的比率;若定义像素的上下左右像素为其相邻像素,则若某像素的相邻像素中存在与其本身类别标签不同的情况,则该像素为边缘像素,边缘像素占比统计即统计像素各类别标签边缘像素占该类别标签总像素数的比率即为边缘像素占比;关键帧语义分割图差分统计即对相邻视频分段关键帧对应语义分割图之间相同位置像素类别标签差异进行统计,如果相同位置像素前后帧类别标签相同,该位置差分为0,否则为1。
提取语义分割特征后,可基于语义分割特征以及音频集合中各音频文件的音频特征构建待处理视频数据的背景音乐。
具体实施时,本申请实施例提供的一种语义分割图生成过程的示意图如图2所示,包括步骤202至步骤210。
步骤202,获取待处理视频数据。
步骤204,按照预设时长阈值对所述待处理视频数据进行视频片段切分。
步骤206,提取切分结果中每个第一视频片段的第一关键帧。
步骤208,将所述第一关键帧输入语义分割模型进行处理。
步骤210,生成所述每个第一视频片段的第一语义分割图。
步骤212,生成语义分割特征。
进一步的,基于所述语义分割图提取所述待处理视频数据的语义分割特征,具体可通过以下方式实现:
基于所述第一语义分割图提取所述每个第一视频片段的第一语义分割特征;
计算切分结果中各第一视频片段的第一语义分割特征的均值,并将所述均值作为所述待处理视频数据的语义分割特征。
具体的,本申请实施例可通过语义分割模型生成待处理视频数据对应的语义分割图,在将待处理视频数据输入语义分割模型之前,可先按照预设时长阈值对所述待处理视频数据进行视频片段切分,并提取切分结果中各视频片段的关键帧,再将关键帧输入所述语义分割模型,以由所述语义分割模型对所述关键帧进行语义分割,生成所述关键帧的语义分割图。
其中,若将所述待处理视频数据切分为n个视频片段,则所述关键帧可以是所述n个视频片段中每个视频片段的随机帧、起始帧、结束帧或中间帧中的任意一帧或多帧视频帧。
另外,如前所述,所述语义分割特征包括但不限于类别标签分布统计、边缘像素占比统计以及前后关键帧语义分割图差分统计等,因此,在确定n个视频片段中每个视频片段的关键帧,并通过语义分割模型生成各关键帧的语义分割图后,可基于各语义分割图提取各关键帧的类别标签分布统计、边缘像素占比统计以及前后关键帧语义分割图差分统计等语义分割特征,并分别计算各关键帧的类别标签分布统计、边缘像素占比统计以及前后关键帧语义分割图差分 统计的均值,将均值计算结果作为所述待处理视频数据的语义分割特征,以基于所述待处理视频数据的语义分割特征以及音频集合中个音频文件的音频特征构建待处理视频数据的背景音乐。
通过提取待处理视频数据的语义分割图,并基于语义分割图提取待处理视频数据的语义分割特征,从而保证所提取的所述待处理视频数据的视频特征的多样性,进而保证基于所述视频特征构建的所述待处理视频数据的背景音频的相关度。
进一步的,所述语义分割模型通过以下方式训练:
按照所述预设时长阈值对样本视频文件进行视频片段切分;
提取切分结果中每个第二视频片段的第二关键帧;
将所述第二关键帧作为样本数据,并将所述第二关键帧的语义分割图中各像素点的类别标识作为标签,输入待训练的语义分割模型进行训练,获得所述语义分割模型,所述语义分割模型使得所述第二关键帧与所述各像素点的类别标识相关联。
具体的,获取样本视频文件后,可按照固定时长对样本视频文件进行视频片段切分,提取切分结果中各视频片段(第二视频片段)的关键帧(第二关键帧),并提取所述关键帧的语义分割图,以将所述关键帧作为样本数据,将关键帧的语义分割图中各像素点的类别标识作为标签对待训练的语义分割模型进行训练,训练获得的语义分割模型使得所述关键帧与所述语义分割图中各像素点的类别标识相关联,在模型应用过程中,将关键帧(视频帧)输入所述语义分割模型,即可输出所述关键帧的语义分割图。
其中,模型训练过程中,语义分割图中各像素点的类别根据关键帧中包含的对象确定,例如,若所述关键帧为风景图,则所述关键帧中可能包含天空、草地、道路、建筑等对象,所述关键帧中各像素点的类别则可以是天空、草地、道路、建筑等,实际应用中,不同的类别可通过不同的颜色或者不同的数字进行表示,如天空所在的像素点均通过淡蓝色表示,而道路所在的像素点均通过 灰色表示,或者天空所在的像素点均用数字1表示,而道路所在的像素点均用数字2表示等。
实际应用中,所述语义分割模型为多层卷积网络,分为降采样和上采样两个部分,在语义分割模型的训练过程中,将第二关键帧作为样本数据,并将第二关键帧的语义分割图中各像素点的类别标识作为标签。
因此,在将第二关键帧和第二关键帧的语义分割图中各像素点的类别标识输入待训练的语义分割模型后,由所述语义分割模型对所述第二关键帧进行降采样处理,以对所述第二关键帧进行缩放,再对缩放处理后的关键帧进行上采样处理,以对缩放后的关键帧进行放大,对放大后的关键帧进行处理,从而输出所述第二关键帧的语义分割图中各像素点的预测类别标识,并计算各像素点的预测类别标识与标签中所述第二关键帧的语义分割图中各像素点的(真实)类别标识之间的误差,以根据误差对所述语义分割模型的参数进行调整。
通过上述方式对语义分割模型的参数进行调整,以获得训练后的语义分割模型,有利于保证所述语义分割模型输出结果的准确性。
步骤104,提取预先建立的音频集合中各音频文件的音频特征。
具体的,所述音频集合即为配乐曲库,本申请实施例通过配乐曲库中包含的音频文件为所述待处理视频数据构建背景音乐,在提取待处理视频数据的语义分割特征后,可提取配乐曲库中各音频文件的音频特征,以基于语义分割特征以及音频集合中各音频文件的音频特征构建待处理视频数据的背景音乐。
具体实施时,本申请一个实施例提供的一种音频特征提取过程的示意图如图3所示,包括步骤302至步骤310。
步骤302,获取音频集合中的音频文件。
步骤304,按照预设时长阈值对所述音频集合中的各音频文件进行切分。
步骤306,对切分结果中的每个第一音频片段进行傅里叶变换,生成所述每个第一音频片段的第一频谱信号。
步骤308,将所述第一频谱信号输入音频特征提取模型进行处理。
步骤310,生成所述音频集合中各音频文件的音频特征。
具体的,本申请实施例可通过音频特征提取模型提取配乐曲库中各音频文件的音频特征,在将所述各音频文件输入音频特征提取模型之前,可先按照预设时长阈值对各音频文件进行切分;其中,所述预设时长阈值与前述对所述待处理视频数据进行视频片段切分对应的预设时长阈值保持一致。
对音频文件进行切分获得切分结果后,对切分结果中的每个音频片段进行傅里叶变换,以生成每个音频片段的频谱信号,再将所述频谱信号输入所述音频特征提取模型,以由所述音频特征提取模型提取所述音频片段的音频特征。
其中,若将所述配乐曲库中的音频文件切分为m个音频片段,则将所述m个音频片段的频谱信号输入所述音频特征提取模型,以生成m个音频特征。
在为待处理视频数据构建背景音频的过程中,通过对音频集合中各音频文件进行切分,对切分结果中的音频片段进行傅里叶变换,并将生成的频谱信号作为音频特征提取模型的输入,有利于保证所述音频特征提取模型输出结果的准确性。
进一步的,所述音频特征提取模型通过以下方式训练:
按照预设时长阈值对样本音频文件进行切分;
对切分结果中的每个第二音频片段进行傅里叶变换,生成所述每个第二音频片段的第二频谱信号;
将所述第二频谱信号作为样本数据,并将所述样本音频文件的音频类型作为标签,输入待训练的音频特征提取模型进行训练,获得所述音频特征提取模型,所述音频特征提取模型使得所述第二频谱信号与所述音频类型相关联。
具体的,获取样本音频文件后,可按照固定时长对样本音频文件进行音频片段切分,所述固定时长与对所述样本视频文件进行视频片段切分对应的固定时长(预设时长阈值)保持一致。
对样本音频文件进行切分后,对切分结果中的每个音频片段进行傅里叶变换,生成各音频片段的频谱信号,以将所述频谱信号作为样本数据,将样本音 频文件的音频类型组委标签进行模型训练,训练获得的音频特征提取模型在应用过程中,音频数据的频谱信号输入所述音频特征提取模型,即可输出所述音频数据的音频特征。
实际应用中,所述音频特征提取模型为卷积神经网络,在对待训练的音频特征提取模型进行训练的过程中,可将第二音频片段的第二频谱信号作为样本数据,并将样本音频文件的音频类型作为标签训练所述卷积神经网络,由所述卷积神经网络对所述第二频谱信号进行处理,从而输出所述第二频谱信号对应的音频类型的预测结果;
通过计算所述预测结果与所述第二频谱信号的标签之间的损失值,以根据损失值,并利用卷积神经网络的反向传播算法迭代更新所述音频特征提取模型的模型参数,从而获得训练后的所述音频特征提取模型。
在模型训练过程中,通过对样本音频文件进行切分,对切分结果中的音频片段进行傅里叶变换,并将生成的频谱信号作为音频特征提取模型的输入,有利于保证所述音频特征提取模型输出结果的准确性。
步骤106,将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。
具体的,本申请实施例所述的将音频特征与语义分割特征进行对齐处理,即将所述音频特征与语义分割特征进行强制对齐,也即确定语义分割特征在音频中对应的时间区间。
强制对齐是一种得到给定语义分割特征和音频特征在时间上的对应关系的技术,可以通过强制对齐工具实现,如通过kaldi(一种开源语音识别工具(Toolkit),它使用WFST来实现解码算法)或HTK(HMM Toolkit,一款基于hmm模型的语音处理工具)等即可实现语义分割特征和音频特征的对齐。
将所述音频特征和所述语义分割特征进行对齐处理后,即可根据对齐结果构建待处理视频数据的背景音频。
具体实施时,将所述音频特征与所述语义分割特征进行对齐处理,具体可通过以下方式实现:
按照预设特征维度对所述音频特征及所述语义分割特征进行维度缩放处理,生成目标音频特征及目标语义分割特征;
将所述目标音频特征与所述目标语义分割特征进行对齐处理。
具体的,由于语义分割特征与所述音频特征的特征维度可能存在差异,因此,为保证背景音频构建结果的准确性,本申请实施例在将语义分割特征与音频特征进行对齐处理之前,可先对所述语义分割特征和所述音频特征进行特征维度缩放处理,具体为将所述语义分割特征和所述音频特征的特征维度进行统一,将语义分割特征和音频特征缩放至相同维度,得到缩放后的目标音频特征和目标语义分割特征,再对所述目标语义分割特征和目标音频特征进行对齐处理。
实际应用中,可分别在所述音频特征提取模型和所述语义分割模型的输出层之前增加全连接层,若需要将m1维音频特征缩放至n维,则音频特征提取模型的输出层前新增的全连接层可实现将输入为m维的特征进行特征维度缩放,并输出n维特征;同样,若需要将m2维语义分割特征缩放至n维,则语义分割模型的输出层前新增的全连接层可实现将输入为m2维的特征进行特征维度缩放,并输出n维特征。
另外,所述将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频,具体可通过以下方式实现:
计算所述音频特征与所述语义分割特征间的距离;
将与所述语义分割特征的距离小于预设距离阈值的音频特征对应的音频文件作为目标音频文件,并基于所述目标音频文件构建所述背景音频。
具体的,提取待处理视频数据的语义分割特征以及音频文件的音频特征后,对所述音频特征和所述语义分割特征进行对齐处理,具体可通过计算所述音频 特征与所述语义分割特征间距离的方式实现,并将与所述语义分割特征的距离小于预设距离阈值的音频特征对应的音频文件作为目标音频文件。
其中,用音频特征提取模型提取音频文件的音频特征,用语义分割模型提取待处理视频数据中关键帧的语义分割特征,并将这些关键帧对应的语义分割特征在时间维度上求平均,随后与音频集合中个音频文件对应的音频特征比较,计算语义分割特征与音频特征间的距离并排序,选取距离小于预设阈值的音频文件作为背景音频。
实际应用中,所述音频特征与所述语义分割特征间的距离可以包括但不限于欧式距离或余弦距离等。
另外,基于所述目标音频文件构建背景音频,具体可根据目标音频文件中不同音频片段与所述待处理视频数据中不同视频片段之间的距离确定目标音频片段,并根据目标音频片段与所述待处理视频数据中视频片段的对应关系构建所述待处理视频数据的背景音频。
例如,在所述距离为欧式距离的情况下,若确定的目标音频文件为音频文件Y1、音频文件Y2,并且待处理视频数据的视频片段划分结果为视频片段V1、视频片段V2以及视频片段V3,若确定音频文件Y1中的音频片段Y11与视频片段V3间的欧式距离大于预设距离阈值(音频片段Y11与视频片段V3的时长相等),音频文件Y1中的音频片段Y15与视频片段V2间的欧式距离大于预设距离阈值(音频片段Y15与视频片段V2的时长相等),音频文件Y2中的音频片段Y23与视频片段V1间的欧式距离大于预设距离阈值(音频片段Y23与视频片段V1的时长相等)。
因此,基于目标音频文件构建的所述待处理视频数据的背景音频即为音频片段Y23-音频片段Y15-音频片段Y11。
此外,将所述音频特征与所述语义分割特征进行对齐处理,即将所述音频特征及所述语义分割特征输入音频对齐模型进行对齐处理。
本申请实施例提供的对齐处理过程的示意图如图4所示,其中,音频对齐 模型包括视频特征处理模块和音频特征处理模块,将音频特征和语义分割特征输入音频对齐模型后,由视频特征处理模块对语义分割特征进行特征连接,并将连接结果输入全连接层,由音频特征处理模块将音频特征输入全连接层,以对所述音频特征和所述语义分割特征进行特征维度的统一;最后将两个模块的输出结果进行损失值计算,以利用损失值计算结果对音频对齐模型的参数进行调整。
本申请实施例提供的背景音频构建过程的示意图如图5所示,获取待处理视频数据后,对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征;获取音频文件后,将所述音频文件输入音频特征提取模型,生成对应的音频特征,然后计算所述语义分割特征与所述音频特征间的欧式距离,并将计算结果中小于预设距离阈值的音频文件作为所述待处理视频数据的背景音频。
进一步的,音频对齐模型通过以下方式训练,包括:
构建由样本视频数据、正样本音频数据、负样本音频数据构成的多个三元组训练样本;
将所述多个三元组训练样本输入所述音频对齐模型,获得每个三元组训练样本中样本视频数据、正样本音频数据、负样本音频数据的特征向量;
计算每个三元组训练样本中样本视频数据与正样本音频数据的特征向量之间的第一距离,以及样本视频数据与负样本音频数据的特征向量之间的第二距离,将所述第一距离和所述第二距离输入度量学习损失函数,根据所述损失函数的输出对所述音频对齐模型进行训练,直至所述损失函数趋于稳定。
具体的,样本视频数据是随机选择的视频数据,正样本音频数据为与样本视频数据匹配成功的音频数据,负样本音频数据为与样本视频数据匹配失败的音频数据。假设选取1000段样本视频数据,则可以构建出1000 3个三元组。可以将这些三元组全部用于模型的训练,也可以随机挑选一些三元组用于模型的训练。
例如,选用一段视频数据作为样本视频数据,选择与所述样本视频数据匹配成功的音频数据作为正样本音频数据,选择与所述样本视频数据匹配失败的音频数据作为负样本音频数据,对所述样本视频数据进行视频片段划分,对所述正样本音频数据和负样本音频数据进行音频片段划分,并对音频片段划分结果进行傅里叶变换,再将视频片段划分结果和傅里叶变换结果输入待训练的音频对齐模型进行训练。
计算作为样本视频数据的视频数据的语义分割特征与作为正样本音频数据的音频特征之间的距离以及作为样本视频数据的视频数据的语义分割特征与作为负样本音频数据的音频特征之间的距离,将两个距离输入度量学习损失函数,根据损失函数的输出对音频对齐模型进行训练,直至损失函数趋于稳定。
具体地,三元组损失函数可以采用:
Figure PCTCN2021120377-appb-000001
其中,i为三元组的编号,N为三元组的数目,x a是样本视频数据的语义分割特征,x p是正样本音频数据的音频特征,x n是负样本音频数据的音频特征,
Figure PCTCN2021120377-appb-000002
是样本视频数据的语义分割特征与正样本音频数据的音频特征之间的欧氏距离,
Figure PCTCN2021120377-appb-000003
是样本视频数据的语义分割特征与负样本音频数据的音频特征之间的欧氏距离,α是样本视频数据的语义分割特征与负样本音频数据的音频特征的音频特征之间的欧氏距离与样本视频数据的语义分割特征与正样本音频数据的音频特征之间的欧氏距离的最小间隔,α的具体参数值可以根据模型表现确定。
经过根据损失函数对音频对齐模型迭代计算并更新音频对齐模型的参数,最终损失函数的值从最初的较大值减小直到趋于稳定,其中,损失函数趋于稳定是指损失函数的值不再减小,达到收敛,例如接近于零,完成对音频对齐模 型的训练,得到经训练的音频对齐模型。
通过三元组损失函数对音频对齐模型进行训练后,音频对齐模型输出的特征向量可以实现匹配成功的视频数据的语义分割特征与音频数据的音频特征的欧氏距离小,匹配失败的视频数据的语义分割特征与音频数据的音频特征的欧氏距离大。
除通过前述的计算样本视频数据的语义分割特征与负样本音频数据的音频特征的音频特征之间的欧氏距离,以及样本视频数据的语义分割特征与正样本音频数据的音频特征之间的欧氏距离,以音频对齐模型的损失值外,还可通过计算样本视频数据的语义分割特征与负样本音频数据的音频特征的音频特征之间的余弦距离,以及样本视频数据的语义分割特征与正样本音频数据的音频特征之间的余弦距离,以计算音频对齐模型的损失值,从而根据损失值对所述音频对齐模型迭代计算并更新音频对齐模型的参数。
实际应用中,除通过计算欧式距离或计算余弦距离的方式计算损失值,还可选择其他方式进行损失值计算,具体的计算方式可根据实际需求确定,在此不做限制。
本申请实施例通过对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征,提取预先建立的音频集合中各音频文件的音频特征,将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频;
通过上述方式为待处理视频数据构建背景音频,有利于提升获取待处理视频数据的背景音频的效率,同时有利于提高所获取到的待处理视频数据的背景音频与所述待处理视频数据的相关度,使背景音频匹配的准确性更高,视频展示效果更好。
参见图6,以本申请实施例提供的所述背景音频构建方法应用在对视频领域的背景音乐构建的应用为例,对所述背景音频构建方法进行进一步说明。其 中,图6示出了本申请一个实施例提供的一种应用于视频领域的背景音频构建方法的处理过程流程图,具体包括以下步骤。
步骤602,按照预设时长阈值对待处理视频数据进行视频片段切分。
步骤604,提取切分结果中每个第一视频片段的第一关键帧。
步骤606,将所述第一关键帧输入语义分割模型进行处理,生成所述每个第一视频片段的第一语义分割图。
步骤608,基于所述第一语义分割图提取所述每个第一视频片段的第一语义分割特征。
步骤610,计算切分结果中各第一视频片段的第一语义分割特征的均值,并将所述均值作为所述待处理视频数据的语义分割特征。
步骤612,按照预设时长阈值对曲库中的各音乐文件进行切分。
步骤614,对切分结果中的每个音乐片段进行傅里叶变换,生成所述每个音乐片段的频谱信号。
步骤616,将所述频谱信号输入声音特征提取模型进行处理,生成所述曲库中各音乐文件的声音特征。
步骤618,按照预设特征维度对所述声音特征及所述语义分割特征进行维度缩放处理,生成目标声音特征及目标语义分割特征。
步骤620,将所述目标声音特征与所述目标语义分割特征输入特征对齐模型进行对齐处理。
步骤622,根据对齐结果在所述曲库中筛选目标音乐文件,并基于所述目标音乐文件构建所述待处理视频数据的背景音乐。
通过上述方式为待处理视频数据构建背景音乐,有利于提升获取待处理视频数据的背景音乐的效率,同时有利于提高所获取到的待处理视频数据的背景音乐与所述待处理视频数据的相关度,使背景音乐匹配的准确性更高,视频展示效果更好。
与上述方法实施例相对应,本申请还提供了背景音频构建装置实施例,图7示出了本申请一个实施例提供的一种背景音频构建装置的结构示意图。如图7所示,该装置包括:
第一提取模块702,被配置为对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征;
第二提取模块704,被配置为提取预先建立的音频集合中各音频文件的音频特征;
构建模块706,被配置为将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。
可选地,所述第一提取模块702,包括:
第一切分子模块,被配置为按照预设时长阈值对所述待处理视频数据进行视频片段切分;
第一提取子模块,被配置为提取切分结果中每个第一视频片段的第一关键帧;
第一处理子模块,被配置为将所述第一关键帧输入语义分割模型进行处理,生成所述每个第一视频片段的第一语义分割图。
可选地,所述第一提取模块702,还包括:
第二提取子模块,被配置为基于所述第一语义分割图提取所述每个第一视频片段的第一语义分割特征;
第一计算子模块,被配置为计算切分结果中各第一视频片段的第一语义分割特征的均值,并将所述均值作为所述待处理视频数据的语义分割特征。
可选地,所述语义分割模型通过以下方式训练:
按照所述预设时长阈值对样本视频文件进行视频片段切分;
提取切分结果中每个第二视频片段的第二关键帧;
将所述第二关键帧作为样本数据,并将所述第二关键帧的语义分割图中各像素点的类别标识作为标签,输入待训练的语义分割模型进行训练,获得所述语义分割模型,所述语义分割模型使得所述第二关键帧与所述各像素点的类别标识相关联。
可选地,所述第二提取模块704,包括:
第一切分子模块,被配置为按照预设时长阈值对所述音频集合中的各音频文件进行切分;
第二处理子模块,被配置为对切分结果中的每个第一音频片段进行傅里叶变换,生成所述每个第一音频片段的第一频谱信号;
第三处理子模块,被配置为将所述第一频谱信号输入音频特征提取模型进行处理,生成所述音频集合中各音频文件的音频特征。
可选地,所述音频特征提取模型通过以下方式训练:
按照预设时长阈值对样本音频文件进行切分;
对切分结果中的每个第二音频片段进行傅里叶变换,生成所述每个第二音频片段的第二频谱信号;
将所述第二频谱信号作为样本数据,并将所述样本音频文件的音频类型作为标签,输入待训练的音频特征提取模型进行训练,获得所述音频特征提取模型,所述音频特征提取模型使得所述第二频谱信号与所述音频类型相关联。
可选地,所述构建模块706,包括:
生成子模块,被配置为按照预设特征维度对所述音频特征及所述语义分割特征进行维度缩放处理,生成目标音频特征及目标语义分割特征;
第一对齐处理子模块,被配置为将所述目标音频特征与所述目标语义分割特征进行对齐处理。
可选地,所述构建模块706,包括:
计算子模块,被配置为计算所述音频特征与所述语义分割特征间的距离;
构建子模块,被配置为将与所述语义分割特征的距离小于预设距离阈值的音频特征对应的音频文件作为目标音频文件,并基于所述目标音频文件构建所述背景音频。
可选地,所述构建模块706,包括:
第二对齐处理子模块,被配置为将所述音频特征及所述语义分割特征输入音频对齐模型进行对齐处理。
可选地,所述音频对齐模型通过以下方式训练,包括:
构建由样本视频数据、正样本音频数据、负样本音频数据构成的多个三元组训练样本;
将所述多个三元组训练样本输入所述音频对齐模型,获得每个三元组训练样本中样本视频数据、正样本音频数据、负样本音频数据的特征向量;
计算每个三元组训练样本中样本视频数据与正样本音频数据的特征向量之间的第一距离,以及样本视频数据与负样本音频数据的特征向量之间的第二距离,将所述第一距离和所述第二距离输入度量学习损失函数,根据所述损失函数的输出对所述音频对齐模型进行训练,直至所述损失函数趋于稳定。
上述为本实施例的一种背景音频构建装置的示意性方案。需要说明的是,该背景音频构建装置的技术方案与上述的背景音频构建方法的技术方案属于同一构思,背景音频构建装置的技术方案未详细描述的细节内容,均可以参见上述背景音频构建方法的技术方案的描述。
图8示出了根据本申请一个实施例提供的一种计算设备800的结构框图。该计算设备800的部件包括但不限于存储器810和处理器820。处理器820与存储器810通过总线830相连接,数据库850用于保存数据。
计算设备800还包括接入设备840,接入设备840使得计算设备800能够经由一个或多个网络860通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备840可以包括有线或无线的任何类型的网络接口(例如, 网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本申请的一个实施例中,计算设备800的上述部件以及图8中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图8所示的计算设备结构框图仅仅是出于示例的目的,而不是对本申请范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备800可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备800还可以是移动式或静止式的服务器。
其中,处理器820用于执行如下计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现所述背景音频构建方法的步骤。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的背景音频构建方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述背景音频构建方法的技术方案的描述。
本申请一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述背景音频构建方法的步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的背景音频构建方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述背景音频构建方法的技术方案的描述。
本申请一实施例还提供一种计算机程序产品,其中,当所述计算机程序产 品在计算机中执行时,令计算机执行上述背景音频构建方法的步骤。
上述为本实施例的一种计算机程序产品的示意性方案。需要说明的是,该计算机程序产品的技术方案与上述的背景音频构建方法的技术方案属于同一构思,计算机程序产品的技术方案未详细描述的细节内容,均可以参见上述背景音频构建方法的技术方案的描述。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序产品代码,所述计算机程序产品代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序产品代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请实施例所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本申请实施例的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。

Claims (14)

  1. 一种背景音频构建方法,包括:
    对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征;
    提取预先建立的音频集合中各音频文件的音频特征;
    将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频。
  2. 根据权利要求1所述的背景音频构建方法,所述对待处理视频数据进行语义分割生成对应的语义分割图,包括:
    按照预设时长阈值对所述待处理视频数据进行视频片段切分;
    提取切分结果中每个第一视频片段的第一关键帧;
    将所述第一关键帧输入语义分割模型进行处理,生成所述每个第一视频片段的第一语义分割图。
  3. 根据权利要求2所述的背景音频构建方法,所述基于所述语义分割图提取所述待处理视频数据的语义分割特征,包括:
    基于所述第一语义分割图提取所述每个第一视频片段的第一语义分割特征;
    计算切分结果中各第一视频片段的第一语义分割特征的均值,并将所述均值作为所述待处理视频数据的语义分割特征。
  4. 根据权利要求2或3所述的背景音频构建方法,所述语义分割模型通过以下方式训练:
    按照所述预设时长阈值对样本视频文件进行视频片段切分;
    提取切分结果中每个第二视频片段的第二关键帧;
    将所述第二关键帧作为样本数据,并将所述第二关键帧的语义分割图中各 像素点的类别标识作为标签,输入待训练的语义分割模型进行训练,获得所述语义分割模型,所述语义分割模型使得所述第二关键帧与所述各像素点的类别标识相关联。
  5. 根据权利要求1至4任意一项所述的背景音频构建方法,所述提取预先建立的音频集合中各音频文件的音频特征,包括:
    按照预设时长阈值对所述音频集合中的各音频文件进行切分;
    对切分结果中的每个第一音频片段进行傅里叶变换,生成所述每个第一音频片段的第一频谱信号;
    将所述第一频谱信号输入音频特征提取模型进行处理,生成所述音频集合中各音频文件的音频特征。
  6. 根据权利要求5所述的背景音频构建方法,所述音频特征提取模型通过以下方式训练:
    按照预设时长阈值对样本音频文件进行切分;
    对切分结果中的每个第二音频片段进行傅里叶变换,生成所述每个第二音频片段的第二频谱信号;
    将所述第二频谱信号作为样本数据,并将所述样本音频文件的音频类型作为标签,输入待训练的音频特征提取模型进行训练,获得所述音频特征提取模型,所述音频特征提取模型使得所述第二频谱信号与所述音频类型相关联。
  7. 根据权利要求1或3所述的背景音频构建方法,所述将所述音频特征与所述语义分割特征进行对齐处理,包括:
    按照预设特征维度对所述音频特征及所述语义分割特征进行维度缩放处理,生成目标音频特征及目标语义分割特征;
    将所述目标音频特征与所述目标语义分割特征进行对齐处理。
  8. 根据权利要求1至6任意一项所述的背景音频构建方法,所述将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中 筛选目标音频文件,并基于所述目标音频文件构建所述待处理视频数据的背景音频,包括:
    计算所述音频特征与所述语义分割特征间的距离;
    将与所述语义分割特征的距离小于预设距离阈值的音频特征对应的音频文件作为目标音频文件,并基于所述目标音频文件构建所述背景音频。
  9. 根据权利要求1至6任意一项所述的背景音频构建方法,所述将所述音频特征与所述语义分割特征进行对齐处理,包括:
    将所述音频特征及所述语义分割特征输入音频对齐模型进行对齐处理。
  10. 根据权利要求9音频构建模型所述的背景音频构建方法,所述音频对齐模型通过以下方式训练,包括:
    构建由样本视频数据、正样本音频数据、负样本音频数据构成的多个三元组训练样本;
    将所述多个三元组训练样本输入所述音频对齐模型,获得每个三元组训练样本中样本视频数据、正样本音频数据、负样本音频数据的特征向量;
    计算每个三元组训练样本中样本视频数据与正样本音频数据的特征向量之间的第一距离,以及样本视频数据与负样本音频数据的特征向量之间的第二距离,将所述第一距离和所述第二距离输入度量学习损失函数,根据所述损失函数的输出对所述音频对齐模型进行训练,直至所述损失函数趋于稳定。
  11. 一种背景音频构建装置,包括:
    第一提取模块,被配置为对待处理视频数据进行语义分割生成对应的语义分割图,并基于所述语义分割图提取所述待处理视频数据的语义分割特征;
    第二提取模块,被配置为提取预先建立的音频集合中各音频文件的音频特征;
    构建模块,被配置为将所述音频特征与所述语义分割特征进行对齐处理,根据对齐结果在所述音频集合中筛选目标音频文件,并基于所述目标音频文件 构建所述待处理视频数据的背景音频。
  12. 一种计算设备,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,其中,所述处理器执行所述计算机可执行指令时实现权利要求1-10任意一项所述的背景音频构建方法的步骤。
  13. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求1-10任意一项所述的背景音频构建方法的步骤。
  14. 一种计算机程序产品,当所述计算机程序产品在计算机中执行时,令计算机执行权利要求1-10任意一项所述背景音频构建方法的步骤。
PCT/CN2021/120377 2020-12-10 2021-09-24 背景音频构建方法及装置 WO2022121447A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21902155.7A EP4207746A4 (en) 2020-12-10 2021-09-24 METHOD AND APPARATUS FOR PRODUCING BACKGROUND AUDIO
US18/133,641 US20230245451A1 (en) 2020-12-10 2023-04-12 Background Audio Construction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011437857.1A CN112584062B (zh) 2020-12-10 2020-12-10 背景音频构建方法及装置
CN202011437857.1 2020-12-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/133,641 Continuation US20230245451A1 (en) 2020-12-10 2023-04-12 Background Audio Construction

Publications (1)

Publication Number Publication Date
WO2022121447A1 true WO2022121447A1 (zh) 2022-06-16

Family

ID=75130472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120377 WO2022121447A1 (zh) 2020-12-10 2021-09-24 背景音频构建方法及装置

Country Status (4)

Country Link
US (1) US20230245451A1 (zh)
EP (1) EP4207746A4 (zh)
CN (1) CN112584062B (zh)
WO (1) WO2022121447A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112584062B (zh) * 2020-12-10 2023-08-08 上海幻电信息科技有限公司 背景音频构建方法及装置
CN113033438B (zh) * 2021-03-31 2022-07-01 四川大学 一种面向模态非完全对齐的数据特征学习方法
CN113923517B (zh) * 2021-09-30 2024-05-07 北京搜狗科技发展有限公司 一种背景音乐生成方法、装置及电子设备
CN114422824A (zh) * 2021-12-29 2022-04-29 阿里巴巴(中国)有限公司 数据处理方法、视频处理方法、显示方法及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009296297A (ja) * 2008-06-05 2009-12-17 Panasonic Corp 音声信号処理装置および方法
US20130089303A1 (en) * 2011-10-10 2013-04-11 Wei Jiang Video concept classification using audio-visual grouplets
CN109862421A (zh) * 2018-12-05 2019-06-07 北京达佳互联信息技术有限公司 一种视频信息识别方法、装置、电子设备及存储介质
CN111324773A (zh) * 2020-02-12 2020-06-23 腾讯科技(深圳)有限公司 一种背景音乐构建方法、装置、电子设备和存储介质
CN112584062A (zh) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 背景音频构建方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477798B (zh) * 2009-02-17 2011-01-05 北京邮电大学 一种分析和提取设定场景的音频数据的方法
JP5366051B2 (ja) * 2009-04-20 2013-12-11 株式会社ジャパンディスプレイ 情報入力装置、表示装置
CN101976258B (zh) * 2010-11-03 2013-07-10 上海交通大学 基于对象分割和特征加权融合的视频语义提取方法
CN110839173A (zh) * 2019-11-18 2020-02-25 上海极链网络科技有限公司 一种音乐匹配方法、装置、终端及存储介质
CN110971969B (zh) * 2019-12-09 2021-09-07 北京字节跳动网络技术有限公司 视频配乐方法、装置、电子设备及计算机可读存储介质
CN111164601B (zh) * 2019-12-30 2023-07-18 深圳市优必选科技股份有限公司 情感识别方法、智能装置和计算机可读存储介质
CN111552777B (zh) * 2020-04-24 2023-09-26 北京达佳互联信息技术有限公司 一种音频识别方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009296297A (ja) * 2008-06-05 2009-12-17 Panasonic Corp 音声信号処理装置および方法
US20130089303A1 (en) * 2011-10-10 2013-04-11 Wei Jiang Video concept classification using audio-visual grouplets
CN109862421A (zh) * 2018-12-05 2019-06-07 北京达佳互联信息技术有限公司 一种视频信息识别方法、装置、电子设备及存储介质
CN111324773A (zh) * 2020-02-12 2020-06-23 腾讯科技(深圳)有限公司 一种背景音乐构建方法、装置、电子设备和存储介质
CN112584062A (zh) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 背景音频构建方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4207746A4

Also Published As

Publication number Publication date
CN112584062B (zh) 2023-08-08
US20230245451A1 (en) 2023-08-03
EP4207746A4 (en) 2024-01-31
EP4207746A1 (en) 2023-07-05
CN112584062A (zh) 2021-03-30

Similar Documents

Publication Publication Date Title
WO2022121447A1 (zh) 背景音频构建方法及装置
WO2021036059A1 (zh) 图像转换模型训练方法、异质人脸识别方法、装置及设备
CN105917359B (zh) 移动视频搜索
WO2020087979A1 (zh) 生成模型的方法和装置
EP3893125A1 (en) Method and apparatus for searching video segment, device, medium and computer program product
WO2021143267A1 (zh) 基于图像检测的细粒度分类模型处理方法、及其相关设备
WO2022095356A1 (zh) 用于图像分类的迁移学习方法、相关装置及存储介质
CN113704531A (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
WO2023138188A1 (zh) 特征融合模型训练及样本检索方法、装置和计算机设备
WO2023035531A1 (zh) 文本图像超分辨率重建方法及其相关设备
WO2010133938A1 (en) Method and apparatus for performing feature extraction using local primitive code
US11347816B2 (en) Adaptive clustering of media content from multiple different domains
CN114333062A (zh) 基于异构双网络和特征一致性的行人重识别模型训练方法
CN113923378A (zh) 视频处理方法、装置、设备及存储介质
CN114510564A (zh) 视频知识图谱生成方法及装置
Jin et al. Text2Poster: Laying Out Stylized Texts on Retrieved Images
CN110826545A (zh) 一种视频类别识别的方法及相关装置
US20220375223A1 (en) Information generation method and apparatus
WO2021223747A1 (zh) 视频处理方法、装置、电子设备、存储介质及程序产品
CN115131570A (zh) 图像特征提取模型的训练方法、图像检索方法及相关设备
CN109215057B (zh) 一种高性能视觉跟踪方法及装置
CN110659382B (zh) 基于异构信息网络表示学习技术的混合音乐推荐方法
CN113221690A (zh) 视频分类方法及装置
CN115019078B (zh) 车辆图像处理方法、计算设备及存储介质
Le et al. Locality and relative distance-aware non-local networks for hand-raising detection in classroom video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21902155

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021902155

Country of ref document: EP

Effective date: 20230331

NENP Non-entry into the national phase

Ref country code: DE