WO2017181852A1 - 一种歌曲确定方法和装置、存储介质 - Google Patents

一种歌曲确定方法和装置、存储介质 Download PDF

Info

Publication number
WO2017181852A1
WO2017181852A1 PCT/CN2017/079631 CN2017079631W WO2017181852A1 WO 2017181852 A1 WO2017181852 A1 WO 2017181852A1 CN 2017079631 W CN2017079631 W CN 2017079631W WO 2017181852 A1 WO2017181852 A1 WO 2017181852A1
Authority
WO
WIPO (PCT)
Prior art keywords
song
audio
matching
identifier
candidate
Prior art date
Application number
PCT/CN2017/079631
Other languages
English (en)
French (fr)
Inventor
赵伟锋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2018526229A priority Critical patent/JP6576557B2/ja
Priority to MYPI2018701777A priority patent/MY194965A/en
Priority to KR1020187010247A priority patent/KR102110057B1/ko
Publication of WO2017181852A1 publication Critical patent/WO2017181852A1/zh
Priority to US16/102,478 priority patent/US10719551B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • G06F16/634Query by example, e.g. query by humming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/835Generation of protective data, e.g. certificates
    • H04N21/8352Generation of protective data, e.g. certificates involving content or source identification data, e.g. Unique Material Identifier [UMID]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • G10L2025/937Signal energy in various frequency bands

Definitions

  • the present invention relates to audio and video processing technologies, and in particular, to a song determining method and apparatus, and a storage medium.
  • Embodiments of the present invention provide a song determining method and apparatus, and a storage medium, which can improve the accuracy of determining a song corresponding to a video episode.
  • an embodiment of the present invention provides a method for determining a song, including:
  • the embodiment of the present invention further provides a song determining apparatus, including:
  • An identifier obtaining unit configured to extract an audio file in the video, and obtain a candidate song identifier of the candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set;
  • An audio frame acquiring unit configured to acquire a candidate song file corresponding to the candidate song identifier, and obtain a matching audio frame matched between the candidate song file and the audio file, to obtain a matching audio frame unit, wherein the matching audio
  • the frame unit includes a plurality of consecutive matched audio frames
  • the song determining unit is configured to acquire a target song identifier from the candidate song identifier set according to the matched audio frame unit corresponding to the candidate song identifier, and determine a target song to which the episode belongs according to the target song identifier.
  • an embodiment of the present invention provides a song determining apparatus, including: a memory and a processor, wherein the memory stores executable instructions, where the executable instructions are used to cause the processor to perform operations including:
  • the matching audio frame unit includes a plurality of consecutive matching audio frames
  • an embodiment of the present invention provides a storage medium, where executable instructions are stored for performing a song determining method provided by an embodiment of the present invention.
  • the embodiment of the present invention adopts extracting an audio file in a video, and acquiring a candidate song identifier of a candidate song to which the episode belongs in the audio file, obtaining a candidate song identifier set, and then acquiring a candidate song file corresponding to the candidate song identifier, and acquiring the candidate song.
  • a matching audio frame matched between the file and the audio file, to obtain a matching audio frame unit, wherein the matching audio frame unit includes a plurality of consecutive matching audio frames, according to the matched audio frame unit corresponding to the candidate song identifier, Obtaining a target song identifier of the target song to which the episode belongs in the candidate song identification set, and determining a target song to which the episode belongs according to the target song identifier.
  • the scheme may first obtain a candidate song identifier set of candidate songs to which the video episode belongs, and then select an identifier of a song to which the video episode belongs from the candidate song identifier set based on the matching audio frame between the video audio file and the song, thereby determining a video episode.
  • the song to which it belongs can improve the accuracy of determining or locating the song corresponding to the video episode relative to the related art.
  • FIG. 1 is a flowchart of a method for determining a song according to an embodiment of the present invention
  • 2a is a flowchart of obtaining a candidate song identifier according to an embodiment of the present invention
  • 2b is a spectrum peak point distribution diagram provided by an embodiment of the present invention.
  • 2c is a filtered peak point distribution map of the embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a first song determining apparatus according to an embodiment of the present invention.
  • 3b is a schematic structural diagram of a second song determining apparatus according to an embodiment of the present invention.
  • 3c is a schematic structural diagram of a third song determining apparatus according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the structure of hardware of a song determining apparatus according to an embodiment of the present invention.
  • Embodiments of the present invention provide a song determining method and apparatus. The details will be described separately below.
  • the embodiment of the present invention will be described from the perspective of a song determining apparatus, and the song determining apparatus may be specifically integrated in a device such as a server that needs to determine a song corresponding to a video episode.
  • the song determining device can also be integrated in a device such as a user terminal (such as a smart phone or a tablet) that needs to determine a song corresponding to a video episode.
  • a user terminal such as a smart phone or a tablet
  • An embodiment of the present invention provides a song determining method, including: extracting an audio file in a video, and acquiring a candidate song identifier of a candidate song to which the episode belongs in the audio file, obtaining a candidate song identifier set, and then acquiring a candidate corresponding to the candidate song identifier.
  • the matching audio frame unit obtains the identifier of the target song to which the episode belongs, that is, the target song identifier, and determines the target song to which the episode belongs according to the target song identifier.
  • the specific process of the song determination method can be as follows:
  • Step 101 Extract an audio file in the video, and obtain a candidate for the episode in the audio file.
  • the candidate song identification of the song, and the candidate song identification set is obtained.
  • the method for obtaining a video may be various.
  • the video server may be sent to the video server to obtain the video, or the video may be extracted from the local storage; that is, the step of “extracting the audio file in the video” may include:
  • the method for extracting audio files in the video may be various.
  • the audio and video may be separated from the video to obtain an audio file of the video; that is, the step of “extracting the audio file in the video” may include: performing sound on the video.
  • the video is separated and processed to obtain an audio file of the video.
  • the candidate song to which the episode belongs may be a song that may match the video episode, the candidate song being identified as an identification of the song that matches the video episode.
  • candidate song identifiers for example, first dividing the audio file of the video into multiple audio segments, and then matching each audio segment with the song (the song in the music library) to obtain the video.
  • the song that matches the episode the identifier of the song is identified as a candidate song; for example, based on the audio segment and the audio fingerprint of the song (that is, the digitized feature of the audio of the song), the song is matched; that is, the step "acquires the audio file.
  • the candidate song identifier of the candidate song to which the episode belongs may include:
  • a candidate song identifier of the candidate song to which the episode belongs is selected.
  • Step 102 Obtain a candidate song file corresponding to the candidate song identifier, and obtain a matching audio frame that matches the candidate song file and the audio file, to obtain a matched audio frame unit, where the matched audio frame unit includes multiple consecutive Match audio frames.
  • the candidate song file corresponding to the candidate song identifier may be obtained from the song database of the song server, for example, a request may be sent to the song server to obtain the corresponding song file; that is, the step “acquiring the candidate song file corresponding to the candidate song identifier” may be include:
  • a candidate song file returned by the song server according to the song acquisition request is received.
  • the matching audio frame may be a matching audio frame that matches the candidate song file and the audio file, for example, when the candidate song file includes multiple first audio frames, and the audio file includes multiple second audio frames, the candidate songs
  • the first audio frame in the file that matches the second audio frame in the audio file is a matching audio frame.
  • the second audio frame in the audio file that matches the first audio frame in the candidate song file is also a matching audio frame.
  • the matching audio frame unit may be an audio frame unit in the candidate song file, or may be an audio frame unit in the audio file.
  • the above-mentioned first audio frame is used to represent an audio frame in the candidate song for comparison with an audio frame (ie, a second audio frame) in the audio file, not a specific one of the candidate songs.
  • the audio frame, the second audio frame is used to represent the audio frame in the audio file, and is not used to represent a specific audio frame in the audio file.
  • a matching audio frame for example, matching an audio frame in a candidate song with an audio frame in an audio file.
  • the audio frame matching may adopt an audio frame based audio feature matching manner, such as matching the audio features of the first audio frame in the candidate song file with the audio features of the second audio frame in the audio file, according to the audio of the audio frame.
  • the feature matching result is used to obtain a matching audio frame; that is, the step of “acquiring matching audio frames that match the candidate song file with the audio file to obtain a matching audio frame unit” may include:
  • a matching audio frame unit is obtained based on the matched audio frame.
  • the audio feature of the audio frame may be referred to as an audio fingerprint.
  • the audio feature may be obtained in multiple manners, for example, according to the average amplitude of the corresponding frequency band of the audio frame, that is, in step “according to the candidate song identifier, the corresponding
  • the song determining method may further include: obtaining, after the step of “matching the audio feature corresponding to the first audio frame in the candidate audio file with the audio feature corresponding to the second audio frame in the audio file”
  • the audio feature corresponding to the first audio frame in the candidate song file; for example, the step of “acquiring the audio feature corresponding to the first audio frame in the candidate song file” may include:
  • converting a candidate song file into a preset format of audio such as 8k16bit audio (that is, 8*1024 sampling rate, 16-bit quantized audio)
  • a first preset number of sampling points as a frame
  • the second predetermined number of sampling points are Fourier transformed for the frame shift, and the spectrum is obtained (for example, 1856 samples are taken as one frame, 58 samples are frame-shifted and Fourier transformed), and then the spectrum is equally divided into a third pre- Set the number (such as 32) of the frequency band, and calculate the average amplitude value corresponding to each frequency band, and then compare each frequency band with the corresponding frequency band in the previous frame (the first frequency band and the first one in the second audio frame) The first frequency band of the audio frame is compared, and the second frequency is the second frequency frame.
  • each frame is compared with the second frequency band of the first audio frame, and so on until all the frequency bands are compared. If it is greater than 1, it is 1 and less than 0, so each frame will be composed of a third preset number of bit values.
  • a data unit which is an audio feature of the frame; for example, in the case of dividing the spectrum into 32 frequency bands, each audio frame will obtain a data unit including 32 bit values, and the 32 bit values That is the audio characteristics of each audio frame.
  • the audio feature of the audio file in the video can also be obtained by using the above acquisition method.
  • the acquisition process can refer to the above description, and details are not described herein.
  • the frame unit may be used as a unit for feature matching, that is, the step “the audio feature corresponding to the first audio frame in the candidate song file and the audio file.
  • Matching the audio features corresponding to the second audio frame to obtain a matching result may include:
  • the step of “acquiring a matching audio frame that matches the candidate song file and the audio file according to the matching result” may include: acquiring, according to the audio feature matching result, a matching between the candidate song file and the audio file. Matching an audio frame, the matched audio frame being an audio frame with an audio feature matching success;
  • the step of “acquiring matching audio frame units according to the matched audio frame” may include: acquiring consecutive numbers of the matched audio frames, and acquiring corresponding matching audio frame units according to the number.
  • the step of “acquiring the number of consecutive matching audio frames and obtaining corresponding matching audio frame units according to the number” may include:
  • the frame contiguous unit comprising a plurality of consecutive matched audio frames
  • the n second audio frames are consecutively selected from the m second audio frames.
  • Audio frame unit a Matching the audio features of the second audio frame in audio frame unit a with the audio features of the corresponding first audio frame in the candidate song (eg, the first audio frame and candidate song in audio frame unit a)
  • the audio features of the first audio frame are matched, the second audio frame in the audio frame unit a is matched with the audio feature of the second audio frame in the candidate song, and so on, until the nth audio frame unit a
  • the audio frame is matched with the audio feature of the nth audio frame in the candidate song.
  • feature matching needs to be performed n times to obtain an audio feature matching result.
  • the audio feature matching result includes the first audio frame and the second audio frame whose audio feature is successfully matched, and the matched audio frame is obtained according to the matching result, and the frame continuous unit and the number of matching audio frames in the continuous unit of the frame are acquired.
  • the new n second audio frames are continuously selected from the m first audio frames to form a new audio frame unit b, wherein the audio frame unit b and the audio frame unit a have at least one different second audio frame ( That is, the n second audio frames that are continuously selected are at least one different audio frame than the n consecutive second audio frames that have been consecutively selected; for example, the first second audio frame is selected before... the tenth
  • the two audio frames constitute the audio frame unit a, and then the second second audio frame can be selected... the eleventh audio frame constitutes the audio frame unit b), and the audio features of the second audio frame in the audio frame unit b are matched with the candidate songs.
  • the audio features of the corresponding first audio frame eg, matching the first audio frame in the audio frame unit b with the audio feature of the first audio frame in the candidate song, and the second audio frame in the audio frame unit b
  • Matching audio features of the second audio frame in the candidate songs ... matching the nth audio frame of the audio frame unit b with the audio features of the nth audio frame of the candidate songs to obtain an audio feature matching result, such as the result Including sound
  • the frequency feature matches the succeeding first audio frame and the second audio frame, obtains a matching audio frame according to the matching result, and acquires a frame continuous unit to And the number of matching audio frames in the contiguous unit of the frame, ... and so on, and then need to continuously select new n second audio frames to form an audio frame unit, and perform audio feature matching to obtain a continuous matching audio frame. The number is not matched until each second audio frame has been matched.
  • the frame contiguous unit may be determined as the matching audio frame unit based on the number.
  • the frame contiguous unit with the largest number of matching audio frames may be selected as the matching audio frame unit; that is, the step “determining the contiguous unit of the frame as the matching audio frame unit according to the number” may include: matching audio of the continuous unit of the frame When the number of frames is greater than the number of matching audio frames of the remaining consecutive units of frames, it is determined that the consecutive units of the frame are matched audio frame units.
  • the first to tenth audio frames b can be selected to form the first audio frame unit, and then, The 1-10th audio frame q in the first audio frame unit (that is, the 1st to 10th audio frames in the audio file) is matched with the 10 audio frames p of the candidate song to obtain a matching audio frame (for example, And performing feature matching on the first audio frame q and the audio frame p in the audio frame unit, and performing feature matching on the 10th audio q and the 10th audio frame p to obtain consecutive matching audio frames to form a frame continuous unit And get the number of matching audio frames in the contiguous unit of the frame.
  • the 1-10th audio frame q in the second audio frame unit (that is, the 2nd to the audio file) 11 audio frames q) are matched with 10 audio frames p to obtain matching audio frames, obtain consecutive matching audio frames to form a frame contiguous unit, and obtain the number of matching audio frames in the contiguous unit of the frame, and so on until selected
  • the 11th to 20th audio frames q constitute an audio frame unit for feature matching.
  • the number of consecutive frames of the frame and the corresponding number of matching audio frames can be obtained.
  • the frame continuous unit including the largest number of matched audio frames can be selected as the matching audio frame unit, that is, the longest selection is performed.
  • the frame contiguous unit acts as a matching audio frame unit.
  • Step 103 According to the candidate audio frame corresponding to the candidate audio frame unit, from the candidate song Obtaining a target song identifier of the target song to which the episode belongs, and determining a target song to which the episode belongs according to the target song identifier.
  • the matching audio frame unit matched between the candidate song file and the audio file corresponding to the candidate song identifier may be obtained by step 102, so that the matching audio corresponding to the candidate song identifier may be selected.
  • the frame unit selects a target song identifier of the target one to which the video episode belongs from the candidate song identification set.
  • the matching audio frame unit may be frame-expanded to obtain a matching song segment matching the candidate song file and the audio file, and then the target song identifier is obtained based on the matching song segment; that is, the step “corresponding to the candidate song identifier according to the candidate song identifier
  • the matching audio frame unit, and obtaining the target song identifier of the target song to which the episode belongs from the candidate song identification set may include:
  • time information corresponding to the candidate song identifier according to the matched song segment including: a first start time of the matched song segment in the video, a second start time in the candidate song, and the matching song segment duration;
  • the first start time is used to indicate the start time of the matching song segment in the video to distinguish from the start time (ie, the second start time) of the matching song segment in the candidate song. , not for a specific time.
  • the matching song segment corresponding to the candidate song identifier is a matching song segment whose candidate song corresponding to the candidate song identifier matches the audio file, and the matching song segment may be a song segment in the candidate song, or may be a song segment in the audio file.
  • the matching song segment is composed of an audio frame, after the matching song segment is acquired, the start time of the segment in the candidate song, the start time in the video, and the start time of the segment can be obtained according to the audio frame in the segment, and The The length of the segment (ie the length of the segment).
  • the start time of the clip in the candidate song may be obtained according to the sequence number of the audio frame of the clip, or the start time of the clip in the video may be obtained according to the sequence number of the audio frame of the clip.
  • the frame unit performs audio frame expansion to obtain a matching song segment corresponding to the candidate song identifier, which may include:
  • the audio file may be synchronized in the candidate song file for frame expansion, that is, the number of extended audio frames is the same and the direction is the same.
  • the method for determining the matching song segment according to the number of matching audio frames between the expansion units may be various, for example, when the number is greater than a certain preset number, determining that the expansion unit at this time is a matching song segment, and When the ratio of the number of matching audio frames to the total number of extended unit audio frames is greater than a preset ratio (eg, 90%), it is determined that the extension unit at this time is a matching song segment.
  • a preset ratio eg, 90%
  • the step of “acquiring the target song identifier from the candidate song identifier set according to the time information corresponding to the candidate identifier” may include:
  • the candidate song in the filtered candidate identifier set is identified as a target song.
  • the candidate song identifier having the inclusion relationship may be determined, and then the candidate song identifier included in the play time is filtered out, that is, the candidate song identifier having the inclusion relationship in the play time is filtered out.
  • the candidate song identifier with short playing time for example, the playing time corresponding to song ID1 is 1s to 10s, the playing time corresponding to song ID2 is 2s to 5s, and the playing time corresponding to song ID3 is 3s to 8s;
  • the play time corresponding to the songs ID1, ID2, and ID3 has an inclusion relationship, and therefore, the song ID having a short play time can be filtered, and here, the song ID2 and ID3 are filtered out.
  • the candidate song identifiers having the overlapping relationship of the playing time may also be determined, and then the candidate song identifiers having the short playing time are filtered out.
  • the play time corresponding to song ID1 is from 1st to 10s
  • the play time corresponding to song ID2 is from 5s to 12s.
  • the song ID of the length of play can be filtered out, where the play length of song ID1 is 10s.
  • the playing time of the song ID2 is 7s, so the song ID2 is filtered out.
  • the song corresponding to the target song identifier may be used as the target song to which the episode belongs.
  • the lyrics of the video episode may be filled into the video, so that the lyrics of the video episode are displayed when the video episode is played; that is, in the step After 103, it can also include:
  • the lyrics corresponding to the episode are filled to the video.
  • expanding a matching audio frame unit to obtain a matching song segment and its time information may include: acquiring the target song identifier according to the target song identifier and the corresponding time information. Interpolating the corresponding lyrics, and filling the lyrics into the video, wherein the time information is time information of the matching song segments corresponding to the target song.
  • the lyrics corresponding to the episode may be obtained according to the start time of the matching song segment corresponding to the target song identifier and the duration of the matching song segment, and according to the start time and duration of the matching song segment in the video.
  • Filling in the lyrics; that is, the step "acquiring the lyrics corresponding to the episode and populating the lyrics to the video according to the target song identifier and its corresponding time information" may include:
  • the lyrics are filled to the video according to the second start time and the duration corresponding to the target song identifier.
  • the target lyric file of the corresponding target song may be obtained according to the target song identifier, and then the lyrics corresponding to the episode are extracted from the target lyric file according to the start time of the matching song segment in the target song and the duration of the matching song segment; That is, the step of “acquiring the lyrics corresponding to the episode according to the target song identifier and the corresponding first start time and the duration” may include:
  • the corresponding lyrics are extracted from the lyric file as the lyrics of the episode.
  • the target song is identified as song 1
  • the matching song segment corresponding to the song 1 has a start time of 5s in the song 1
  • the matching song segment is 10s.
  • the 5s can be obtained from the lyric file of the song 1. Lyrics to 15s.
  • the step of “filling the lyrics into the video according to the second start time and the duration corresponding to the target song identifier” may include:
  • the lyrics are populated to the video based on the presentation time.
  • the second start time of the matching song segment corresponding to the target song identifier is 7s in the video, and the duration of the matching song segment is 8s.
  • the display time of the lyrics in the video can be obtained from the 7th to the 15th. After that, the lyrics can be inserted at the corresponding position of the video based on the presentation time.
  • the method in order to display the lyrics of the complete sentence to enhance the user experience, after obtaining the episode lyrics, it may be determined whether the lyrics are complete sentences, and if so, the lyrics filling operation is performed; that is, in the step “ After obtaining the lyrics corresponding to the episode, the method may further include: before the step of “filling the lyrics into the video”, the method may further include:
  • an interface may be set in the video, so that when the video episode is played, the interface may be jumped to play the song to which the video episode belongs; that is, in the step.
  • the method may further include:
  • the form of the jump interface can be various, such as a button, an input box, etc., and can be set according to actual needs.
  • an interface may also be set in the video, so that the target song to which the video episode belongs may be added to the song of the music software through the interface when playing the video episode.
  • the list that is, in the step "Get the insert After the target song identifier of the target song belongs to the song, it may also include:
  • An add interface is set in the video according to the target song identifier, so that the terminal adds the target song to the song list of the music software through the add interface when playing the episode.
  • the form of the added interface may be various, such as a button, an input box, etc., which may be set according to actual needs;
  • the music software may be a commonly used music playing software, such as a cloud-based music playing software, online music playing.
  • Software, etc. the song list can be a song list or a song playlist, such as a favorite song list.
  • the embodiment of the present invention adopts an audio file in an extracted video, and obtains a candidate song identifier of a candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set, and then obtains a candidate song file corresponding to the candidate song identifier.
  • the matching audio frame unit includes a plurality of consecutive matching audio frames, according to the candidate song identifier corresponding to the Matching an audio frame unit, obtaining a target song identifier of the target song to which the episode belongs from the candidate song identifier set, and determining a target song to which the episode belongs according to the target song identifier; the scheme may first obtain a candidate song of the candidate song to which the video episode belongs Identifying the set, and then, based on the matching audio frame between the audio file of the video and the song, selecting the identifier of the song to which the video episode belongs from the set of candidate song identifiers, thereby determining the song to which the video episode belongs, which can be improved relative to related art Identify or locate a video episode Should the accuracy and efficiency of the song.
  • the embodiment of the present invention fills the lyrics corresponding to the episode according to the target song identifier and its corresponding matching audio frame unit to the video; the scheme can also automatically complete the matching of the video episode and the song. To determine the song to which the video episode belongs, and to automatically obtain the lyrics of the video episode for filling, which can improve the accuracy and efficiency of the video episode lyrics filling relative to the related art.
  • the candidate song identifier can be obtained based on the audio fingerprint matching between the audio file and the candidate song file in the video.
  • the process of acquiring the candidate song identifier based on the audio fingerprint matching is mainly introduced.
  • Figure 2a the process of obtaining the candidate song identifier is as follows:
  • Step 201 Divide the audio file into a plurality of audio segments, and obtain an audio fingerprint of the audio segment.
  • the audio file can be divided into multiple ways.
  • the audio file can be divided into multiple audio segments by a preset frame length and a preset frame shift, and the duration of each audio segment is equal to the preset frame length. That is, the step of dividing the audio file into a plurality of audio segments may include:
  • the audio file is divided into a plurality of audio segments by a preset frame length and a preset frame shift.
  • PCM Pulse Code Modulation For example, converting an audio file into 8k16bit (that is, 8*1024 sample rate, 16-bit quantized audio) PCM Pulse Code Modulation is also called audio, and then, in 10 seconds, the frame length is 1 second. The frame is shifted into a plurality of small audio segments of 10 seconds. For example, when the duration of each frame is 1 s, the first frame and the tenth frame are divided into one audio segment, and the second frame and the eleventh frame are segmented. The frame is divided into an audio segment. In the specific implementation, the appropriate division method can be selected according to actual needs.
  • the audio fingerprints may be multiple.
  • the audio small fingerprint is selected, and the audio small fingerprint is a data structure, which may be composed of spectral peak points on the spectrum, for example, Obtaining a spectrum corresponding to the audio frame of the audio, and then extracting a peak point of the spectrum corresponding to the audio frame, thereby obtaining a peak point of the spectrum corresponding to the audio, and then combining the peak points in the set to obtain an audio fingerprint; that is, the step “acquiring The audio fingerprint of the audio segment may include:
  • the peak points of the spectrum in the peak set are combined to obtain an audio fingerprint of the audio segment.
  • the step of “combining the peak points of the peaks in the peak set to obtain the audio fingerprint of the audio segment” may include:
  • the spectral peak point with the target spectral peak point to obtain an audio fingerprint of the audio segment, the audio fingerprint including: a frequency corresponding to the peak point of the spectrum, a time difference and a frequency between the peak point of the spectrum and the peak point of the target spectrum difference.
  • the target spectral peak point combined with the spectral peak point may be a spectral peak point other than the spectral peak point; for example, after acquiring the peak set corresponding to the audio segment, generating a frequency peak point distribution map according to the peak set, Then, a target area corresponding to a certain frequency peak point (also referred to as an anchor point) may be determined in the frequency peak point distribution map, the target area including: a target frequency peak point combined with the frequency peak point, and then The anchor point is combined with the target frequency peak point in the target area, and after combining, multiple audio fingerprints can be obtained.
  • a target area corresponding to a certain frequency peak point also referred to as an anchor point
  • the anchor point is combined with the target frequency peak point in the target area, and after combining, multiple audio fingerprints can be obtained.
  • the horizontal axis of the frequency peak point distribution map is the time, and the vertical axis is the frequency of the peak point. Since the audio frame has a corresponding relationship with the time, in order to quickly acquire the audio fingerprint, the audio frame number can be used to represent the time in the embodiment of the present invention.
  • the frequency of the peak point can also be represented by a frequency band index number, and the index number can range from (0 to 255), that is, the above peak points t and f can respectively use audio frames.
  • the serial number and the frequency band index number are indicated.
  • the target area can be represented by the audio frame number and the frequency band index number.
  • the target area can be composed of a time area and a frequency domain area, wherein the time area can be (15 to 63) frames (the time difference is represented by 6 bits).
  • the frequency domain area may be (-31 to 31) frequency bands (the frequency difference is represented by 6 bits), and the size of the target area may be set according to actual requirements.
  • the target area can only include three target spectral peak points, that is, the number of target spectral peak points corresponding to the anchor point is 3.
  • the short-time peak frequency points interact with each other, and one frequency component may mask a frequency component close to it (the so-called auditory masking effect), so the time interval is small, and The peak points with smaller frequency spacing are filtered out to ensure that the selected peak points are more evenly distributed along the time and frequency axis; that is, after the step of “getting the peak set corresponding to the audio segment”, the step “the peak of the spectrum in the peak set is two Before the two are combined, the song determination method may further include:
  • the spectral peak points in the peak set are filtered according to the time difference between the peak points of the spectrum and the frequency difference.
  • a spectral peak point distribution map corresponding to a peak set of an audio in order to make the spectral peak point distribution uniform, the peak point in the peak set may be filtered, and the filtered peak set corresponds to a spectral peak point distribution reference.
  • Figure 2c a spectral peak point distribution map corresponding to a peak set of an audio
  • the audio features may be distinguished based on the size of the audio fingerprint.
  • the audio feature in the first embodiment may be referred to as a large audio fingerprint.
  • the audio fingerprint of the audio segment in the description of 2 may be referred to as a small audio fingerprint.
  • Step 202 Determine whether there is a fingerprint sample matching the audio fingerprint in the preset sample set. If yes, go to step 203 and if no, end the process.
  • the preset sample set may include at least one fingerprint sample, where each fingerprint sample is an audio fingerprint of a song; for example, the preset sample set may have multiple fingerprint samples, and each fingerprint sample It may correspond to a song ID, for example, fingerprint sample 1 corresponds to song 1, fingerprint sample 2 corresponds to song 2, and fingerprint sample n corresponds to song n.
  • a plurality of audio fingerprints of the audio segment may be acquired, and then, whether there is a fingerprint sample matching each of the audio fingerprints (ie, the same) in the preset sample set, obtaining a plurality of matching fingerprint samples, and then acquiring each matching fingerprint sample Corresponding song identification to obtain a song identification set, the song identification set including a plurality of the song identifications.
  • the audio fingerprint corresponding to the audio segment includes: the audio fingerprint D1 and the audio fingerprint D2, and the audio fingerprint D1 of the audio segment is compared with the fingerprint sample in the preset sample set, respectively.
  • the same fingerprint sample of the audio fingerprint D1 determines that the preset sample has a fingerprint sample matching the audio fingerprint D1.
  • the audio fingerprint D2 can be compared with the fingerprint sample in the preset sample set, respectively, if there is an audio fingerprint. D2 is the same fingerprint sample, and it is determined that there is a fingerprint sample matching the audio fingerprint D2 in the preset sample set.
  • the song may be extracted from the song database, and then the audio fingerprint of the song is extracted as a fingerprint sample.
  • the manner of extracting the audio fingerprint of the song may also be obtained by using the audio fingerprint of the audio segment.
  • the spectrum corresponding to the audio frame in the song, and then extracting the peak points of the spectrum and combining the peak points of the spectrum to obtain the audio fingerprint of the song (ie, the fingerprint sample), the song can be extracted from a certain song database;
  • the song determining method may further include:
  • Step 203 Obtain a song identifier corresponding to the matching fingerprint sample, to obtain a corresponding corresponding to the audio segment. a first song identification set, the first song identification set comprising a plurality of the song identifications.
  • the method for obtaining the song identifier corresponding to the fingerprint sample may be multiple.
  • the mapping relationship set may be used to obtain the song identifier corresponding to the matching fingerprint sample, and the mapping relationship set may include a mapping relationship between the fingerprint sample and the song identifier. (ie, the corresponding relationship), that is, the step of the step of “acquiring the song identifier corresponding to the matching fingerprint sample” includes:
  • mapping relationship set a song identifier corresponding to the matching fingerprint sample, where the mapping relationship set includes a mapping relationship between the fingerprint sample and the song identifier.
  • the mapping relationship set may be a preset mapping relationship set, and the mapping relationship between the fingerprint sample and the song identifier may be preset by the system, or may be set by the user; that is, in the step “extracting audio in the video” Before the file, the song determination method may further include:
  • mapping relationship setting request indicates a fingerprint sample and a song identifier that need to establish a mapping relationship
  • mapping relationship between the fingerprint sample and the song identifier is established according to the mapping relationship setting to obtain a mapping relationship set.
  • the mapping relationship set may be presented in the form of a table, which is called a mapping relationship table, and the mapping relationship table may include: a preset sample set, and a song identifier corresponding to the fingerprint sample in the preset sample set, where the mapping A relational table can be stored in a database and can be called a fingerprint library.
  • Step 204 Select, from the song identification set, a candidate song identifier of the candidate song to which the episode belongs.
  • the method may further include: obtaining a first offset time of the audio fingerprint in the audio segment, and acquiring the matching fingerprint sample in the matching song, after the step of “selecting the candidate song identifier from the song identification set” a second offset time, wherein the first offset time is a time of the spectral peak point in the audio segment, and the matching song is a song corresponding to the song identifier;
  • the step of “selecting the candidate song identifier of the candidate song to which the episode belongs from the song identification set” may include:
  • the candidate song identification is selected from the song identification set based on the start time of the audio segment in the matching song.
  • the offset time t1 of the audio fingerprint D1 (f1, ⁇ f', ⁇ t') in the audio segment can be obtained, and the t1 is the time of the spectral peak point a1 in the audio segment, and the fingerprint sample is also extracted in the above manner.
  • the offset time of the fingerprint sample in the song to which it belongs is the time of the peak point of the spectrum corresponding to the fingerprint sample (ie, the anchor point) in the song to which it belongs.
  • the offset time of the matching fingerprint sample in the matching song may be acquired based on the preset time mapping relationship set, where the preset time mapping relationship set may include: the fingerprint sample and the fingerprint sample in the song to which the fingerprint is located
  • the mapping relationship (correspondence relationship) between the offset times, that is, the step "the second offset time of the matching fingerprint sample in the matching song" may include:
  • the preset time mapping relationship set includes: the fingerprint sample and the fingerprint sample in the song to which the fingerprint sample belongs The mapping relationship between offset times.
  • the preset time mapping relationship set may be a preset time mapping relationship set, and the mapping relationship between the fingerprint sample and the offset time may be preset by the system, or may be set by the user; that is, in the step “ Before the audio file in the video is extracted, the lyrics filling method may further include:
  • the fingerprint mapping relationship request indicates a fingerprint sample and an offset time that need to establish a mapping relationship, where the offset time is an offset time of the fingerprint sample in a song to which the fingerprint sample belongs;
  • a mapping relationship between the fingerprint sample and the offset time is established to obtain a time mapping relationship set.
  • the time mapping relationship set may be presented in the form of a table, which is called a time mapping relationship table, and the mapping relationship table may include: a preset sample set, and an offset time corresponding to the fingerprint sample in the preset sample set.
  • the time mapping relationship set and the mapping relationship set are set in the same mapping relationship set, for example, setting a total mapping relationship set, and the set may include The mapping relationship between the fingerprint sample and the song identifier, and the mapping relationship between the fingerprint sample and the offset time.
  • a total mapping relationship table may be set, and the relationship table may include: a preset sample set, a preset sample. The song identifier corresponding to the fingerprint sample in the collection, and the offset time corresponding to the fingerprint sample in the preset sample set.
  • the start time of the audio segment is the same in a plurality of different songs, it indicates that the plurality of songs are most likely to be the candidate songs corresponding to the audio segment, that is, the songs to which the video episode belongs, that is, the step "according to the song Identifying a start time corresponding to the song identifier in the set, and selecting the candidate song identifier from the song identifier set may include:
  • a song identifier corresponding to the target start time is selected from the song identification set as a candidate song identifier.
  • the start time of the same number reaching the preset number may be selected as the target start time, that is, the step “determining the target start time from the time set according to the same number of the start time” may include:
  • the preset number can be set according to actual needs, for example, it can be 5, 6, 9, and so on.
  • the start time of the audio segment in the song may be obtained according to the offset time corresponding to the audio fingerprint and the offset time corresponding to the song identifier in the song identifier set.
  • the song identifier may be calculated.
  • the time difference between the offset time and the offset time corresponding to the audio fingerprint which is the start time of the audio segment in the song.
  • the offset time corresponding to the audio fingerprint of the audio segment is t'
  • the offset time corresponding to the fingerprint sample ie, the offset time corresponding to the song identifier
  • the start time ⁇ t corresponding to each song identifier of the song identification set can be calculated to obtain a time set, such as ( ⁇ t1, ⁇ t2, ⁇ t1, ⁇ t1, ⁇ t2, ⁇ t3, ⁇ t3, ⁇ tn).
  • the number of each start time can be obtained, and then it is determined whether the number is greater than a preset number, and if so, it is determined that the start time to be seeded is the target start time; for example, in the preset
  • the number is 8
  • the number of statistics ⁇ t1 is 10
  • the number of ⁇ t2 is 6,
  • the number of ⁇ t3 is 12.
  • the number of ⁇ t1 is greater than the preset number
  • the number of ⁇ t2 is less than the preset number. If the number of ⁇ t3 is greater than the preset number, it can be determined that ⁇ t1 and ⁇ t3 are the target start times.
  • the audio fingerprint in order to improve the matching speed of the audio fingerprint, may be converted.
  • the audio fingerprint is converted into a specific feature number by using a preset algorithm, and is named as a hash value (hash_key). .
  • hash_key f1 ⁇ 2 ⁇ 12+ ⁇ f ⁇ 2 ⁇ 6+ ⁇ t
  • is an exponential operator, which is converted into a specific
  • the number, that is, the bitwise height constitutes a 20-bit integer, so that only the hash_key matching can be performed in the subsequent audio fingerprint matching, that is, the step "determine whether there is a fingerprint sample matching the audio fingerprint in the preset sample set" may be Includes:
  • the preset digital sample set includes at least one feature number, which is called a digital sample, and each digital sample can correspond to a song identifier.
  • the step of “acquiring the song identifier corresponding to the matching fingerprint sample” may include: acquiring a song identifier corresponding to the matching digital sample.
  • the song identifier corresponding to the matching digital sample may be obtained based on the digital mapping relationship set, that is, the step of “acquiring the song identifier corresponding to the matching digital sample” may include: acquiring the song identifier corresponding to the matching digital sample according to the digital mapping relationship set, where The set of digital mapping relationships includes: a correspondence between a digital sample and a song identifier.
  • the digital mapping relationship set may be a preset digital mapping relationship set, and the mapping relationship between the digital sample and the song identifier may be preset by the system, or may be set by the user; that is, in the step “extracting the video” Before the audio file, the song determination method may further include:
  • the digital mapping relationship setting request indicating a digital feature and a song identifier that need to establish a mapping relationship
  • a mapping relationship between the digital feature and the song identifier is obtained, and a digital mapping relationship set is obtained.
  • the step of “acquiring the second offset time of the matching fingerprint sample in the matching song” may include: acquiring a second offset time corresponding to the matching digital sample according to the digital time mapping relationship set, wherein the digital time mapping relationship set includes a number The mapping between the sample and the offset time.
  • the method for obtaining the digital time mapping relationship set can refer to the above digital mapping relationship. The way to create collections or collections of time-relational relationships is not repeated here.
  • the digital mapping relationship set and the digital time mapping relationship set may be set in a set, for example, setting a total mapping relationship set, where the set includes: between a digital sample and a song identifier. a mapping relationship between the mapping relationship, the digital sample, and the offset time; for example, a mapping relationship table may be set, where the mapping relationship table may include: a preset digital sample set, a song identifier corresponding to the digital sample in the preset digital sample set, The offset time corresponding to the digital sample in the preset digital sample set.
  • the embodiment of the present invention divides the audio file into a plurality of audio segments, and acquires an audio fingerprint of the audio segment, and then determines whether there is a fingerprint sample matching the audio fingerprint in the preset sample set, and if so, Obtaining a song identifier corresponding to the matching fingerprint sample, obtaining a first song identifier set corresponding to the audio segment, and selecting, from the song identifier set, a candidate song identifier of the candidate song to which the episode belongs; the scheme may acquire all candidates of the video insertion The song, and then determining the song corresponding to the video episode from the candidate song based on the matching of the candidate song and the audio of the video, can improve the accuracy and efficiency of determining the song corresponding to the video episode compared to the related art.
  • the embodiment uses the peak point of the spectrum to construct the audio fingerprint, the candidate song corresponding to the video episode and its identifier can be accurately obtained, and the accuracy of determining or locating the candidate song to which the video episode belongs is further improved.
  • the embodiment of the present invention further provides a song determining apparatus.
  • the song determining apparatus may further include an identifier acquiring unit 301, an audio frame acquiring unit 302, and a song determining unit 303, as follows:
  • the identifier obtaining unit 301 is configured to extract an audio file in the video, and obtain a candidate song identifier of the candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set.
  • the identifier obtaining unit 301 may include: an audio extraction subunit, a fingerprint acquisition subunit, a determination subunit, an identifier collection acquisition subunit, and a selection subunit;
  • the audio extraction subunit is configured to extract an audio file in the video
  • the fingerprint acquisition subunit is configured to divide the audio file into a plurality of audio segments, and acquire an audio fingerprint of the audio segment;
  • the determining subunit is configured to determine whether a fingerprint sample matching the audio fingerprint exists in the preset sample set
  • the identifier collection acquisition sub-unit is configured to: when determining that there is a fingerprint sample matching the audio fingerprint, obtain a song identifier corresponding to the matching fingerprint sample, and obtain a song identifier set corresponding to the audio segment, where the song identifier set includes a plurality of the songs Identification
  • the selection subunit is configured to select a candidate song identifier of the candidate song to which the episode belongs from the song identification set.
  • the method for obtaining a video may be multiple.
  • the video server may be sent to the video server to obtain the video, or the video may be extracted from the local storage.
  • the audio and video separation processing is performed to obtain an audio file of the video; that is, the step of "extracting the audio file in the video” may include: performing audio and video separation processing on the video to obtain an audio file of the video.
  • the audio file may be divided into multiple manners.
  • the audio file may be divided into multiple audio segments by a preset frame length and a preset frame shift, and the duration of each audio segment is equal to the preset frame length.
  • the candidate song to which the episode belongs may be a song that may match the video episode, the candidate song being identified as an identification of the song that matches the video episode.
  • an audio fingerprint of an audio segment there are many ways to obtain an audio fingerprint of an audio segment, such as the following:
  • the peak points of the spectrum in the peak set are combined two by two to obtain an audio fingerprint of the audio segment.
  • the step “combining the peak points of the peaks in the peak set to obtain the audio fingerprint of the audio segment” may include:
  • the peak points of the spectrum in the peak set are combined two by two to obtain an audio fingerprint of the audio segment.
  • the method for selecting a candidate song identifier from the song identifier set may be multiple, for example, may be obtained based on an offset time of the audio fingerprint
  • the song determining apparatus may further include: An offset time acquisition unit, configured to acquire a first offset time of the audio fingerprint in the audio segment after the sub-unit selects the candidate song identifier after the fingerprint acquisition sub-unit acquires the audio fingerprint, And a second offset time of the matching fingerprint sample in the matching song, wherein the first offset time is a time of the spectral peak point in the audio segment, and the matching song is a song corresponding to the song identifier;
  • select the subunit which can be specifically configured as:
  • the candidate song identification is selected from the song identification set based on the start time of the audio segment in the matching song.
  • the selected subunit is specifically configured as:
  • a song identifier corresponding to the target start time is selected from the song identification set as a candidate song identifier.
  • the audio frame obtaining unit 302 is configured to acquire a candidate song file corresponding to the candidate song identifier, and obtain a matching audio frame that matches the candidate song file and the audio file to obtain a matching audio frame unit, where the matching audio is
  • the frame unit includes a plurality of consecutive matched audio frames.
  • the audio frame obtaining unit 302 may specifically include: a matching subunit, a first acquiring subunit, and a second acquiring subunit;
  • the matching subunit is configured to match an audio feature of the first audio frame in the candidate song file with an audio feature of the second audio frame in the audio file to obtain a matching result
  • the first obtaining subunit is configured to acquire, according to the matching result, a matching audio frame that matches the candidate song file and the audio file;
  • the second obtaining subunit is configured to obtain a matching audio frame unit according to the matched audio frame.
  • the matching subunit is specifically configured as follows:
  • Audio features of the first audio frame in the candidate song file and second sounds in the audio frame unit are matched to obtain an audio feature matching result
  • the first acquiring sub-unit is configured to: obtain a matching audio frame that matches the candidate song file and the audio file according to the audio feature matching result, where the matching audio frame is an audio frame with an audio feature matching success. ;
  • the second obtaining subunit is specifically configured as:
  • the frame contiguous unit comprising a plurality of consecutive matched audio frames
  • the song determining apparatus of the embodiment of the present invention may further include: a feature acquiring unit, configured to: after the identifier acquiring unit 301 acquires the candidate song identifier, and before the matching subunit performs feature matching, Obtaining an audio feature corresponding to the first audio frame in the candidate song file.
  • a feature acquiring unit configured to: after the identifier acquiring unit 301 acquires the candidate song identifier, and before the matching subunit performs feature matching, Obtaining an audio feature corresponding to the first audio frame in the candidate song file.
  • the feature acquiring unit may be specifically configured as:
  • the candidate song file is converted into audio of a preset format (such as 8k16bit audio), and then, the first preset number of sampling points are used as one frame, and the second predetermined number of sampling points are subjected to Fourier transform.
  • Obtain the spectrum for example, one frame with 1856 samples and Fourier transform with 58 samples
  • divide the spectrum into a third preset number (such as 32) and calculate each The average amplitude value corresponding to the frequency band, and then each frequency band corresponds to the previous frame
  • the frequency bands are compared (the first frequency band in the second audio frame is compared with the first frequency band of the first audio frame, the second frequency band in the second audio frame is compared with the second frequency band of the first audio frame, and so on until compared
  • the song determining unit 303 is configured to acquire a target song identifier from the candidate song identifier set according to the matched audio frame unit corresponding to the candidate song identifier, and determine a target to which the episode belongs according to the target song identifier. song.
  • the song determining unit 303 may specifically include: an audio frame extension subunit, a time acquisition subunit, an identifier acquisition subunit, and a song determination subunit;
  • the audio frame extension subunit is configured to perform audio frame expansion on the matched audio frame unit corresponding to the candidate song identifier, to obtain a matching song segment corresponding to the candidate song identifier;
  • the time acquisition subunit is configured to acquire time information corresponding to the candidate song identifier according to the matched song segment, the time information including: a first start time of the matching song segment in the video, and a second time in the candidate song The start time and the length of the matching song segment;
  • the identifier acquisition subunit is configured to obtain a target song identifier from the candidate song identifier set according to time information corresponding to the candidate identifier;
  • the song determining subunit is configured to determine a target song to which the episode belongs according to the target song identifier.
  • the audio frame extension subunit may be specifically configured as:
  • the identifier obtaining subunit may be specifically configured as:
  • the candidate song in the filtered candidate identifier set is used as the target song identifier of the target song to which the episode belongs.
  • the candidate song identifier included in the play time is filtered out; for example, the play time corresponding to the candidate song identifier is acquired. Thereafter, it is also possible to determine candidate song identities in which the play time has an overlapping relationship, and then filter out candidate song identities having a short play duration.
  • the song determining apparatus of the embodiment of the present invention may further include: a lyrics filling unit 304;
  • the lyrics filling unit 304 is configured to fill the lyrics corresponding to the episode to the video according to the target song identifier and its corresponding matching audio frame unit;
  • the identifier obtaining subunit is configured to receive information according to time information corresponding to the candidate identifier Obtaining a target song identifier in the candidate song identifier set;
  • the song determining subunit is configured to determine a target song to which the episode belongs according to the target song identifier.
  • the lyrics filling unit 304 may include: a lyrics acquiring subunit and a filling subunit;
  • the lyrics obtaining sub-unit is configured to acquire the lyrics corresponding to the episode according to the target song identifier and the corresponding first start time and the duration;
  • the padding subunit is configured to fill the lyrics to the video according to the second start time and the duration corresponding to the target song identifier.
  • the target lyric file of the corresponding target song may be obtained according to the target song identifier, and then the lyrics corresponding to the episode are extracted from the target lyric file according to the start time of the matching song segment in the target song and the duration of the matching song segment. That is, the lyrics obtaining subunit can be specifically configured as:
  • the corresponding lyrics are extracted from the lyric file as the lyrics of the episode.
  • the filler subunit can be specifically configured as:
  • the lyrics are populated to the video based on the presentation time.
  • the example song determining apparatus may further include a lyrics determining unit 305, referring to FIG. 3c;
  • the lyrics determining unit 305 may be configured to determine whether the lyrics is a complete statement after the lyrics filling unit 304 acquires the lyrics corresponding to the episode, before filling the lyrics with the lyrics;
  • the lyrics filling unit 304 may be configured to: when the lyrics determining unit 305 determines that the lyrics are complete sentences, perform the second start time and the duration according to the target song identifier, and fill the lyrics to the lyrics The steps of the video.
  • an embodiment of the present invention may further provide an interface in the video, so that when the video episode is played, the interface may be jumped to play the song to which the video episode belongs; that is, the implementation of the present invention
  • the lyrics filling method may further include: an interface setting unit;
  • the interface setting unit may be configured to: after the song determining unit 303 acquires the episode target song identifier, set a jump interface in the video according to the target song identifier, so that the terminal jumps to play by using the jump interface when playing the episode The target song to which the episode belongs.
  • the form of the jump interface can be various, such as a button, an input box, etc., and can be set according to actual needs.
  • the interface setting unit may be further configured to: after the song determining unit 303 acquires the target song identifier, set an add interface in the video according to the target song identifier, so that the terminal plays the The target song is added to the song list of the music software through the add interface during the episode.
  • the foregoing units may be implemented as a separate entity, or may be implemented in any combination, and may be implemented as the same or a plurality of entities.
  • the foregoing method embodiments and details are not described herein.
  • the song determining device identifier obtaining unit 301 of the embodiment of the present invention extracts an audio file in the video, and acquires a candidate song identifier of the candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set, and then, by the audio frame.
  • the obtaining unit 302 acquires a candidate song file corresponding to the candidate song identifier, and acquires a matching audio frame that matches the candidate song file and the audio file to obtain a matching audio frame unit, where the matching audio frame unit includes multiple consecutive Matching the audio frame, the song determining unit 303 obtains the target song identifier from the candidate song identifier set according to the matched audio frame unit corresponding to the candidate song identifier, and according to The target song identifier determines a target song to which the episode belongs;
  • the scheme may first obtain a candidate song identifier set of candidate songs to which the video episode belongs, and then select an identifier of a song to which the video episode belongs from the candidate song identifier set based on the matching audio frame between the video audio file and the song, thereby determining a video episode.
  • the associated song can improve the accuracy and efficiency of determining or locating the song corresponding to the video episode relative to the related art.
  • the apparatus of the embodiment of the present invention may further fill the video corresponding to the episode according to the target song identifier and its corresponding matching audio frame unit after determining the song to which the video episode belongs; the scheme may also automatically complete the video episode and the song.
  • the matching is to determine the song to which the video episode belongs, and the lyrics of the video episode can be automatically filled for filling, and the accuracy and efficiency of the video episode lyrics filling can be improved compared with the related art.
  • FIG. 4 exemplarily shows a schematic diagram of the structure of the song determining apparatus 40 provided by the embodiment of the present invention.
  • the structure shown in FIG. 4 is only one example of a suitable structure and is not intended to suggest any limitation regarding the structure of the song determining apparatus 40.
  • the song determining device 40 can be implemented in a distributed computing environment including, for example, a server computer, a small computer, a mainframe computer, and any of the above-described devices.
  • Computer readable instructions may be distributed via computer readable media (discussed below).
  • Computer readable instructions may be implemented as program modules, such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
  • APIs application programming interfaces
  • data structures such as lists, etc. that perform particular tasks or implement particular abstract data types.
  • the functionality of the computer readable instructions can be combined or distributed at will in various environments.
  • FIG. 4 illustrates an example of the structure of a song determining apparatus 40 provided in accordance with an embodiment of the present invention.
  • the song determining device 40 includes at least one processing unit 41 and a storage unit 42.
  • storage unit 42 may be volatile (such as Random Access Memory (RAM)), non-volatile (such as read-only memory). Memory (ROM, Read Only Memory), flash memory, etc.) or some combination of the two. This configuration is illustrated by dashed lines in FIG.
  • RAM Random Access Memory
  • ROM Read Only Memory
  • flash memory etc.
  • song determining device 40 may include additional features and/or functionality.
  • song determining device 40 may also include additional storage devices (eg, removable and/or non-removable) including, but not limited to, magnetic storage devices, optical storage devices, and the like.
  • This additional storage device is illustrated by storage unit 43 in FIG.
  • computer readable instructions for implementing one or more embodiments provided by embodiments of the present invention may be in storage unit 43.
  • the storage unit 43 can also store other computer readable instructions for implementing an operating system, applications, and the like.
  • Computer readable instructions may be loaded into storage unit 42 for execution by, for example, processing unit 41.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data.
  • the storage unit 42 and the storage unit 43 are examples of computer storage media.
  • Computer storage media includes, but is not limited to, RAM, ROM, Electrically Erasable Programmable Read-Only Memory, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage device
  • the song determining device 40 may also include a communication connection 46 that allows the song determining device 40 to communicate with other devices.
  • Communication connection 46 may include, but is not limited to, a modem, a network interface card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interface for connecting song determination device 40 to other song determination devices.
  • NIC network interface card
  • Communication connection 46 may include a wired connection or a wireless connection. Communication connection 46 can transmit and/or receive communication media.
  • Computer readable medium can include a communication medium.
  • Communication media typically includes Computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transmission mechanism, and including any information delivery medium.
  • modulated data signal can include a signal that one or more of the signal characteristics are set or changed in such a manner as to encode the information into the signal.
  • Song determining device 40 may include an input unit 45, such as a keyboard, mouse, pen, voice input device, touch input device, infrared camera, video input device, and/or any other input device.
  • Output unit 44 may also be included in song determining device 40, such as one or more displays, speakers, printers, and/or any other output device.
  • the input unit 45 and the output unit 44 may be connected to the song determining device 40 via a wired connection, a wireless connection, or any combination thereof.
  • an input device or output device from another song determining device may be used as the input unit 45 or output unit 44 of the song determining device 40.
  • the components of song determining device 40 may be connected by various interconnects, such as a bus. Such interconnections may include a Peripheral Component Interconnect (PCI), a Universal Serial Bus (USB), a FireWire (IEEE 1394), an optical bus structure, and the like.
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • FireWire IEEE 1394
  • optical bus structure an optical bus structure, and the like.
  • the components of song determining device 40 may be interconnected by a network.
  • storage unit 42 may be comprised of a plurality of physical memory units that are interconnected by a network located in different physical locations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Mathematical Physics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)
  • Television Signal Processing For Recording (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)
  • Indexing, Searching, Synchronizing, And The Amount Of Synchronization Travel Of Record Carriers (AREA)

Abstract

一种歌曲确定方法和装置、存储介质,上述方法包括:提取视频中的音频文件,并获取该音频文件中插曲所属候选歌曲的候选歌曲标识,得到候选歌曲标识集合(101);获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元(102);其中,该匹配音频帧单元包括多个连续的匹配音频帧;根据该候选歌曲标识对应的该匹配音频帧单元,从该候选歌曲标识集合中获取该插曲所属目标歌曲的目标歌曲标识,并根据该目标歌曲标识确定该插曲所属的目标歌曲(103)。上述方法能够提高确定或定位视频插曲对应歌曲的精确性。

Description

一种歌曲确定方法和装置、存储介质 技术领域
本发明涉及音视频处理技术,具体涉及一种歌曲确定方法和装置、存储介质。
背景技术
随着互联网的发展和通信网络的发展,视频技术也随之得到了突飞猛进的发展,网络视频得到了广泛的推广,越多的用户通过网络观看视频。
目前一些视频中经常会出现插曲,此时,就需要为视频的插曲添加歌词,以使得用户可以观看到插曲的歌词,提升用户体验。而为视频插曲填充歌词首先需要确定或者定位视频插曲所属的歌曲,目前确定或者定位视频插曲所属歌曲的方式主要为:提取视频中视频插曲片段,然后,将视频插曲片段与音乐库中的歌曲进行大致的匹配,将匹配成功的歌曲作为视频插曲所属的歌曲。
相关技术提供的确定或者定位视频插曲所属的歌曲方案中,由于视频插曲片段提取的准确性低,以及歌曲匹配采用较为简单的匹配方式,因此,会导致确定视频插曲对应歌曲的精确性比较低。
发明内容
本发明实施例提供一种歌曲确定方法和装置、存储介质,可以提高确定视频插曲对应歌曲的精确性。
第一方面,本发明实施例提供一种歌曲确定方法,包括:
提取视频中的音频文件,并获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,得到候选歌曲标识集合;
获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;
根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,并根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
第二方面,本发明实施例还提供一种歌曲确定装置,包括:
标识获取单元,配置为提取视频中的音频文件,并获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,得到候选歌曲标识集合;
音频帧获取单元,配置为获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;
歌曲确定单元,配置为根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,并根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
第三方面,本发明实施例提供一种歌曲确定装置,包括:存储器和处理器,所述存储器中存储有可执行指令,所述可执行指令用于引起所述处理器执行包括以下的操作:
提取视频中的音频文件;
获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;
获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;
基于所获取的匹配音频帧形成匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;
根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识;
根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
第四方面,本发明实施例提供一种存储介质,存储有可执行指令,用于执行本发明实施例提供的歌曲确定方法。
本发明实施例采用提取视频中的音频文件,并获取该音频文件中插曲所属候选歌曲的候选歌曲标识,得到候选歌曲标识集合,然后,获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元,其中,该匹配音频帧单元包括多个连续的匹配音频帧,根据该候选歌曲标识对应的该匹配音频帧单元,从该候选歌曲标识集合中获取该插曲所属目标歌曲的目标歌曲标识,并根据该目标歌曲标识确定该插曲所属的目标歌曲。
该方案可以先获取视频插曲所属候选歌曲的候选歌曲标识集合,然后,基于视频的音频文件与歌曲之间的匹配音频帧,从候选歌曲标识集合中选取视频插曲所属歌曲的标识,从而确定视频插曲所属的歌曲,相对于相关技术而言,可以提高确定或者定位视频插曲对应歌曲的精确性。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种歌曲确定方法的流程图;
图2a是本发明实施例提供的一种获取候选歌曲标识的流程图;
图2b是本发明实施例提供的频谱峰值点分布图;
图2c是本发明实施例提供的过滤后的频谱峰值点分布图;
图3a是本发明实施例提供的第一种歌曲确定装置的结构示意图;
图3b是本发明实施例提供的第二种歌曲确定装置的结构示意图;
图3c是本发明实施例提供的第三种歌曲确定装置的结构示意图;
图4是本发明实施例提供的歌曲确定装置的硬件的结构的示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种歌曲确定方法和装置。以下将分别进行详细说明。
本发明实施例将从歌曲确定装置的角度进行描述,该歌曲确定装置具体可以集成在服务器等需要确定视频插曲对应歌曲的设备中。
当然,该歌曲确定装置也可以集成在用户终端(如智能手机、平板电脑)等需要确定视频插曲对应歌曲的设备中。
本发明实施例提供一种歌曲确定方法,包括:提取视频中的音频文件,并获取该音频文件中插曲所属候选歌曲的候选歌曲标识,得到候选歌曲标识集合,然后,获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元,其中,该匹配音频帧单元包括多个连续的匹配音频帧,根据该候选歌曲标识对应的该匹配音频帧单元,从该候选歌曲标识集合中获取该插曲所属目标歌曲的标识,即目标歌曲标识,并根据该目标歌曲标识确定该插曲所属的目标歌曲。
如图1所示,该歌曲确定方法的具体流程可以如下:
步骤101、提取视频中的音频文件,并获取该音频文件中插曲所属候选 歌曲的候选歌曲标识,得到候选歌曲标识集合。
其中,获取视频的方式可以有多种,比如,可以向视频服务器发送请求来获取视频,也可以从本地存储中提取视频;也即步骤“提取视频中的音频文件”可以包括:
向视频服务器发送视频获取请求;
接收该视频服务器根据该视频获取请求返回的视频;
提取该视频中的音频文件。
该提取视频中的音频文件的方式可以有多种,比如,可以对视频进行音频和视频进行分离处理,得到视频的音频文件;即步骤“提取视频中的音频文件”可以包括:对视频进行音视频分离处理,得到视频的音频文件。
在本发明实施例中,插曲所属的候选歌曲可以为可能与视频插曲相匹配的歌曲,该候选歌曲标识为与视频插曲匹配的歌曲的标识。
该获取候选歌曲标识的方式可以有多种,比如,先将视频的音频文件划分成多个音频段,然后,将每个音频段与歌曲(曲库中的歌曲)进行匹配,以得与视频插曲相匹配的歌曲,将该歌曲的标识作为候选歌曲标识;例如,基于音频段与歌曲的音频指纹(也就是歌曲的音频的数字化的特征)进行歌曲匹配;也即步骤“获取该音频文件中插曲所属候选歌曲的候选歌曲标识”可以包括:
将该音频文件划分成多个音频段,并获取该音频段的音频指纹;
确定预设样本集合中是否存在与该音频指纹匹配的指纹样本;
若是,则获取匹配指纹样本对应的歌曲标识,得到该音频段对应的歌曲标识集合,该歌曲标识集合包括多个该歌曲标识;
从该歌曲标识集合中,选取该插曲所属候选歌曲的候选歌曲标识。
其中,获取候选歌曲标识的具体过程将在本发明实施例的后续记载中作进一步描述。
步骤102、获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元,其中,该匹配音频帧单元包括多个连续的匹配音频帧。
比如,可以从歌曲服务器的歌曲数据库中获取候选歌曲标识对应的候选歌曲文件,例如,可以向歌曲服务器发送请求来获取对应的歌曲文件;也即步骤“获取候选歌曲标识对应的候选歌曲文件”可以包括:
向歌曲服务器发送歌曲获取请求,该歌曲获取请求携带候选歌曲标识;
接收该歌曲服务器根据该歌曲获取请求返回的候选歌曲文件。
其中,匹配音频帧可以为候选歌曲文件与该音频文件之间相匹配的匹配音频帧,比如,在候选歌曲文件包括多个第一音频帧,音频文件包括多个第二音频帧时,候选歌曲文件中与该音频文件中第二音频帧匹配的第一音频帧为匹配音频帧,同样,音频文件中与候选歌曲文件中第一音频帧匹配的第二音频帧也为匹配音频帧。此时,该匹配音频帧单元可以为候选歌曲文件中的音频帧单元,也可以为音频文件中的音频帧单元。
可以理解地,上述的第一音频帧是用于表示候选歌曲中的音频帧,以用于与音频文件中的音频帧(即第二音频帧)进行比较,并非是候选歌曲中的某个特定的音频帧,同理,第二音频帧是用于表示音频文件中的音频帧,并非用于表示音频文件中的某个特定的音频帧。
本发明实施例中,获取匹配音频帧的方式可以有多种,比如,将候选歌曲中音频帧与音频文件中的音频帧进行匹配。
例如,音频帧匹配可以采用基于音频帧的音频特征匹配的方式,如将候选歌曲文件中的第一音频帧的音频特征与音频文件中第二音频帧的音频特征进行匹配,根据音频帧的音频特征匹配结果来获取匹配音频帧;也即步骤“获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,得到匹配音频帧单元”可以包括:
将该候选歌曲文件中第一音频帧的音频特征与该音频文件中第二音频帧的音频特征进行匹配,得到匹配结果;
根据该匹配结果获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧;
根据该匹配音频帧获取匹配音频帧单元。
其中,音频帧的音频特征可以称为音频指纹,该音频特征的获取方式可以有多种,比如,可以根据音频帧对应频段的平均幅值来获取,也即在步骤“根据候选歌曲标识获取相应的候选歌曲文件”之后,步骤“将该候选歌曲文件中第一音频帧对应的音频特征与该音频文件中第二音频帧对应的音频特征进行匹配”之前,该歌曲确定方法还可以包括:获取该候选歌曲文件中第一音频帧对应的音频特征;例如,步骤“获取该候选歌曲文件中第一音频帧对应的音频特征”可以包括:
获取所候选歌曲文件中每个第一音频帧对应的频谱;
将该第一音频帧对应的频谱划分成预设数量的频段,并获取该频段对应的平均幅值;
将每个该频段的平均幅值与上一个第一音频帧对应频段的平均幅值进行比较,得到比较结果;
根据该比较结果获取该第一音频帧对应的音频特征。
例如,将候选歌曲文件转换成预设格式的音频,如8k16bit音频(也就是8*1024采样率、16比特的量化音频),然后,以第一预设数量的采样点为一帧,以第二预设数量的采样点为帧移进行傅立叶变换,得到频谱(如以1856个采样点为一帧,以58个采样点为帧移进行傅立叶变换),接着,将该频谱平均分成第三预设数量(如32个)的频段,并计算每个频段对应的平均幅度值,随后,将每个频段与上一个帧中对应频段进行比较(第二音频帧中第一个频段与第一个音频帧第一频段进行比较,第二音频帧中第二频 段与第一个音频帧第二频段进行比较,依次类推直到比较完所有频段),若大于则为1,小于则为0,这样每一个帧将会得到第三预设数量个bit值组成的数据单元,该数据单元即为该帧的音频特征;例如,在将频谱划分长32个频段的情况下,每一音频帧将会得到一个包括32个bit值的数据单元,该32个bit值即为每一音频帧的音频特征。
同样,视频中的音频文件的音频特征也可以采用上述获取方式获取,例如获取过程可以参考上述描述,这里就不在赘述。
本发明实施例中音频特征的匹配方式可以有多种,比如,可以以帧单元为一个单位进行特征匹配,即步骤“将该候选歌曲文件中第一音频帧对应的音频特征与该音频文件中第二音频帧对应的音频特征进行匹配,得到匹配结果”可以包括:
获取该候选歌曲文件中第一音频帧的帧数,从该音频文件中选取音频帧单元,该音频帧单元包括与该帧数相等数量的第二音频帧;
将该候选歌曲文件中第一音频帧的音频特征与该音频帧单元中第二音频帧的音频特征进行匹配,得到音频特征匹配结果;
此时,步骤“根据该匹配结果获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧”可以包括:根据该音频特征匹配结果获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,该匹配音频帧为音频特征匹配成功的音频帧;
相应地,步骤“根据该匹配音频帧获取匹配音频帧单元”可以包括:获取该匹配音频帧连续的个数,并根据该个数获取相应的匹配音频帧单元。
例如,步骤“获取该匹配音频帧连续的个数,并根据该个数获取相应的匹配音频帧单元”可以包括:
根据匹配音频帧获取帧连续单元,该帧连续单元包括多个连续的该匹配音频帧;
获取帧连续单元中匹配音频帧的个数,并根据该个数确定该帧连续单元为匹配音频帧单元。
比如,候选歌曲有n个第一音频帧,音频文件有m个第二音频帧,m>n,且均为正整数时,从m个第二音频帧中连续选取n个第二音频帧组成音频帧单元a,然后,将音频帧单元a中第二音频帧的音频特征与候选歌曲中相应第一音频帧的音频特征进行匹配(如将音频帧单元a中第一个音频帧与候选歌曲中第一个音频帧的音频特征进行匹配,将音频帧单元a中第二个音频帧与候选歌曲中第二音频帧的音频特征进行匹配,以此类推,直至将音频帧单元a第n个音频帧与候选歌曲中第n个音频帧的音频特征进行匹配),此时,需要进行n次的特征匹配,得到音频特征匹配结果。
如音频特征匹配结果包括音频特征匹配成功的第一音频帧和第二音频帧,根据该匹配结果获取匹配音频帧,并获取帧连续单元以及该帧连续单元中匹配音频帧的个数。
接着,重新从m个第一音频帧连续选取新的n个第二音频帧组成新的音频帧单元b,其中,该音频帧单元b与音频帧单元a存在至少一个不同的第二音频帧(也即,重新连续选取的n个第二音频帧,与之前连续选取的n个第二音频帧至少存在一个不同的音频帧;如,之前选取第1个第二音频帧……第10个第二音频帧组成音频帧单元a,之后可以选取第2个第二音频帧……第11个音频帧组成音频帧单元b),将音频帧单元b中第二音频帧的音频特征与候选歌曲中相应第一音频帧的音频特征进行匹配(如将音频帧单元b中第一个音频帧与候选歌曲中第一个音频帧的音频特征进行匹配,将音频帧单元b中第二个音频帧与候选歌曲中第二音频帧的音频特征进行匹配……将音频帧单元b第n个音频帧与候选歌曲中第n个音频帧的音频特征进行匹配),以得音频特征匹配结果,如该结果包括音频特征匹配成功的第一音频帧和第二音频帧,根据该匹配结果获取匹配音频帧,并获取帧连续单元以 及该帧连续单元中匹配音频帧的个数,……依次类推接下来还需要重新连续选取新的n个第二音频帧组成音频帧单元,并进行音频特征匹配,以获取匹配音频帧连续的个数,直到每个第二音频帧均经过匹配之后才停止匹配。
在经过前述匹配之后,可得到一系列帧连续单元及其对应的匹配音频帧个数,此时,可以基于该个数来确定帧连续单元为匹配音频帧单元。比如,可以选取匹配音频帧个数最大的帧连续单元为匹配音频帧单元;也即步骤“根据该个数确定该帧连续单元为匹配音频帧单元”可以包括:当该帧连续单元的匹配音频帧个数大于其余帧连续单元的匹配音频帧个数时,确定该帧连续单元为匹配音频帧单元。
比如,候选歌曲有10个音频帧p,即10帧,音频文件有20个音频帧q,即20帧时,可以选取第1至第10个音频帧b组成第一音频帧单元,然后,将第一音频帧单元中的第1-10个音频帧q(也就是音频文件中的第1至10个音频帧)),与候选歌曲的10个音频帧p进行匹配,得到匹配音频帧(例如,将音频帧单元中第1个音频帧q与音频帧p进行特征匹配……,将第10个音频q与第10个音频帧p进行特征匹配),获取连续的匹配音频帧组成帧连续单元,并获取帧连续单元中匹配音频帧的个数。
接着,选取音频文件中的第2至11个音频帧q组成第二音频帧单元,然后,将第二音频帧单元中的第1-10个音频帧q(也就是音频文件中的第2至11个音频帧q)与10个音频帧p进行匹配,得到匹配音频帧,获取连续的匹配音频帧组成帧连续单元,并获取帧连续单元中匹配音频帧的个数,以此类推,直到选取第11至第20个音频帧q组成音频帧单元进行特征匹配。
经过前面的特征匹配,可得到多个帧连续单元及其对应的匹配音频帧个数,此时,可以选取包括的匹配音频帧个数最大的帧连续单元作为匹配音频帧单元,即选取最长的帧连续单元作为匹配音频帧单元。
步骤103、根据该候选歌曲标识对应的该匹配音频帧单元,从该候选歌 曲标识集合中获取该插曲所属目标歌曲的目标歌曲标识,并根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
通过步骤102可以获取候选歌曲标识对应的候选歌曲文件与音频文件之间相匹配的匹配音频帧单元,即每个候选歌曲标识对应的匹配音频帧单元,这样便可以根据候选歌曲标识对应的匹配音频帧单元从候选歌曲标识集合选取视频插曲所属目标个的目标歌曲标识。
比如,可以对匹配音频帧单元进行帧扩展得到候选歌曲文件与音频文件之间相匹配的匹配歌曲片段,然后,基于该匹配歌曲片段来获取目标歌曲标识;也即步骤“根据该候选歌曲标识对应的该匹配音频帧单元,从该候选歌曲标识集合中获取该插曲所属目标歌曲的目标歌曲标识”可以包括:
对该候选歌曲标识对应的该匹配音频帧单元进行音频帧扩展,得到该候选歌曲标识对应的匹配歌曲片段;
根据该匹配歌曲片段获取候选歌曲标识对应的时间信息,该时间信息包括:该匹配歌曲片段在该视频中的第一起始时间、在该候选歌曲中的第二起始时间以及该匹配歌曲片段的时长;
根据该候选标识对应的时间信息从该候选歌曲标识集合中获取该插曲所属目标歌曲的目标歌曲标识。
可以理解地,第一起始时间用于表示该匹配歌曲片段在该视频中的起始时间,以与在该匹配歌曲片段在候选歌曲中的起始时间(也就是第二起始时间)进行区分,并非用于特指某个时间。
其中,候选歌曲标识对应的匹配歌曲片段为候选歌曲标识对应的候选歌曲与音频文件相匹配的匹配歌曲片段,该匹配歌曲片段可以为候选歌曲中的歌曲片段,也可以为音频文件中的歌曲片段;本发明实施例中,由于匹配歌曲片段由音频帧组成,在获取匹配歌曲片段之后即可根据片段中音频帧获取该片段在候选歌曲中的起始时间,在视频中的起始时间,以及该 片段的时长(即该片段的长度)。
比如,可以根据片段的音频帧在歌曲中序号获取片段在候选歌曲中的起始时间,也可以根据片段的音频帧在音频文件中序号获取片段在视频中的起始时间。
本发明实施例中,对匹配音频帧单元进行帧扩展的方式可以有多种,比如,分别在候选歌曲文件和音频文件中进行帧扩展,也即步骤“对该候选歌曲标识对应的该匹配音频帧单元进行音频帧扩展,得到该候选歌曲标识对应的匹配歌曲片段”可以包括:
分别在该候选歌曲文件和该音频文件中对该匹配音频帧单元进行音频帧扩展,得到该候选歌曲文件中的第一匹配音频帧扩展单元以及该音频文件中的第二匹配音频帧扩展单元;
将该第一匹配音频帧扩展单元中第一音频帧的音频特征与该第二匹配音频帧扩展单元中第二音频帧的音频特征进行匹配,得到扩展单元之间的匹配音频帧;
根据该扩展单元之间的匹配音频帧的数量,确定该第一匹配音频帧扩展单元或者第二匹配音频帧扩展单元为该候选歌曲与该音频文件之间相匹配的匹配歌曲片段。
在本发明实施例一实施方式中,可以在候选歌曲文件中音频文件同步进行帧扩展,即扩展的音频帧数相同,方向相同。
其中,根据该扩展单元之间的匹配音频帧的数量确定匹配歌曲片段的方式可以有多种,比如当该数量大于某个预设数量时,确定此时的扩展单元为匹配歌曲片段,又比如,当匹配音频帧的数量与扩展单元音频帧总数量的比值大于预设比值(如90%)时,确定此时的扩展单元为匹配歌曲片段。
在获取候选歌曲标识对应的时间信息情况下,步骤“根据该候选标识对应的时间信息从该候选歌曲标识集合中获取目标歌曲标识”可以包括:
根据候选歌曲标识对应的第二起始时间和该时长获取该候选歌曲标识对应的播放时间,该播放时间为该匹配歌曲片段在该视频中的播放时间;
根据候选歌曲标识对应的播放时间对该候选歌曲标识集合中的候选歌曲标识进行过滤,得到过滤后的候选标识集合;
将该过滤后的候选标识集合中的该候选歌曲作为目标歌曲标识。
比如,在获取候选歌曲标识对应的播放时间之后,可以确定播放时间具有包含关系的候选歌曲标识,然后,过滤掉播放时间被包含的候选歌曲标识,即过滤掉播放时间具有包含关系的候选歌曲标识中播放时间短的候选歌曲标识;例如歌曲ID1对应的播放时间为第1s到第10s,歌曲ID2对应的播放时间为第2s到第5s,歌曲ID3对应的播放时间为第3s至第8s;此时,歌曲ID1、ID2、ID3对应的播放时间具有包含关系,因此,可以过滤播放时间较短的歌曲ID,这里,过滤掉歌曲ID2和ID3。
又比如,在获取候选歌曲标识对应的播放时间之后,还可以确定播放时间具有重叠关系的候选歌曲标识,然后,过滤掉播放时长较短的候选歌曲标识。例如,歌曲ID1对应的播放时间为第1s到第10s,歌曲ID2对应的播放时间为第5s到第12s,此时,可以过滤掉播放时长短的歌曲ID,这里歌曲ID1的播放时长为10s,歌曲ID2的播放时长为7s,因此,过滤掉歌曲ID2。
本发明实施例在获取目标歌曲标识之后,可以将目标歌曲标识对应的歌曲作为插曲所属的目标歌曲。
在本发明实施例一实施方式中,还可以在获取视频插曲对应的目标歌曲标识之后,将视频插曲的歌词填充至视频中,以使得播放视频插曲时显示视频插曲的歌词;也即,在步骤103之后,还可以包括:
根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频。
例如,在对匹配音频帧单元进行扩展得到匹配歌曲片段及其时间信息 的情况下,步骤“根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频”可以包括:根据目标歌曲标识及其对应的时间信息,获取该插曲对应的歌词,并将该歌词填充至该视频,其中,时间信息为该目标歌曲对应的匹配歌曲片段的时间信息。
比如,可以根据目标歌曲标识对应的匹配歌曲片段在歌曲中的起始时间、以及该匹配歌曲片段的时长,获取插曲对应的歌词,以及根据该匹配歌曲片段在视频中的起始时间以及时长来进行歌词填充;也即步骤“根据目标歌曲标识及其对应的时间信息,获取该插曲对应的歌词,并将该歌词填充至该视频”可以包括:
根据目标歌曲标识及其对应的该第一起始时间、该时长,获取该插曲对应的歌词;
根据该目标歌曲标识对应的该第二起始时间和该时长,将该歌词填充至该视频。
比如,可以根据目标歌曲标识获取相应目标歌曲的目标歌词文件,然后,根据匹配歌曲片段在该目标歌曲中的起始时间和匹配歌曲片段的时长,从该目标歌词文件中提取插曲对应的歌词;即步骤“根据目标歌曲标识及其对应的该第一起始时间、该时长,获取该插曲对应的歌词”可以包括:
根据目标歌曲标识获取相应的目标歌曲的歌词文件;
根据目标歌曲标识对应的第一起始时间和时长,从该歌词文件中提取相应的歌词,以作为插曲的歌词。
例如,目标歌曲标识为歌曲1,该歌曲1对应的匹配歌曲片段在歌曲1中的起始时间为第5s,该匹配歌曲片段为10s,此时,可以从歌曲1的歌词文件中获取第5s至15s的歌词。
又比如,步骤“根据该目标歌曲标识对应的该第二起始时间和该时长,将该歌词填充至该视频”可以包括:
根据该目标歌曲对应的第二起始时间和时长,获取歌词在视频中的展示时间;
根据该展示时间将该歌词填充至该视频。
例如,目标歌曲标识对应的匹配歌曲片段在视频中的第二起始时间为第7s,以及匹配歌曲片段的时长为8s,此时,可以得到歌词在视频中的展示时间为第7s至第15s,之后,可以基于该展示时间在视频的相应位置插入该歌词。
在本发明实施例一实施方式中,为展示完整语句的插曲歌词,以提升用户体验,在获取插曲歌词之后可以确定歌词是否为完整的语句,若是,则进行歌词填充操作;也即在步骤“获取该插曲对应的歌词”之后,步骤“将歌词填充至该视频”之前,该方法还可以包括:
确定该歌词是否为完整的语句;
若是,则执行根据该目标歌曲标识对应的该第二起始时间和该时长,将该歌词填充至该视频的步骤。
在本发明实施例一实施方式中,为了提高用户体验,还可以在视频中设置一个接口,以使得在播放视频插曲时可以通过该接口跳转到播放该视频插曲所属的歌曲;也即在步骤“在获取该插曲目标歌曲标识”之后,该方法还可以包括:
根据目标歌曲标识在该视频中设置跳转接口,以使得终端在播放该插曲时通过该跳转接口跳转至播放该插曲所属的目标歌曲。
其中,该跳转接口的形式可以为多种,比如可以为按钮、输入框等等,可以根据实际需求设定。
在本发明实施例一实施方式中,为提高用户体验,还可以同样在视频中设置一个为接口,以使得在播放视频插曲时可以通过该接口将视频插曲所属的目标歌曲添加到音乐软件的歌曲列表中,也即在步骤“获取所述插 曲所属目标歌曲的目标歌曲标识”之后,还可以包括:
根据目标歌曲标识在所述视频中设置添加接口,以使得终端在播放所述插曲时通过所述添加接口将所述目标歌曲添加到音乐软件的歌曲列表中。
其中,该添加接口的形式可以为多种,比如可以为按钮、输入框等等,可以根据实际需求设定;音乐软件可以为常用的音乐播放软件,如基于云端的音乐播放软件、在线音乐播放软件等等,该歌曲列表可以为歌单或者歌曲播放列表,如收藏歌单等等。
由上可知,本发明实施例采用提取视频中的音频文件,并获取该音频文件中插曲所属候选歌曲的候选歌曲标识,以得到候选歌曲标识集合,然后,获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,以得到匹配音频帧单元,其中,该匹配音频帧单元包括多个连续的匹配音频帧,根据该候选歌曲标识对应的该匹配音频帧单元,从该候选歌曲标识集合中获取该插曲所属目标歌曲的目标歌曲标识,并根据该目标歌曲标识确定该插曲所属的目标歌曲;该方案可以先获取视频插曲所属候选歌曲的候选歌曲标识集合,然后,基于视频的音频文件与歌曲之间的匹配音频帧,从候选歌曲标识集合中选取视频插曲所属歌曲的标识,从而确定视频插曲所属的歌曲,相对于相关技术而言,可以提高确定或者定位视频插曲对应歌曲的精确性和效率。
此外,本发明实施例还在确定视频插曲所属歌曲之后,根据目标歌曲标识及其对应的匹配音频帧单元将该插曲对应的歌词填充至该视频;该方案还可以自动完成视频插曲与歌曲的匹配,以确定视频插曲所属的歌曲,并可以自动获取视频插曲的歌词进行填充,相对于相关技术而言,还可以提高视频插曲歌词填充的准确性以及效率。
本发明实施例将在前述记载的歌曲确定方法的基础上,作进一步说明。
由本发明实施例前述的记载可知,可以基于视频中音频文件和候选歌曲文件之间的音频指纹匹配获取候选歌曲标识,本发明实施例中将着重介绍基于音频指纹匹配获取候选歌曲标识的过程,参考图2a,该获取候选歌曲标识的流程如下:
步骤201、将该音频文件划分成多个音频段,并获取该音频段的音频指纹。
例如,音频文件的划分方式可以有多种,比如,可以以预设帧长和预设帧移,将音频文件划分成多个音频段,每个音频段的时长与预设帧长相等,也即步骤“将该音频文件划分成多个音频段”可以包括:
将音频文件转换成相应格式的音频;
以预设帧长和预设帧移,将音频文件划分成多个音频段。
例如,将音频文件转换成8k16bit(也就是8*1024采样率、16比特的量化音频)脉码编码调制(PCM Pulse Code Modulation)也称为音频,然后,以10秒为帧长,1秒为帧移,分割为多个以10秒钟为一段的小音频段,如,在每帧时长为1s时,将第一帧与第十帧划分成一个音频段,将第二帧与第十一帧划分成一个音频段。具体实施时可以根据实际需求选的合适的划分方式。
本发明实施例中音频指纹可以为多种,为了减少计算量,加快歌词填充速度,选用音频小指纹,该音频小指纹为一种数据结构,其可以由频谱上频谱峰值点组合而成,例如,获取音频的音频帧对应的频谱,然后,提取音频帧对应的频谱峰值点,从而得到该音频对应的频谱峰值点,然后,将集合中峰值点两两组合得到音频指纹;也即步骤“获取该音频段的音频指纹”可以包括:
获取该音频段中音频帧对应的频谱;
从该频谱中提取该音频帧对应的频谱峰值点,得到该音频段对应的峰 值集合,该峰值集合包括该音频帧对应的频谱峰值点;
将该峰值集合中频谱峰值点两两进行组合,得到该音频段的音频指纹。
在本发明实施例一实施方式中,步骤“将该峰值集合中频谱峰值点两两进行组合,得到该音频段的音频指纹”可以包括:
确定与该频谱峰值点相组合的目标频谱峰值点;
将该频谱峰值点与该目标频谱峰值点进行组合,得到音频段的音频指纹,该音频指纹包括:该频谱峰值点对应的频率、该频谱峰值点与该目标频谱峰值点之间的时间差和频率差。
其中,与该频谱峰值点相组合的目标频谱峰值点,可以为除了该频谱峰值点以外的频谱峰值点;比如,在获取音频段对应的峰值集合之后,根据峰值集合生成频率峰值点分布图,然后,在可以在频率峰值点分布图确定某个频率峰值点(也称为锚点)对应的目标区域,该目标区域包括:与该频率峰值点相组合的目标频率峰值点,接着,将该锚点与目标区域中的目标频率峰值点进行组合,组合之后,可以得到多个音频指纹。
例如,将某个频谱峰值点a1(t1,f1)与目标区域内的目标频谱峰值点a2(t2,f2)进行组合构成音频指纹D1(f1,Δf’,Δt’),其中,频率Δf’=f2-f1,Δt’=t2-t1,其中,该t1可以为音频指纹D1在该音频段内的偏移时间,同样将该频谱峰值点a1(t1,f1)分别与目标区域内的目标频谱峰值点a3(t3,f3),a4(t4,f4)进行组合,可以得到音频指纹D2(f1,Δf”,Δt”),D2(f1,Δf”’,Δt”’),其中,Δf”=f3-f1,Δt”=t3-t1,Δf”;=f4-f1,Δt”’=t4-t1,以此类推,可以得到音频段对应的音频指纹集合。
其中,频率峰值点分布图的横轴为时间,纵轴为峰值点的频率,由于音频帧与时间具有对应关系,为了快速获取音频指纹,本发明实施例中可以用音频帧序号来表示时间,此外,还可以用频带索引号来表示峰值点的频率,索引号的范围可以为(0~255),即上述峰值点t和f分别可以用音频帧 序号和频段索引号来表示。此时,目标区域即可用音频帧序号和频带索引号来表示,比如,目标区域可以由时间区域和频域区域构成,其中,时间区域可以为(15~63)帧(时间差用6bit表示),频域区域可以为(-31~31)个频带(频带差用6bit表示),该目标区域的大小可以根据实际需求设定,为了节省资源加快获取指纹速度,在本发明实施例一实施方式中,该目标区域中仅可以包括三个目标频谱峰值点,即锚点对应的目标频谱峰值点的个数为3。
在本发明实施例一实施方式中,为了能够精确地提取音频指纹,需要保证频谱峰值点分布均匀,因此,需要对音频段的峰值集合进行峰值点过滤,例如,过滤掉存在相互影响的峰值点,比如,由于人对声音感知时,短时谱峰值频率点之间是相互影响,一个频率分量可能掩蔽与其相近的频率分量(即所谓的听觉掩蔽效应),所以要将时间间距较小,且频率间距较小的峰值点过滤掉,以保证选取的峰值点沿时间和频率轴分布比较均匀;即在步骤“得到音频段对应的峰值集合”之后,步骤“将该峰值集合中频谱峰值点两两进行组合”之前,该歌曲确定方法还可以包括:
根据频谱峰值点之间的时间差以及频率差,对峰值集合中频谱峰值点进行过滤。
参考图2b,为某个音频的峰值集合对应的频谱峰值点分布图,为了使得频谱峰值点分布均匀,可以对该峰值集合中峰值点进行过滤,过滤后的峰值集合对应的频谱峰值点分布参考图2c。
为了与本发明实施例前述记载的音频指纹(即音频特征)进行区别,比如,可以基于音频指纹的大小进行区别,该实施例一中音频特征可以称为大音频指纹,本发明实施例基于图2的记载中音频段的音频指纹可以称为小音频指纹。
步骤202、确定预设样本集合中是否存在与该音频指纹匹配的指纹样 本,若是,执行步骤203,若否,结束流程。
其中,预设样本集合可以包括至少一种指纹样本,该预设样本集合中每一种指纹样本为一首歌曲的音频指纹;例如,预设样本集合可以多种指纹样本,每一种指纹样本可以对应一种歌曲ID,例如,指纹样本1对应歌曲1、指纹样本2对应歌曲2……指纹样本n对应歌曲n。
例如,可以获取音频段的多个音频指纹,然后,确定预设样本集合中是否存在每个音频指纹匹配(即相同)的指纹样本,得到多个匹配指纹样本,然后,获取每个匹配指纹样本对应的歌曲标识,以得到歌曲标识集合,该歌曲标识集合包括多个该歌曲标识。比如,对于某个音频段,该音频段对应的音频指纹包括:音频指纹D1、音频指纹D2时,将该音频段的音频指纹D1分别与预设样本集合中指纹样本一一比较,若有与音频指纹D1相同的指纹样本,则确定预设样本存在与该音频指纹D1相匹配的指纹样本,同样,可以将音频指纹D2分别与预设样本集合中指纹样本一一比较,若有与音频指纹D2相同的指纹样本,则确定预设样本集合存在与该音频指纹D2相匹配的指纹样本。
本发明实施例中,可以从歌曲数据库中提取歌曲,然后,提取该歌曲的音频指纹作为指纹样本,其中,提取歌曲的音频指纹的方式也可以采用上述音频段的音频指纹获取方式,即可以获取歌曲中音频帧对应的频谱,然后,提取频谱峰值点,并对频谱峰值点两两组合,以得到歌曲的音频指纹(即指纹样本),该歌曲可以从某个歌曲数据库中提取;也即在步骤201之前,该歌曲确定方法还可以包括:
从歌曲数据库中获取歌曲;
获取歌曲对应的音频指纹,并将该歌曲对应的音频指纹作为指纹样本,以得到预设样本集合。
步骤203、获取匹配指纹样本对应的歌曲标识,以得到该音频段对应的 第一歌曲标识集合,该第一歌曲标识集合包括多个该歌曲标识。
其中,获取匹配指纹样本对应的歌曲标识的方式可以有多种,比如,可以采用映射关系集合来获取匹配指纹样本对应的歌曲标识,该映射关系集合可以包括指纹样本与歌曲标识之间的映射关系(即对应关系),也即步骤“获取匹配指纹样本对应的歌曲标识”的步骤具体,包括:
基于映射关系集合获取该匹配指纹样本对应的歌曲标识,该映射关系集合包括指纹样本与歌曲标识之间的映射关系。
其中,该映射关系集合可以为预置的映射关系集合,该指纹样本与歌曲标识之间的映射关系可以由系统预先设置,也可以由用户自行进行设置;也即在步骤“提取视频中的音频文件”之前,该歌曲确定方法还可以包括:
接收映射关系设置请求,该映射关系设置请求指示需要建立映射关系的指纹样本和歌曲标识;
根据该映射关系设置请求建立指纹样本与歌曲标识之间的映射关系,以得到映射关系集合。
本发明实施例中映射关系集合可以以表格的形式呈现,称为映射关系表,该映射关系表可以包括:预设样本集合,以及预设样本集合中指纹样本对应的歌曲标识,其中,该映射关系表可以存储在某个数据库,可称为指纹库。
步骤204、从该歌曲标识集合中,选取该插曲所属候选歌曲的候选歌曲标识。
在本发明实施例获取音频段对应的歌曲标识集合之后,还需要进一步地作筛选,获取最有可能与音频段匹配的歌曲标识;由于最有可能与音频段匹配的歌曲(即插曲所属的候选歌曲)与音频段在歌曲标识对应的歌曲中的起始时间相关,因此,可以基于音频段在歌曲中的起始时间从歌曲标识集合中选取插曲所属候选歌曲的候选歌曲标识;也即步骤“获取音频指 纹”之后,步骤“从歌曲标识集合中选取候选歌曲标识”之前,该方法还可以包括:获取该音频指纹在该音频段中的第一偏移时间、以及该匹配指纹样本在匹配歌曲中的第二偏移时间,其中,该第一偏移时间为该频谱峰值点在该音频段内的时间,该匹配歌曲为该歌曲标识对应的歌曲;
此时,步骤“从该歌曲标识集合中,选取该插曲所属候选歌曲的候选歌曲标识”可以包括:
根据该第一偏移时间和该第二偏移时间,获取该音频段在该匹配歌曲中的起始时间;
根据该音频段在匹配歌曲中的起始时间,从该歌曲标识集合中选取该候选歌曲标识。
比如,可以获取音频指纹D1(f1,Δf’,Δt’)在音频段内的偏移时间t1,该t1即为频谱峰值点a1在音频段中的时间,同样在采用上述方式提取指纹样本时,该指纹样本在其所属歌曲中的偏移时间,即为指纹样本对应的频谱峰值点(即锚点)在其所属歌曲中的时间。
例如,本发明实施例中可以基于预设时间映射关系集合来获取匹配指纹样本在匹配歌曲中的偏移时间,该预设时间映射关系集合可以包括:指纹样本与该指纹样本在其所属歌曲中的偏移时间之间的映射关系(对应关系),也即步骤“该匹配指纹样本在匹配歌曲中的第二偏移时间”可以包括:
根据预设时间映射关系集合,获取匹配指纹样本在该歌曲标识中对应的匹配歌曲中的第二偏移时间,其中,预设时间映射关系集合包括:指纹样本与该指纹样本其所属歌曲中的偏移时间之间的映射关系。
其中,该预设时间映射关系集合可以为预置的时间映射关系集合,该指纹样本与偏移时间之间的映射关系可以由系统预先设置,也可以由用户自行进行设置;也即在步骤“提取视频中的音频文件”之前,该歌词填充方法还可以包括:
接收时间映射关系设置请求,该时间映射关系设置请求指示需要建立映射关系的指纹样本和偏移时间,该偏移时间为该指纹样本在其所属歌曲中的偏移时间;
根据该时间映射关系设置请求建立指纹样本与偏移时间之间的映射关系,以得到时间映射关系集合。
本发明实施例中时间映射关系集合可以以表格的形式呈现,称为时间映射关系表,该映射关系表可以包括:预设样本集合,以及预设样本集合中指纹样本对应的偏移时间。
在本发明实施例一实施方式中,为方便获取歌曲标识和偏移时间,时间映射关系集合与上述映射关系集合设置在同一个映射关系集合,比如,设置一个总映射关系集合,该集合可以包括:指纹样本与歌曲标识之间的映射关系,和指纹样本与偏移时间之间的映射关系,例如,可以设置一张总映射关系表,该关系表可以包括:预设样本集合、预设样本集合中指纹样本对应的歌曲标识、预设样本集合中指纹样本对应偏移时间。
实际应用中,如果音频段在多个不同歌曲中的起始时间相同时,表明该多个歌曲最有可能是与音频段匹配的歌曲即视频插曲所属的候选歌曲,也即步骤“根据该歌曲标识集合中歌曲标识对应的起始时间,从该歌曲标识集合中选取该候选歌曲标识”可以包括:
获取该歌曲标识集合中歌曲标识对应的起始时间,以得到时间集合;
根据该起始时间的相同个数从该时间集合中确定目标起始时间;
从歌曲标识集合中选取该目标起始时间对应的歌曲标识作为候选歌曲标识。
比如,可以选取相同个数达到预设个数的起始时间作为目标起始时间,也即步骤“根据该起始时间的相同个数从该时间集合中确定目标起始时间”可以包括:
获取该时间集合中每种该起始时间的个数;
判断该个数是否大于预设个数;
若是,则确定该种起始时间为目标起始时间。
其中,预设个数可以根据实际需求设定,比如,可以为5、6、9等等。
本发明实施例中,音频段在歌曲中的起始时间可以根据该音频指纹对应的偏移时间、以及该歌曲标识集合中该歌曲标识对应的偏移时间得到,例如,可以计算歌曲标识对应的偏移时间与音频指纹对应的偏移时间之间的时间差,该时间差即为该音频段在该歌曲中的起始时间。例如,音频段音频指纹对应的偏移时间为t’,匹配指纹样本对应的偏移时间(即歌曲标识对应的偏移时间)为t”,此时,音频段在该歌曲标识对应的歌曲中的起始时间也即该歌曲标识对应的起始时间为Δt=t”-t’,采用此方式可以计算歌曲标识集合每个歌曲标识对应的起始时间Δt,得到时间集合,比如(Δt1、Δt2、Δt1、Δt1、Δt2、Δt3……Δt3……Δtn)。
在得到时间集合之后,可以获取每种起始时间的个数,然后,判断该个数是否大于预设个数,若是,则确定待种起始时间为目标起始时间;比如,在预设个数为8时,统计Δt1的个数为10、Δt2的个数为6,Δt3的个数为12,此时Δt1的个数大于预设个数,Δt2的个数小于预设个数,Δt3的个数大于预设个数,那么可以确定Δt1和Δt3为目标起始时间。
在本发明实施例一实施方式中,为提高音频指纹的匹配速度,可以对音频指纹进行转换,比如,采用预设算法将音频指纹转换成一个具体的特征数字,命名为哈希值(hash_key)。例如,对于音频指纹D1(f1,Δf’,Δt’),可以采用公式:hash_key=f1·2^12+Δf·2^6+Δt,“^”为指数运算符,将其转换成一个具体的数字,即按位高低构成一个20bit整数,这样在后续进行音频指纹匹配时只需进行hash_key匹配即可,也即步骤“确定预设样本集合中是否存在与该音频指纹匹配的指纹样本”可以包括:
将该音频指纹转换成相应的特征数字;
确定预设数字集合中是否存在与该特征数字匹配的数字样本;
若是,则确定预设样本集合中存在与该音频指纹匹配的指纹样本;
若否,则确定预设样本集合中不存在与该音频指纹匹配的指纹样本。
其中,预设数字样本集合中包括至少一种特征数字,称为数字样本,每一种数字样本可以对应一种歌曲标识。
此时,步骤“获取匹配指纹样本对应的歌曲标识”可以包括:获取匹配数字样本对应的歌曲标识。
例如,可以基于数字映射关系集合来获取匹配数字样本对应的歌曲标识,也即步骤“获取匹配数字样本对应的歌曲标识”可以包括:根据数字映射关系集合获取匹配数字样本对应的歌曲标识,其中,该数字映射关系集合包括:数字样本与歌曲标识之间的对应关系。
其中,该数字映射关系集合可以为预置的数字映射关系集合,该数字样本与歌曲标识之间的映射关系可以由系统预先设置,也可以由用户自行进行设置;也即在步骤“提取视频中的音频文件”之前,该歌曲确定方法还可以包括:
获取歌曲的音频指纹,并将该音频指纹转换成数字特征;
接收数字映射关系设置请求,该数字映射关系设置请求指示需要建立映射关系的数字特征和歌曲标识;
根据该数字映射关系设置请求获取数字特征与歌曲标识之间的映射关系,得到数字映射关系集合。
同样,步骤“获取该匹配指纹样本在匹配歌曲中的第二偏移时间”可以包括:根据数字时间映射关系集合获取匹配数字样本对应的第二偏移时间,其中,数字时间映射关系集合包括数字样本与偏移时间之间的映射关系。例如,数字时间映射关系集合的获取方式可以参考上述数字映射关系 集合或者时间映射关系集合的创建方式,这里就不再赘述。
在本发明实施例一实施方式中,该数字映射关系集合、该数字时间映射关系集合可以设置在一个集合中,比如,设置一个总映射关系集合,该集合包括:数字样本与歌曲标识之间的映射关系、数字样本与偏移时间之间的映射关系;例如,可以设置一个映射关系表,该映射关系表可以包括:预设数字样本集合、预设数字样本集合中数字样本对应的歌曲标识、预设数字样本集合中数字样本对应的偏移时间。
例如,可以从歌曲数据库中获取歌曲,然后,获取歌曲的音频指纹及其对应的偏移时间,将音频指纹转换成特征数字hash_key,之后可以创建一张hash_表,该hash_表包括多个hash_记录,每个hash_记录包括:{hash_key}:(value),其中,hash_key=f1·2^12+Δf·2^6+Δt(按位高低构成一个20bit整数),value={song_id:t_1},表示成32bit数字,其中song_id占用19bit(可表示52万首歌曲),t1占用13bit(如果帧移为0.032ms,可表示最长歌曲长度为5min)。
由上可知,本发明实施例采用将该音频文件划分成多个音频段,并获取该音频段的音频指纹,然后,确定预设样本集合中是否存在与该音频指纹匹配的指纹样本,若是,则获取匹配指纹样本对应的歌曲标识,得到该音频段对应的第一歌曲标识集合,从该歌曲标识集合中,选取该插曲所属候选歌曲的候选歌曲标识;该方案可以获取视频插所属的所有候选歌曲,然后,基于候选歌曲与视频的音频的匹配从该候选歌曲中确定视频插曲对应的歌曲,与相关技术相比,可以提高确定视频插曲对应歌曲的精确性和效率。
此外,由于本发明实施例采用频谱峰值点来构建音频指纹,可以精确地可获取视频插曲对应的候选歌曲及其标识,进一步提高了确定或者定位视频插曲所属候选歌曲的准确性。
本发明实施例还提供一种歌曲确定装置,如图3a所示,该歌曲确定装置还可以包括标识获取单元301、音频帧获取单元302以及歌曲确定单元303,如下:
(1)标识获取单元301;
标识获取单元301,配置为提取视频中的音频文件,并获取该音频文件中插曲所属候选歌曲的候选歌曲标识,得到候选歌曲标识集合。
比如,该标识获取单元301可以包括:音频提取子单元、指纹获取子单元、确定子单元、标识集合获取子单元以及选取子单元;
该音频提取子单元,配置为提取视频中的音频文件;
该指纹获取子单元,配置为将该音频文件划分成多个音频段,并获取该音频段的音频指纹;
该确定子单元,配置为确定预设样本集合中是否存在与该音频指纹匹配的指纹样本;
该标识集合获取子单元,配置为在确定存在与该音频指纹匹配的指纹样本时,获取匹配指纹样本对应的歌曲标识,得到该音频段对应的歌曲标识集合,该歌曲标识集合包括多个该歌曲标识;
该选取子单元,配置为从该歌曲标识集合中,选取该插曲所属候选歌曲的候选歌曲标识。
其中,获取视频的方式可以有多种,比如,可以向视频服务器发送请求来获取视频,也可以从本地存储中提取视频;也即音频提取子单元可以具体配置为:
向视频服务器发送视频获取请求;
接收该视频服务器根据该视频获取请求返回的视频;
提取该视频中的音频文件。
该提取视频中的音频文件的方式可以有多种,比如,可以对视频进行 音视频分离处理,以得到视频的音频文件;即步骤“提取视频中的音频文件”可以包括:对视频进行音视频分离处理,以得到视频的音频文件。
例如,该音频文件的划分方式可以有多种,比如,可以以预设帧长和预设帧移,将音频文件划分成多个音频段,每个音频段的时长与预设帧长相等。
在本发明实施例中,插曲所属的候选歌曲可以为可能与视频插曲相匹配的歌曲,该候选歌曲标识为与视频插曲匹配的歌曲的标识。
例如,获取音频段的音频指纹的方式也有多种,比如可以采用以下方式获取:
获取该音频段中音频帧对应的频谱;
从该频谱中提取该音频帧对应的频谱峰值点,以得到该音频段对应的峰值集合,该峰值集合包括该音频帧对应的频谱峰值点;
将该峰值集合中频谱峰值点两两进行组合,以得到该音频段的音频指纹。
比如,步骤“将该峰值集合中频谱峰值点两两进行组合,以得到该音频段的音频指纹”可以包括:
将该峰值集合中频谱峰值点两两进行组合,以得到该音频段的音频指纹。
在本发明实施例一实施方式中,从歌曲标识集合中选取候选歌曲标识的方式可以有多种,比如,可以基于音频指纹的偏移时间来获取,也即,该歌曲确定装置还可以包括:偏移时间获取单元,该偏移时间获取单元,配置为在指纹获取子单元获取音频指纹之后,选取子单元选取候选歌曲标识之前,获取该音频指纹在该音频段中的第一偏移时间、以及该匹配指纹样本在匹配歌曲中的第二偏移时间,其中,该第一偏移时间为该频谱峰值点在该音频段内的时间,该匹配歌曲为该歌曲标识对应的歌曲;
此时,选取子单元,可以具体配置为:
根据该第一偏移时间和该第二偏移时间,获取该音频段在该匹配歌曲中的起始时间;
根据该音频段在匹配歌曲中的起始时间,从该歌曲标识集合中选取该候选歌曲标识。
比如,选取子单元具体配置为:
获取该歌曲标识集合中歌曲标识对应的起始时间,以得到时间集合;
根据每种该起始时间的个数从该时间集合中确定目标起始时间;
从歌曲标识集合中选取该目标起始时间对应的歌曲标识作为候选歌曲标识。
(2)、音频帧获取单元302;
该音频帧获取单元302,配置为获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,以得到匹配音频帧单元,其中,该匹配音频帧单元包括多个连续的匹配音频帧。
比如,该音频帧获取单元302,可以具体包括:匹配子单元、第一获取子单元以及第二获取子单元;
该匹配子单元,配置为将该候选歌曲文件中第一音频帧的音频特征与该音频文件中第二音频帧的音频特征进行匹配,以得到匹配结果;
该第一获取子单元,配置为根据该匹配结果获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧;
该第二获取子单元,配置为根据该匹配音频帧获取匹配音频帧单元。
其中,该匹配子单元,具体配置为:
获取该候选歌曲文件中第一音频帧的帧数,从该音频文件中选取音频帧单元,该音频帧单元包括与该帧数相等数量的第二音频帧;
将该候选歌曲文件中第一音频帧的音频特征与该音频帧单元中第二音 频帧的音频特征进行匹配,得到音频特征匹配结果;
此时,该第一获取子单元,具体配置为:根据该音频特征匹配结果获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,该匹配音频帧为音频特征匹配成功的音频帧;
该第二获取子单元,具体配置为:
根据匹配音频帧获取帧连续单元,该帧连续单元包括多个连续的该匹配音频帧;
获取帧连续单元中匹配音频帧的个数,并根据该个数确定该帧连续单元为匹配音频帧单元。
在本发明实施例一实施方式中,本发明实施例歌曲确定装置还可以包括:特征获取单元,该特征获取单元在标识获取单元301获取候选歌曲标识之后,匹配子单元进行特征匹配之前,配置为获取该候选歌曲文件中第一音频帧对应的音频特征。
比如,该特征获取单元,可以具体配置为:
获取所候选歌曲文件中每个第一音频帧对应的频谱;
将该第一音频帧对应的频谱划分成预设数量的频段,并获取该频段对应的平均幅值;
将每个该频段的平均幅值与上一个第一音频帧对应频段的平均幅值进行比较,得到比较结果;
根据该比较结果获取该第一音频帧对应的音频特征。
例如,将候选歌曲文件转换成预设格式的音频(如8k16bit音频),然后,以第一预设数量的采样点为一帧,以第二预设数量的采样点为帧移进行傅立叶变换,得到频谱(如以1856个采样点为一帧,以58个采样点为帧移进行傅立叶变换),接着,将该频谱平均分成第三预设数量(如32个)的频段,并计算每个频段对应的平均幅度值,随后,将每个频段与上一个帧中对应 频段进行比较(第二音频帧中第一个频段与第一个音频帧第一频段进行比较,第二音频帧中第二频段与第一个音频帧第二频段进行比较,以此类推直到比较完所有频段),若大于则为1,小于则为0,这样每一个帧将会得到第三预设数量个bit值组成的数据单元,该数据单元即为该帧的音频特征;例如,在将频谱划分长32个频段的情况下,每一音频帧将会得到一个包括32个bit值的数据单元,该32个bit值即为每一音频帧的音频特征。
(3)、歌曲确定单元303;
该歌曲确定单元303,配置为根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,并根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
比如,该歌曲确定单元303具体可以包括:音频帧扩展子单元、时间获取子单元、标识获取子单元以及歌曲确定子单元;
该音频帧扩展子单元,配置为对该候选歌曲标识对应的该匹配音频帧单元进行音频帧扩展,得到该候选歌曲标识对应的匹配歌曲片段;
该时间获取子单元,配置为根据该匹配歌曲片段获取候选歌曲标识对应的时间信息,该时间信息包括:该匹配歌曲片段在该视频中的第一起始时间、在该候选歌曲中的第二起始时间以及该匹配歌曲片段的时长;
所述标识获取子单元,配置为根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识;
所述歌曲确定子单元,配置为根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
其中,音频帧扩展子单元可以具体配置为:
分别在该候选歌曲文件和该音频文件中对该匹配音频帧单元进行音频帧扩展,以得到该候选歌曲文件中的第一匹配音频帧扩展单元以及该音频文件中的第二匹配音频帧扩展单元;
将该第一匹配音频帧扩展单元中第一音频帧的音频特征与该第二匹配音频帧扩展单元中第二音频帧的音频特征进行匹配,以得到扩展单元之间的匹配音频帧;
根据该扩展单元之间的匹配音频帧的数量,确定该第一匹配音频帧扩展单元或者第二匹配音频帧扩展单元为该候选歌曲与该音频文件之间相匹配的匹配歌曲片段。
其中,标识获取子单元可以具体配置为:
根据候选歌曲标识对应的第二起始时间和该时长获取该候选歌曲标识对应的播放时间,该播放时间为该匹配歌曲片段在该视频中的播放时间;
根据候选歌曲标识对应的播放时间对该候选歌曲标识集合中的候选歌曲标识进行过滤,以得到过滤后的候选标识集合;
将该过滤后的候选标识集合中的该候选歌曲作为该插曲所属目标歌曲的目标歌曲标识。
比如,在获取候选歌曲标识对应的播放时间之后,可以确定播放时间具有包含关系的候选歌曲标识,然后,过滤掉播放时间被包含的候选歌曲标识;又比如,在获取候选歌曲标识对应的播放时间之后,还可以确定播放时间具有重叠关系的候选歌曲标识,然后,过滤掉播放时长较短的候选歌曲标识。
在本发明实施例一实施方式中,参考图3b,基于图3a,本发明实施例歌曲确定装置还可以包括:歌词填充单元304;
该歌词填充单元304,配置为根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频;
相应地,对于歌曲确定单元303中的所述标识获取子单元和所述歌曲确定子单元来说:
所述标识获取子单元,配置为根据所述候选标识对应的时间信息从所 述候选歌曲标识集合中获取目标歌曲标识;
所述歌曲确定子单元,配置为根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
比如,该歌词填充单元304可以包括:歌词获取子单元和填充子单元;
该歌词获取子单元,配置为根据目标歌曲标识及其对应的该第一起始时间、该时长,获取该插曲对应的歌词;
该填充子单元,配置为根据该目标歌曲标识对应的该第二起始时间和该时长,将该歌词填充至该视频。
比如,可以根据目标歌曲标识获取相应目标歌曲的目标歌词文件,然后,根据匹配歌曲片段在该目标歌曲中的起始时间和匹配歌曲片段的时长,从该目标歌词文件中提取插曲对应的歌词,也即歌词获取子单元可以具体配置为:
根据目标歌曲标识获取相应的目标歌曲的歌词文件;
根据目标歌曲标识对应的第一起始时间和时长,从该歌词文件中提取相应的歌词,以作为插曲的歌词。
又比如,填充子单元,可以具体配置为:
根据该目标歌曲对应的第二起始时间和时长,获取歌词在视频中的展示时间;
根据该展示时间将该歌词填充至该视频。
在本发明实施例一实施方式中,为展示完整语句的插曲歌词,以提升用户体验,在获取插曲歌词之后可以确定歌词是否为完整的语句,若是,则进行歌词填充操作;也即本发明实施例歌曲确定装置还可以包括,歌词确定单元305,参考图3c;
该歌词确定单元305,可以配置为在歌词填充单元304获取该插曲对应的歌词之后,将歌词填充至该视频之前,确定该歌词是否为完整的语句;
此时,该歌词填充单元304,可以具体配置为在歌词确定单元305确定歌词是完整的语句时,执行根据该目标歌曲标识对应的该第二起始时间和该时长,将该歌词填充至该视频的步骤。
在本发明实施例一实施方式中,本发明实施例还可以在视频中设置一个接口,以使得在播放视频插曲时可以通过该接口跳转到播放该视频插曲所属的歌曲;也即本发明实施例歌词填充方法还可以包括:接口设置单元;
该接口设置单元可以配置为在歌曲确定单元303获取插曲目标歌曲标识之后,根据目标歌曲标识在该视频中设置跳转接口,以使得终端在播放该插曲时通过该跳转接口跳转至播放该插曲所属的目标歌曲。
其中,该跳转接口的形式可以为多种,比如可以为按钮、输入框等等,可以根据实际需求设定。
在本发明实施例一实施方式中,该接口设置单元,还可以配置为在歌曲确定单元303获取目标歌曲标识之后,根据目标歌曲标识在所述视频中设置添加接口,以使得终端在播放所述插曲时通过所述添加接口将所述目标歌曲添加到音乐软件的歌曲列表中。
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施例,在此不再赘述。
由上可知,本发明实施例歌曲确定装置标识获取单元301采用提取视频中的音频文件,并获取该音频文件中插曲所属候选歌曲的候选歌曲标识,以得到候选歌曲标识集合,然后,由音频帧获取单元302获取候选歌曲标识对应的候选歌曲文件,并获取该候选歌曲文件与该音频文件之间相匹配的匹配音频帧,以得到匹配音频帧单元,其中,该匹配音频帧单元包括多个连续的匹配音频帧,由歌曲确定单元303根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,并根据 所述目标歌曲标识确定所述插曲所属的目标歌曲;
该方案可以先获取视频插曲所属候选歌曲的候选歌曲标识集合,然后,基于视频的音频文件与歌曲之间的匹配音频帧,从候选歌曲标识集合中选取视频插曲所属歌曲的标识,从而确定视频插曲所属的歌曲,相对于相关技术而言,可以提高确定或者定位视频插曲对应歌曲的精确性和效率。
此外,本发明实施例装置还可以在确定视频插曲所属歌曲之后,根据目标歌曲标识及其对应的匹配音频帧单元将该插曲对应的歌词填充至该视频;该方案还可以自动完成视频插曲与歌曲的匹配,以确定视频插曲所属的歌曲,并可以自动获取视频插曲的歌词进行填充,相对于相关技术而言,还可以提高视频插曲歌词填充的准确性以及效率。
图4示例性示出了本发明实施例提供的歌曲确定装置40的结构的示意图。图4示出的结构仅仅是适当的结构的一个实例并且不旨在建议关于歌曲确定装置40的结构的任何限制。歌曲确定装置40可以在包括如服务器计算机、小型计算机、大型计算机以及任意的上述设备的分布式计算环境中实现。
尽管没有要求,但是在“计算机可读指令”被一个或多个歌曲确定装置执行的通用背景下描述实施例。计算机可读指令可以经由计算机可读介质来分布(下文讨论)。计算机可读指令可以实现为程序模块,比如执行特定任务或实现特定抽象数据类型的功能、对象、应用编程接口(API)、数据结构等等。典型地,该计算机可读指令的功能可以在各种环境中随意组合或分布。
图4图示了包括本发明实施例的提供的歌曲确定装置40的结构的实例。在一种配置中,歌曲确定装置40包括至少一个处理单元41和存储单元42。根据歌曲确定装置的确切配置和类型,存储单元42可以是易失性的(比如随机存取器(RAM,Random Access Memory))、非易失性的(比如只读存 储器(ROM,Read Only Memory)、闪存等)或二者的某种组合。该配置在图4中由虚线图示。
在其他实施例中,歌曲确定装置40可以包括附加特征和/或功能。例如,歌曲确定装置40还可以包括附加的存储装置(例如可移除和/或不可移除的),其包括但不限于磁存储装置、光存储装置等等。这种附加存储装置在图4中由存储单元43图示。在一个实施例中,用于实现本发明实施例所提供的一个或多个实施例的计算机可读指令可以在存储单元43中。存储单元43还可以存储用于实现操作系统、应用程序等的其他计算机可读指令。计算机可读指令可以载入存储单元42中由例如处理单元41执行。
本发明实施例所使用的术语“计算机可读介质”包括计算机存储介质。计算机存储介质包括以用于存储诸如计算机可读指令或其他数据之类的信息的任何方法或技术实现的易失性和非易失性、可移除和不可移除介质。存储单元42和存储单元43是计算机存储介质的实例。计算机存储介质包括但不限于RAM、ROM、电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory)、闪存或其他存储器技术、CD-ROM、数字通用盘(DVD)或其他光存储装置、盒式磁带、磁带、磁盘存储装置或其他磁存储设备、或可以用于存储期望信息并可以被歌曲确定装置40访问的任何其他介质。任意这样的计算机存储介质可以是歌曲确定装置40的一部分。
歌曲确定装置40还可以包括允许歌曲确定装置40与其他设备通信的通信连接46。通信连接46可以包括但不限于调制解调器、网络接口卡(NIC)、集成网络接口、射频发射器/接收器、红外端口、USB连接或用于将歌曲确定装置40连接到其他歌曲确定装置的其他接口。通信连接46可以包括有线连接或无线连接。通信连接46可以发射和/或接收通信媒体。
术语“计算机可读介质”可以包括通信介质。通信介质典型地包含计 算机可读指令或诸如载波或其他传输机构之类的“已调制数据信号”中的其他数据,并且包括任何信息递送介质。术语“已调制数据信号”可以包括这样的信号:该信号特性中的一个或多个按照将信息编码到信号中的方式来设置或改变。
歌曲确定装置40可以包括输入单元45,比如键盘、鼠标、笔、语音输入设备、触摸输入设备、红外相机、视频输入设备和/或任何其他输入设备。歌曲确定装置40中也可以包括输出单元44,比如一个或多个显示器、扬声器、打印机和/或任意其他输出设备。输入单元45和输出单元44可以经由有线连接、无线连接或其任意组合连接到歌曲确定装置40。在一个实施例中,来自另一个歌曲确定装置的输入设备或输出设备可以被用作歌曲确定装置40的输入单元45或输出单元44。
歌曲确定装置40的组件可以通过各种互连(比如总线)连接。这样的互连可以包括外部设备互连总线(PCI,Peripheral Component Interconnect)(比如快速PCI)、通用串行总线(USB,Universal Serial Bus)、火线(IEEE1394)、光学总线结构等等。在另一个实施例中,歌曲确定装置40的组件可以通过网络互连。例如,存储单元42可以由位于不同物理位置中的、通过网络互连的多个物理存储器单元构成。
以上对本发明实施例所提供的一种歌曲确定方法、装置和存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (27)

  1. 一种歌曲确定方法,包括:
    提取视频中的音频文件;
    获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;
    获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;
    基于所获取的匹配音频帧形成匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;
    根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识;
    根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
  2. 如权利要求1所述的歌曲确定方法,其中,
    所述获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,基于所获取的匹配音频帧形成匹配音频帧单元,包括:
    将所述候选歌曲文件中第一音频帧的音频特征与所述音频文件中第二音频帧的音频特征进行匹配,得到匹配结果;
    根据所述匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;
    根据所述匹配音频帧获取匹配音频帧单元。
  3. 如权利要求2所述的歌曲确定方法,其中,
    所述将所述候选歌曲文件中第一音频帧对应的音频特征与所述音频文件中第二音频帧对应的音频特征进行匹配,得到匹配结果,包括:
    获取所述候选歌曲文件中第一音频帧的帧数,从所述音频文件中选取音频帧单元,所述音频帧单元包括与所述帧数相等数量的第二音频帧;
    将所述候选歌曲文件中第一音频帧的音频特征与所述音频帧单元中第二音频帧的音频特征进行匹配,得到音频特征匹配结果;
    所述根据所述匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,包括:
    根据所述音频特征匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,所述匹配音频帧为音频特征匹配成功的音频帧;
    所述根据所述匹配音频帧获取匹配音频帧单元,包括:
    根据匹配音频帧获取帧连续单元,所述帧连续单元包括多个连续的所述匹配音频帧;
    获取帧连续单元中匹配音频帧的个数,并根据所述个数确定所述帧连续单元为匹配音频帧单元。
  4. 如权利要求1所述的歌曲确定方法,其中,
    所述根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,包括:
    对所述候选歌曲标识对应的所述匹配音频帧单元进行音频帧扩展,得到所述候选歌曲标识对应的匹配歌曲片段;
    根据所述匹配歌曲片段获取候选歌曲标识对应的时间信息,其中,所述时间信息包括:所述匹配歌曲片段在所述视频中的第一起始时间、在所述候选歌曲中的第二起始时间以及所述匹配歌曲片段的时长;
    根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识。
  5. 如权利要求4所述的歌曲确定方法,其中,
    所述对所述候选歌曲标识对应的所述匹配音频帧单元进行音频帧扩展,得到所述候选歌曲标识对应的匹配歌曲片段,包括:
    分别在所述候选歌曲文件和所述音频文件中对所述匹配音频帧单元进 行音频帧扩展,得到所述候选歌曲文件中的第一匹配音频帧扩展单元以及所述音频文件中的第二匹配音频帧扩展单元;
    将所述第一匹配音频帧扩展单元中第一音频帧的音频特征与所述第二匹配音频帧扩展单元中第二音频帧的音频特征进行匹配,得到扩展单元之间的匹配音频帧;
    根据所述扩展单元之间的匹配音频帧的数量,确定所述第一匹配音频帧扩展单元或者所述第二匹配音频帧扩展单元为所述候选歌曲与所述音频文件之间相匹配的匹配歌曲片段。
  6. 如权利要求2所述的歌曲确定方法,其中,
    在根据候选歌曲标识获取相应的候选歌曲文件之后,将所述候选歌曲文件中第一音频帧对应的音频特征与所述音频文件中第二音频帧对应的音频特征进行匹配之前,所述歌曲确定方法还包括:
    获取所候选歌曲文件中每个第一音频帧对应的频谱;
    将所述第一音频帧对应的频谱划分成预设数量的频段,并获取所述频段对应的平均幅值;
    将每个所述频段的平均幅值与上一个第一音频帧对应频段的平均幅值进行比较,得到比较结果;
    根据所述比较结果获取所述第一音频帧对应的音频特征。
  7. 如权利要求4所述的歌曲确定方法,其中,
    所述根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识,包括:
    根据候选歌曲标识对应的第二起始时间和所述时长获取所述候选歌曲标识对应的播放时间,所述播放时间为所述匹配歌曲片段在所述视频中的播放时间;
    根据候选歌曲标识对应的播放时间对所述候选歌曲标识集合中的候选 歌曲标识进行过滤,得到过滤后的候选标识集合;
    将所述过滤后的候选标识集合中的所述候选歌曲作为目标歌曲标识。
  8. 如权利要求4所述的歌曲确定方法,其中,还包括:
    在获取所述插曲所属目标歌曲的目标歌曲标识之后,
    根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频。
  9. 如权利要求5所述的歌曲确定方法,其中,
    所述根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频,包括:
    根据目标歌曲标识及其对应的所述第一起始时间、所述时长,获取所述插曲对应的歌词;
    根据所述目标歌曲标识对应的所述第二起始时间和所述时长,将所述歌词填充至所述视频。
  10. 如权利要求9所述的歌曲确定方法,其中,还包括:
    在获取所述插曲对应的歌词之后,且在将歌词填充至所述视频之前,
    确定所述歌词是否为完整的语句;
    若是,则执行根据所述目标歌曲标识对应的所述第二起始时间和所述时长,将所述歌词填充至所述视频的步骤。
  11. 如权利要求1所述的歌曲确定方法,其中,
    所述获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,包括:
    将所述音频文件划分成多个音频段,并获取所述音频段的音频指纹;
    确定预设样本集合中是否存在与所述音频指纹匹配的指纹样本;
    若是,则获取匹配指纹样本对应的歌曲标识,得到所述音频段对应的歌曲标识集合,所述歌曲标识集合包括多个所述歌曲标识;
    从所述歌曲标识集合中,选取所述插曲所属候选歌曲的候选歌曲标识。
  12. 如权利要求11所述的歌曲确定方法,其中,
    所述获取所述音频段的音频指纹,包括:
    获取所述音频段中音频帧对应的频谱;
    从所述频谱中提取所述音频帧对应的频谱峰值点,得到所述音频段对应的峰值集合,所述峰值集合包括所述音频帧对应的频谱峰值点;
    将所述峰值集合中频谱峰值点两两进行组合,得到所述音频段的音频指纹。
  13. 如权利要求12所述的歌曲确定方法,其中,
    所述将所述峰值集合中频谱峰值点两两进行组合,得到所述音频段的音频指纹,包括:
    确定与所述频谱峰值点相组合的目标频谱峰值点;
    将所述频谱峰值点与所述目标频谱峰值点进行组合,得到音频段的音频指纹,所述音频指纹包括:所述频谱峰值点对应的频率、所述频谱峰值点与所述目标频谱峰值点之间的时间差和频率差。
  14. 如权利要求13所述的歌曲确定方法,其中,还包括:
    在获取所述音频指纹之后,选取候选歌曲标识之前,
    获取所述音频指纹在所述音频段中的第一偏移时间、以及所述匹配指纹样本在匹配歌曲中的第二偏移时间,其中,所述第一偏移时间为所述频谱峰值点在所述音频段内的时间,所述匹配歌曲为所述歌曲标识对应的歌曲;
    所述从所述歌曲标识集合中,选取所述插曲所属候选歌曲的候选歌曲标识,包括:
    根据所述第一偏移时间和所述第二偏移时间,获取所述音频段在所述匹配歌曲中的起始时间;
    根据所述音频段在匹配歌曲中的起始时间,从所述歌曲标识集合中选 取所述候选歌曲标识。
  15. 如权利要求14所述的歌曲确定方法,其中,
    所述根据所述歌曲标识集合中歌曲标识对应的起始时间,从所述歌曲标识集合中选取所述候选歌曲标识,包括:
    获取所述歌曲标识集合中歌曲标识对应的起始时间,得到时间集合;
    根据每种所述起始时间的个数从所述时间集合中确定目标起始时间;
    从歌曲标识集合中选取所述目标起始时间对应的歌曲标识作为候选歌曲标识。
  16. 如权利要求4所述的歌曲确定方法,其中,还包括:
    在获取所述插曲所属目标歌曲的目标歌曲标识之后,
    根据目标歌曲标识在所述视频中设置跳转接口,供终端在播放所述插曲时通过所述跳转接口跳转至播放所述插曲所属的目标歌曲。
  17. 如权利要求1所述的歌曲确定方法,其中,还包括:
    在获取目标歌曲标识之后,
    根据目标歌曲标识在所述视频中设置添加接口,供终端在播放所述插曲时通过所述添加接口将所述目标歌曲添加到音乐软件的歌曲列表中。
  18. 一种歌曲确定装置,包括:
    标识获取单元,配置为提取视频中的音频文件,并获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;
    音频帧获取单元,配置为获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,基于所获取的匹配音频帧形成匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;
    歌曲确定单元,配置为根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,并根据所述目标歌 曲标识确定所述插曲所属的目标歌曲。
  19. 如权利要求18所述的歌曲确定装置,其中,
    所述音频帧获取单元具体包括:匹配子单元、第一获取子单元以及第二获取子单元;
    所述匹配子单元,配置为将所述候选歌曲文件中第一音频帧的音频特征与所述音频文件中第二音频帧的音频特征进行匹配,得到匹配结果;
    所述第一获取子单元,配置为根据所述匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;
    所述第二获取子单元,配置为根据所述匹配音频帧获取匹配音频帧单元。
  20. 权利要求19所述的歌曲确定装置,其中,
    所述匹配子单元,具体配置为:
    获取所述候选歌曲文件中第一音频帧的帧数,从所述音频文件中选取音频帧单元,所述音频帧单元包括与所述帧数相等数量的第二音频帧;
    将所述候选歌曲文件中第一音频帧的音频特征与所述音频帧单元中第二音频帧的音频特征进行匹配,得到音频特征匹配结果;
    所述第一获取子单元,具体配置为:根据所述音频特征匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,所述匹配音频帧为音频特征匹配成功的音频帧;
    所述第二获取子单元,具体配置为:
    根据匹配音频帧获取帧连续单元,所述帧连续单元包括多个连续的所述匹配音频帧;
    获取帧连续单元中匹配音频帧的个数,并根据所述个数确定所述帧连续单元为匹配音频帧单元。
  21. 如权利要求18所述的歌曲确定装置,其中,
    所述歌曲确定单元具体包括:音频帧扩展子单元、时间获取子单元、标识获取子单元以及歌曲确定子单元;
    所述音频帧扩展子单元,配置为对所述候选歌曲标识对应的所述匹配音频帧单元进行音频帧扩展,得到所述候选歌曲标识对应的匹配歌曲片段;
    所述时间获取子单元,配置为根据所述匹配歌曲片段获取候选歌曲标识对应的时间信息,其中,所述时间信息包括:所述匹配歌曲片段在所述视频中的第一起始时间、在所述候选歌曲中的第二起始时间以及所述匹配歌曲片段的时长;
    所述标识获取子单元,配置为根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识;
    所述歌曲确定子单元,配置为根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
  22. 如权利要求21所述的歌曲确定装置,其中,
    所述标识获取子单元,具体配置为:
    根据候选歌曲标识对应的第二起始时间和所述时长获取所述候选歌曲标识对应的播放时间,所述播放时间为所述匹配歌曲片段在所述视频中的播放时间;
    根据候选歌曲标识对应的播放时间对所述候选歌曲标识集合中的候选歌曲标识进行过滤,得到过滤后的候选标识集合;
    将所述过滤后的候选标识集合中的所述候选歌曲作为所述插曲所属目标歌曲的目标歌曲标识。
  23. 如权利要求21所述的歌曲确定装置,其中,还包括:
    歌词填充单元,配置为根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频。
  24. 如权利要求23所述的歌曲确定装置,其中,
    所述歌词填充单元包括:歌词获取子单元和填充子单元;
    所述歌词获取子单元,配置为根据目标歌曲标识及其对应的所述第一起始时间、所述时长,获取所述插曲对应的歌词;
    所述填充子单元,配置为根据所述目标歌曲标识对应的所述第二起始时间和所述时长,将所述歌词填充至所述视频。
  25. 如权利要求18所述的歌曲确定装置,其中,
    所述标识获取单元具体包括:音频提取子单元、指纹获取子单元、确定子单元、标识集合获取子单元以及选取子单元;
    所述音频提取子单元,配置为提取视频中的音频文件;
    所述指纹获取子单元,配置为将所述音频文件划分成多个音频段,并获取所述音频段的音频指纹;
    所述确定子单元,配置为确定预设样本集合中是否存在与所述音频指纹匹配的指纹样本;
    所述标识集合获取子单元,配置为在确定存在与所述音频指纹匹配的指纹样本时,获取匹配指纹样本对应的歌曲标识,得到所述音频段对应的歌曲标识集合,所述歌曲标识集合包括多个所述歌曲标识;
    所述选取子单元,配置为从所述歌曲标识集合中,选取所述插曲所属候选歌曲的候选歌曲标识。
  26. 一种歌曲确定装置,包括:存储器和处理器,所述存储器中存储有可执行指令,所述可执行指令用于引起所述处理器执行包括以下的操作:
    提取视频中的音频文件;
    获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;
    获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;
    基于所获取的匹配音频帧得到匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;
    根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识;
    根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
  27. 一种存储介质,存储有可执行指令,用于执行权利要求1至17任一项所述的歌曲确定方法。
PCT/CN2017/079631 2016-04-19 2017-04-06 一种歌曲确定方法和装置、存储介质 WO2017181852A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2018526229A JP6576557B2 (ja) 2016-04-19 2017-04-06 歌曲確定方法及び装置、記憶媒体
MYPI2018701777A MY194965A (en) 2016-04-19 2017-04-06 Song determining method and device, and storage medium
KR1020187010247A KR102110057B1 (ko) 2016-04-19 2017-04-06 노래 확정 방법과 장치, 기억 매체
US16/102,478 US10719551B2 (en) 2016-04-19 2018-08-13 Song determining method and device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610244446.8A CN105868397B (zh) 2016-04-19 2016-04-19 一种歌曲确定方法和装置
CN201610244446.8 2016-04-19

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/102,478 Continuation US10719551B2 (en) 2016-04-19 2018-08-13 Song determining method and device and storage medium

Publications (1)

Publication Number Publication Date
WO2017181852A1 true WO2017181852A1 (zh) 2017-10-26

Family

ID=56633482

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/079631 WO2017181852A1 (zh) 2016-04-19 2017-04-06 一种歌曲确定方法和装置、存储介质

Country Status (6)

Country Link
US (1) US10719551B2 (zh)
JP (1) JP6576557B2 (zh)
KR (1) KR102110057B1 (zh)
CN (1) CN105868397B (zh)
MY (1) MY194965A (zh)
WO (1) WO2017181852A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515662A (zh) * 2021-07-09 2021-10-19 北京百度网讯科技有限公司 一种相似歌曲检索方法、装置、设备以及存储介质
US11570506B2 (en) 2017-12-22 2023-01-31 Nativewaves Gmbh Method for synchronizing an additional signal to a primary signal

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868397B (zh) * 2016-04-19 2020-12-01 腾讯科技(深圳)有限公司 一种歌曲确定方法和装置
CN106708990B (zh) * 2016-12-15 2020-04-24 腾讯音乐娱乐(深圳)有限公司 一种音乐片段提取方法和设备
US20190019522A1 (en) * 2017-07-11 2019-01-17 Dubbydoo, LLC, c/o Fortis LLP Method and apparatus for multilingual film and audio dubbing
CN107918663A (zh) * 2017-11-22 2018-04-17 腾讯科技(深圳)有限公司 音频文件检索方法及装置
US10929097B2 (en) * 2018-06-26 2021-02-23 ROVl GUIDES, INC. Systems and methods for switching operational modes based on audio triggers
CN109558509B (zh) * 2018-07-04 2021-10-15 北京邮电大学 一种广播音频中广告检索的方法和装置
CN112004134B (zh) * 2019-05-27 2022-12-09 腾讯科技(深圳)有限公司 多媒体数据的展示方法、装置、设备及存储介质
CN110992983B (zh) * 2019-11-26 2023-04-18 腾讯音乐娱乐科技(深圳)有限公司 识别音频指纹的方法、装置、终端及存储介质
CN111161758B (zh) * 2019-12-04 2023-03-31 厦门快商通科技股份有限公司 一种基于音频指纹的听歌识曲方法、系统及音频设备
CN111400543B (zh) * 2020-03-20 2023-10-10 腾讯科技(深圳)有限公司 音频片段的匹配方法、装置、设备及存储介质
CN111475672B (zh) * 2020-03-27 2023-12-08 咪咕音乐有限公司 一种歌词分配方法、电子设备及存储介质
CN111404808B (zh) * 2020-06-02 2020-09-22 腾讯科技(深圳)有限公司 一种歌曲的处理方法
US20220027407A1 (en) * 2020-07-27 2022-01-27 Audible Magic Corporation Dynamic identification of unknown media
US11657814B2 (en) * 2020-10-08 2023-05-23 Harman International Industries, Incorporated Techniques for dynamic auditory phrase completion
CN112866584B (zh) * 2020-12-31 2023-01-20 北京达佳互联信息技术有限公司 视频合成方法、装置、终端及存储介质
CN112764612A (zh) * 2021-01-21 2021-05-07 北京字跳网络技术有限公司 互动方法、装置、电子设备和存储介质
CN112906369A (zh) * 2021-02-19 2021-06-04 脸萌有限公司 一种歌词文件生成方法及装置
CN113436641A (zh) * 2021-06-22 2021-09-24 腾讯音乐娱乐科技(深圳)有限公司 一种音乐转场时间点检测方法、设备及介质
CN113780180A (zh) * 2021-09-13 2021-12-10 江苏环雅丽书智能科技有限公司 一种音频长时指纹提取及匹配方法
CN114020958B (zh) * 2021-09-26 2022-12-06 天翼爱音乐文化科技有限公司 一种音乐分享方法、设备及存储介质
CN114071184A (zh) * 2021-11-11 2022-02-18 腾讯音乐娱乐科技(深圳)有限公司 一种字幕定位方法、电子设备及介质
US20230186953A1 (en) * 2021-12-09 2023-06-15 Bellevue Investments Gmbh & Co. Kgaa System and method for ai/xi based automatic song finding method for videos
CN114339081A (zh) * 2021-12-22 2022-04-12 腾讯音乐娱乐科技(深圳)有限公司 一种字幕生成方法、电子设备及计算机可读存储介质
CN114666653A (zh) * 2022-03-23 2022-06-24 腾讯音乐娱乐科技(深圳)有限公司 一种音乐片段的字幕显示方法、设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005101243A1 (en) * 2004-04-13 2005-10-27 Matsushita Electric Industrial Co. Ltd. Method and apparatus for identifying audio such as music
CN103971689A (zh) * 2013-02-04 2014-08-06 腾讯科技(深圳)有限公司 一种音频识别方法及装置
CN104142989A (zh) * 2014-07-28 2014-11-12 腾讯科技(深圳)有限公司 一种匹配检测方法及装置
CN104409087A (zh) * 2014-11-18 2015-03-11 广东欧珀移动通信有限公司 歌曲文件播放方法和系统
CN104598541A (zh) * 2014-12-29 2015-05-06 乐视网信息技术(北京)股份有限公司 多媒体文件的识别方法、装置
CN105868397A (zh) * 2016-04-19 2016-08-17 腾讯科技(深圳)有限公司 一种歌曲确定方法和装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5596705A (en) * 1995-03-20 1997-01-21 International Business Machines Corporation System and method for linking and presenting movies with their underlying source information
US5809471A (en) * 1996-03-07 1998-09-15 Ibm Corporation Retrieval of additional information not found in interactive TV or telephony signal by application using dynamically extracted vocabulary
US6209028B1 (en) * 1997-03-21 2001-03-27 Walker Digital, Llc System and method for supplying supplemental audio information for broadcast television programs
KR100716290B1 (ko) * 2005-07-04 2007-05-09 삼성전자주식회사 영상처리장치, 부가정보처리장치 및 영상처리방법
US8168876B2 (en) 2009-04-10 2012-05-01 Cyberlink Corp. Method of displaying music information in multimedia playback and related electronic device
US20110292992A1 (en) * 2010-05-28 2011-12-01 Microsoft Corporation Automating dynamic information insertion into video
US9093120B2 (en) * 2011-02-10 2015-07-28 Yahoo! Inc. Audio fingerprint extraction by scaling in time and resampling
CN103116629B (zh) * 2013-02-01 2016-04-20 腾讯科技(深圳)有限公司 一种音频内容的匹配方法和系统
CN103440330A (zh) * 2013-09-03 2013-12-11 网易(杭州)网络有限公司 一种音乐节目信息获取方法和设备
CN103853836B (zh) * 2014-03-14 2017-01-25 广州酷狗计算机科技有限公司 一种基于音乐指纹特征的音乐检索方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005101243A1 (en) * 2004-04-13 2005-10-27 Matsushita Electric Industrial Co. Ltd. Method and apparatus for identifying audio such as music
CN103971689A (zh) * 2013-02-04 2014-08-06 腾讯科技(深圳)有限公司 一种音频识别方法及装置
CN104142989A (zh) * 2014-07-28 2014-11-12 腾讯科技(深圳)有限公司 一种匹配检测方法及装置
CN104409087A (zh) * 2014-11-18 2015-03-11 广东欧珀移动通信有限公司 歌曲文件播放方法和系统
CN104598541A (zh) * 2014-12-29 2015-05-06 乐视网信息技术(北京)股份有限公司 多媒体文件的识别方法、装置
CN105868397A (zh) * 2016-04-19 2016-08-17 腾讯科技(深圳)有限公司 一种歌曲确定方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11570506B2 (en) 2017-12-22 2023-01-31 Nativewaves Gmbh Method for synchronizing an additional signal to a primary signal
CN113515662A (zh) * 2021-07-09 2021-10-19 北京百度网讯科技有限公司 一种相似歌曲检索方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
MY194965A (en) 2022-12-28
US10719551B2 (en) 2020-07-21
CN105868397A (zh) 2016-08-17
JP6576557B2 (ja) 2019-09-18
US20180349494A1 (en) 2018-12-06
KR20180050745A (ko) 2018-05-15
KR102110057B1 (ko) 2020-05-12
CN105868397B (zh) 2020-12-01
JP2019505874A (ja) 2019-02-28

Similar Documents

Publication Publication Date Title
WO2017181852A1 (zh) 一种歌曲确定方法和装置、存储介质
CN107591149B (zh) 音频合成方法、装置及存储介质
US10776422B2 (en) Dual sound source audio data processing method and apparatus
TWI494917B (zh) 音頻識別方法及裝置
EP1855216A2 (en) System, device, method, and program for segmenting radio broadcast audio data
CN106055659B (zh) 一种歌词数据匹配方法及其设备
US9558272B2 (en) Method of and a system for matching audio tracks using chromaprints with a fast candidate selection routine
CN111640411B (zh) 音频合成方法、装置及计算机可读存储介质
CN110209872B (zh) 片段音频歌词生成方法、装置、计算机设备和存储介质
US20210304776A1 (en) Method and apparatus for filtering out background audio signal and storage medium
CN108280074A (zh) 音频的识别方法及系统
WO2023040520A1 (zh) 视频配乐方法、装置、计算机设备和存储介质
US9881083B2 (en) Method of and a system for indexing audio tracks using chromaprints
KR100916310B1 (ko) 오디오 신호처리 기반의 음악 및 동영상간의 교차 추천 시스템 및 방법
US9990911B1 (en) Method for creating preview track and apparatus using the same
CN106775567B (zh) 一种音效匹配方法及系统
WO2023005193A1 (zh) 字幕显示方法及装置
CN113747233B (zh) 一种音乐替换方法、装置、电子设备及存储介质
CN110400578B (zh) 哈希码的生成及其匹配方法、装置、电子设备和存储介质
CN108268572B (zh) 一种歌曲同步方法及系统
CN108205550B (zh) 音频指纹的生成方法及装置
JP2010086273A (ja) 楽曲検索装置、楽曲検索方法、および楽曲検索プログラム
JP6413828B2 (ja) 情報処理方法、情報処理装置、及びプログラム
KR101365592B1 (ko) Mgi음악 파일 생성 시스템 및 방법
CN115203342A (zh) 一种音频识别方法、电子设备及可读存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20187010247

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2018526229

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17785337

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17785337

Country of ref document: EP

Kind code of ref document: A1