WO2017181852A1 - 一种歌曲确定方法和装置、存储介质 - Google Patents
一种歌曲确定方法和装置、存储介质 Download PDFInfo
- Publication number
- WO2017181852A1 WO2017181852A1 PCT/CN2017/079631 CN2017079631W WO2017181852A1 WO 2017181852 A1 WO2017181852 A1 WO 2017181852A1 CN 2017079631 W CN2017079631 W CN 2017079631W WO 2017181852 A1 WO2017181852 A1 WO 2017181852A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- song
- audio
- matching
- identifier
- candidate
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000001228 spectrum Methods 0.000 claims description 40
- 230000003595 spectral effect Effects 0.000 claims description 34
- 239000000284 extract Substances 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 description 68
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000037433 frameshift Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/685—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/632—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/632—Query formulation
- G06F16/634—Query by example, e.g. query by humming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/835—Generation of protective data, e.g. certificates
- H04N21/8352—Generation of protective data, e.g. certificates involving content or source identification data, e.g. Unique Material Identifier [UMID]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
- G10L2025/937—Signal energy in various frequency bands
Definitions
- the present invention relates to audio and video processing technologies, and in particular, to a song determining method and apparatus, and a storage medium.
- Embodiments of the present invention provide a song determining method and apparatus, and a storage medium, which can improve the accuracy of determining a song corresponding to a video episode.
- an embodiment of the present invention provides a method for determining a song, including:
- the embodiment of the present invention further provides a song determining apparatus, including:
- An identifier obtaining unit configured to extract an audio file in the video, and obtain a candidate song identifier of the candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set;
- An audio frame acquiring unit configured to acquire a candidate song file corresponding to the candidate song identifier, and obtain a matching audio frame matched between the candidate song file and the audio file, to obtain a matching audio frame unit, wherein the matching audio
- the frame unit includes a plurality of consecutive matched audio frames
- the song determining unit is configured to acquire a target song identifier from the candidate song identifier set according to the matched audio frame unit corresponding to the candidate song identifier, and determine a target song to which the episode belongs according to the target song identifier.
- an embodiment of the present invention provides a song determining apparatus, including: a memory and a processor, wherein the memory stores executable instructions, where the executable instructions are used to cause the processor to perform operations including:
- the matching audio frame unit includes a plurality of consecutive matching audio frames
- an embodiment of the present invention provides a storage medium, where executable instructions are stored for performing a song determining method provided by an embodiment of the present invention.
- the embodiment of the present invention adopts extracting an audio file in a video, and acquiring a candidate song identifier of a candidate song to which the episode belongs in the audio file, obtaining a candidate song identifier set, and then acquiring a candidate song file corresponding to the candidate song identifier, and acquiring the candidate song.
- a matching audio frame matched between the file and the audio file, to obtain a matching audio frame unit, wherein the matching audio frame unit includes a plurality of consecutive matching audio frames, according to the matched audio frame unit corresponding to the candidate song identifier, Obtaining a target song identifier of the target song to which the episode belongs in the candidate song identification set, and determining a target song to which the episode belongs according to the target song identifier.
- the scheme may first obtain a candidate song identifier set of candidate songs to which the video episode belongs, and then select an identifier of a song to which the video episode belongs from the candidate song identifier set based on the matching audio frame between the video audio file and the song, thereby determining a video episode.
- the song to which it belongs can improve the accuracy of determining or locating the song corresponding to the video episode relative to the related art.
- FIG. 1 is a flowchart of a method for determining a song according to an embodiment of the present invention
- 2a is a flowchart of obtaining a candidate song identifier according to an embodiment of the present invention
- 2b is a spectrum peak point distribution diagram provided by an embodiment of the present invention.
- 2c is a filtered peak point distribution map of the embodiment of the present invention.
- FIG. 3 is a schematic structural diagram of a first song determining apparatus according to an embodiment of the present invention.
- 3b is a schematic structural diagram of a second song determining apparatus according to an embodiment of the present invention.
- 3c is a schematic structural diagram of a third song determining apparatus according to an embodiment of the present invention.
- FIG. 4 is a schematic diagram showing the structure of hardware of a song determining apparatus according to an embodiment of the present invention.
- Embodiments of the present invention provide a song determining method and apparatus. The details will be described separately below.
- the embodiment of the present invention will be described from the perspective of a song determining apparatus, and the song determining apparatus may be specifically integrated in a device such as a server that needs to determine a song corresponding to a video episode.
- the song determining device can also be integrated in a device such as a user terminal (such as a smart phone or a tablet) that needs to determine a song corresponding to a video episode.
- a user terminal such as a smart phone or a tablet
- An embodiment of the present invention provides a song determining method, including: extracting an audio file in a video, and acquiring a candidate song identifier of a candidate song to which the episode belongs in the audio file, obtaining a candidate song identifier set, and then acquiring a candidate corresponding to the candidate song identifier.
- the matching audio frame unit obtains the identifier of the target song to which the episode belongs, that is, the target song identifier, and determines the target song to which the episode belongs according to the target song identifier.
- the specific process of the song determination method can be as follows:
- Step 101 Extract an audio file in the video, and obtain a candidate for the episode in the audio file.
- the candidate song identification of the song, and the candidate song identification set is obtained.
- the method for obtaining a video may be various.
- the video server may be sent to the video server to obtain the video, or the video may be extracted from the local storage; that is, the step of “extracting the audio file in the video” may include:
- the method for extracting audio files in the video may be various.
- the audio and video may be separated from the video to obtain an audio file of the video; that is, the step of “extracting the audio file in the video” may include: performing sound on the video.
- the video is separated and processed to obtain an audio file of the video.
- the candidate song to which the episode belongs may be a song that may match the video episode, the candidate song being identified as an identification of the song that matches the video episode.
- candidate song identifiers for example, first dividing the audio file of the video into multiple audio segments, and then matching each audio segment with the song (the song in the music library) to obtain the video.
- the song that matches the episode the identifier of the song is identified as a candidate song; for example, based on the audio segment and the audio fingerprint of the song (that is, the digitized feature of the audio of the song), the song is matched; that is, the step "acquires the audio file.
- the candidate song identifier of the candidate song to which the episode belongs may include:
- a candidate song identifier of the candidate song to which the episode belongs is selected.
- Step 102 Obtain a candidate song file corresponding to the candidate song identifier, and obtain a matching audio frame that matches the candidate song file and the audio file, to obtain a matched audio frame unit, where the matched audio frame unit includes multiple consecutive Match audio frames.
- the candidate song file corresponding to the candidate song identifier may be obtained from the song database of the song server, for example, a request may be sent to the song server to obtain the corresponding song file; that is, the step “acquiring the candidate song file corresponding to the candidate song identifier” may be include:
- a candidate song file returned by the song server according to the song acquisition request is received.
- the matching audio frame may be a matching audio frame that matches the candidate song file and the audio file, for example, when the candidate song file includes multiple first audio frames, and the audio file includes multiple second audio frames, the candidate songs
- the first audio frame in the file that matches the second audio frame in the audio file is a matching audio frame.
- the second audio frame in the audio file that matches the first audio frame in the candidate song file is also a matching audio frame.
- the matching audio frame unit may be an audio frame unit in the candidate song file, or may be an audio frame unit in the audio file.
- the above-mentioned first audio frame is used to represent an audio frame in the candidate song for comparison with an audio frame (ie, a second audio frame) in the audio file, not a specific one of the candidate songs.
- the audio frame, the second audio frame is used to represent the audio frame in the audio file, and is not used to represent a specific audio frame in the audio file.
- a matching audio frame for example, matching an audio frame in a candidate song with an audio frame in an audio file.
- the audio frame matching may adopt an audio frame based audio feature matching manner, such as matching the audio features of the first audio frame in the candidate song file with the audio features of the second audio frame in the audio file, according to the audio of the audio frame.
- the feature matching result is used to obtain a matching audio frame; that is, the step of “acquiring matching audio frames that match the candidate song file with the audio file to obtain a matching audio frame unit” may include:
- a matching audio frame unit is obtained based on the matched audio frame.
- the audio feature of the audio frame may be referred to as an audio fingerprint.
- the audio feature may be obtained in multiple manners, for example, according to the average amplitude of the corresponding frequency band of the audio frame, that is, in step “according to the candidate song identifier, the corresponding
- the song determining method may further include: obtaining, after the step of “matching the audio feature corresponding to the first audio frame in the candidate audio file with the audio feature corresponding to the second audio frame in the audio file”
- the audio feature corresponding to the first audio frame in the candidate song file; for example, the step of “acquiring the audio feature corresponding to the first audio frame in the candidate song file” may include:
- converting a candidate song file into a preset format of audio such as 8k16bit audio (that is, 8*1024 sampling rate, 16-bit quantized audio)
- a first preset number of sampling points as a frame
- the second predetermined number of sampling points are Fourier transformed for the frame shift, and the spectrum is obtained (for example, 1856 samples are taken as one frame, 58 samples are frame-shifted and Fourier transformed), and then the spectrum is equally divided into a third pre- Set the number (such as 32) of the frequency band, and calculate the average amplitude value corresponding to each frequency band, and then compare each frequency band with the corresponding frequency band in the previous frame (the first frequency band and the first one in the second audio frame) The first frequency band of the audio frame is compared, and the second frequency is the second frequency frame.
- each frame is compared with the second frequency band of the first audio frame, and so on until all the frequency bands are compared. If it is greater than 1, it is 1 and less than 0, so each frame will be composed of a third preset number of bit values.
- a data unit which is an audio feature of the frame; for example, in the case of dividing the spectrum into 32 frequency bands, each audio frame will obtain a data unit including 32 bit values, and the 32 bit values That is the audio characteristics of each audio frame.
- the audio feature of the audio file in the video can also be obtained by using the above acquisition method.
- the acquisition process can refer to the above description, and details are not described herein.
- the frame unit may be used as a unit for feature matching, that is, the step “the audio feature corresponding to the first audio frame in the candidate song file and the audio file.
- Matching the audio features corresponding to the second audio frame to obtain a matching result may include:
- the step of “acquiring a matching audio frame that matches the candidate song file and the audio file according to the matching result” may include: acquiring, according to the audio feature matching result, a matching between the candidate song file and the audio file. Matching an audio frame, the matched audio frame being an audio frame with an audio feature matching success;
- the step of “acquiring matching audio frame units according to the matched audio frame” may include: acquiring consecutive numbers of the matched audio frames, and acquiring corresponding matching audio frame units according to the number.
- the step of “acquiring the number of consecutive matching audio frames and obtaining corresponding matching audio frame units according to the number” may include:
- the frame contiguous unit comprising a plurality of consecutive matched audio frames
- the n second audio frames are consecutively selected from the m second audio frames.
- Audio frame unit a Matching the audio features of the second audio frame in audio frame unit a with the audio features of the corresponding first audio frame in the candidate song (eg, the first audio frame and candidate song in audio frame unit a)
- the audio features of the first audio frame are matched, the second audio frame in the audio frame unit a is matched with the audio feature of the second audio frame in the candidate song, and so on, until the nth audio frame unit a
- the audio frame is matched with the audio feature of the nth audio frame in the candidate song.
- feature matching needs to be performed n times to obtain an audio feature matching result.
- the audio feature matching result includes the first audio frame and the second audio frame whose audio feature is successfully matched, and the matched audio frame is obtained according to the matching result, and the frame continuous unit and the number of matching audio frames in the continuous unit of the frame are acquired.
- the new n second audio frames are continuously selected from the m first audio frames to form a new audio frame unit b, wherein the audio frame unit b and the audio frame unit a have at least one different second audio frame ( That is, the n second audio frames that are continuously selected are at least one different audio frame than the n consecutive second audio frames that have been consecutively selected; for example, the first second audio frame is selected before... the tenth
- the two audio frames constitute the audio frame unit a, and then the second second audio frame can be selected... the eleventh audio frame constitutes the audio frame unit b), and the audio features of the second audio frame in the audio frame unit b are matched with the candidate songs.
- the audio features of the corresponding first audio frame eg, matching the first audio frame in the audio frame unit b with the audio feature of the first audio frame in the candidate song, and the second audio frame in the audio frame unit b
- Matching audio features of the second audio frame in the candidate songs ... matching the nth audio frame of the audio frame unit b with the audio features of the nth audio frame of the candidate songs to obtain an audio feature matching result, such as the result Including sound
- the frequency feature matches the succeeding first audio frame and the second audio frame, obtains a matching audio frame according to the matching result, and acquires a frame continuous unit to And the number of matching audio frames in the contiguous unit of the frame, ... and so on, and then need to continuously select new n second audio frames to form an audio frame unit, and perform audio feature matching to obtain a continuous matching audio frame. The number is not matched until each second audio frame has been matched.
- the frame contiguous unit may be determined as the matching audio frame unit based on the number.
- the frame contiguous unit with the largest number of matching audio frames may be selected as the matching audio frame unit; that is, the step “determining the contiguous unit of the frame as the matching audio frame unit according to the number” may include: matching audio of the continuous unit of the frame When the number of frames is greater than the number of matching audio frames of the remaining consecutive units of frames, it is determined that the consecutive units of the frame are matched audio frame units.
- the first to tenth audio frames b can be selected to form the first audio frame unit, and then, The 1-10th audio frame q in the first audio frame unit (that is, the 1st to 10th audio frames in the audio file) is matched with the 10 audio frames p of the candidate song to obtain a matching audio frame (for example, And performing feature matching on the first audio frame q and the audio frame p in the audio frame unit, and performing feature matching on the 10th audio q and the 10th audio frame p to obtain consecutive matching audio frames to form a frame continuous unit And get the number of matching audio frames in the contiguous unit of the frame.
- the 1-10th audio frame q in the second audio frame unit (that is, the 2nd to the audio file) 11 audio frames q) are matched with 10 audio frames p to obtain matching audio frames, obtain consecutive matching audio frames to form a frame contiguous unit, and obtain the number of matching audio frames in the contiguous unit of the frame, and so on until selected
- the 11th to 20th audio frames q constitute an audio frame unit for feature matching.
- the number of consecutive frames of the frame and the corresponding number of matching audio frames can be obtained.
- the frame continuous unit including the largest number of matched audio frames can be selected as the matching audio frame unit, that is, the longest selection is performed.
- the frame contiguous unit acts as a matching audio frame unit.
- Step 103 According to the candidate audio frame corresponding to the candidate audio frame unit, from the candidate song Obtaining a target song identifier of the target song to which the episode belongs, and determining a target song to which the episode belongs according to the target song identifier.
- the matching audio frame unit matched between the candidate song file and the audio file corresponding to the candidate song identifier may be obtained by step 102, so that the matching audio corresponding to the candidate song identifier may be selected.
- the frame unit selects a target song identifier of the target one to which the video episode belongs from the candidate song identification set.
- the matching audio frame unit may be frame-expanded to obtain a matching song segment matching the candidate song file and the audio file, and then the target song identifier is obtained based on the matching song segment; that is, the step “corresponding to the candidate song identifier according to the candidate song identifier
- the matching audio frame unit, and obtaining the target song identifier of the target song to which the episode belongs from the candidate song identification set may include:
- time information corresponding to the candidate song identifier according to the matched song segment including: a first start time of the matched song segment in the video, a second start time in the candidate song, and the matching song segment duration;
- the first start time is used to indicate the start time of the matching song segment in the video to distinguish from the start time (ie, the second start time) of the matching song segment in the candidate song. , not for a specific time.
- the matching song segment corresponding to the candidate song identifier is a matching song segment whose candidate song corresponding to the candidate song identifier matches the audio file, and the matching song segment may be a song segment in the candidate song, or may be a song segment in the audio file.
- the matching song segment is composed of an audio frame, after the matching song segment is acquired, the start time of the segment in the candidate song, the start time in the video, and the start time of the segment can be obtained according to the audio frame in the segment, and The The length of the segment (ie the length of the segment).
- the start time of the clip in the candidate song may be obtained according to the sequence number of the audio frame of the clip, or the start time of the clip in the video may be obtained according to the sequence number of the audio frame of the clip.
- the frame unit performs audio frame expansion to obtain a matching song segment corresponding to the candidate song identifier, which may include:
- the audio file may be synchronized in the candidate song file for frame expansion, that is, the number of extended audio frames is the same and the direction is the same.
- the method for determining the matching song segment according to the number of matching audio frames between the expansion units may be various, for example, when the number is greater than a certain preset number, determining that the expansion unit at this time is a matching song segment, and When the ratio of the number of matching audio frames to the total number of extended unit audio frames is greater than a preset ratio (eg, 90%), it is determined that the extension unit at this time is a matching song segment.
- a preset ratio eg, 90%
- the step of “acquiring the target song identifier from the candidate song identifier set according to the time information corresponding to the candidate identifier” may include:
- the candidate song in the filtered candidate identifier set is identified as a target song.
- the candidate song identifier having the inclusion relationship may be determined, and then the candidate song identifier included in the play time is filtered out, that is, the candidate song identifier having the inclusion relationship in the play time is filtered out.
- the candidate song identifier with short playing time for example, the playing time corresponding to song ID1 is 1s to 10s, the playing time corresponding to song ID2 is 2s to 5s, and the playing time corresponding to song ID3 is 3s to 8s;
- the play time corresponding to the songs ID1, ID2, and ID3 has an inclusion relationship, and therefore, the song ID having a short play time can be filtered, and here, the song ID2 and ID3 are filtered out.
- the candidate song identifiers having the overlapping relationship of the playing time may also be determined, and then the candidate song identifiers having the short playing time are filtered out.
- the play time corresponding to song ID1 is from 1st to 10s
- the play time corresponding to song ID2 is from 5s to 12s.
- the song ID of the length of play can be filtered out, where the play length of song ID1 is 10s.
- the playing time of the song ID2 is 7s, so the song ID2 is filtered out.
- the song corresponding to the target song identifier may be used as the target song to which the episode belongs.
- the lyrics of the video episode may be filled into the video, so that the lyrics of the video episode are displayed when the video episode is played; that is, in the step After 103, it can also include:
- the lyrics corresponding to the episode are filled to the video.
- expanding a matching audio frame unit to obtain a matching song segment and its time information may include: acquiring the target song identifier according to the target song identifier and the corresponding time information. Interpolating the corresponding lyrics, and filling the lyrics into the video, wherein the time information is time information of the matching song segments corresponding to the target song.
- the lyrics corresponding to the episode may be obtained according to the start time of the matching song segment corresponding to the target song identifier and the duration of the matching song segment, and according to the start time and duration of the matching song segment in the video.
- Filling in the lyrics; that is, the step "acquiring the lyrics corresponding to the episode and populating the lyrics to the video according to the target song identifier and its corresponding time information" may include:
- the lyrics are filled to the video according to the second start time and the duration corresponding to the target song identifier.
- the target lyric file of the corresponding target song may be obtained according to the target song identifier, and then the lyrics corresponding to the episode are extracted from the target lyric file according to the start time of the matching song segment in the target song and the duration of the matching song segment; That is, the step of “acquiring the lyrics corresponding to the episode according to the target song identifier and the corresponding first start time and the duration” may include:
- the corresponding lyrics are extracted from the lyric file as the lyrics of the episode.
- the target song is identified as song 1
- the matching song segment corresponding to the song 1 has a start time of 5s in the song 1
- the matching song segment is 10s.
- the 5s can be obtained from the lyric file of the song 1. Lyrics to 15s.
- the step of “filling the lyrics into the video according to the second start time and the duration corresponding to the target song identifier” may include:
- the lyrics are populated to the video based on the presentation time.
- the second start time of the matching song segment corresponding to the target song identifier is 7s in the video, and the duration of the matching song segment is 8s.
- the display time of the lyrics in the video can be obtained from the 7th to the 15th. After that, the lyrics can be inserted at the corresponding position of the video based on the presentation time.
- the method in order to display the lyrics of the complete sentence to enhance the user experience, after obtaining the episode lyrics, it may be determined whether the lyrics are complete sentences, and if so, the lyrics filling operation is performed; that is, in the step “ After obtaining the lyrics corresponding to the episode, the method may further include: before the step of “filling the lyrics into the video”, the method may further include:
- an interface may be set in the video, so that when the video episode is played, the interface may be jumped to play the song to which the video episode belongs; that is, in the step.
- the method may further include:
- the form of the jump interface can be various, such as a button, an input box, etc., and can be set according to actual needs.
- an interface may also be set in the video, so that the target song to which the video episode belongs may be added to the song of the music software through the interface when playing the video episode.
- the list that is, in the step "Get the insert After the target song identifier of the target song belongs to the song, it may also include:
- An add interface is set in the video according to the target song identifier, so that the terminal adds the target song to the song list of the music software through the add interface when playing the episode.
- the form of the added interface may be various, such as a button, an input box, etc., which may be set according to actual needs;
- the music software may be a commonly used music playing software, such as a cloud-based music playing software, online music playing.
- Software, etc. the song list can be a song list or a song playlist, such as a favorite song list.
- the embodiment of the present invention adopts an audio file in an extracted video, and obtains a candidate song identifier of a candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set, and then obtains a candidate song file corresponding to the candidate song identifier.
- the matching audio frame unit includes a plurality of consecutive matching audio frames, according to the candidate song identifier corresponding to the Matching an audio frame unit, obtaining a target song identifier of the target song to which the episode belongs from the candidate song identifier set, and determining a target song to which the episode belongs according to the target song identifier; the scheme may first obtain a candidate song of the candidate song to which the video episode belongs Identifying the set, and then, based on the matching audio frame between the audio file of the video and the song, selecting the identifier of the song to which the video episode belongs from the set of candidate song identifiers, thereby determining the song to which the video episode belongs, which can be improved relative to related art Identify or locate a video episode Should the accuracy and efficiency of the song.
- the embodiment of the present invention fills the lyrics corresponding to the episode according to the target song identifier and its corresponding matching audio frame unit to the video; the scheme can also automatically complete the matching of the video episode and the song. To determine the song to which the video episode belongs, and to automatically obtain the lyrics of the video episode for filling, which can improve the accuracy and efficiency of the video episode lyrics filling relative to the related art.
- the candidate song identifier can be obtained based on the audio fingerprint matching between the audio file and the candidate song file in the video.
- the process of acquiring the candidate song identifier based on the audio fingerprint matching is mainly introduced.
- Figure 2a the process of obtaining the candidate song identifier is as follows:
- Step 201 Divide the audio file into a plurality of audio segments, and obtain an audio fingerprint of the audio segment.
- the audio file can be divided into multiple ways.
- the audio file can be divided into multiple audio segments by a preset frame length and a preset frame shift, and the duration of each audio segment is equal to the preset frame length. That is, the step of dividing the audio file into a plurality of audio segments may include:
- the audio file is divided into a plurality of audio segments by a preset frame length and a preset frame shift.
- PCM Pulse Code Modulation For example, converting an audio file into 8k16bit (that is, 8*1024 sample rate, 16-bit quantized audio) PCM Pulse Code Modulation is also called audio, and then, in 10 seconds, the frame length is 1 second. The frame is shifted into a plurality of small audio segments of 10 seconds. For example, when the duration of each frame is 1 s, the first frame and the tenth frame are divided into one audio segment, and the second frame and the eleventh frame are segmented. The frame is divided into an audio segment. In the specific implementation, the appropriate division method can be selected according to actual needs.
- the audio fingerprints may be multiple.
- the audio small fingerprint is selected, and the audio small fingerprint is a data structure, which may be composed of spectral peak points on the spectrum, for example, Obtaining a spectrum corresponding to the audio frame of the audio, and then extracting a peak point of the spectrum corresponding to the audio frame, thereby obtaining a peak point of the spectrum corresponding to the audio, and then combining the peak points in the set to obtain an audio fingerprint; that is, the step “acquiring The audio fingerprint of the audio segment may include:
- the peak points of the spectrum in the peak set are combined to obtain an audio fingerprint of the audio segment.
- the step of “combining the peak points of the peaks in the peak set to obtain the audio fingerprint of the audio segment” may include:
- the spectral peak point with the target spectral peak point to obtain an audio fingerprint of the audio segment, the audio fingerprint including: a frequency corresponding to the peak point of the spectrum, a time difference and a frequency between the peak point of the spectrum and the peak point of the target spectrum difference.
- the target spectral peak point combined with the spectral peak point may be a spectral peak point other than the spectral peak point; for example, after acquiring the peak set corresponding to the audio segment, generating a frequency peak point distribution map according to the peak set, Then, a target area corresponding to a certain frequency peak point (also referred to as an anchor point) may be determined in the frequency peak point distribution map, the target area including: a target frequency peak point combined with the frequency peak point, and then The anchor point is combined with the target frequency peak point in the target area, and after combining, multiple audio fingerprints can be obtained.
- a target area corresponding to a certain frequency peak point also referred to as an anchor point
- the anchor point is combined with the target frequency peak point in the target area, and after combining, multiple audio fingerprints can be obtained.
- the horizontal axis of the frequency peak point distribution map is the time, and the vertical axis is the frequency of the peak point. Since the audio frame has a corresponding relationship with the time, in order to quickly acquire the audio fingerprint, the audio frame number can be used to represent the time in the embodiment of the present invention.
- the frequency of the peak point can also be represented by a frequency band index number, and the index number can range from (0 to 255), that is, the above peak points t and f can respectively use audio frames.
- the serial number and the frequency band index number are indicated.
- the target area can be represented by the audio frame number and the frequency band index number.
- the target area can be composed of a time area and a frequency domain area, wherein the time area can be (15 to 63) frames (the time difference is represented by 6 bits).
- the frequency domain area may be (-31 to 31) frequency bands (the frequency difference is represented by 6 bits), and the size of the target area may be set according to actual requirements.
- the target area can only include three target spectral peak points, that is, the number of target spectral peak points corresponding to the anchor point is 3.
- the short-time peak frequency points interact with each other, and one frequency component may mask a frequency component close to it (the so-called auditory masking effect), so the time interval is small, and The peak points with smaller frequency spacing are filtered out to ensure that the selected peak points are more evenly distributed along the time and frequency axis; that is, after the step of “getting the peak set corresponding to the audio segment”, the step “the peak of the spectrum in the peak set is two Before the two are combined, the song determination method may further include:
- the spectral peak points in the peak set are filtered according to the time difference between the peak points of the spectrum and the frequency difference.
- a spectral peak point distribution map corresponding to a peak set of an audio in order to make the spectral peak point distribution uniform, the peak point in the peak set may be filtered, and the filtered peak set corresponds to a spectral peak point distribution reference.
- Figure 2c a spectral peak point distribution map corresponding to a peak set of an audio
- the audio features may be distinguished based on the size of the audio fingerprint.
- the audio feature in the first embodiment may be referred to as a large audio fingerprint.
- the audio fingerprint of the audio segment in the description of 2 may be referred to as a small audio fingerprint.
- Step 202 Determine whether there is a fingerprint sample matching the audio fingerprint in the preset sample set. If yes, go to step 203 and if no, end the process.
- the preset sample set may include at least one fingerprint sample, where each fingerprint sample is an audio fingerprint of a song; for example, the preset sample set may have multiple fingerprint samples, and each fingerprint sample It may correspond to a song ID, for example, fingerprint sample 1 corresponds to song 1, fingerprint sample 2 corresponds to song 2, and fingerprint sample n corresponds to song n.
- a plurality of audio fingerprints of the audio segment may be acquired, and then, whether there is a fingerprint sample matching each of the audio fingerprints (ie, the same) in the preset sample set, obtaining a plurality of matching fingerprint samples, and then acquiring each matching fingerprint sample Corresponding song identification to obtain a song identification set, the song identification set including a plurality of the song identifications.
- the audio fingerprint corresponding to the audio segment includes: the audio fingerprint D1 and the audio fingerprint D2, and the audio fingerprint D1 of the audio segment is compared with the fingerprint sample in the preset sample set, respectively.
- the same fingerprint sample of the audio fingerprint D1 determines that the preset sample has a fingerprint sample matching the audio fingerprint D1.
- the audio fingerprint D2 can be compared with the fingerprint sample in the preset sample set, respectively, if there is an audio fingerprint. D2 is the same fingerprint sample, and it is determined that there is a fingerprint sample matching the audio fingerprint D2 in the preset sample set.
- the song may be extracted from the song database, and then the audio fingerprint of the song is extracted as a fingerprint sample.
- the manner of extracting the audio fingerprint of the song may also be obtained by using the audio fingerprint of the audio segment.
- the spectrum corresponding to the audio frame in the song, and then extracting the peak points of the spectrum and combining the peak points of the spectrum to obtain the audio fingerprint of the song (ie, the fingerprint sample), the song can be extracted from a certain song database;
- the song determining method may further include:
- Step 203 Obtain a song identifier corresponding to the matching fingerprint sample, to obtain a corresponding corresponding to the audio segment. a first song identification set, the first song identification set comprising a plurality of the song identifications.
- the method for obtaining the song identifier corresponding to the fingerprint sample may be multiple.
- the mapping relationship set may be used to obtain the song identifier corresponding to the matching fingerprint sample, and the mapping relationship set may include a mapping relationship between the fingerprint sample and the song identifier. (ie, the corresponding relationship), that is, the step of the step of “acquiring the song identifier corresponding to the matching fingerprint sample” includes:
- mapping relationship set a song identifier corresponding to the matching fingerprint sample, where the mapping relationship set includes a mapping relationship between the fingerprint sample and the song identifier.
- the mapping relationship set may be a preset mapping relationship set, and the mapping relationship between the fingerprint sample and the song identifier may be preset by the system, or may be set by the user; that is, in the step “extracting audio in the video” Before the file, the song determination method may further include:
- mapping relationship setting request indicates a fingerprint sample and a song identifier that need to establish a mapping relationship
- mapping relationship between the fingerprint sample and the song identifier is established according to the mapping relationship setting to obtain a mapping relationship set.
- the mapping relationship set may be presented in the form of a table, which is called a mapping relationship table, and the mapping relationship table may include: a preset sample set, and a song identifier corresponding to the fingerprint sample in the preset sample set, where the mapping A relational table can be stored in a database and can be called a fingerprint library.
- Step 204 Select, from the song identification set, a candidate song identifier of the candidate song to which the episode belongs.
- the method may further include: obtaining a first offset time of the audio fingerprint in the audio segment, and acquiring the matching fingerprint sample in the matching song, after the step of “selecting the candidate song identifier from the song identification set” a second offset time, wherein the first offset time is a time of the spectral peak point in the audio segment, and the matching song is a song corresponding to the song identifier;
- the step of “selecting the candidate song identifier of the candidate song to which the episode belongs from the song identification set” may include:
- the candidate song identification is selected from the song identification set based on the start time of the audio segment in the matching song.
- the offset time t1 of the audio fingerprint D1 (f1, ⁇ f', ⁇ t') in the audio segment can be obtained, and the t1 is the time of the spectral peak point a1 in the audio segment, and the fingerprint sample is also extracted in the above manner.
- the offset time of the fingerprint sample in the song to which it belongs is the time of the peak point of the spectrum corresponding to the fingerprint sample (ie, the anchor point) in the song to which it belongs.
- the offset time of the matching fingerprint sample in the matching song may be acquired based on the preset time mapping relationship set, where the preset time mapping relationship set may include: the fingerprint sample and the fingerprint sample in the song to which the fingerprint is located
- the mapping relationship (correspondence relationship) between the offset times, that is, the step "the second offset time of the matching fingerprint sample in the matching song" may include:
- the preset time mapping relationship set includes: the fingerprint sample and the fingerprint sample in the song to which the fingerprint sample belongs The mapping relationship between offset times.
- the preset time mapping relationship set may be a preset time mapping relationship set, and the mapping relationship between the fingerprint sample and the offset time may be preset by the system, or may be set by the user; that is, in the step “ Before the audio file in the video is extracted, the lyrics filling method may further include:
- the fingerprint mapping relationship request indicates a fingerprint sample and an offset time that need to establish a mapping relationship, where the offset time is an offset time of the fingerprint sample in a song to which the fingerprint sample belongs;
- a mapping relationship between the fingerprint sample and the offset time is established to obtain a time mapping relationship set.
- the time mapping relationship set may be presented in the form of a table, which is called a time mapping relationship table, and the mapping relationship table may include: a preset sample set, and an offset time corresponding to the fingerprint sample in the preset sample set.
- the time mapping relationship set and the mapping relationship set are set in the same mapping relationship set, for example, setting a total mapping relationship set, and the set may include The mapping relationship between the fingerprint sample and the song identifier, and the mapping relationship between the fingerprint sample and the offset time.
- a total mapping relationship table may be set, and the relationship table may include: a preset sample set, a preset sample. The song identifier corresponding to the fingerprint sample in the collection, and the offset time corresponding to the fingerprint sample in the preset sample set.
- the start time of the audio segment is the same in a plurality of different songs, it indicates that the plurality of songs are most likely to be the candidate songs corresponding to the audio segment, that is, the songs to which the video episode belongs, that is, the step "according to the song Identifying a start time corresponding to the song identifier in the set, and selecting the candidate song identifier from the song identifier set may include:
- a song identifier corresponding to the target start time is selected from the song identification set as a candidate song identifier.
- the start time of the same number reaching the preset number may be selected as the target start time, that is, the step “determining the target start time from the time set according to the same number of the start time” may include:
- the preset number can be set according to actual needs, for example, it can be 5, 6, 9, and so on.
- the start time of the audio segment in the song may be obtained according to the offset time corresponding to the audio fingerprint and the offset time corresponding to the song identifier in the song identifier set.
- the song identifier may be calculated.
- the time difference between the offset time and the offset time corresponding to the audio fingerprint which is the start time of the audio segment in the song.
- the offset time corresponding to the audio fingerprint of the audio segment is t'
- the offset time corresponding to the fingerprint sample ie, the offset time corresponding to the song identifier
- the start time ⁇ t corresponding to each song identifier of the song identification set can be calculated to obtain a time set, such as ( ⁇ t1, ⁇ t2, ⁇ t1, ⁇ t1, ⁇ t2, ⁇ t3, ⁇ t3, ⁇ tn).
- the number of each start time can be obtained, and then it is determined whether the number is greater than a preset number, and if so, it is determined that the start time to be seeded is the target start time; for example, in the preset
- the number is 8
- the number of statistics ⁇ t1 is 10
- the number of ⁇ t2 is 6,
- the number of ⁇ t3 is 12.
- the number of ⁇ t1 is greater than the preset number
- the number of ⁇ t2 is less than the preset number. If the number of ⁇ t3 is greater than the preset number, it can be determined that ⁇ t1 and ⁇ t3 are the target start times.
- the audio fingerprint in order to improve the matching speed of the audio fingerprint, may be converted.
- the audio fingerprint is converted into a specific feature number by using a preset algorithm, and is named as a hash value (hash_key). .
- hash_key f1 ⁇ 2 ⁇ 12+ ⁇ f ⁇ 2 ⁇ 6+ ⁇ t
- ⁇ is an exponential operator, which is converted into a specific
- the number, that is, the bitwise height constitutes a 20-bit integer, so that only the hash_key matching can be performed in the subsequent audio fingerprint matching, that is, the step "determine whether there is a fingerprint sample matching the audio fingerprint in the preset sample set" may be Includes:
- the preset digital sample set includes at least one feature number, which is called a digital sample, and each digital sample can correspond to a song identifier.
- the step of “acquiring the song identifier corresponding to the matching fingerprint sample” may include: acquiring a song identifier corresponding to the matching digital sample.
- the song identifier corresponding to the matching digital sample may be obtained based on the digital mapping relationship set, that is, the step of “acquiring the song identifier corresponding to the matching digital sample” may include: acquiring the song identifier corresponding to the matching digital sample according to the digital mapping relationship set, where The set of digital mapping relationships includes: a correspondence between a digital sample and a song identifier.
- the digital mapping relationship set may be a preset digital mapping relationship set, and the mapping relationship between the digital sample and the song identifier may be preset by the system, or may be set by the user; that is, in the step “extracting the video” Before the audio file, the song determination method may further include:
- the digital mapping relationship setting request indicating a digital feature and a song identifier that need to establish a mapping relationship
- a mapping relationship between the digital feature and the song identifier is obtained, and a digital mapping relationship set is obtained.
- the step of “acquiring the second offset time of the matching fingerprint sample in the matching song” may include: acquiring a second offset time corresponding to the matching digital sample according to the digital time mapping relationship set, wherein the digital time mapping relationship set includes a number The mapping between the sample and the offset time.
- the method for obtaining the digital time mapping relationship set can refer to the above digital mapping relationship. The way to create collections or collections of time-relational relationships is not repeated here.
- the digital mapping relationship set and the digital time mapping relationship set may be set in a set, for example, setting a total mapping relationship set, where the set includes: between a digital sample and a song identifier. a mapping relationship between the mapping relationship, the digital sample, and the offset time; for example, a mapping relationship table may be set, where the mapping relationship table may include: a preset digital sample set, a song identifier corresponding to the digital sample in the preset digital sample set, The offset time corresponding to the digital sample in the preset digital sample set.
- the embodiment of the present invention divides the audio file into a plurality of audio segments, and acquires an audio fingerprint of the audio segment, and then determines whether there is a fingerprint sample matching the audio fingerprint in the preset sample set, and if so, Obtaining a song identifier corresponding to the matching fingerprint sample, obtaining a first song identifier set corresponding to the audio segment, and selecting, from the song identifier set, a candidate song identifier of the candidate song to which the episode belongs; the scheme may acquire all candidates of the video insertion The song, and then determining the song corresponding to the video episode from the candidate song based on the matching of the candidate song and the audio of the video, can improve the accuracy and efficiency of determining the song corresponding to the video episode compared to the related art.
- the embodiment uses the peak point of the spectrum to construct the audio fingerprint, the candidate song corresponding to the video episode and its identifier can be accurately obtained, and the accuracy of determining or locating the candidate song to which the video episode belongs is further improved.
- the embodiment of the present invention further provides a song determining apparatus.
- the song determining apparatus may further include an identifier acquiring unit 301, an audio frame acquiring unit 302, and a song determining unit 303, as follows:
- the identifier obtaining unit 301 is configured to extract an audio file in the video, and obtain a candidate song identifier of the candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set.
- the identifier obtaining unit 301 may include: an audio extraction subunit, a fingerprint acquisition subunit, a determination subunit, an identifier collection acquisition subunit, and a selection subunit;
- the audio extraction subunit is configured to extract an audio file in the video
- the fingerprint acquisition subunit is configured to divide the audio file into a plurality of audio segments, and acquire an audio fingerprint of the audio segment;
- the determining subunit is configured to determine whether a fingerprint sample matching the audio fingerprint exists in the preset sample set
- the identifier collection acquisition sub-unit is configured to: when determining that there is a fingerprint sample matching the audio fingerprint, obtain a song identifier corresponding to the matching fingerprint sample, and obtain a song identifier set corresponding to the audio segment, where the song identifier set includes a plurality of the songs Identification
- the selection subunit is configured to select a candidate song identifier of the candidate song to which the episode belongs from the song identification set.
- the method for obtaining a video may be multiple.
- the video server may be sent to the video server to obtain the video, or the video may be extracted from the local storage.
- the audio and video separation processing is performed to obtain an audio file of the video; that is, the step of "extracting the audio file in the video” may include: performing audio and video separation processing on the video to obtain an audio file of the video.
- the audio file may be divided into multiple manners.
- the audio file may be divided into multiple audio segments by a preset frame length and a preset frame shift, and the duration of each audio segment is equal to the preset frame length.
- the candidate song to which the episode belongs may be a song that may match the video episode, the candidate song being identified as an identification of the song that matches the video episode.
- an audio fingerprint of an audio segment there are many ways to obtain an audio fingerprint of an audio segment, such as the following:
- the peak points of the spectrum in the peak set are combined two by two to obtain an audio fingerprint of the audio segment.
- the step “combining the peak points of the peaks in the peak set to obtain the audio fingerprint of the audio segment” may include:
- the peak points of the spectrum in the peak set are combined two by two to obtain an audio fingerprint of the audio segment.
- the method for selecting a candidate song identifier from the song identifier set may be multiple, for example, may be obtained based on an offset time of the audio fingerprint
- the song determining apparatus may further include: An offset time acquisition unit, configured to acquire a first offset time of the audio fingerprint in the audio segment after the sub-unit selects the candidate song identifier after the fingerprint acquisition sub-unit acquires the audio fingerprint, And a second offset time of the matching fingerprint sample in the matching song, wherein the first offset time is a time of the spectral peak point in the audio segment, and the matching song is a song corresponding to the song identifier;
- select the subunit which can be specifically configured as:
- the candidate song identification is selected from the song identification set based on the start time of the audio segment in the matching song.
- the selected subunit is specifically configured as:
- a song identifier corresponding to the target start time is selected from the song identification set as a candidate song identifier.
- the audio frame obtaining unit 302 is configured to acquire a candidate song file corresponding to the candidate song identifier, and obtain a matching audio frame that matches the candidate song file and the audio file to obtain a matching audio frame unit, where the matching audio is
- the frame unit includes a plurality of consecutive matched audio frames.
- the audio frame obtaining unit 302 may specifically include: a matching subunit, a first acquiring subunit, and a second acquiring subunit;
- the matching subunit is configured to match an audio feature of the first audio frame in the candidate song file with an audio feature of the second audio frame in the audio file to obtain a matching result
- the first obtaining subunit is configured to acquire, according to the matching result, a matching audio frame that matches the candidate song file and the audio file;
- the second obtaining subunit is configured to obtain a matching audio frame unit according to the matched audio frame.
- the matching subunit is specifically configured as follows:
- Audio features of the first audio frame in the candidate song file and second sounds in the audio frame unit are matched to obtain an audio feature matching result
- the first acquiring sub-unit is configured to: obtain a matching audio frame that matches the candidate song file and the audio file according to the audio feature matching result, where the matching audio frame is an audio frame with an audio feature matching success. ;
- the second obtaining subunit is specifically configured as:
- the frame contiguous unit comprising a plurality of consecutive matched audio frames
- the song determining apparatus of the embodiment of the present invention may further include: a feature acquiring unit, configured to: after the identifier acquiring unit 301 acquires the candidate song identifier, and before the matching subunit performs feature matching, Obtaining an audio feature corresponding to the first audio frame in the candidate song file.
- a feature acquiring unit configured to: after the identifier acquiring unit 301 acquires the candidate song identifier, and before the matching subunit performs feature matching, Obtaining an audio feature corresponding to the first audio frame in the candidate song file.
- the feature acquiring unit may be specifically configured as:
- the candidate song file is converted into audio of a preset format (such as 8k16bit audio), and then, the first preset number of sampling points are used as one frame, and the second predetermined number of sampling points are subjected to Fourier transform.
- Obtain the spectrum for example, one frame with 1856 samples and Fourier transform with 58 samples
- divide the spectrum into a third preset number (such as 32) and calculate each The average amplitude value corresponding to the frequency band, and then each frequency band corresponds to the previous frame
- the frequency bands are compared (the first frequency band in the second audio frame is compared with the first frequency band of the first audio frame, the second frequency band in the second audio frame is compared with the second frequency band of the first audio frame, and so on until compared
- the song determining unit 303 is configured to acquire a target song identifier from the candidate song identifier set according to the matched audio frame unit corresponding to the candidate song identifier, and determine a target to which the episode belongs according to the target song identifier. song.
- the song determining unit 303 may specifically include: an audio frame extension subunit, a time acquisition subunit, an identifier acquisition subunit, and a song determination subunit;
- the audio frame extension subunit is configured to perform audio frame expansion on the matched audio frame unit corresponding to the candidate song identifier, to obtain a matching song segment corresponding to the candidate song identifier;
- the time acquisition subunit is configured to acquire time information corresponding to the candidate song identifier according to the matched song segment, the time information including: a first start time of the matching song segment in the video, and a second time in the candidate song The start time and the length of the matching song segment;
- the identifier acquisition subunit is configured to obtain a target song identifier from the candidate song identifier set according to time information corresponding to the candidate identifier;
- the song determining subunit is configured to determine a target song to which the episode belongs according to the target song identifier.
- the audio frame extension subunit may be specifically configured as:
- the identifier obtaining subunit may be specifically configured as:
- the candidate song in the filtered candidate identifier set is used as the target song identifier of the target song to which the episode belongs.
- the candidate song identifier included in the play time is filtered out; for example, the play time corresponding to the candidate song identifier is acquired. Thereafter, it is also possible to determine candidate song identities in which the play time has an overlapping relationship, and then filter out candidate song identities having a short play duration.
- the song determining apparatus of the embodiment of the present invention may further include: a lyrics filling unit 304;
- the lyrics filling unit 304 is configured to fill the lyrics corresponding to the episode to the video according to the target song identifier and its corresponding matching audio frame unit;
- the identifier obtaining subunit is configured to receive information according to time information corresponding to the candidate identifier Obtaining a target song identifier in the candidate song identifier set;
- the song determining subunit is configured to determine a target song to which the episode belongs according to the target song identifier.
- the lyrics filling unit 304 may include: a lyrics acquiring subunit and a filling subunit;
- the lyrics obtaining sub-unit is configured to acquire the lyrics corresponding to the episode according to the target song identifier and the corresponding first start time and the duration;
- the padding subunit is configured to fill the lyrics to the video according to the second start time and the duration corresponding to the target song identifier.
- the target lyric file of the corresponding target song may be obtained according to the target song identifier, and then the lyrics corresponding to the episode are extracted from the target lyric file according to the start time of the matching song segment in the target song and the duration of the matching song segment. That is, the lyrics obtaining subunit can be specifically configured as:
- the corresponding lyrics are extracted from the lyric file as the lyrics of the episode.
- the filler subunit can be specifically configured as:
- the lyrics are populated to the video based on the presentation time.
- the example song determining apparatus may further include a lyrics determining unit 305, referring to FIG. 3c;
- the lyrics determining unit 305 may be configured to determine whether the lyrics is a complete statement after the lyrics filling unit 304 acquires the lyrics corresponding to the episode, before filling the lyrics with the lyrics;
- the lyrics filling unit 304 may be configured to: when the lyrics determining unit 305 determines that the lyrics are complete sentences, perform the second start time and the duration according to the target song identifier, and fill the lyrics to the lyrics The steps of the video.
- an embodiment of the present invention may further provide an interface in the video, so that when the video episode is played, the interface may be jumped to play the song to which the video episode belongs; that is, the implementation of the present invention
- the lyrics filling method may further include: an interface setting unit;
- the interface setting unit may be configured to: after the song determining unit 303 acquires the episode target song identifier, set a jump interface in the video according to the target song identifier, so that the terminal jumps to play by using the jump interface when playing the episode The target song to which the episode belongs.
- the form of the jump interface can be various, such as a button, an input box, etc., and can be set according to actual needs.
- the interface setting unit may be further configured to: after the song determining unit 303 acquires the target song identifier, set an add interface in the video according to the target song identifier, so that the terminal plays the The target song is added to the song list of the music software through the add interface during the episode.
- the foregoing units may be implemented as a separate entity, or may be implemented in any combination, and may be implemented as the same or a plurality of entities.
- the foregoing method embodiments and details are not described herein.
- the song determining device identifier obtaining unit 301 of the embodiment of the present invention extracts an audio file in the video, and acquires a candidate song identifier of the candidate song to which the episode belongs in the audio file, to obtain a candidate song identifier set, and then, by the audio frame.
- the obtaining unit 302 acquires a candidate song file corresponding to the candidate song identifier, and acquires a matching audio frame that matches the candidate song file and the audio file to obtain a matching audio frame unit, where the matching audio frame unit includes multiple consecutive Matching the audio frame, the song determining unit 303 obtains the target song identifier from the candidate song identifier set according to the matched audio frame unit corresponding to the candidate song identifier, and according to The target song identifier determines a target song to which the episode belongs;
- the scheme may first obtain a candidate song identifier set of candidate songs to which the video episode belongs, and then select an identifier of a song to which the video episode belongs from the candidate song identifier set based on the matching audio frame between the video audio file and the song, thereby determining a video episode.
- the associated song can improve the accuracy and efficiency of determining or locating the song corresponding to the video episode relative to the related art.
- the apparatus of the embodiment of the present invention may further fill the video corresponding to the episode according to the target song identifier and its corresponding matching audio frame unit after determining the song to which the video episode belongs; the scheme may also automatically complete the video episode and the song.
- the matching is to determine the song to which the video episode belongs, and the lyrics of the video episode can be automatically filled for filling, and the accuracy and efficiency of the video episode lyrics filling can be improved compared with the related art.
- FIG. 4 exemplarily shows a schematic diagram of the structure of the song determining apparatus 40 provided by the embodiment of the present invention.
- the structure shown in FIG. 4 is only one example of a suitable structure and is not intended to suggest any limitation regarding the structure of the song determining apparatus 40.
- the song determining device 40 can be implemented in a distributed computing environment including, for example, a server computer, a small computer, a mainframe computer, and any of the above-described devices.
- Computer readable instructions may be distributed via computer readable media (discussed below).
- Computer readable instructions may be implemented as program modules, such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
- program modules such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types.
- APIs application programming interfaces
- data structures such as lists, etc. that perform particular tasks or implement particular abstract data types.
- the functionality of the computer readable instructions can be combined or distributed at will in various environments.
- FIG. 4 illustrates an example of the structure of a song determining apparatus 40 provided in accordance with an embodiment of the present invention.
- the song determining device 40 includes at least one processing unit 41 and a storage unit 42.
- storage unit 42 may be volatile (such as Random Access Memory (RAM)), non-volatile (such as read-only memory). Memory (ROM, Read Only Memory), flash memory, etc.) or some combination of the two. This configuration is illustrated by dashed lines in FIG.
- RAM Random Access Memory
- ROM Read Only Memory
- flash memory etc.
- song determining device 40 may include additional features and/or functionality.
- song determining device 40 may also include additional storage devices (eg, removable and/or non-removable) including, but not limited to, magnetic storage devices, optical storage devices, and the like.
- This additional storage device is illustrated by storage unit 43 in FIG.
- computer readable instructions for implementing one or more embodiments provided by embodiments of the present invention may be in storage unit 43.
- the storage unit 43 can also store other computer readable instructions for implementing an operating system, applications, and the like.
- Computer readable instructions may be loaded into storage unit 42 for execution by, for example, processing unit 41.
- Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data.
- the storage unit 42 and the storage unit 43 are examples of computer storage media.
- Computer storage media includes, but is not limited to, RAM, ROM, Electrically Erasable Programmable Read-Only Memory, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage device
- the song determining device 40 may also include a communication connection 46 that allows the song determining device 40 to communicate with other devices.
- Communication connection 46 may include, but is not limited to, a modem, a network interface card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interface for connecting song determination device 40 to other song determination devices.
- NIC network interface card
- Communication connection 46 may include a wired connection or a wireless connection. Communication connection 46 can transmit and/or receive communication media.
- Computer readable medium can include a communication medium.
- Communication media typically includes Computer readable instructions or other data in a “modulated data signal” such as a carrier wave or other transmission mechanism, and including any information delivery medium.
- modulated data signal can include a signal that one or more of the signal characteristics are set or changed in such a manner as to encode the information into the signal.
- Song determining device 40 may include an input unit 45, such as a keyboard, mouse, pen, voice input device, touch input device, infrared camera, video input device, and/or any other input device.
- Output unit 44 may also be included in song determining device 40, such as one or more displays, speakers, printers, and/or any other output device.
- the input unit 45 and the output unit 44 may be connected to the song determining device 40 via a wired connection, a wireless connection, or any combination thereof.
- an input device or output device from another song determining device may be used as the input unit 45 or output unit 44 of the song determining device 40.
- the components of song determining device 40 may be connected by various interconnects, such as a bus. Such interconnections may include a Peripheral Component Interconnect (PCI), a Universal Serial Bus (USB), a FireWire (IEEE 1394), an optical bus structure, and the like.
- PCI Peripheral Component Interconnect
- USB Universal Serial Bus
- FireWire IEEE 1394
- optical bus structure an optical bus structure, and the like.
- the components of song determining device 40 may be interconnected by a network.
- storage unit 42 may be comprised of a plurality of physical memory units that are interconnected by a network located in different physical locations.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Signal Processing (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Mathematical Physics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Security & Cryptography (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management Or Editing Of Information On Record Carriers (AREA)
- Television Signal Processing For Recording (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
- Indexing, Searching, Synchronizing, And The Amount Of Synchronization Travel Of Record Carriers (AREA)
Abstract
Description
Claims (27)
- 一种歌曲确定方法,包括:提取视频中的音频文件;获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;基于所获取的匹配音频帧形成匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识;根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
- 如权利要求1所述的歌曲确定方法,其中,所述获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,基于所获取的匹配音频帧形成匹配音频帧单元,包括:将所述候选歌曲文件中第一音频帧的音频特征与所述音频文件中第二音频帧的音频特征进行匹配,得到匹配结果;根据所述匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;根据所述匹配音频帧获取匹配音频帧单元。
- 如权利要求2所述的歌曲确定方法,其中,所述将所述候选歌曲文件中第一音频帧对应的音频特征与所述音频文件中第二音频帧对应的音频特征进行匹配,得到匹配结果,包括:获取所述候选歌曲文件中第一音频帧的帧数,从所述音频文件中选取音频帧单元,所述音频帧单元包括与所述帧数相等数量的第二音频帧;将所述候选歌曲文件中第一音频帧的音频特征与所述音频帧单元中第二音频帧的音频特征进行匹配,得到音频特征匹配结果;所述根据所述匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,包括:根据所述音频特征匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,所述匹配音频帧为音频特征匹配成功的音频帧;所述根据所述匹配音频帧获取匹配音频帧单元,包括:根据匹配音频帧获取帧连续单元,所述帧连续单元包括多个连续的所述匹配音频帧;获取帧连续单元中匹配音频帧的个数,并根据所述个数确定所述帧连续单元为匹配音频帧单元。
- 如权利要求1所述的歌曲确定方法,其中,所述根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,包括:对所述候选歌曲标识对应的所述匹配音频帧单元进行音频帧扩展,得到所述候选歌曲标识对应的匹配歌曲片段;根据所述匹配歌曲片段获取候选歌曲标识对应的时间信息,其中,所述时间信息包括:所述匹配歌曲片段在所述视频中的第一起始时间、在所述候选歌曲中的第二起始时间以及所述匹配歌曲片段的时长;根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识。
- 如权利要求4所述的歌曲确定方法,其中,所述对所述候选歌曲标识对应的所述匹配音频帧单元进行音频帧扩展,得到所述候选歌曲标识对应的匹配歌曲片段,包括:分别在所述候选歌曲文件和所述音频文件中对所述匹配音频帧单元进 行音频帧扩展,得到所述候选歌曲文件中的第一匹配音频帧扩展单元以及所述音频文件中的第二匹配音频帧扩展单元;将所述第一匹配音频帧扩展单元中第一音频帧的音频特征与所述第二匹配音频帧扩展单元中第二音频帧的音频特征进行匹配,得到扩展单元之间的匹配音频帧;根据所述扩展单元之间的匹配音频帧的数量,确定所述第一匹配音频帧扩展单元或者所述第二匹配音频帧扩展单元为所述候选歌曲与所述音频文件之间相匹配的匹配歌曲片段。
- 如权利要求2所述的歌曲确定方法,其中,在根据候选歌曲标识获取相应的候选歌曲文件之后,将所述候选歌曲文件中第一音频帧对应的音频特征与所述音频文件中第二音频帧对应的音频特征进行匹配之前,所述歌曲确定方法还包括:获取所候选歌曲文件中每个第一音频帧对应的频谱;将所述第一音频帧对应的频谱划分成预设数量的频段,并获取所述频段对应的平均幅值;将每个所述频段的平均幅值与上一个第一音频帧对应频段的平均幅值进行比较,得到比较结果;根据所述比较结果获取所述第一音频帧对应的音频特征。
- 如权利要求4所述的歌曲确定方法,其中,所述根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识,包括:根据候选歌曲标识对应的第二起始时间和所述时长获取所述候选歌曲标识对应的播放时间,所述播放时间为所述匹配歌曲片段在所述视频中的播放时间;根据候选歌曲标识对应的播放时间对所述候选歌曲标识集合中的候选 歌曲标识进行过滤,得到过滤后的候选标识集合;将所述过滤后的候选标识集合中的所述候选歌曲作为目标歌曲标识。
- 如权利要求4所述的歌曲确定方法,其中,还包括:在获取所述插曲所属目标歌曲的目标歌曲标识之后,根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频。
- 如权利要求5所述的歌曲确定方法,其中,所述根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频,包括:根据目标歌曲标识及其对应的所述第一起始时间、所述时长,获取所述插曲对应的歌词;根据所述目标歌曲标识对应的所述第二起始时间和所述时长,将所述歌词填充至所述视频。
- 如权利要求9所述的歌曲确定方法,其中,还包括:在获取所述插曲对应的歌词之后,且在将歌词填充至所述视频之前,确定所述歌词是否为完整的语句;若是,则执行根据所述目标歌曲标识对应的所述第二起始时间和所述时长,将所述歌词填充至所述视频的步骤。
- 如权利要求1所述的歌曲确定方法,其中,所述获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,包括:将所述音频文件划分成多个音频段,并获取所述音频段的音频指纹;确定预设样本集合中是否存在与所述音频指纹匹配的指纹样本;若是,则获取匹配指纹样本对应的歌曲标识,得到所述音频段对应的歌曲标识集合,所述歌曲标识集合包括多个所述歌曲标识;从所述歌曲标识集合中,选取所述插曲所属候选歌曲的候选歌曲标识。
- 如权利要求11所述的歌曲确定方法,其中,所述获取所述音频段的音频指纹,包括:获取所述音频段中音频帧对应的频谱;从所述频谱中提取所述音频帧对应的频谱峰值点,得到所述音频段对应的峰值集合,所述峰值集合包括所述音频帧对应的频谱峰值点;将所述峰值集合中频谱峰值点两两进行组合,得到所述音频段的音频指纹。
- 如权利要求12所述的歌曲确定方法,其中,所述将所述峰值集合中频谱峰值点两两进行组合,得到所述音频段的音频指纹,包括:确定与所述频谱峰值点相组合的目标频谱峰值点;将所述频谱峰值点与所述目标频谱峰值点进行组合,得到音频段的音频指纹,所述音频指纹包括:所述频谱峰值点对应的频率、所述频谱峰值点与所述目标频谱峰值点之间的时间差和频率差。
- 如权利要求13所述的歌曲确定方法,其中,还包括:在获取所述音频指纹之后,选取候选歌曲标识之前,获取所述音频指纹在所述音频段中的第一偏移时间、以及所述匹配指纹样本在匹配歌曲中的第二偏移时间,其中,所述第一偏移时间为所述频谱峰值点在所述音频段内的时间,所述匹配歌曲为所述歌曲标识对应的歌曲;所述从所述歌曲标识集合中,选取所述插曲所属候选歌曲的候选歌曲标识,包括:根据所述第一偏移时间和所述第二偏移时间,获取所述音频段在所述匹配歌曲中的起始时间;根据所述音频段在匹配歌曲中的起始时间,从所述歌曲标识集合中选 取所述候选歌曲标识。
- 如权利要求14所述的歌曲确定方法,其中,所述根据所述歌曲标识集合中歌曲标识对应的起始时间,从所述歌曲标识集合中选取所述候选歌曲标识,包括:获取所述歌曲标识集合中歌曲标识对应的起始时间,得到时间集合;根据每种所述起始时间的个数从所述时间集合中确定目标起始时间;从歌曲标识集合中选取所述目标起始时间对应的歌曲标识作为候选歌曲标识。
- 如权利要求4所述的歌曲确定方法,其中,还包括:在获取所述插曲所属目标歌曲的目标歌曲标识之后,根据目标歌曲标识在所述视频中设置跳转接口,供终端在播放所述插曲时通过所述跳转接口跳转至播放所述插曲所属的目标歌曲。
- 如权利要求1所述的歌曲确定方法,其中,还包括:在获取目标歌曲标识之后,根据目标歌曲标识在所述视频中设置添加接口,供终端在播放所述插曲时通过所述添加接口将所述目标歌曲添加到音乐软件的歌曲列表中。
- 一种歌曲确定装置,包括:标识获取单元,配置为提取视频中的音频文件,并获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;音频帧获取单元,配置为获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,基于所获取的匹配音频帧形成匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;歌曲确定单元,配置为根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识,并根据所述目标歌 曲标识确定所述插曲所属的目标歌曲。
- 如权利要求18所述的歌曲确定装置,其中,所述音频帧获取单元具体包括:匹配子单元、第一获取子单元以及第二获取子单元;所述匹配子单元,配置为将所述候选歌曲文件中第一音频帧的音频特征与所述音频文件中第二音频帧的音频特征进行匹配,得到匹配结果;所述第一获取子单元,配置为根据所述匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;所述第二获取子单元,配置为根据所述匹配音频帧获取匹配音频帧单元。
- 权利要求19所述的歌曲确定装置,其中,所述匹配子单元,具体配置为:获取所述候选歌曲文件中第一音频帧的帧数,从所述音频文件中选取音频帧单元,所述音频帧单元包括与所述帧数相等数量的第二音频帧;将所述候选歌曲文件中第一音频帧的音频特征与所述音频帧单元中第二音频帧的音频特征进行匹配,得到音频特征匹配结果;所述第一获取子单元,具体配置为:根据所述音频特征匹配结果获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧,所述匹配音频帧为音频特征匹配成功的音频帧;所述第二获取子单元,具体配置为:根据匹配音频帧获取帧连续单元,所述帧连续单元包括多个连续的所述匹配音频帧;获取帧连续单元中匹配音频帧的个数,并根据所述个数确定所述帧连续单元为匹配音频帧单元。
- 如权利要求18所述的歌曲确定装置,其中,所述歌曲确定单元具体包括:音频帧扩展子单元、时间获取子单元、标识获取子单元以及歌曲确定子单元;所述音频帧扩展子单元,配置为对所述候选歌曲标识对应的所述匹配音频帧单元进行音频帧扩展,得到所述候选歌曲标识对应的匹配歌曲片段;所述时间获取子单元,配置为根据所述匹配歌曲片段获取候选歌曲标识对应的时间信息,其中,所述时间信息包括:所述匹配歌曲片段在所述视频中的第一起始时间、在所述候选歌曲中的第二起始时间以及所述匹配歌曲片段的时长;所述标识获取子单元,配置为根据所述候选标识对应的时间信息从所述候选歌曲标识集合中获取目标歌曲标识;所述歌曲确定子单元,配置为根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
- 如权利要求21所述的歌曲确定装置,其中,所述标识获取子单元,具体配置为:根据候选歌曲标识对应的第二起始时间和所述时长获取所述候选歌曲标识对应的播放时间,所述播放时间为所述匹配歌曲片段在所述视频中的播放时间;根据候选歌曲标识对应的播放时间对所述候选歌曲标识集合中的候选歌曲标识进行过滤,得到过滤后的候选标识集合;将所述过滤后的候选标识集合中的所述候选歌曲作为所述插曲所属目标歌曲的目标歌曲标识。
- 如权利要求21所述的歌曲确定装置,其中,还包括:歌词填充单元,配置为根据所述目标歌曲标识及其对应的匹配音频帧单元,将所述插曲对应的歌词填充至所述视频。
- 如权利要求23所述的歌曲确定装置,其中,所述歌词填充单元包括:歌词获取子单元和填充子单元;所述歌词获取子单元,配置为根据目标歌曲标识及其对应的所述第一起始时间、所述时长,获取所述插曲对应的歌词;所述填充子单元,配置为根据所述目标歌曲标识对应的所述第二起始时间和所述时长,将所述歌词填充至所述视频。
- 如权利要求18所述的歌曲确定装置,其中,所述标识获取单元具体包括:音频提取子单元、指纹获取子单元、确定子单元、标识集合获取子单元以及选取子单元;所述音频提取子单元,配置为提取视频中的音频文件;所述指纹获取子单元,配置为将所述音频文件划分成多个音频段,并获取所述音频段的音频指纹;所述确定子单元,配置为确定预设样本集合中是否存在与所述音频指纹匹配的指纹样本;所述标识集合获取子单元,配置为在确定存在与所述音频指纹匹配的指纹样本时,获取匹配指纹样本对应的歌曲标识,得到所述音频段对应的歌曲标识集合,所述歌曲标识集合包括多个所述歌曲标识;所述选取子单元,配置为从所述歌曲标识集合中,选取所述插曲所属候选歌曲的候选歌曲标识。
- 一种歌曲确定装置,包括:存储器和处理器,所述存储器中存储有可执行指令,所述可执行指令用于引起所述处理器执行包括以下的操作:提取视频中的音频文件;获取所述音频文件中插曲所属候选歌曲的候选歌曲标识,形成候选歌曲标识集合;获取候选歌曲标识对应的候选歌曲文件,并获取所述候选歌曲文件与所述音频文件之间相匹配的匹配音频帧;基于所获取的匹配音频帧得到匹配音频帧单元,其中,所述匹配音频帧单元包括多个连续的匹配音频帧;根据所述候选歌曲标识对应的所述匹配音频帧单元,从所述候选歌曲标识集合中获取目标歌曲标识;根据所述目标歌曲标识确定所述插曲所属的目标歌曲。
- 一种存储介质,存储有可执行指令,用于执行权利要求1至17任一项所述的歌曲确定方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2018526229A JP6576557B2 (ja) | 2016-04-19 | 2017-04-06 | 歌曲確定方法及び装置、記憶媒体 |
MYPI2018701777A MY194965A (en) | 2016-04-19 | 2017-04-06 | Song determining method and device, and storage medium |
KR1020187010247A KR102110057B1 (ko) | 2016-04-19 | 2017-04-06 | 노래 확정 방법과 장치, 기억 매체 |
US16/102,478 US10719551B2 (en) | 2016-04-19 | 2018-08-13 | Song determining method and device and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610244446.8A CN105868397B (zh) | 2016-04-19 | 2016-04-19 | 一种歌曲确定方法和装置 |
CN201610244446.8 | 2016-04-19 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/102,478 Continuation US10719551B2 (en) | 2016-04-19 | 2018-08-13 | Song determining method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017181852A1 true WO2017181852A1 (zh) | 2017-10-26 |
Family
ID=56633482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/079631 WO2017181852A1 (zh) | 2016-04-19 | 2017-04-06 | 一种歌曲确定方法和装置、存储介质 |
Country Status (6)
Country | Link |
---|---|
US (1) | US10719551B2 (zh) |
JP (1) | JP6576557B2 (zh) |
KR (1) | KR102110057B1 (zh) |
CN (1) | CN105868397B (zh) |
MY (1) | MY194965A (zh) |
WO (1) | WO2017181852A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113515662A (zh) * | 2021-07-09 | 2021-10-19 | 北京百度网讯科技有限公司 | 一种相似歌曲检索方法、装置、设备以及存储介质 |
US11570506B2 (en) | 2017-12-22 | 2023-01-31 | Nativewaves Gmbh | Method for synchronizing an additional signal to a primary signal |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868397B (zh) * | 2016-04-19 | 2020-12-01 | 腾讯科技(深圳)有限公司 | 一种歌曲确定方法和装置 |
CN106708990B (zh) * | 2016-12-15 | 2020-04-24 | 腾讯音乐娱乐(深圳)有限公司 | 一种音乐片段提取方法和设备 |
US20190019522A1 (en) * | 2017-07-11 | 2019-01-17 | Dubbydoo, LLC, c/o Fortis LLP | Method and apparatus for multilingual film and audio dubbing |
CN107918663A (zh) * | 2017-11-22 | 2018-04-17 | 腾讯科技(深圳)有限公司 | 音频文件检索方法及装置 |
US10929097B2 (en) * | 2018-06-26 | 2021-02-23 | ROVl GUIDES, INC. | Systems and methods for switching operational modes based on audio triggers |
CN109558509B (zh) * | 2018-07-04 | 2021-10-15 | 北京邮电大学 | 一种广播音频中广告检索的方法和装置 |
CN112004134B (zh) * | 2019-05-27 | 2022-12-09 | 腾讯科技(深圳)有限公司 | 多媒体数据的展示方法、装置、设备及存储介质 |
CN110992983B (zh) * | 2019-11-26 | 2023-04-18 | 腾讯音乐娱乐科技(深圳)有限公司 | 识别音频指纹的方法、装置、终端及存储介质 |
CN111161758B (zh) * | 2019-12-04 | 2023-03-31 | 厦门快商通科技股份有限公司 | 一种基于音频指纹的听歌识曲方法、系统及音频设备 |
CN111400543B (zh) * | 2020-03-20 | 2023-10-10 | 腾讯科技(深圳)有限公司 | 音频片段的匹配方法、装置、设备及存储介质 |
CN111475672B (zh) * | 2020-03-27 | 2023-12-08 | 咪咕音乐有限公司 | 一种歌词分配方法、电子设备及存储介质 |
CN111404808B (zh) * | 2020-06-02 | 2020-09-22 | 腾讯科技(深圳)有限公司 | 一种歌曲的处理方法 |
US20220027407A1 (en) * | 2020-07-27 | 2022-01-27 | Audible Magic Corporation | Dynamic identification of unknown media |
US11657814B2 (en) * | 2020-10-08 | 2023-05-23 | Harman International Industries, Incorporated | Techniques for dynamic auditory phrase completion |
CN112866584B (zh) * | 2020-12-31 | 2023-01-20 | 北京达佳互联信息技术有限公司 | 视频合成方法、装置、终端及存储介质 |
CN112764612A (zh) * | 2021-01-21 | 2021-05-07 | 北京字跳网络技术有限公司 | 互动方法、装置、电子设备和存储介质 |
CN112906369A (zh) * | 2021-02-19 | 2021-06-04 | 脸萌有限公司 | 一种歌词文件生成方法及装置 |
CN113436641A (zh) * | 2021-06-22 | 2021-09-24 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音乐转场时间点检测方法、设备及介质 |
CN113780180A (zh) * | 2021-09-13 | 2021-12-10 | 江苏环雅丽书智能科技有限公司 | 一种音频长时指纹提取及匹配方法 |
CN114020958B (zh) * | 2021-09-26 | 2022-12-06 | 天翼爱音乐文化科技有限公司 | 一种音乐分享方法、设备及存储介质 |
CN114071184A (zh) * | 2021-11-11 | 2022-02-18 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种字幕定位方法、电子设备及介质 |
US20230186953A1 (en) * | 2021-12-09 | 2023-06-15 | Bellevue Investments Gmbh & Co. Kgaa | System and method for ai/xi based automatic song finding method for videos |
CN114339081A (zh) * | 2021-12-22 | 2022-04-12 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种字幕生成方法、电子设备及计算机可读存储介质 |
CN114666653A (zh) * | 2022-03-23 | 2022-06-24 | 腾讯音乐娱乐科技(深圳)有限公司 | 一种音乐片段的字幕显示方法、设备及可读存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005101243A1 (en) * | 2004-04-13 | 2005-10-27 | Matsushita Electric Industrial Co. Ltd. | Method and apparatus for identifying audio such as music |
CN103971689A (zh) * | 2013-02-04 | 2014-08-06 | 腾讯科技(深圳)有限公司 | 一种音频识别方法及装置 |
CN104142989A (zh) * | 2014-07-28 | 2014-11-12 | 腾讯科技(深圳)有限公司 | 一种匹配检测方法及装置 |
CN104409087A (zh) * | 2014-11-18 | 2015-03-11 | 广东欧珀移动通信有限公司 | 歌曲文件播放方法和系统 |
CN104598541A (zh) * | 2014-12-29 | 2015-05-06 | 乐视网信息技术(北京)股份有限公司 | 多媒体文件的识别方法、装置 |
CN105868397A (zh) * | 2016-04-19 | 2016-08-17 | 腾讯科技(深圳)有限公司 | 一种歌曲确定方法和装置 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5596705A (en) * | 1995-03-20 | 1997-01-21 | International Business Machines Corporation | System and method for linking and presenting movies with their underlying source information |
US5809471A (en) * | 1996-03-07 | 1998-09-15 | Ibm Corporation | Retrieval of additional information not found in interactive TV or telephony signal by application using dynamically extracted vocabulary |
US6209028B1 (en) * | 1997-03-21 | 2001-03-27 | Walker Digital, Llc | System and method for supplying supplemental audio information for broadcast television programs |
KR100716290B1 (ko) * | 2005-07-04 | 2007-05-09 | 삼성전자주식회사 | 영상처리장치, 부가정보처리장치 및 영상처리방법 |
US8168876B2 (en) | 2009-04-10 | 2012-05-01 | Cyberlink Corp. | Method of displaying music information in multimedia playback and related electronic device |
US20110292992A1 (en) * | 2010-05-28 | 2011-12-01 | Microsoft Corporation | Automating dynamic information insertion into video |
US9093120B2 (en) * | 2011-02-10 | 2015-07-28 | Yahoo! Inc. | Audio fingerprint extraction by scaling in time and resampling |
CN103116629B (zh) * | 2013-02-01 | 2016-04-20 | 腾讯科技(深圳)有限公司 | 一种音频内容的匹配方法和系统 |
CN103440330A (zh) * | 2013-09-03 | 2013-12-11 | 网易(杭州)网络有限公司 | 一种音乐节目信息获取方法和设备 |
CN103853836B (zh) * | 2014-03-14 | 2017-01-25 | 广州酷狗计算机科技有限公司 | 一种基于音乐指纹特征的音乐检索方法及系统 |
-
2016
- 2016-04-19 CN CN201610244446.8A patent/CN105868397B/zh active Active
-
2017
- 2017-04-06 MY MYPI2018701777A patent/MY194965A/en unknown
- 2017-04-06 WO PCT/CN2017/079631 patent/WO2017181852A1/zh active Application Filing
- 2017-04-06 KR KR1020187010247A patent/KR102110057B1/ko active IP Right Grant
- 2017-04-06 JP JP2018526229A patent/JP6576557B2/ja active Active
-
2018
- 2018-08-13 US US16/102,478 patent/US10719551B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005101243A1 (en) * | 2004-04-13 | 2005-10-27 | Matsushita Electric Industrial Co. Ltd. | Method and apparatus for identifying audio such as music |
CN103971689A (zh) * | 2013-02-04 | 2014-08-06 | 腾讯科技(深圳)有限公司 | 一种音频识别方法及装置 |
CN104142989A (zh) * | 2014-07-28 | 2014-11-12 | 腾讯科技(深圳)有限公司 | 一种匹配检测方法及装置 |
CN104409087A (zh) * | 2014-11-18 | 2015-03-11 | 广东欧珀移动通信有限公司 | 歌曲文件播放方法和系统 |
CN104598541A (zh) * | 2014-12-29 | 2015-05-06 | 乐视网信息技术(北京)股份有限公司 | 多媒体文件的识别方法、装置 |
CN105868397A (zh) * | 2016-04-19 | 2016-08-17 | 腾讯科技(深圳)有限公司 | 一种歌曲确定方法和装置 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11570506B2 (en) | 2017-12-22 | 2023-01-31 | Nativewaves Gmbh | Method for synchronizing an additional signal to a primary signal |
CN113515662A (zh) * | 2021-07-09 | 2021-10-19 | 北京百度网讯科技有限公司 | 一种相似歌曲检索方法、装置、设备以及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
MY194965A (en) | 2022-12-28 |
US10719551B2 (en) | 2020-07-21 |
CN105868397A (zh) | 2016-08-17 |
JP6576557B2 (ja) | 2019-09-18 |
US20180349494A1 (en) | 2018-12-06 |
KR20180050745A (ko) | 2018-05-15 |
KR102110057B1 (ko) | 2020-05-12 |
CN105868397B (zh) | 2020-12-01 |
JP2019505874A (ja) | 2019-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017181852A1 (zh) | 一种歌曲确定方法和装置、存储介质 | |
CN107591149B (zh) | 音频合成方法、装置及存储介质 | |
US10776422B2 (en) | Dual sound source audio data processing method and apparatus | |
TWI494917B (zh) | 音頻識別方法及裝置 | |
EP1855216A2 (en) | System, device, method, and program for segmenting radio broadcast audio data | |
CN106055659B (zh) | 一种歌词数据匹配方法及其设备 | |
US9558272B2 (en) | Method of and a system for matching audio tracks using chromaprints with a fast candidate selection routine | |
CN111640411B (zh) | 音频合成方法、装置及计算机可读存储介质 | |
CN110209872B (zh) | 片段音频歌词生成方法、装置、计算机设备和存储介质 | |
US20210304776A1 (en) | Method and apparatus for filtering out background audio signal and storage medium | |
CN108280074A (zh) | 音频的识别方法及系统 | |
WO2023040520A1 (zh) | 视频配乐方法、装置、计算机设备和存储介质 | |
US9881083B2 (en) | Method of and a system for indexing audio tracks using chromaprints | |
KR100916310B1 (ko) | 오디오 신호처리 기반의 음악 및 동영상간의 교차 추천 시스템 및 방법 | |
US9990911B1 (en) | Method for creating preview track and apparatus using the same | |
CN106775567B (zh) | 一种音效匹配方法及系统 | |
WO2023005193A1 (zh) | 字幕显示方法及装置 | |
CN113747233B (zh) | 一种音乐替换方法、装置、电子设备及存储介质 | |
CN110400578B (zh) | 哈希码的生成及其匹配方法、装置、电子设备和存储介质 | |
CN108268572B (zh) | 一种歌曲同步方法及系统 | |
CN108205550B (zh) | 音频指纹的生成方法及装置 | |
JP2010086273A (ja) | 楽曲検索装置、楽曲検索方法、および楽曲検索プログラム | |
JP6413828B2 (ja) | 情報処理方法、情報処理装置、及びプログラム | |
KR101365592B1 (ko) | Mgi음악 파일 생성 시스템 및 방법 | |
CN115203342A (zh) | 一种音频识别方法、电子设备及可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 20187010247 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2018526229 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17785337 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17785337 Country of ref document: EP Kind code of ref document: A1 |