WO2017050175A1 - 音频识别方法和系统 - Google Patents

音频识别方法和系统 Download PDF

Info

Publication number
WO2017050175A1
WO2017050175A1 PCT/CN2016/099053 CN2016099053W WO2017050175A1 WO 2017050175 A1 WO2017050175 A1 WO 2017050175A1 CN 2016099053 W CN2016099053 W CN 2016099053W WO 2017050175 A1 WO2017050175 A1 WO 2017050175A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature point
audio file
spectrogram
feature
window
Prior art date
Application number
PCT/CN2016/099053
Other languages
English (en)
French (fr)
Inventor
杜志军
Original Assignee
阿里巴巴集团控股有限公司
杜志军
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 杜志军 filed Critical 阿里巴巴集团控股有限公司
Priority to KR1020187008373A priority Critical patent/KR102077411B1/ko
Priority to SG11201801808RA priority patent/SG11201801808RA/en
Priority to JP2018515493A priority patent/JP6585835B2/ja
Priority to EP16848057.2A priority patent/EP3355302B1/en
Publication of WO2017050175A1 publication Critical patent/WO2017050175A1/zh
Priority to US15/897,431 priority patent/US10679647B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to an audio recognition method and system.
  • One application is, for example, a user who hears a song that does not know the title of the song, can record a piece of audio of the song, and then uses audio recognition technology to recognize the song of the song. Name, singer and other information.
  • the feature points of the audio to be recognized are generally extracted, and the feature point pairs are identified.
  • the horizontal axis represents time and the vertical axis represents frequency.
  • the extracted feature points are “X” in the figure; two feature points form a feature point pair, and there are 8 feature point pairs in the target area; the feature point pairs are used to identify in the database, and the database stores songs.
  • the feature points and song information such as song title, singer, etc.; if the same feature point pair can be matched in the same target area in the database, the matching is successful; and then the corresponding song information can be obtained.
  • the extracted feature points are not necessarily in the normal position due to the inevitable noise interference when recording the audio, the probability that the feature point pair is successfully matched is low.
  • An object of the present application is to provide an audio recognition method and system, which solves the problem of low feature point matching success rate in audio recognition in the prior art.
  • an audio recognition method provided by an embodiment of the present application includes:
  • a diffusion unit configured to perform a diffusion process on the first feature point in the spectrogram of the audio file to be obtained, to obtain a feature point map, where the number of the first feature points is multiple;
  • a searching unit searching, in a spectrogram of the target audio file, whether there is a second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map;
  • a determining unit configured to: when the second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map is found in the spectrogram of the target audio file, determine that the to-be-identified audio file is A portion of the target audio file.
  • the audio recognition method and system provided by the embodiment of the present application can reduce the number by performing diffusion processing on the first feature point in the spectrum map of the audio file to be identified. A deviation of a feature point caused by noise; thereby improving the matching ratio of the first feature point after the diffusion process to the target audio file, that is, improving the feature point matching success rate.
  • FIG. 1 is a schematic diagram of identifying a feature point pair in the prior art
  • FIG. 3 is a schematic diagram of a spectrogram of an audio file to be identified
  • 4a is a schematic view of a first feature point before diffusion processing
  • 4b is a schematic diagram of a first feature point after diffusion processing
  • FIG. 5 is a flow chart of the method of step S120 of Figure 1;
  • FIG. 6 is a schematic diagram of searching for a second feature point corresponding to a first feature point after diffusion processing in a feature point map in a spectrogram of a target audio file;
  • FIG. 7 is a flowchart of an audio identification method provided in an embodiment of the present application.
  • Figure 8a is a schematic diagram of a first feature point determined in a spectrogram
  • Figure 8b is a partial enlarged view of Figure 8a
  • FIG. 9 is a schematic block diagram of an audio recognition system provided in an embodiment of the present application.
  • FIG. 2 is a flowchart of an audio identification method provided in an embodiment of the present application.
  • the audio recognition method includes the following steps:
  • S110 Perform a diffusion process on the first feature points in the spectrogram of the audio file to be obtained, to obtain a feature point map, where the number of the first feature points is multiple.
  • the spectrogram also known as the speech spectrogram, is typically obtained by processing the received time domain signal.
  • the abscissa of the spectrogram is used to represent time
  • the ordinate is used to represent frequency
  • the coordinate point value is used to represent the energy of speech data.
  • a two-dimensional plane can be used to express three-dimensional information, so the energy value of the voice data represented by the coordinate point value can be represented by a color.
  • the darker the color the stronger the speech energy of the coordinate point; conversely, the lighter the color, the weaker the speech energy of the coordinate point.
  • It can also be represented by gray scale, and the closer the color is to white, the The stronger the speech energy of the coordinate point; conversely, the closer the color is to black, the weaker the speech energy of the coordinate point.
  • the spectrogram can intuitively represent the spectral characteristics of the speech signal over time.
  • the strength of any given frequency component at a given time is indicated by the shade of the corresponding point or the shade of the hue.
  • the spectrogram can be obtained by the following steps:
  • A1 The audio file to be identified is subjected to frame processing according to a preset time.
  • the preset time may be an empirical value obtained by the user based on past experience.
  • the preset time in this embodiment includes 32 milliseconds. That is, the audio file to be identified is subjected to frame processing in 32 milliseconds, and an audio segment of one frame every 32 milliseconds and a frame stack of 16 milliseconds is obtained.
  • A2 Perform short-term spectrum analysis on the segmented audio segment to obtain a spectral map.
  • the short time spectrum analysis includes Fast Fourier Transformation (FFT).
  • FFT is a fast algorithm of discrete Fourier transform.
  • the FFT can be used to transform the audio signal into a spectral map that records the joint distribution information of the time domain and the frequency domain.
  • the horizontal axis can represent the number of frames, that is, the number of frames after the audio file is framed, corresponding to the width of the spectral map; the vertical axis can represent the frequency, a total of 256 frequency points, corresponding to the spectrum
  • the height of the coordinate point value represents the energy of the first feature point.
  • the method may further include:
  • A3 Extract the 300-2khz frequency segment after the short-term spectrum analysis.
  • the present embodiment can eliminate the negative influence of other frequency band noise on the frequency segment by extracting the frequency band of 300-2khz.
  • the method may further include:
  • the energy value of the first feature point of the spectrogram of the audio file to be identified is normalized to the gray value of the first feature point.
  • the energy value range of the first feature point after the FFT is large, sometimes it may reach 0-2 ⁇ 8, or even 0-2 ⁇ 16 (the energy value range is proportional to the signal strength of the audio file); Therefore, the energy value is normalized to a range of 0-255 here; so that 0-255 can correspond to a gray value, 0 represents black, and 255 represents white.
  • a general normalization method includes: traversing the energy value of the first feature point in the entire spectrogram to obtain a maximum value and a minimum value;
  • the embodiment of the present application may adopt the above general normalization method.
  • this normalization method when there may be some mute, the obtained V min is too small, for example, may approach and 0, so that the normalization formula becomes This has nothing to do with V min . Therefore, such V min is not representative and affects the overall normalization processing result.
  • a new normalization method is provided in the embodiment of the present application, which may include:
  • the energy value of the first feature point is normalized to the gray value of the first feature point according to the local maximum value and the local minimum value.
  • V is the energy value of the first feature points
  • V min is the local minimum
  • V max is the local maximum.
  • the first preset length may include a pre-T frame of the current frame to a last T frame of the current frame. That is, the first preset length is 2T frames, and the 2T+1 frames are greater than 1 second.
  • the normalization method provided in this embodiment can only affect the normalization result within the first preset length in which it is located, and cannot affect the normalization result outside the first preset length. So such a normalization method can reduce the impact of mute on the overall normalized result.
  • the diffusion process may include a Gaussian function diffusion process, that is, performing diffusion processing on the first feature point by using a Gaussian function; and may further include an amplification process of enlarging the first feature point by several times, for example, by 10 times.
  • Gaussian function diffusion processing is performed on the radius or diameter of the first feature point by using the formula (1).
  • the first feature point enlargement processing will be taken as an example.
  • the radius or diameter of the first feature point is magnified, for example by a 10x magnification of the radius or diameter.
  • the first feature point may be enlarged by several times to become at least one of a circle, a diamond, a rectangle, and the like.
  • the white point (the first feature point of the audio file to be identified) before the diffusion process and the black point (the feature point of the target audio file) are deviated, and thus the second feature point obtained by the final matching is less;
  • the white point after the diffusion process spreads from one point to a region, and the region coincides with the black dots.
  • the diffusion process can make the first feature point diffuse from the point into a region, and thus can have certain anti-interference ability to the noise, for example, the first feature point of the recorded audio may be the first feature point of the original audio due to the influence of noise interference. There is a slight deviation in the position, and this deviation can be ignored after the diffusion process, and the number of second feature points obtained by the matching is increased.
  • S120 Searching, in the spectrogram of the target audio file, whether there is a second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map.
  • the step S120 may specifically include:
  • S122 determining, in each traversing process, a feature point in a coordinate map of the target audio file in the window that is located in a coordinate range of the first feature point after the diffusion processing in the window as a second feature point;
  • FIG. 6 a schematic diagram of finding a second feature point corresponding to the first feature point after the diffusion processing in the feature point map is searched for in the spectrogram of the target audio file. It is assumed that the number of frames of the feature point map is N; the number of frames of the spectrogram of the target audio file is L, and the L is greater than or equal to N. First searching in the region of the target audio file whose frame number is [0, N]; then searching in the region of [1, N+1]; thus searching frame by frame until [LN, L] The region ends traversing.
  • the coordinates of the spectral map of the target audio file are located within the coordinate range of the first feature point after the diffusion processing.
  • the point is determined as the second feature point. Searching, in the target audio file, each second feature point corresponding to each of the first features after the diffusion process.
  • the audio file to be identified may be determined to be part of the target audio file.
  • the first feature point in the spectrogram of the audio file to be identified is subjected to diffusion processing, so that the deviation of the first feature point caused by the noise may be reduced; thereby improving the first feature point after the diffusion processing
  • the matching rate of the target audio file that is, the audio feature point matching success rate can be improved.
  • the step S122 may specifically include:
  • the feature point whose matching degree is greater than the first threshold is determined as the second feature point.
  • the matching degree includes a ratio of the number of feature points in the coordinate range of the first feature point after the diffusion process in the spectral spectrum of the window to the number of first feature points or is located in the spectral spectrum of the window.
  • the first threshold may be a statistical result of the user according to the comprehensive related factors.
  • the first feature point after diffusion is 100.
  • the feature points are 60; then the degree of matching between the first feature points and the feature points is 60%. If the first threshold is 80%, the feature point is determined as the second feature point.
  • the sum of the energy values of the first feature points corresponding to the feature points in the coordinate range of the first feature point after the diffusion processing in the intra-theme spectrum for example, 10 feature points
  • the 10 The energy values of the ten first feature points corresponding to the feature points are added to obtain the sum of the energy values. If the sum of the energy values is greater than the first threshold, the feature point is determined as the second feature point.
  • the feature point is determined as the second feature point.
  • step S110 S101 and S102 may also be included, as shown in FIG. 7:
  • the second threshold may be a statistical result of the user according to the comprehensive related factors; the smaller the second threshold is, the more key points can be extracted, and the longer the subsequent matching time may be; the larger the second threshold is, the key that can be extracted The fewer points, the lower the probability of success for subsequent matches.
  • the preset area may be a circular area centered on the key point and determined according to a preset radius; or a rectangular area centered on the key point and determined according to a preset length and width.
  • the preset area may be a statistical result of the user according to comprehensive related factors; Small, the more first feature points that can be determined, the longer the subsequent matching time may be caused; the larger the preset area, the fewer the first feature points that can be determined, and the probability of success of subsequent matching may be too low.
  • FIG. 8a a schematic diagram of the determined first feature point on the spectrogram.
  • the white point in the figure is the first feature point.
  • the second preset threshold is 30, and the preset area is 15*15 (centered on a key point, 15 frames on the abscissa and 15 on the ordinate), as shown in FIG. 8b.
  • FIG. 8a is a partially enlarged schematic view; the energy value or the gray value of the white point in the figure is greater than the first preset threshold 30 and is still the maximum value in the preset area 15*15, and such a point is extracted as the first feature point. .
  • the difference between the embodiment of the present application and the previous embodiment is that the feature point with the energy value or the gray value in the spectrogram is extracted as the first feature point, so that the interference of the weak feature point on the subsequent matching can be excluded. And it can greatly reduce the amount of data processed by diffusion, thereby improving system performance.
  • the target audio file may carry audio information.
  • the audio information may include a song title.
  • the user records a to-be-identified audio file or a to-be-recognized audio file that does not know the name of the song, and is a song that does not know the name of the song.
  • the to-be-recognized audio can be identified.
  • the song name of the file is not limited to the song name of the file.
  • FIG. 9 is a schematic block diagram of an audio recognition system provided in an embodiment of the present application.
  • the audio recognition system includes:
  • the diffusion unit 210 is configured to perform a diffusion process on the first feature point in the spectrogram of the audio file to be obtained, to obtain a feature point map, where the number of the first feature points is multiple;
  • the searching unit 220 searches, in the spectrogram of the target audio file, whether there is a second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map;
  • a determining unit 230 configured to determine, in the spectrogram of the target audio file, an area of the second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map,
  • the audio file is part of the target audio file.
  • the method further includes:
  • a normalization unit configured to normalize the energy value of the first feature point of the spectrogram of the audio file to be identified into the gray value of the first feature point.
  • the diffusion process includes at least one of a Gaussian function diffusion process or an amplification process.
  • the normalization unit may specifically include:
  • a first normalization subunit configured to traverse the spectral map frame by frame with the first preset length as a window
  • a second normalization subunit configured to acquire a local maximum value and a local minimum value in the energy value of the first feature point in the window
  • a third normalization subunit configured to normalize the energy value of the first feature point to the gray value of the first feature point according to the local maximum value and the local minimum value.
  • the searching unit 220 may specifically include:
  • a first search subunit configured to traverse the spectral map of the target audio file frame by frame with the feature point map as a window
  • a second search subunit configured to determine, in each traversing process, a feature point in a coordinate map of the target audio file in the window that is located within a coordinate range of the first feature point after the diffusion processing in the window Second feature point;
  • a third search subunit configured to search for a second feature point corresponding to each of the first feature points after the diffusion processing in the spectrogram of the target audio file in the window.
  • the second locating subunit may specifically include:
  • a fourth search subunit configured to determine a feature point in the spectrogram of the target audio file in the window that is located in a coordinate range of the first feature point after the diffusion processing in the window, and the first feature point Matching degree;
  • a fifth search subunit configured to determine the feature point whose matching degree is greater than the first threshold as the second feature point.
  • the matching degree includes a ratio of the number of feature points in the coordinate range of the first feature point after the diffusion processing in the spectral spectrum of the window to the first feature point or the intra-speech spectrum Medium The sum of the energy values or the gray values of the first feature points corresponding to the feature points in the coordinate range of the first feature point after the diffusion processing.
  • the method further includes:
  • a first processing unit configured to use, as a key point, a feature point included in a spectrogram of the audio file to be identified or a feature point whose gray value is greater than a second threshold;
  • the second processing unit is configured to determine the key point as the first feature point when the energy value or the gray value of the key point is a maximum value in the preset area.
  • the target audio file carries audio information, the audio information including a song name.
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • the controller can be implemented in any suitable manner, for example, the controller can take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (eg, software or firmware) executable by the (micro)processor.
  • computer readable program code eg, software or firmware
  • examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, The Microchip PIC18F26K20 and the Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic.
  • the controller can be logically programmed by means of logic gates, switches, ASICs, programmable logic controllers, and embedding.
  • Such a controller can therefore be considered a hardware component, and the means for implementing various functions included therein can also be considered as a structure within the hardware component.
  • a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.
  • the system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the present invention may employ computer-usable storage media (including but not limited to disk storage, in one or more of the computer-usable program code embodied therein. The form of a computer program product implemented on a CD-ROM, optical memory, or the like.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only Memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable read only Memory
  • flash memory or other memory technology
  • CD-ROM compact disc
  • DVD digital versatile disc
  • magnetic cassette magnetic tape storage or other magnetic
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • the application can be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种音频识别方法及系统,所述方法包括:对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个(S110);在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点(S120);若是,则确定所述待识别音频文件为所述目标音频文件的一部分(S130),本方法可以在音频识别中提高特征点的匹配成功率。

Description

音频识别方法和系统 技术领域
本申请涉及互联网技术领域,特别涉及一种音频识别方法及系统。
背景技术
随着互联网技术的不断发展,互联网已成为人们生活中必不可少的工具。利用互联网设备实现未知音频的识别,并基于音频识别的互动,成为一种新的应用趋势。
基于音频识别的互动有多种应用,一种应用例如是:用户听到一首不知道歌名的歌曲,可以录制该歌曲的一段音频,然后利用音频识别技术,可以识别出这首歌的歌名、歌手等信息。
现有技术中,一般是提取待识别音频的特征点,利用特征点对进行识别的。如图1所示,横轴代表时间,纵轴代表频率。提取的特征点为图中的“X”;两个特征点构成一个特征点对,在目标区域内有8个特征点对;采用特征点对的方式在数据库中进行识别,数据库内存储有歌曲的特征点及歌曲信息如歌名、歌手等;如果在数据库中能在相同的目标区域内匹配到一样的特征点对,则匹配成功;进而可以得到对应的歌曲信息。然而,由于录制音频时不可避免的受到噪声的影响,提取的特征点不一定都在正常的位置出现,所以导致特征点对匹配成功的概率较低。
综上所述,现有技术中存在音频识别中特征点匹配成功率低的问题。
发明内容
本申请实施例的目的是提供一种音频识别方法及系统,用以解决现有技术中音频识别中特征点匹配成功率低的问题。
为解决上述技术问题,本申请一实施例提供的音频识别方法,包括:
对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个;
在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点;
若是,则确定所述待识别音频文件为所述目标音频文件的一部分。
本申请一实施例提供的音频识别系统,包括:
扩散单元,用于对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个;
查找单元,在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点;
确定单元,用于在目标音频文件的语谱图中查找到与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点时,则确定所述待识别音频文件为所述目标音频文件的一部分。
由以上本申请实施例提供的技术方案可见,本申请实施例提供的一种音频识别方法及系统,通过对待识别音频文件的语谱图中的第一特征点进行扩散处理,可以减少所述第一特征点受噪声影响产生的偏差;从而提高扩散处理后的第一特征点与目标音频文件的匹配率,即提高了特征点匹配成功率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为现有技术中利用特征点对进行识别的示意图;
图2为本申请一实施例中提供的音频识别方法的流程图;
图3为待识别音频文件的语谱图的示意图;
图4a为扩散处理前的第一特征点的示意图;
图4b为扩散处理后的第一特征点的示意图;
图5为图1中S120步骤的方法流程图;
图6为在目标音频文件的语谱图中查找与特征点图中扩散处理后第一特征点分别对应的第二特征点的示意图;
图7为本申请一实施例中提供的音频识别方法的流程图;
图8a为在语谱图中确定的第一特征点的示意图;
图8b为图8a的局部放大图;
图9为本申请一实施例中提供的音频识别系统的模块示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
图2为本申请一实施例中提供的音频识别方法的流程图。本实施例中,所述音频识别方法包括如下步骤:
S110:对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个。
语谱图也称为语音频谱图,一般是通过处理接收的时域信号得到。一般地,语谱图的横坐标用来表示时间,纵坐标用来表示频率,坐标点值表示语音数据的能量。通常可以采用二维平面来表达三维信息,所以坐标点值所表示的语音数据的能量值,大小可以通过颜色来表示。例如通过彩色的方式表示,颜色越深的可以表示该坐标点的语音能量越强;反之,颜色越浅的可以表示该坐标点的语音能量越弱。还可以通过灰度的方式表示,颜色越接近于白色的可以表示 该坐标点的语音能量越强;反之,颜色越接近于黑色的可以表示该坐标点的语音能量越弱。
这样,语谱图可以直观的表示语音信号随时间变化的频谱特性。任一给定频率成分在给定时刻的强弱用相应点的灰度或色调的浓淡来表示。
具体地,语谱图可以通过如下步骤获得:
A1:对待识别音频文件按照预设时间进行分帧处理。
所述预设时间可以是用户根据过往经验得出的经验值。本实施例中所述预设时间包括32毫秒。即对待识别音频文件按照32毫秒进行分帧处理,得到每32毫秒为一帧,帧叠16毫秒的音频片段。
A2:对分帧处理后的音频片段进行短时频谱分析,得到语谱图。
所述短时频谱分析包括快速傅立叶变化(Fast Fourier Transformation,FFT)。FFT是离散傅立叶变换的快速算法,利用FFT可以将音频信号转变为记录了时间域与频率域的联合分布信息的语谱图。
由于以32毫秒分帧处理,而32毫秒对应了8000hz采样,使得FFT计算后可以得到256频率点。
如图3中横轴可以代表帧数,即音频文件分帧处理后的帧数的个数,对应了语谱图的宽度;纵轴可以代表频率,共有256个频率点,对应了语谱图的高度;坐标点值表示第一特征点的能量。
优选地,在对分帧处理后的音频片段进行短时频谱分析之后,还可以包括:
A3:提取所述短时频谱分析后300-2khz频率段。
由于一般的歌曲主要的频率是集中在300-2khz这个频率段上的,所以本实施例通过提取300-2khz这个频率段后,即可以消除其它频率段噪声对所述频率段的负面影响。
在本申请的另一实施例中,在S110步骤之前,还可以包括:
将待识别音频文件的语谱图的第一特征点的能量值归一化为第一特征点的灰度值。
本实施例中,由于经过FFT之后的第一特征点的能量值范围较大,有时可能达到0-2^8,甚至0-2^16(能量值范围与音频文件的信号强度呈正比);所以,这里将所述能量值归一化到0-255的范围内;使得0-255可以对应为灰度值,0代表黑色,255代表白色。
一般的归一化方法包括:遍历整个语谱图中的第一特征点的能量值,获得最大值和最小值;
对所述第一特征点进行归一化:
Figure PCTCN2016099053-appb-000001
其中,V为第一特征点的能量值;Vmin为最小值;Vmax为最大值。
本申请实施例可以是采用上述一般的归一化方法。然而,这种归一化方法,对于可能存在某些弱音时,获得的Vmin太小,例如可能趋近与0,使得归一化公式变为了
Figure PCTCN2016099053-appb-000002
这样就与Vmin无关了。因此这样的Vmin不具有代表性,影响了整体的归一化处理结果。
本申请实施例中提供了一种新的归一化方法,可以包括:
以第一预设长度为窗口逐帧遍历语谱图;
获取所述窗口内第一特征点的能量值中的局部最大值和局部最小值;
根据所述局部最大值和局部最小值将第一特征点的能量值归一化为第一特征点的灰度值。
利用(2)所示的公式,其中,V为第一特征点的能量值;Vmin为局部最小值;Vmax为局部最大值。
本实施例以分帧处理后说明,所述第一预设长度可以包括当前帧的前T帧到当前帧的后T帧。即所述第一预设长度为2T帧,2T+1帧大于1秒。
本实施例提供的归一化方法,对于某些弱音,只能影响在其所在的第一预设长度内的归一化结果,不能影响在第一预设长度之外的归一化结果。所以这样的归一化方法可以减少弱音对整体归一化结果的影响。
所述扩散处理,可以包括高斯函数(Gauss function)扩散处理,即利用高斯函数对第一特征点进行扩散处理;还可以包括放大处理,即将第一特征点放大若干倍,例如放大10倍。
以下以高斯函数扩散处理为例,利用如下公式:
Figure PCTCN2016099053-appb-000003
其中a、b与c为常数,且a>0。
即利用公式(1)对第一特征点的半径或直径进行高斯函数扩散处理。
以下以将第一特征点放大处理为例。将所述第一特征点的半径或直径放大处理,例如将半径或直径放大10倍。当然,在某些实施例中,还可以将所述第一特征点放大若干倍后变为圆形、菱形、矩形等中的至少一种。
如图4a所示,在扩散处理前的白点(待识别音频文件的第一特征点)与黑点(目标音频文件的特征点)存在偏差,进而最后匹配得到的第二特征点就少;如图4b所示,在扩散处理后的白点从一个点扩散成了一个区域,并且所述区域与黑点都重合。
扩散处理可以使得第一特征点由点扩散为区域,进而可以对噪声有一定的抗干扰能力,例如由于噪声干扰的影响,录制的音频的第一特征点可能与原始的音频的第一特征点位置有少许的偏差,而通过所述扩散处理后可以忽略这个偏差,增加匹配得到的第二特征点的数量。
S120:在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点。
如图5所示,所述S120步骤,具体可以包括:
S121:以所述特征点图为窗口逐帧遍历所述目标音频文件的语谱图;
S122:每次遍历过程中将所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点确定为第二特征点;
S123:查找所述窗口内所述目标音频文件的语谱图中是否存在与所述扩散 处理后各第一特征点分别对应的各第二特征点。
如图6所示,为在目标音频文件的语谱图中查找与特征点图中扩散处理后第一特征点分别对应的第二特征点的示意图。假设特征点图的帧数为N;目标音频文件的语谱图的帧数为L,所述L大于或等于N。首先在所述目标音频文件的语谱图中帧数为[0,N]的区域内查找;之后在[1,N+1]的区域内查找;这样逐帧查找,直到[L-N,L]的区域结束遍历。在每次遍历过程中每一帧的[t,t+N]的窗口内其中t为帧数,将目标音频文件的语谱图中坐标位于扩散处理后第一特征点的坐标范围内的特征点确定为第二特征点。在目标音频文件内查找与所述扩散处理后各第一特征分别对应的各第二特征点。
在其它实施例中,还可以是遍历数据库中所有的音频文件。这样,可以更精确的识别出待识别音频文件的音频信息。
S130:若是,则确定所述待识别音频文件的语谱图为所述目标音频文件的一部分。
如果在目标音频文件的语谱图中查找到与所述扩散处理后各第一特征点分别对应的第二特征点,则可以确定所述待识别音频文件为所述目标音频文件的一部分。
通过本实施例中,对待识别音频文件的语谱图中的第一特征点进行扩散处理,可以减少所述第一特征点受噪声影响产生的偏差;从而提高扩散处理后的第一特征点与目标音频文件的匹配率,即可以实现了提高音频特征点匹配成功率。
在本申请的一实施例中,所述S122步骤,具体可以包括:
确定所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点与第一特征点的匹配度;
将所述匹配度大于第一阈值的特征点确定为第二特征点。
所述匹配度包括所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点个数与第一特征点个数的比值或所述窗口内语谱图中位于扩 散处理后第一特征点的坐标范围内的特征点对应的第一特征点的能量值或者灰度值之和。所述第一阈值可以是用户根据综合相关因素的一个统计结果。
以所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点个数与第一特征点个数的比值为例,例如扩散后第一特征点为100个,所述特征点为60个;则所述第一特征点与所述特征点的匹配度为60%。如果所述第一阈值为80%,那么将所述特征点确定为第二特征点。
以所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点对应的第一特征点的能量值之和为例,例如特征点有10个,那么将这10个特征点对应的10个第一特征点的能量值相加,得到能量值之和。如果所述能量值之和大于所述第一阈值,那么将所述特征点确定为第二特征点。
以所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点对应的第一特征点的灰度值之和为例,例如特征点有10个,那么将这10个特征点对应的10个第一特征点的灰度值相加,得到灰度值之和。如果所述灰度值之和大于所述第一阈值,那么将所述特征点确定为第二特征点。
在本申请的一实施例中,在S110步骤之前,还可以包括S101、S102,如图7所示:
S101:将待识别音频文件的语谱图中包含的能量值或者灰度值大于第二阈值的特征点作为关键点;
所述第二阈值可以是用户根据综合相关因素的一个统计结果;第二阈值越小,可以提取的关键点就越多,进而可能造成后续匹配时间越久;第二阈值越大,可以提取的关键点就越少,进而可能造成后续匹配的成功概率过低。
S102:若所述关键点的能量值或者灰度值在预设区域内为最大值,则将该关键点确定为第一特征点;
所述预设区域可以是以所述关键点为中心并根据预设半径确定的圆形区域;或者以所述关键点为中心并根据预设长和宽确定的矩形区域。
所述预设区域可以是用户根据综合相关因素的一个统计结果;预设区域越 小,可以确定的第一特征点越多,进而可能造成后续匹配时间越久;预设区域越大,可以确定的第一特征点越少,进而可能造成后续匹配的成功概率过低。
如图8a所示,为确定的第一特征点在语谱图上的示意图。图中白点即第一特征点。具体的,假设所述第二预设阈值为30,所述预设区域为15*15(以关键点为中心,横坐标取15帧,纵坐标取长度15),如图8b所示,为图8a的局部放大示意图;图中白点的能量值或者灰度值即大于第一预设阈值30并且在预设区域15*15内依然是最大值,提取出这样的点作为第一特征点。
本申请实施例与上一实施例不同之处在于,通过提取语谱图中能量值或者灰度值大的特征点作为第一特征点,从而可以排除能量弱的特征点对后续匹配的干扰,并且还可以大大的减少扩散处理的数据量,进而提高系统性能。
在本申请的一实施例中,所述目标音频文件可以携带有音频信息。本申请应用于歌曲识别场景中时,所述音频信息可以包括歌曲名。用户录制一段不知道歌曲名的待识别音频文件或待识别音频文件就是一首不知道歌曲名的歌曲,当确定待识别音频文件为目标音频文件的一部分时,就可以识别出所述待识别音频文件的歌曲名。
图9为本申请一实施例中提供的音频识别系统的模块示意图。本实施例中,所述音频识别系统包括:
扩散单元210,用于对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个;
查找单元220,在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点;
确定单元230,用于在目标音频文件的语谱图中查找到与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点的区域时,则确定所述待识别音频文件为所述目标音频文件的一部分。
优选地,在所述扩散单元210之前,还可以包括:
归一化单元,用于将待识别音频文件的语谱图的第一特征点的能量值归一化为第一特征点的灰度值。
优选地,所述扩散处理包括高斯函数扩散处理或者放大处理中的至少一种。
优选地,所述归一化单元,具体可以包括:
第一归一化子单元,用于以第一预设长度为窗口逐帧遍历语谱图;
第二归一化子单元,用于获取所述窗口内第一特征点的能量值中的局部最大值和局部最小值;
第三归一化子单元,用于根据所述局部最大值和局部最小值将第一特征点的能量值归一化为第一特征点的灰度值。
优选地,所述查找单元220,具体可以包括:
第一查找子单元,用于以所述特征点图为窗口逐帧遍历所述目标音频文件的语谱图;
第二查找子单元,用于每次遍历过程中将所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点确定为第二特征点;
第三查找子单元,用于查找所述窗口内所述目标音频文件的语谱图中是否存在与所述扩散处理后各第一特征点分别对应的各第二特征点。
优选地,所述第二查找子单元,具体可以包括:
第四查找子单元,用于确定所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点与所述第一特征点的匹配度;
第五查找子单元,用于将所述匹配度大于第一阈值的特征点确定为第二特征点。
优选地,所述匹配度包括所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点个数与第一特征点个数的比值或所述窗口内语谱图中 位于扩散处理后第一特征点的坐标范围内的特征点对应的第一特征点的能量值或者灰度值之和。
优选地,在所述扩散处理之前,还可以包括:
第一处理单元,用于将待识别音频文件的语谱图中包含的能量值或者灰度值大于第二阈值的特征点作为关键点;
第二处理单元,用于在所述关键点的能量值或者灰度值在预设区域内为最大值时,将该关键点确定为第一特征点。
优选地,所述目标音频文件携带有音频信息,所述音频信息包括歌曲名。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL (Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、 CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读 存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (18)

  1. 一种音频识别方法,其特征在于,包括:
    对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个;
    在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点;
    若是,则确定所述待识别音频文件为所述目标音频文件的一部分。
  2. 如权利要求1所述的方法,其特征在于,在所述对待识别音频文件的语谱图中的第一特征点进行扩散处理,之前,还包括:
    将待识别音频文件的语谱图中的第一特征点的能量值归一化为第一特征点的灰度值。
  3. 如权利要求1或2所述的方法,其特征在于,所述扩散处理包括高斯函数扩散处理或者放大处理中的至少一种。
  4. 如权利要求2所述的方法,其特征在于,将待识别音频文件的语谱图中的第一特征点的能量值归一化为第一特征点的灰度值,具体包括:
    以第一预设长度为窗口逐帧遍历语谱图;
    获取所述窗口内第一特征点的能量值中的局部最大值和局部最小值;
    根据所述局部最大值和局部最小值将第一特征点的能量值归一化为第一特征点的灰度值。
  5. 如权利要求1或2所述的方法,其特征在于,所述在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点,具体包括:
    以所述特征点图为窗口逐帧遍历所述目标音频文件的语谱图;
    每次遍历过程中将所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点确定为第二特征点;
    查找所述窗口内所述目标音频文件的语谱图中是否存在与所述扩散处理后各第一特征点分别对应的各第二特征点。
  6. 如权利要求5所述的方法,其特征在于,所述将所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点确定为第二特征点,包括:
    确定所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点与第一特征点的匹配度;
    将所述匹配度大于第一阈值的特征点确定为第二特征点。
  7. 如权利要求6所述的方法,其特征在于,所述匹配度包括所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点个数与第一特征点个数的比值或所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点对应的第一特征点的能量值或者灰度值之和。
  8. 如权利要求1或2所述的方法,其特征在于,在所述对待识别音频文件的语谱图的第一特征点进行扩散处理,之前,还包括:
    将待识别音频文件的语谱图中包含的能量值或者灰度值大于第二阈值的特征点作为关键点;
    若所述关键点的能量值或者灰度值在预设区域内为最大值,则将该关键点确定为第一特征点。
  9. 如权利要求1所述的方法,其特征在于,所述目标音频文件携带有音频信息,所述音频信息包括歌曲名。
  10. 一种音频识别系统,其特征在于,包括:
    扩散单元,用于对待识别音频文件的语谱图中的第一特征点进行扩散处理,得到特征点图,所述第一特征点的数量为多个;
    查找单元,在目标音频文件的语谱图中查找是否存在与所述特征点图中扩散处理后的各第一特征点分别对应的第二特征点;
    确定单元,用于在目标音频文件的语谱图中查找到与所述特征点图中扩散 处理后的各第一特征点分别对应的第二特征点时,则确定所述待识别音频文件为所述目标音频文件的一部分。
  11. 如权利要求10所述的系统,其特征在于,在所述扩散单元之前,还包括:
    归一化单元,用于将待识别音频文件的语谱图中的第一特征点的能量值归一化为第一特征点的灰度值。
  12. 如权利要求10或11所述的系统,其特征在于,所述扩散处理包括高斯函数扩散处理或者放大处理中的至少一种。
  13. 如权利要求11所述的系统,其特征在于,所述归一化单元,具体包括:
    第一归一化子单元,用于以第一预设长度为窗口逐帧遍历语谱图;
    第二归一化子单元,用于获取所述窗口内第一特征点的能量值中的局部最大值和局部最小值;
    第三归一化子单元,用于根据所述局部最大值和局部最小值将第一特征点的能量值归一化为第一特征点的灰度值。
  14. 如权利要求10或11所述的系统,其特征在于,所述查找单元,具体包括:
    第一查找子单元,用于以所述特征点图为窗口逐帧遍历所述目标音频文件的语谱图;
    第二查找子单元,用于每次遍历过程中将所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点确定为第二特征点;
    第三查找子单元,用于查找所述窗口内所述目标音频文件的语谱图中是否存在与所述扩散处理后各第一特征点分别对应的各第二特征点。
  15. 如权利要求14所述的系统,其特征在于,所述第二查找子单元,具体包括:
    第四查找子单元,用于确定所述窗口内所述目标音频文件的语谱图中坐标位于所述窗口内扩散处理后第一特征点的坐标范围内的特征点与第一特征点的匹配度;
    第五查找子单元,用于将所述匹配度大于第一阈值的特征点确定为第二特征点。
  16. 如权利要求15所述的系统,其特征在于,所述匹配度包括所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点个数与第一特征点个数的比值或所述窗口内语谱图中位于扩散处理后第一特征点的坐标范围内的特征点对应的第一特征点的能量值或者灰度值之和。
  17. 如权利要求10或11所述的系统,其特征在于,在所述扩散处理之前,还包括:
    第一处理单元,用于将待识别音频文件的语谱图中包含的能量值或者灰度值大于第二阈值的特征点作为关键点;
    第二处理单元,用于在所述关键点的能量值或者灰度值在预设区域内为最大值时,将该关键点确定为第一特征点。
  18. 如权利要求10所述的系统,其特征在于,所述目标音频文件携带有音频信息,所述音频信息包括歌曲名。
PCT/CN2016/099053 2015-09-24 2016-09-14 音频识别方法和系统 WO2017050175A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020187008373A KR102077411B1 (ko) 2015-09-24 2016-09-14 음성 인식 방법 및 시스템
SG11201801808RA SG11201801808RA (en) 2015-09-24 2016-09-14 Audio recognition method and system
JP2018515493A JP6585835B2 (ja) 2015-09-24 2016-09-14 オーディオ認識方法及びシステム
EP16848057.2A EP3355302B1 (en) 2015-09-24 2016-09-14 Audio recognition method and system
US15/897,431 US10679647B2 (en) 2015-09-24 2018-02-15 Audio recognition method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510618550.4 2015-09-24
CN201510618550.4A CN106558318B (zh) 2015-09-24 2015-09-24 音频识别方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/897,431 Continuation US10679647B2 (en) 2015-09-24 2018-02-15 Audio recognition method and system

Publications (1)

Publication Number Publication Date
WO2017050175A1 true WO2017050175A1 (zh) 2017-03-30

Family

ID=58385690

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/099053 WO2017050175A1 (zh) 2015-09-24 2016-09-14 音频识别方法和系统

Country Status (7)

Country Link
US (1) US10679647B2 (zh)
EP (1) EP3355302B1 (zh)
JP (1) JP6585835B2 (zh)
KR (1) KR102077411B1 (zh)
CN (1) CN106558318B (zh)
SG (1) SG11201801808RA (zh)
WO (1) WO2017050175A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294947A (zh) * 2022-07-29 2022-11-04 腾讯科技(深圳)有限公司 音频数据处理方法、装置、电子设备及介质

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10397663B2 (en) * 2016-04-08 2019-08-27 Source Digital, Inc. Synchronizing ancillary data to content including audio
CN108364661B (zh) * 2017-12-15 2020-11-24 海尔优家智能科技(北京)有限公司 可视化语音性能评估方法、装置、计算机设备及存储介质
CN108615006B (zh) * 2018-04-23 2020-04-17 百度在线网络技术(北京)有限公司 用于输出信息的方法和装置
CN109035419A (zh) * 2018-08-06 2018-12-18 深圳市果壳文化科技有限公司 一种基于ar技术的社交方法和系统
WO2020102979A1 (zh) * 2018-11-20 2020-05-28 深圳市欢太科技有限公司 语音信息的处理方法、装置、存储介质及电子设备
KR20210037229A (ko) 2019-09-27 2021-04-06 주식회사 케이티 다중 채널을 통해 멀티미디어 데이터를 전송하는 사용자 단말, 서버 및 방법
CN111444384B (zh) * 2020-03-31 2023-10-13 北京字节跳动网络技术有限公司 一种音频关键点确定方法、装置、设备及存储介质
CN111640421B (zh) * 2020-05-13 2023-06-16 广州国音智能科技有限公司 语音对比方法、装置、设备及计算机可读存储介质
CN112101301B (zh) * 2020-11-03 2021-02-26 武汉工程大学 一种螺杆水冷机组的好音稳定预警方法、装置及存储介质
US11929078B2 (en) * 2021-02-23 2024-03-12 Intuit, Inc. Method and system for user voice identification using ensembled deep learning algorithms
CN114255741B (zh) * 2022-02-28 2022-06-10 腾讯科技(深圳)有限公司 重复音频检测方法、设备、存储介质
CN117789706B (zh) * 2024-02-27 2024-05-03 富迪科技(南京)有限公司 一种音频信息内容识别方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2969862B2 (ja) * 1989-10-04 1999-11-02 松下電器産業株式会社 音声認識装置
CN101720048A (zh) * 2009-12-04 2010-06-02 山东大学 基于音频特征的收视率调查系统及收视信息检索方法
CN103729368A (zh) * 2012-10-13 2014-04-16 复旦大学 一种基于局部频谱图像描述子的鲁棒音频识别方法
CN103971676A (zh) * 2014-04-23 2014-08-06 上海师范大学 一种快速语音孤立词识别算法及其用途、语音识别系统
CN103971689A (zh) * 2013-02-04 2014-08-06 腾讯科技(深圳)有限公司 一种音频识别方法及装置
CN103999473A (zh) * 2011-12-20 2014-08-20 雅虎公司 用于内容识别的音频指纹
CN104125509A (zh) * 2013-04-28 2014-10-29 腾讯科技(深圳)有限公司 节目识别方法、装置及服务器

Family Cites Families (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990453B2 (en) 2000-07-31 2006-01-24 Landmark Digital Services Llc System and methods for recognizing sound and music signals in high noise and distortion
CA2483104C (en) * 2002-04-25 2011-06-21 Shazam Entertainment, Ltd. Robust and invariant audio pattern matching
US20070195963A1 (en) 2006-02-21 2007-08-23 Nokia Corporation Measuring ear biometrics for sound optimization
KR20090083098A (ko) 2008-01-29 2009-08-03 삼성전자주식회사 하모닉 특징을 이용한 음악 인식 방법 및 음악 인식을이용한 이동 로봇의 동작 생성 방법
US8706276B2 (en) * 2009-10-09 2014-04-22 The Trustees Of Columbia University In The City Of New York Systems, methods, and media for identifying matching audio
US8886531B2 (en) 2010-01-13 2014-11-11 Rovi Technologies Corporation Apparatus and method for generating an audio fingerprint and using a two-stage query
JP5728888B2 (ja) * 2010-10-29 2015-06-03 ソニー株式会社 信号処理装置および方法、並びにプログラム
US20120296458A1 (en) 2011-05-18 2012-11-22 Microsoft Corporation Background Audio Listening for Content Recognition
US9461759B2 (en) 2011-08-30 2016-10-04 Iheartmedia Management Services, Inc. Identification of changed broadcast media items
US8586847B2 (en) 2011-12-02 2013-11-19 The Echo Nest Corporation Musical fingerprinting based on onset intervals
US9292894B2 (en) 2012-03-14 2016-03-22 Digimarc Corporation Content recognition and synchronization using local caching
US9113203B2 (en) 2012-06-28 2015-08-18 Google Inc. Generating a sequence of audio fingerprints at a set top box
US9661361B2 (en) 2012-09-19 2017-05-23 Google Inc. Systems and methods for live media content matching
US8867028B2 (en) * 2012-10-19 2014-10-21 Interfiber Analysis, LLC System and/or method for measuring waveguide modes
US9373336B2 (en) 2013-02-04 2016-06-21 Tencent Technology (Shenzhen) Company Limited Method and device for audio recognition
US9269022B2 (en) * 2013-04-11 2016-02-23 Digimarc Corporation Methods for object recognition and related arrangements
GB2518663A (en) * 2013-09-27 2015-04-01 Nokia Corp Audio analysis apparatus
JP2015103088A (ja) * 2013-11-26 2015-06-04 キヤノン株式会社 画像処理装置、画像処理方法、及びプログラム
US10321842B2 (en) * 2014-04-22 2019-06-18 Interaxon Inc. System and method for associating music with brain-state data
US9894413B2 (en) 2014-06-12 2018-02-13 Google Llc Systems and methods for locally detecting consumed video content
US9805125B2 (en) 2014-06-20 2017-10-31 Google Inc. Displaying a summary of media content items
US9838759B2 (en) 2014-06-20 2017-12-05 Google Inc. Displaying information related to content playing on a device
US9946769B2 (en) 2014-06-20 2018-04-17 Google Llc Displaying information related to spoken dialogue in content playing on a device
US9905233B1 (en) 2014-08-07 2018-02-27 Digimarc Corporation Methods and apparatus for facilitating ambient content recognition using digital watermarks, and related arrangements
JP6464650B2 (ja) 2014-10-03 2019-02-06 日本電気株式会社 音声処理装置、音声処理方法、およびプログラム
US10750236B2 (en) 2015-04-23 2020-08-18 The Nielsen Company (Us), Llc Automatic content recognition with local matching
US9743138B2 (en) 2015-07-31 2017-08-22 Mutr Llc Method for sound recognition task trigger
US9913056B2 (en) 2015-08-06 2018-03-06 Dolby Laboratories Licensing Corporation System and method to enhance speakers connected to devices with microphones

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2969862B2 (ja) * 1989-10-04 1999-11-02 松下電器産業株式会社 音声認識装置
CN101720048A (zh) * 2009-12-04 2010-06-02 山东大学 基于音频特征的收视率调查系统及收视信息检索方法
CN103999473A (zh) * 2011-12-20 2014-08-20 雅虎公司 用于内容识别的音频指纹
CN103729368A (zh) * 2012-10-13 2014-04-16 复旦大学 一种基于局部频谱图像描述子的鲁棒音频识别方法
CN103971689A (zh) * 2013-02-04 2014-08-06 腾讯科技(深圳)有限公司 一种音频识别方法及装置
CN104125509A (zh) * 2013-04-28 2014-10-29 腾讯科技(深圳)有限公司 节目识别方法、装置及服务器
CN103971676A (zh) * 2014-04-23 2014-08-06 上海师范大学 一种快速语音孤立词识别算法及其用途、语音识别系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294947A (zh) * 2022-07-29 2022-11-04 腾讯科技(深圳)有限公司 音频数据处理方法、装置、电子设备及介质
CN115294947B (zh) * 2022-07-29 2024-06-11 腾讯科技(深圳)有限公司 音频数据处理方法、装置、电子设备及介质

Also Published As

Publication number Publication date
SG11201801808RA (en) 2018-04-27
CN106558318A (zh) 2017-04-05
US10679647B2 (en) 2020-06-09
US20180174599A1 (en) 2018-06-21
EP3355302A1 (en) 2018-08-01
CN106558318B (zh) 2020-04-28
KR102077411B1 (ko) 2020-02-13
KR20180044957A (ko) 2018-05-03
EP3355302A4 (en) 2019-06-05
JP6585835B2 (ja) 2019-10-02
JP2018534609A (ja) 2018-11-22
EP3355302B1 (en) 2022-02-09

Similar Documents

Publication Publication Date Title
WO2017050175A1 (zh) 音频识别方法和系统
CN109065044B (zh) 唤醒词识别方法、装置、电子设备及计算机可读存储介质
US20210149939A1 (en) Responding to remote media classification queries using classifier models and context parameters
US10535365B2 (en) Analog voice activity detection
WO2019148586A1 (zh) 多人发言中发言人识别方法以及装置
Roma et al. Recurrence quantification analysis features for environmental sound recognition
WO2021114733A1 (zh) 一种分频段进行处理的噪声抑制方法及其系统
CN113327626B (zh) 语音降噪方法、装置、设备及存储介质
KR102441063B1 (ko) 끝점 검출 장치, 그를 포함한 시스템 및 그 방법
CN113421586B (zh) 梦呓识别方法、装置和电子设备
CN103854661A (zh) 一种提取音乐特征的方法及装置
CN108877779B (zh) 用于检测语音尾点的方法和装置
US20160027438A1 (en) Concurrent Segmentation of Multiple Similar Vocalizations
CN104077336B (zh) 一种拖拽音频文件进行音频文件信息检索的方法和装置
US9978392B2 (en) Noisy signal identification from non-stationary audio signals
TW201828285A (zh) 音頻識別方法和系統
WO2019174392A1 (zh) 针对rpc信息的向量处理
Song et al. Feature extraction and classification for audio information in news video
CN112397073B (zh) 一种音频数据处理方法及装置
KR101925248B1 (ko) 음성 인증 최적화를 위해 음성 특징벡터를 활용하는 방법 및 장치
US11790931B2 (en) Voice activity detection using zero crossing detection
CN111275095B (zh) 一种对象类型识别方法及装置
Ge et al. Design and Implementation of Intelligent Singer Recognition System
KR20140050951A (ko) 음성 인식 시스템
Waghela et al. SUV detection algorithm for speech signals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16848057

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 11201801808R

Country of ref document: SG

ENP Entry into the national phase

Ref document number: 20187008373

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2018515493

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE