WO2019232845A1 - Voice data processing method and apparatus, and computer device, and storage medium - Google Patents

Voice data processing method and apparatus, and computer device, and storage medium Download PDF

Info

Publication number
WO2019232845A1
WO2019232845A1 PCT/CN2018/094184 CN2018094184W WO2019232845A1 WO 2019232845 A1 WO2019232845 A1 WO 2019232845A1 CN 2018094184 W CN2018094184 W CN 2018094184W WO 2019232845 A1 WO2019232845 A1 WO 2019232845A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
frame
speech
term
short
Prior art date
Application number
PCT/CN2018/094184
Other languages
French (fr)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232845A1 publication Critical patent/WO2019232845A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

A voice data processing method and apparatus, and a computer device, and a storage medium. The voice data processing method comprises: obtaining original voice data (S10); performing framing and segmentation processing on the original voice data by using a VAD algorithm to obtain at least two frames of voice data to be detected (S20); performing feature extraction on each frame of the voice data to be detected by using an ASR voice feature extraction algorithm to obtain a filter voice feature to be detected (S30); recognizing the filter voice feature to be detected by using a trained ASR-LSTM voice recognition model to obtain a recognition probability value (S40); and if the recognition probability value is greater than a preset probability value, using the voice data to be detected as target voice data (S50). According to the voice data processing method, the interference of noise and silence can be effectively eliminated, thereby improving the accuracy of model recognition.

Description

语音数据处理方法、装置、计算机设备及存储介质Voice data processing method, device, computer equipment and storage medium
本专利申请以2018年6月4日提交的申请号为201810561725.6,名称为“语音数据处理方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This patent application is based on a Chinese invention patent application filed on June 4, 2018 with the application number 201810561725.6, entitled "Voice Data Processing Method, Device, Computer Equipment, and Storage Medium", and claims its priority.
技术领域Technical field
本申请涉及语音识别技术领域,尤其涉及一种语音数据处理方法、装置、计算机设备及存储介质。The present application relates to the technical field of speech recognition, and in particular, to a method, a device, a computer device, and a storage medium for processing voice data.
背景技术Background technique
语音活动检测(Voice Activity Detection,以下简称VAD)又称语音端点检测或语音边界检测,是从声音信号流中识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用。Voice Activity Detection (VAD), also known as voice endpoint detection or voice boundary detection, identifies and eliminates long periods of silence from the sound signal stream to achieve voice savings without reducing service quality The role of resources.
目前,在语音识别模型训练或识别时,需要获取较纯净的语音数据进行模型训练,但用于当前的语音数据往往夹杂着噪音或静音,导致在使用夹杂噪音的语音数据进行训练时,获取的语音识别模型的准确率较低,不利于语音识别模型的推广应用。At present, when training or recognizing a speech recognition model, it is necessary to obtain relatively pure speech data for model training, but the current speech data is often mixed with noise or silence, resulting in the use of speech data with mixed noise for training. The accuracy of speech recognition models is low, which is not conducive to the popularization and application of speech recognition models.
发明内容Summary of the Invention
基于此,有必要针对上述技术问题,提供一种语音数据处理方法、装置、计算机设备及存储介质,用于解决现有技术中语音识别模型的准确率较低的技术问题。Based on this, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for speech data processing in order to solve the technical problem of low accuracy of a speech recognition model in the prior art.
一种语音数据处理方法,包括:A voice data processing method includes:
获取原始语音数据;Get the original voice data;
采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;
若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
一种语音数据处理装置,包括:A voice data processing device includes:
原始语音数据获取模块,用于获取原始语音数据;Raw voice data acquisition module, used to obtain raw voice data;
待测语音数据获取模块,用于采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;The voice data to be tested module is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
待测滤波器语音特征获取模块,用于采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;The filter feature acquisition module under test is configured to use ASR voice feature extraction algorithm to perform feature extraction on the frame of voice data to be tested for each frame to acquire the filter feature of the filter under test;
识别概率值获取模块,用于采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;A recognition probability value acquisition module, configured to use the trained ASR-LSTM speech recognition model to recognize the voice characteristics of the filter under test to obtain a recognition probability value;
目标语音数据获取模块,用于若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。The target voice data acquisition module is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:
获取原始语音数据;Get the original voice data;
采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波 器语音特征;ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;
若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
获取原始语音数据;Get the original voice data;
采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;
若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
本申请的一个或多个实施例的细节在下面的附图及描述中提出。本申请的其他特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一实施例中语音数据处理方法的一应用环境图;FIG. 1 is an application environment diagram of a voice data processing method according to an embodiment of the present application;
图2是本申请一实施例中语音数据处理方法的一流程图;2 is a flowchart of a voice data processing method according to an embodiment of the present application;
图3是图2中步骤S20的一具体流程图;FIG. 3 is a specific flowchart of step S20 in FIG. 2;
图4是图2中步骤S30的一具体流程图;FIG. 4 is a specific flowchart of step S30 in FIG. 2;
图5是本申请一实施例中语音数据处理方法的又一流程图;5 is another flowchart of a voice data processing method according to an embodiment of the present application;
图6是图5中步骤S63的一具体流程图;6 is a specific flowchart of step S63 in FIG. 5;
图7是本申请一实施例中语音数据处理装置的一示意图;7 is a schematic diagram of a voice data processing apparatus according to an embodiment of the present application;
图8是本申请一实施例中计算机设备的一示意图。FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
本申请提供的语音数据处理方法,可应用在如图1的应用环境中,其中,计算机设备通过网络与服务器进行通信。计算机设备可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器来实现。The voice data processing method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network. Computer devices can be, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as a stand-alone server.
具体地,该语音数据处理方法应用在银行、证券、保险等金融机构或者其他机构配置的计算机设备上,用于采用语音数据处理方法对原始语音数据进行预处理,获取训练数据,以便采用该训练数据训练声纹模型或其他语音模型,以提高模型识别的准确率。Specifically, the voice data processing method is applied to a computer device configured by a financial institution such as a bank, securities, insurance or the like, and is used to preprocess the original voice data by using the voice data processing method to obtain training data in order to use the training Data is used to train voiceprint models or other speech models to improve the accuracy of model recognition.
在一实施例中,如图2所示,提供一种语音数据处理方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2, a method for processing voice data is provided. The method is applied to the server in FIG. 1 as an example, and includes the following steps:
S10:获取原始语音数据。S10: Obtain the original voice data.
其中,原始语音数据是采用录音设备录制得到的说话人语音数据,该原始语音数据是未经处理的语音数据。本实施例中,该原始语音数据可以是wav、mp3或其他格式的语音 数据。该原始语音数据包括目标语音数据和干扰语音数据,其中,目标语音数据是指原始语音数据中声纹连续变化明显的语音部分,该目标语音数据一般为说话人语音。相应地,干扰语音数据是指原始语音数据中目标语音数据之外的语音部分,即干扰语音数据为说话人语音之外的语音。具体地,干扰语音数据包括静音段和噪音段,其中,静音段是指原始语音数据中由于静默而没有发音的语音部分,如采集到的原始语音数据中因说话人在说话过程由于思考和呼吸等而没有发出声音时的语音部分,该语音部分则为静音段。噪音段是指原始语音数据中的环境噪音对应的语音部分,如门窗的开关和物体的碰撞等发出的声音都可以认为是噪音段。The original voice data is speaker voice data obtained by using a recording device, and the original voice data is unprocessed voice data. In this embodiment, the original voice data may be voice data in wav, mp3, or other formats. The original voice data includes target voice data and interference voice data, where the target voice data refers to a voice part in which the voiceprint continuously changes significantly in the original voice data, and the target voice data is generally a speaker voice. Correspondingly, the interfering voice data refers to a voice portion other than the target voice data in the original voice data, that is, the interfering voice data is a voice other than the speaker voice. Specifically, the interfering speech data includes a mute segment and a noise segment, where the mute segment refers to a speech portion of the original speech data that is not pronounced due to silence, such as the collected original speech data due to the speaker thinking and breathing during the speaking process The voice part when no sound is produced, the voice part is a mute section. The noise section refers to the voice part corresponding to the environmental noise in the original voice data, and the sounds such as the opening and closing of doors and windows and the collision of objects can be considered as the noise section.
S20:采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据。S20: The VAD algorithm is used to frame and segment the original voice data to obtain at least two frames of voice data to be tested.
其中,待测语音数据是采用VAD算法将干扰语音数据中的静音段切除后获取的原始语音数据。VAD(Voice Activity Detection,语音活动检测)算法是从噪音环境中准确定位出目标语音数据的开始和结束的算法。VAD算法可用于从原始语音数据的信号流中识别和消除长时间的静音段,以消除原始语音数据中的静音段这一干扰语音数据,提高语音数据处理的精度。The voice data to be tested is the original voice data obtained by cutting out the mute section in the interference voice data using the VAD algorithm. The VAD (Voice Activity Detection) algorithm is an algorithm that accurately locates the start and end of target voice data from a noisy environment. The VAD algorithm can be used to identify and eliminate long silent segments from the signal stream of the original speech data, in order to eliminate the interference speech data of the silent segment in the original speech data, and improve the accuracy of speech data processing.
帧是语音数据中最小的观测单位,分帧是依据语音数据的时序进行划分的过程,由于原始语音数据整体上看不是平稳的,但是在局部上可以看作是平稳的,所以将原始语音数据进行分帧可获取较平稳的单帧语音数据。在语音识别或声纹识别过程中需要输入的是平稳信号,所以服务器需要先对原始语音数据进行分帧处理。Frame is the smallest unit of observation in speech data. Framing is the process of dividing according to the timing of speech data. Since the original speech data is not stable as a whole, but can be regarded as stable locally, the original speech data is considered. Framed to obtain a relatively stable single-frame voice data. In the process of speech recognition or voiceprint recognition, a stable signal needs to be input, so the server needs to perform frame processing on the original speech data first.
切分是将原始语音数据中属于静音段的单帧语音数据切除的过程。本实施例中,采用VAD算法对分帧处理后的原始语音数据进行切分处理,去除静音段,以获取至少两帧待测语音数据。Segmentation is a process of cutting out a single frame of speech data belonging to a mute segment in the original speech data. In this embodiment, the VAD algorithm is used to perform segmentation processing on the original voice data after frame processing, and to remove the mute segment to obtain at least two frames of voice data to be tested.
在一实施例中,如图3所示,步骤S20中,即采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,具体包括如下步骤:In an embodiment, as shown in FIG. 3, in step S20, the original voice data is framed and segmented using the VAD algorithm to obtain at least two frames of voice data to be tested, which specifically includes the following steps:
S21:对原始语音数据进行分帧处理,获取至少两帧单帧语音数据。S21: Perform frame processing on the original voice data to obtain at least two frames of single-frame voice data.
分帧是将N个采样点集合成一个观测单位,称为帧。通常情况下N的值为256或512,涵盖的时间约为20-30ms左右。为避免相邻两帧的变化过大,通过使相邻两帧之间有一段重叠区域,此重叠区域包含了M个采样点,通常M的值约为N的1/2或1/3,此过程称为分帧。具体地,在对原始语音数据进行分帧后,可获取至少两帧单帧语音数据,每一帧单帧语音数据包含N个采样点数。Framing is the collection of N sampling points into an observation unit, called a frame. Generally, the value of N is 256 or 512, and the time covered is about 20-30ms. In order to avoid the change of two adjacent frames from being too large, there is an overlapping area between the adjacent two frames. This overlapping area contains M sampling points. Generally, the value of M is about 1/2 or 1/3 of N. This process is called framing. Specifically, after framing the original voice data, at least two frames of single-frame voice data may be acquired, and each frame of single-frame voice data includes N sampling points.
进一步地,由于对原始语音数据进行分帧处理后获取的至少两帧单帧语音数据中,每一帧的起始段和末尾端会出现不连续的地方,分帧越多会导致分帧后的单帧语音数据与分帧前的原始语音数据的误差越大。为了使分帧后的单帧语音数据变得连续,每一帧都可以表现出周期函数的特征,因此,还需要对分帧后的每一单帧语音数据进行加窗处理和预加重处理,以获取质量更好的单帧语音数据。Further, since at least two frames of single-frame speech data obtained by framing the original voice data, discontinuities appear at the beginning and end of each frame. The more frames, the more frame The larger the error between the single frame of speech data and the original speech data before the frame. In order to make the single-frame speech data after the frame continuous, each frame can show the characteristics of the periodic function. Therefore, it is necessary to perform windowing and pre-emphasis processing on each single-frame speech data after the frame. To get better single-frame voice data.
加窗是每一帧乘以汉明窗(即Hamming Window),由于汉明窗的幅频特性是旁瓣衰减较大,服务器通过对单帧语音数据进行加窗处理,可增加帧左端和帧右端的连续性。即通过对分帧后的单帧语音数据进行加窗处理,可将非平稳语音信号转变为短时平稳信号。设分帧后的信号为S(n),n=0,1…,N-1,N为帧的大小,汉明窗的信号为W(n),则加窗处理后的信号为S'(n)=S(n)×W(n),其中,
Figure PCTCN2018094184-appb-000001
N为帧的大小,不同的a值会产生不同的汉明窗,一般情况下a取0.46。
Windowing is to multiply each frame by a Hamming Window (namely, Hamming Window). Because the amplitude and frequency characteristics of the Hamming window is a large sidelobe attenuation, the server can increase the left end of the frame and the frame by windowing the single frame of voice data. Right end continuity. That is, by windowing the single-frame speech data after framed, non-stationary speech signals can be converted into short-term stationary signals. Let the framed signal be S (n), n = 0,1 ..., N-1, N is the frame size, and the signal of the Hamming window is W (n), then the signal after windowing is S ' (n) = S (n) × W (n), where
Figure PCTCN2018094184-appb-000001
N is the size of the frame. Different a values will produce different Hamming windows. Generally, a is taken as 0.46.
为了增加语音信号相对于低频分量的高频分量的幅度,以消除声门激励和口鼻辐射的 影响,需要对单帧语音数据进行预加重处理,有助于提高信噪比。信噪比是指一个电子设备或者电子系统中信号与噪音的比例。In order to increase the amplitude of the high-frequency component of the speech signal relative to the low-frequency component to eliminate the effects of glottal excitation and oral and nasal radiation, pre-emphasis processing of single-frame speech data is needed, which helps to improve the signal-to-noise ratio. Signal-to-noise ratio refers to the ratio of signal to noise in an electronic device or electronic system.
预加重是将加窗后的单帧语音数据通过一个高通滤波器H(Z)=1-μz -1,其中,μ值介于0.9-1.0之间,Z表示单帧语音数据,预加重的目标是提升高频部分,使信号的频谱更平滑,保持在低频到高频的整个频带中,能用同样的信噪比求频谱,突出高频的共振峰。 The pre-emphasis is to pass the windowed single-frame voice data through a high-pass filter H (Z) = 1-μz -1 , where μ is between 0.9-1.0, and Z represents a single-frame voice data. The goal is to improve the high-frequency part, make the spectrum of the signal smoother, keep it in the entire low-frequency to high-frequency band, can use the same signal-to-noise ratio to find the spectrum, and highlight the high-frequency formants.
可以理解地,通过对原始语音数据进行分帧、加窗和预加重等预处理,使得预处理后的单帧语音数据具有分辨率高、平稳性好且与原始语音数据误差较小的优点,使得后续对至少两帧单帧语音数据进行切分处理时,可提高获取至少两帧待测语音数据的效率和质量。Understandably, by pre-processing the original voice data such as framing, windowing, and pre-emphasis, the pre-processed single-frame voice data has the advantages of high resolution, good stability, and small errors from the original voice data. When subsequent segmentation processing is performed on at least two frames of single-frame voice data, the efficiency and quality of obtaining at least two frames of voice data to be tested can be improved.
S22:采用短时能量计算公式对单帧语音数据进行切分处理,获取单帧语音数据对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据。S22: The short-term energy calculation formula is used to segment the single-frame voice data to obtain the short-term energy corresponding to the single-frame voice data, and the single-frame voice data whose short-term energy is greater than the first threshold is retained as the first voice data.
其中,短时能量计算公式具体为
Figure PCTCN2018094184-appb-000002
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列。
Among them, the short-term energy calculation formula is specifically
Figure PCTCN2018094184-appb-000002
Among them, N is a frame length of a single frame of voice data, x n (m) is an n-th frame of single frame of voice data, E (n) is short-term energy, and m is a time series.
其中,短时能量是指一帧语音信号的能量。第一门限阈值是预先设定的数值较低的门限阈值。第一语音数据是指单帧语音数据中某帧单帧语音数据对应的短时能量大于第一门限阈值的语音数据。VAD算法可检测出单帧语音数据中的静音段、过渡段、语音段和结束段这四部分语音。具体地,采用短时能量计算公式对每一帧单帧语音数据进行计算,获取每一帧单帧语音数据对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据。本实施例中,保留短时能量大于第一门限阈值的单帧语音数据,即标记起始点,证明该起始点之后的单帧语音数据进入过渡段,即最终获取的第一语音数据包括过渡段、语音段和结束段。可以理解地,步骤S21中基于短时能量获取到的第一语音数据是将短时能量不大于第一门限阈值的单帧语音数据切分后所得到的,即去除了单帧语音数据中静音段这一部分干扰语音数据。Among them, short-term energy refers to the energy of a frame of voice signals. The first threshold value is a threshold value with a lower preset value. The first voice data refers to voice data in which the short-term energy corresponding to a single frame of voice data in a single frame of voice data is greater than a first threshold. The VAD algorithm can detect the four parts of speech in a single frame of voice data: the mute segment, the transition segment, the speech segment, and the end segment. Specifically, the short-term energy calculation formula is used to calculate each frame of single-frame voice data, and the short-term energy corresponding to each frame of single-frame voice data is obtained, and the single-frame voice data whose short-term energy is greater than the first threshold is retained as First voice data. In this embodiment, single-frame voice data whose short-term energy is greater than the first threshold is retained, that is, the starting point is marked, and it is proved that the single-frame voice data after the starting point enters the transition section, that is, the first voice data finally obtained includes the transition section. , Speech segment, and ending segment. Understandably, the first voice data obtained based on the short-term energy in step S21 is obtained by segmenting a single frame of voice data whose short-term energy is not greater than the first threshold threshold, that is, the mute in the single-frame voice data is removed. This part of the segment interferes with speech data.
S23:采用过零率计算公式对第一语音数据进行切分处理,获取第一语音数据对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。S23: Use the zero-crossing rate calculation formula to segment the first voice data to obtain the zero-crossing rate corresponding to the first voice data, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames to be tested Voice data.
其中,过零率计算公式具体为
Figure PCTCN2018094184-appb-000003
其中,sgn[]为符号函数,其函数公式为
Figure PCTCN2018094184-appb-000004
x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
Among them, the formula for calculating the zero-crossing rate is specifically
Figure PCTCN2018094184-appb-000003
Among them, sgn [] is a symbolic function, and its function formula is
Figure PCTCN2018094184-appb-000004
x n (m) is the first speech data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
其中,第二门限阈值是预先设定好的数值较高的门限阈值。由于第一门限阈值被超过未必是语音段的开始,有可能是很短的噪音引起的,因此需要计算每一帧第一语音数据(即处于过渡段及过渡段以后的原始语音数据)的过零率,若第一语音数据对应的过零率不大于第二门限阈值,则认为该第一语音数据处于静音段,将该段第一语音数据进行切分,即保留过零率大于第二门限阈值的第一语音数据,从而获取至少两帧待测语音数据,达到进一步切分第一语音数据的过渡段中的干扰语音数据的目的。The second threshold value is a preset threshold value with a relatively high value. Because the first threshold is not necessarily the beginning of the speech segment, it may be caused by short noise. Therefore, the calculation of the first speech data (that is, the original speech data in the transition period and after the transition period) of each frame needs to be calculated. Zero rate. If the zero-crossing rate corresponding to the first voice data is not greater than the second threshold threshold, the first voice data is considered to be in the mute section, and the first voice data of this segment is segmented, that is, the retained zero-crossing rate is greater than the second Threshold the first voice data to obtain at least two frames of voice data to be tested, thereby achieving the purpose of further segmenting the interference voice data in the transition section of the first voice data.
本实施例中,先采用短时能量计算公式对原始语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,即标记起始点,证明该起始点之后的单帧语音数据进入过渡段,可初步切除单帧语音数据中的静音段;然后,计算每一 帧第一语音数据(即处于过渡段及过渡段以后的原始语音数据)的过零率,将过零率不大于第二门限阈值的第一语音数据切除,以获取过零率大于第二门限阈值的至少两帧待测语音数据。本实施例中,VAD算法通过采用双门限的方式切分第一语音数据中静音段对应的干扰语音数据,实现简单,提高语音数据的处理效率。In this embodiment, the short-term energy calculation formula is used to perform segmentation processing on the original speech data to obtain the corresponding short-term energy, and to retain a single frame of speech data whose short-term energy is greater than the first threshold value, that is, to mark the starting point to prove The single frame of voice data after the starting point enters the transition section, and the mute section in the single frame of voice data can be initially cut off. Then, the first voice data of each frame (that is, the original voice data in the transition section and after the transition section) is calculated. Zero rate, the first voice data whose zero-crossing rate is not greater than the second threshold is cut off to obtain at least two frames of voice data to be tested whose zero-crossing rate is greater than the second threshold. In this embodiment, the VAD algorithm cuts the interference voice data corresponding to the mute segment in the first voice data by using a dual threshold method, which is simple to implement and improves the processing efficiency of voice data.
S30:采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征。S30: ASR voice feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested.
其中,待测滤波器语音特征是采用ASR语音特征提取算法对待测语音数据进行特征提取所获取的滤波器特征。滤波器(Filter-Bank,简称Fbank)特征是语音识别过程中常用的语音特征。由于当前常用的梅尔特征在进行模型训练或识别过程中会进行降维处理,导致部分信息的丢失,为避免上述问题出现,本实施例中采用滤波器特征代替常用的梅尔特征,可有助于提高后续模型识别的准确率。ASR(Automatic Speech Recognition,自动语音识别),是一种将人的语音转换为文本的技术,一般包括语音特征提取、声学模型与模式匹配和语言模型与语言处理三大部分。ASR语音特征提取算法是ASR技术中用于实现语音特征提取的算法。The voice feature of the filter under test is a filter feature obtained by performing feature extraction of the voice data to be tested using an ASR voice feature extraction algorithm. Filter-Bank (Fbank for short) features are commonly used in speech recognition. Because the commonly used Mel features are currently subjected to dimensionality reduction during model training or recognition, which results in the loss of some information, in order to avoid the above problems, the filter features are used instead of the commonly used Mel features in this embodiment. Helps improve the accuracy of subsequent model recognition. ASR (Automatic Speech Recognition) is a technology that converts human speech into text. It generally includes three parts: speech feature extraction, acoustic model and pattern matching, and language model and language processing. ASR speech feature extraction algorithm is an algorithm used in ASR technology to implement speech feature extraction.
由于声学模型或语音识别模型的识别是基于待测语音数据进行特征提取后的语音特征进行识别,而不能直接基于待测语音数据进行识别,因此,需先对待测语音数据进行特征提取。本实施例中,采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,以获取待测滤波器语音特征,可为后续模型识别提供技术支持。Because the recognition of the acoustic model or the speech recognition model is based on the speech features after feature extraction based on the voice data to be tested, and cannot be directly based on the voice data to be tested, it is necessary to first perform feature extraction on the voice data to be tested. In this embodiment, an ASR speech feature extraction algorithm is used to perform feature extraction on each frame of voice data to be tested to obtain the voice characteristics of the filter to be tested, which can provide technical support for subsequent model recognition.
在一实施例中,如图4所示,步骤S30中,即采用ASR语音特征提取算法对待测语音数据进行特征提取,获取待测滤波器语音特征,具体包括如下步骤:In an embodiment, as shown in FIG. 4, in step S30, the ASR voice feature extraction algorithm is used to perform feature extraction to obtain the voice features of the filter to be tested, which specifically includes the following steps:
S31:对每一帧待测语音数据进行快速傅里叶变换,获取与每一帧待测语音数据对应的频谱。S31: Perform fast Fourier transform on the voice data to be tested for each frame, and obtain a frequency spectrum corresponding to the voice data to be tested for each frame.
其中,待测语音数据对应的频谱是指待测语音数据在频域上的能量谱。由于语音信号在时域上的变换通常很难看出信号的特性,通常需将它转换为频域上的能量分布来观察,不同的能量分布代表不同语音的特性。本实施例中对每一帧待测语音数据进行快速傅里叶变换得到各帧待测语音数据频谱,即能量谱。The spectrum corresponding to the voice data to be tested refers to the energy spectrum of the voice data to be tested in the frequency domain. Because the transformation of the speech signal in the time domain is usually difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain to observe. Different energy distributions represent the characteristics of different speech. In this embodiment, the fast Fourier transform is performed on the voice data to be tested for each frame to obtain the spectrum of the voice data to be tested for each frame, that is, the energy spectrum.
快速傅里叶变换(Fast Fourier Transform,以下简称FFT)是由离散傅里叶变换(Discrete Fourier Transform,以下简称DFT)的快速计算的统称。快速傅里叶变换用于将时域信号转换为频域能量谱的变换过程。由于待测语音数据是对原始语音数据进行预处理和语音活动检测处理后的信号,主要体现为时域上的信号,很难看出信号的特性,因此,需将对每一帧待测语音数据进行快速傅里叶变换以得到在频谱上的能量分布。Fast Fourier Transform (hereinafter referred to as FFT) is a collective term for fast calculation by Discrete Fourier Transform (hereinafter referred to as DFT). Fast Fourier transform is used to transform the time-domain signal into the frequency-domain energy spectrum. Since the voice data to be tested is a signal after preprocessing and voice activity detection of the original voice data, it is mainly reflected in the time domain signal. It is difficult to see the characteristics of the signal. Therefore, the voice data to be tested is required for each frame. A fast Fourier transform is performed to obtain the energy distribution in the frequency spectrum.
快速傅里叶变换的公式为X i(w)=FFT{x i(k)};其中,x i(k)为时域上的第i帧待测语音数据,X i(w)为频域上的第i帧待测语音数据对应的语音信号频谱,k表示时间序列,w表示语音信号频谱中的频率。具体地,离散傅里叶变换的计算公式为
Figure PCTCN2018094184-appb-000005
其中,
Figure PCTCN2018094184-appb-000006
N为每一帧待测语音数据所包含的采样点数。由于在数据量较大时,DFT的算法复杂度高,计算量较大,耗费时间,因此采用快速傅里叶变换进行计算,以加快计算速度,节省时间。具体地,快速傅里叶变换是利用离散傅里叶变换公式中的旋转因子
Figure PCTCN2018094184-appb-000007
的特性,即周期性、对称性和可约性,采用蝶形运算对上述公式进行转换,以降低算法复杂度。
The formula of the fast Fourier transform is X i (w) = FFT {x i (k)}; where x i (k) is the voice data of the i-th frame in the time domain and X i (w) is the frequency The voice signal spectrum corresponding to the voice data to be measured in the i-th frame on the domain, k represents a time series, and w represents a frequency in the voice signal spectrum. Specifically, the calculation formula of the discrete Fourier transform is
Figure PCTCN2018094184-appb-000005
among them,
Figure PCTCN2018094184-appb-000006
N is the number of sampling points included in each frame of voice data to be tested. Because the DFT algorithm has high complexity, large amount of calculation, and time-consuming when the amount of data is large, fast Fourier transform is used for calculation to speed up the calculation and save time. Specifically, the fast Fourier transform uses the rotation factor in the discrete Fourier transform formula
Figure PCTCN2018094184-appb-000007
The characteristics, namely periodicity, symmetry and reducibility, are transformed by butterfly operation to reduce the algorithm complexity.
具体地,N个采样点的DFT运算称为蝶形运算,而FFT运算就由若干级迭代的蝶形运算 组成。假设每一帧待测语音数据的采样点数为2^L个,(L为正整数),若采样点不足2^L个,可以用0补位,知道满足帧内采样点数在2^L个,则蝶形运算的计算公式为
Figure PCTCN2018094184-appb-000008
其中,X'(k')为偶数项分支的离散傅立叶变换,X”(k”)为奇数项分支的离散傅立叶变换。通过蝶形运算将N个采样点的DFT运算转换为奇数项离散傅里叶变换和偶数项离散傅里叶变换进行计算,降低算法复杂度,实现高效运算的目的。
Specifically, the DFT operation of N sampling points is called a butterfly operation, and the FFT operation is composed of several stages of iterative butterfly operations. Assume that the number of sampling points of the speech data to be tested is 2 ^ L (L is a positive integer). If the number of sampling points is less than 2 ^ L, you can use 0's complement to know that the number of sampling points in the frame is 2 ^ L , The calculation formula for butterfly operation is
Figure PCTCN2018094184-appb-000008
Among them, X '(k') is a discrete Fourier transform of an even-numbered branch, and X "(k") is a discrete Fourier transform of an even-numbered branch. The DFT operation of N sampling points is converted into an odd-numbered discrete Fourier transform and an even-numbered discrete Fourier transform through butterfly operations to reduce the complexity of the algorithm and achieve the purpose of efficient operations.
S32:将频谱通过Mel滤波器组,获取待测滤波器语音特征。S32: Pass the spectrum through the Mel filter bank to obtain the voice characteristics of the filter under test.
其中,Mel滤波器组是指将快速傅里叶变换输出的能量谱(即待测语音数据的频谱)通过一组Mel(梅尔)尺度的三角滤波器组,定义一个有M个滤波器的滤波器组,采用的滤波器为三角滤波器,中心频率为f(m),m=1,2,...,M。M通常取22-26。梅尔滤波器组用于对频谱进行平滑化,并起消除滤波作用,可以突出语音的共振峰特征,可降低运算量。然后计算梅尔滤波器组中每个三角滤波器输出的对数能量
Figure PCTCN2018094184-appb-000009
其中,M是三角滤波器的个数,m表示第m个三角滤波器,H m(w)表示第m个三角滤波器的频率响应,X i(w)表示第i帧待测语音数据对应的语音信号频谱,w表示语音信号频谱中的频率,该对数能量即为待测滤波器语音特征。
Among them, the Mel filter bank refers to passing the energy spectrum (that is, the spectrum of the voice data to be measured) output by the fast Fourier transform through a set of triangular filter banks of Mel scale to define a M filter The filter bank uses a triangular filter with a center frequency of f (m), m = 1,2, ..., M. M usually takes 22-26. The Mel filter bank is used to smooth the spectrum and eliminate filtering. It can highlight the formant characteristics of speech and reduce the amount of calculation. Then calculate the logarithmic energy output from each triangular filter in the Mel filter bank
Figure PCTCN2018094184-appb-000009
Among them, M is the number of triangular filters, m is the m-th triangular filter, H m (w) is the frequency response of the m-th triangular filter, and X i (w) is the correspondence of the voice data to be measured in the i-th frame. The spectrum of the voice signal, w represents the frequency in the spectrum of the voice signal, and the logarithmic energy is the voice characteristics of the filter under test.
本实施例中,先对每一帧待测语音数据进行快速傅里叶变换,获取与每一帧待测语音数据对应的频谱,以降低运算复杂度加快计算速度,节省时间。然后,将频谱通过Mel滤波器组并计算梅尔滤波器组中每个三角滤波器输出的对数能量,获取待测滤波器语音特征,以消除滤波,突出语音的共振峰特征,降低运算量。In this embodiment, first, fast Fourier transform is performed on the voice data to be tested for each frame to obtain the frequency spectrum corresponding to the voice data to be tested for each frame, so as to reduce the computational complexity, speed up the calculation, and save time. Then, pass the spectrum through the Mel filter bank and calculate the logarithmic energy output by each triangular filter in the Mel filter bank to obtain the voice characteristics of the filter under test to eliminate filtering, highlight the formant characteristics of the voice, and reduce the amount of calculation .
S40:采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值。S40: Use the trained ASR-LSTM speech recognition model to identify the speech features of the filter under test, and obtain the recognition probability value.
其中,ASR-LSTM语音识别模型是预先训练好的用于区分待测滤波器语音特征中的语音和噪音的模型。具体地,ASR-LSTM语音识别模型是采用LSTM(long-short term memory,长短时记忆神经网络)对采用ASR语音特征提取算法提取出的训练滤波器语音特征进行训练后获得的语音识别模型。识别概率值是采用ASR-LSTM语音识别模型对待测滤波器语音特征进行识别时,识别其为语音的概率。该识别概率值可以为0-1之间的实数。具体地,将每一帧待测语音数据对应的待测滤波器语音特征输入到ASR-LSTM语音识别模型中进行识别,以获取每一帧待测滤波器语音特征对应的识别概率值,即为语音的可能性。Among them, the ASR-LSTM speech recognition model is a model that is pre-trained to distinguish between speech and noise in the speech features of the filter under test. Specifically, the ASR-LSTM speech recognition model is a speech recognition model obtained by using LSTM (long-short term memory, long-term memory neural network) to train the training filter speech features extracted using the ASR speech feature extraction algorithm. The recognition probability value is the probability that when the ASR-LSTM speech recognition model is used to recognize the speech features of the filter under test, it is recognized as speech. The recognition probability value may be a real number between 0-1. Specifically, the speech feature of the filter to be tested corresponding to the speech data of each frame to be tested is input to the ASR-LSTM speech recognition model for recognition, so as to obtain the recognition probability value corresponding to the speech feature of the filter to be tested per frame, that is, The possibility of speech.
S50:若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。S50: If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as the target voice data.
由于待测语音数据是去除了静音段的单帧语音数据,因此排除了静音段的干扰。具体地,若识别概率值大于预设概率值,则认为该待测语音数据不为噪音段,即将识别概率值大于预设概率值的待测语音数据确定为目标语音数据。可以理解地,服务器通过对已去除静音段的待测语音数据进行识别,可排除目标语音数据中携带静音段和噪音段等干扰语音数据,以便采用目标语音数据作为训练数据对声纹模型或其他语音模型进行训练,以提高模型的识别准确率。若识别概率值不大于预设概率值,则证明该段待测语音数据很可能为噪音,将该段待测语音数据排除,以避免后续基于目标语音数据训练模型时,导致训练所得的模型识别准确率不高的问题。Since the voice data to be tested is a single frame of voice data with the mute segment removed, interference from the mute segment is eliminated. Specifically, if the recognition probability value is greater than a preset probability value, the voice data to be tested is not considered to be a noise segment, that is, the voice data to be tested with a recognition probability value greater than the preset probability value is determined as the target voice data. Understandably, the server recognizes the voice data to be tested after removing the mute segment, and can exclude interference voice data such as mute segment and noise segment from the target voice data, so as to use the target voice data as training data for the voiceprint model or other The speech model is trained to improve the recognition accuracy of the model. If the recognition probability value is not greater than the preset probability value, it proves that the piece of voice data to be tested is likely to be noise. This piece of voice data to be tested is excluded to avoid subsequent recognition of the model when training the model based on the target voice data. The problem of low accuracy.
本实施例中,先获取原始语音数据,该原始语音数据包括目标语音数据和干扰语音数据,采用VAD算法对原始语音数据进行分帧和切分处理,以便初步的切除静音段的干扰, 为后续获取较纯净的目标语音数据提供保障。采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征,有效解决了解决模型训练时对数据进行降维处理,造成部分信息丢失的问题。若识别概率值大于预设概率值,则认为该待测语音数据为目标语音数据,使得获取的目标语音数据不包含静音段和噪音段等被切除的干扰语音数据,即获取较纯净的目标语音数据,有助于后续利用目标语音数据作为训练数据对声纹模型或其他语音模型进行训练,以提高模型的识别准确率。In this embodiment, the original voice data is obtained first, and the original voice data includes target voice data and interference voice data. The VAD algorithm is used to perform frame and segment processing on the original voice data in order to initially cut off the interference of the mute section. Obtaining more pure target voice data provides protection. ASR voice feature extraction algorithm is used to extract the feature of each frame of the voice data to be tested, and to obtain the voice characteristics of the filter to be tested, which effectively solves the problem of reducing the dimensionality of the data during model training and causing partial information loss. If the recognition probability value is greater than a preset probability value, the voice data to be tested is considered to be the target voice data, so that the acquired target voice data does not include cut-off interference voice data such as mute segments and noise segments, that is, to obtain a purer target voice The data helps to use the target voice data as training data to train the voiceprint model or other voice models in order to improve the recognition accuracy of the model.
在一实施例中,该语音数据处理方法还包括:预先训练ASR-LSTM语音识别模型。In one embodiment, the speech data processing method further includes: pre-training an ASR-LSTM speech recognition model.
如图5所示,预先训练ASR-LSTM语音识别模型,具体包括如下步骤:As shown in Figure 5, pre-training the ASR-LSTM speech recognition model includes the following steps:
S61:获取训练语音数据。S61: Obtain training voice data.
其中,训练语音数据是从开源语音数据库中获取的随时间连续变化的语音数据,用于进行模型训练。该训练语音数据包括纯净的语音数据和纯净的噪音数据。开源语音数据库中已经将纯净的语音数据和纯净的噪音数据进行标记,以便进行模型训练。该训练语音数据中纯净的语音数据和纯净的噪音数据的比例为1:1,即获取同等比例的纯净的语音数据和纯净的噪音数据,能够有效防止模型训练过拟合的情况,以使通过训练语音数据训练所获得的模型的识别效果更加精准。本实施例中,在服务器获取训练语音数据之后,还需要对训练语音数据进行分帧,获取至少两帧训练语音数据,以便后续对每一帧训练语音数据进行特征提取。Among them, the training voice data is the voice data that continuously changes with time obtained from the open source voice database and is used for model training. The training voice data includes pure voice data and pure noise data. The open-source speech database has labeled pure speech data and pure noise data for model training. The ratio of pure voice data and pure noise data in the training voice data is 1: 1, that is, obtaining equal proportions of pure voice data and pure noise data can effectively prevent the model training from overfitting, so as to pass The recognition effect of the model obtained by training voice data training is more accurate. In this embodiment, after the server obtains the training voice data, the training voice data needs to be framed to obtain at least two frames of training voice data in order to perform feature extraction for each frame of training voice data subsequently.
S62:采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征。S62: The ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features.
由于声学模型训练是基于训练语音数据进行特征提取后的语音特征进行训练,而不是直接基于训练语音数据进行训练,因此,需先对训练语音数据进行特征提取,以获取待测滤波器语音特征。可以理解地,由于训练语音数据是具备时序性的,因此对每一帧待测语音数据进行特征提取所获取的训练滤波器语音特征是具备时序性的。具体地,服务器采用ASR语音特征提取算法对每一帧训练语音数据进行特征提取,获取携带时序状态的训练滤波器语音特征,为后续模型训练提供技术支持。本实施例中,采用ASR语音特征提取算法对训练语音数据进行特征提取的步骤与步骤S30的特征提取的步骤相同,为避免赘述,在此不再重复。Since the acoustic model training is based on the speech features after feature extraction based on the training voice data, rather than directly based on the training voice data, it is necessary to perform feature extraction on the training voice data first to obtain the voice features of the filter under test. Understandably, since the training voice data is time-series, the training filter voice characteristics obtained by performing feature extraction on each frame of voice data to be tested are time-series. Specifically, the server uses the ASR speech feature extraction algorithm to perform feature extraction on each frame of training speech data, obtains the training filter speech features carrying the timing state, and provides technical support for subsequent model training. In this embodiment, the steps of using ASR speech feature extraction algorithm for feature extraction of training speech data are the same as the step of feature extraction in step S30, in order to avoid repetition, and will not be repeated here.
S63:将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。S63: The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
其中,长短时记忆神经网络(long-short term memory,以下简称LSTM)模型是一种时间递归神经网络模型,适合于处理和预测具有时间序列,且时间序列间隔和延迟相对较长的重要事件。LSTM模型具有时间记忆功能,因而用来处理携带时序状态的训练滤波器语音特征。LSTM模型是具有长时记忆能力的神经网络模型中的一种,具有输入层、隐藏层和输出层这三层网络结构。其中,输入层是LSTM模型的第一层,用于接收外界信号,即负责接收训练滤波器语音特征。输出层是LSTM模型的最后一层,用于向外界输出信号,即负责输出LSTM模型的计算结果。隐藏层是LSTM模型中除输入层和输出层之外的各层,用于对滤波器语音特征进行训练,以调整LSTM模型中隐藏层的各层的参数,以获取ASR-LSTM语音识别模型。可以理解地,采用LSTM模型进行模型训练增加了滤波器语音特征的时序性,从而提高了ASR-LSTM语音识别模型的准确率。本实施例中,LSTM模型的输出层采用Softmax(回归模型)进行回归处理,用于分类输出权重矩阵。Softmax(回归模型)是一种常用于神经网络的分类函数,它将多个神经元的输出,映射到[0,1]区间内,可以理解成概率,计算起来简单方便,从而来进行多分类输出,使其输出结果更准确。Among them, a long-short-term memory neural network (long-term memory LSTM) model is a time-recurrent neural network model that is suitable for processing and predicting important events with time series and relatively long time series intervals and delays. The LSTM model has the function of temporal memory, so it is used to process the speech features of the training filter that carry the timing state. The LSTM model is one of the neural network models with long-term memory capabilities. It has a three-layer network structure of an input layer, a hidden layer, and an output layer. Among them, the input layer is the first layer of the LSTM model and is used to receive external signals, that is, it is responsible for receiving the voice characteristics of the training filter. The output layer is the last layer of the LSTM model and is used to output signals to the outside world, that is, it is responsible for outputting the calculation results of the LSTM model. Hidden layers are layers other than the input layer and output layer in the LSTM model. They are used to train the filter speech features to adjust the parameters of each layer of the hidden layer in the LSTM model to obtain the ASR-LSTM speech recognition model. Understandably, using the LSTM model for model training increases the temporality of the filter's speech features, thereby improving the accuracy of the ASR-LSTM speech recognition model. In this embodiment, the output layer of the LSTM model uses Softmax (regression model) for regression processing, which is used to classify the output weight matrix. Softmax (regression model) is a classification function commonly used in neural networks. It maps the output of multiple neurons into the [0,1] interval, which can be understood as a probability. It is simple and convenient to calculate, so as to perform multi-classification. Output to make its output more accurate.
本实施例中,先从开源语音数据库中获取同等比例的语音数据和噪音数据,以防止模型训练过拟合的情况,使通过训练语音数据训练获得的语音识别模型的识别效果更加精准。然后,采用ASR语音特征提取算法对每帧训练语音数据进行特征提取,获取训练滤波 器语音特征。最后,通过采用具有时间记忆能力的长短时记忆神经网络模型对训练滤波器语音特征进行训练,获取训练好的ASR-LSTM语音识别模型,使得该ASR-LSTM语音识别模型的识别准确率较高。In this embodiment, an equal proportion of speech data and noise data is first obtained from an open source speech database to prevent over-fitting of the model training, and the recognition effect of the speech recognition model obtained by training the speech data training is more accurate. Then, the ASR speech feature extraction algorithm is used to extract the features of each frame of training speech data to obtain the training filter speech features. Finally, the speech features of the training filter are trained by using a long-term and short-term memory neural network model with temporal memory capability to obtain a trained ASR-LSTM speech recognition model, which makes the recognition accuracy of the ASR-LSTM speech recognition model high.
在一实施例中,如图6所示,步骤S63中,将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,具体包括如下步骤:In an embodiment, as shown in FIG. 6, in step S63, the voice characteristics of the training filter are input to a long-term and short-term memory neural network model for training, and the trained ASR-LSTM speech recognition model is obtained, which specifically includes the following steps:
S631:在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元。S631: In the hidden layer of the long-term and short-term memory neural network model, the first activation function is used to calculate the voice characteristics of the training filter to obtain the neurons carrying the identification of the activation state.
其中,长短时记忆神经网络模型的隐藏层中的每个神经元包括三个门,分别为输入门、遗忘门和输出门。遗忘门决定在神经元中所要丢弃的过去的信息。输入门决定在神经元中所要增加的信息。输出门决定在神经元中所要输出的信息。第一激活函数是用于激活神经元状态的函数。神经元状态决定各个门(即输入门、遗忘门和输出门)的丢弃、增加和输出的信息。激活状态标识包括通过标识和不通过标识。本实施例中的输入门、遗忘门和输出门对应的标识分别为i、f和o。Among them, each neuron in the hidden layer of the short-term memory neural network model includes three gates, which are an input gate, a forget gate, and an output gate. The forget gate determines the past information to be discarded in the neuron. The input gate determines the information to be added to the neuron. The output gate determines the information to be output in the neuron. The first activation function is a function for activating a neuron state. The state of the neuron determines the information discarded, added, and output by each gate (ie, input gate, forget gate, and output gate). The activation status flag includes a pass flag and a fail flag. The identifiers corresponding to the input gate, the forget gate, and the output gate in this embodiment are i, f, and o, respectively.
本实施例中,具体选用Sigmoid(S型生长曲线)函数作为第一激活函数,Sigmoid函数是一个在生物学中常见的S型的函数,在信息科学中,由于其具有单增以及反函数单增等性质,Sigmoid函数常被用作神经网络的阈值函数,可将变量映射到0-1之间。第一激活函数的计算公式为
Figure PCTCN2018094184-appb-000010
其中,z表示遗忘门的输出值。
In this embodiment, the Sigmoid (S-shaped growth curve) function is specifically selected as the first activation function. The Sigmoid function is a S-type function commonly used in biology. In addition, Sigmoid function is often used as the threshold function of neural networks, which can map variables to 0-1. The calculation formula for the first activation function is
Figure PCTCN2018094184-appb-000010
Among them, z represents the output value of the forget gate.
具体地,通过计算每一神经元(训练滤波器语音特征)的激活状态,以获取携带激活状态标识为通过标识的神经元。本实施例中,采用遗忘门的计算公式f t=σ(z)=σ(W f·[h t-1,x t]+b f),计算遗忘门哪些信息被接收(即只接收携带激活状态标识为通过标识的神经元),其中,f t表示遗忘门限(即激活状态),W f表示遗忘门的权重矩阵,b f表示遗忘门的权值偏置项,h t-1表示上一时刻神经元的输出,x t表示当前时刻的输入数据即训练滤波器语音特征,t表示当前时刻,t-1表示上一时刻。遗忘门中还包括遗忘门限,通过遗忘门的计算公式对训练滤波器语音特征进行计算,会得到一个0-1区间的标量(即遗忘门限),此标量决定了神经元根据当前状态和过去状态的综合判断所接收过去信息的比例,以达到数据的降维,减少计算量,提高训练效率。 Specifically, the activation state of each neuron (training filter voice feature) is calculated to obtain the neuron that carries the activation state identifier as the pass identifier. In this embodiment, a calculation formula of the forgetting gate is used f t = σ (z) = σ (W f · [h t-1 , x t ] + b f ) to calculate which information of the forgetting gate is received (that is, only receiving and carrying The activation state is identified by the identified neurons), where f t represents the forgetting threshold (that is, the activation state), W f represents the weight matrix of the forgetting gate, b f represents the weight bias term of the forgetting gate, and h t-1 represents The output of the neuron at the last moment, x t represents the input data of the current moment, that is, the voice characteristics of the training filter, t represents the current moment, and t-1 represents the previous moment. The forgetting gate also includes the forgetting threshold. By calculating the speech filter's speech features through the calculation formula of the forgetting gate, a 0-1 interval scalar (ie, forgetting threshold) is obtained. This scalar determines the neuron according to the current state and the past state. Comprehensively determine the proportion of past information received in order to reduce the dimensionality of the data, reduce the amount of calculation, and improve training efficiency.
S632:在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值。S632: In the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neuron carrying the identification of the activation state to obtain the output value of the hidden layer of the long-term and short-term memory neural network model.
其中,长短时记忆神经网络模型隐藏层的输出值包括输入门的输出值、输出门的输出值和神经元状态。具体地,在长短时记忆神经网络模型的隐藏层中的输入门中,采用第二激活函数携带激活状态标识为通过标识的神经元进行计算,获取隐藏层的输出值。本实施例中,由于线性模型的表达能力不够,因此采用tanh(双曲正切)函数作为输入门的激活函数(即第二激活函数),可加入非线性因素使得训练出的ASR-LSTM语音识别模型能够解决更复杂的问题。并且,激活函数tanh(双曲正切)具有收敛速度快的优点,可以节省训练时间,增加训练效率。Among them, the output value of the hidden layer of the short-term memory neural network model includes the output value of the input gate, the output value of the output gate, and the state of the neuron. Specifically, in the input gate in the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to carry the activation state identifier to perform calculation through the identified neurons to obtain the output value of the hidden layer. In this embodiment, because the expressive ability of the linear model is insufficient, a tanh (hyperbolic tangent) function is used as the activation function of the input gate (ie, the second activation function). Non-linear factors can be added to make the trained ASR-LSTM speech recognition Models can solve more complex problems. In addition, the activation function tanh (hyperbolic tangent) has the advantage of fast convergence speed, which can save training time and increase training efficiency.
具体地,通过输入门的计算公式计算输入门的输出值。其中,输入门中还包括输入门限,输入门的计算公式为i t=σ(W i·[h t-1,x t]+b i),其中,W i为输入门的权值矩阵,i t表示输入门限,b i表示输入门的偏置项,通过输入门的计算公式对训练滤波器语音特征进 行计算会得到一个0-1区间的标量(即输入门限),此标量控制了神经元根据当前状态和过去状态的综合判断所接收当前信息的比例,即接收新输入的信息的比例,以减少计算量,提高训练效率。 Specifically, the output value of the input gate is calculated by a calculation formula of the input gate. The input gate also includes an input threshold. The calculation formula of the input gate is i t = σ (W i · [h t-1 , x t ] + b i ), where W i is a weight matrix of the input gate, i t represents the input threshold, b i represents the bias term of the input gate, and the calculation of the training filter's voice characteristics by the calculation formula of the input gate will obtain a 0-1 interval scalar (that is, the input threshold), which controls the nerve Yuan judges the proportion of the received current information, that is, the proportion of newly input information, according to the comprehensive evaluation of the current state and the past state, so as to reduce the calculation amount and improve the training efficiency.
然后,采用神经元状态的计算公式
Figure PCTCN2018094184-appb-000011
Figure PCTCN2018094184-appb-000012
计算当前神经元状态;其中,W c表示神经元状态的权重矩阵,b c表示神经元状态的偏置项,
Figure PCTCN2018094184-appb-000013
表示上一时刻的神经元状态,C t表示当前时刻神经元状态。通过将神经元状态和遗忘门限(输入门限)进行点乘操作,以便模型只输出所需的信息,提高模型学习的效率。
Then, the calculation formula of the state of the neuron is adopted.
Figure PCTCN2018094184-appb-000011
with
Figure PCTCN2018094184-appb-000012
Calculate the current neuron state; where W c represents the weight matrix of the neuron state, b c represents the bias term of the neuron state,
Figure PCTCN2018094184-appb-000013
Represents the state of the neuron at the previous moment, and C t represents the state of the neuron at the current moment. By performing a dot product operation on the state of the neuron and the forgetting threshold (input threshold), the model can only output the required information, thereby improving the efficiency of model learning.
最后,采用输出门的计算公式o t=σ(W o[h t-1,x t]+b o)计算输出门中哪些信息被输出,再采用公式h t=o t*tanh(C t)计算当前时刻神经元的输出值,其中,o t表示输出门限,W o表示输出门的权重矩阵,b o表示输出门的偏置项,h t表示当前时刻神经元的输出值。 Finally, the output gate calculation formula o t = σ (W o [h t-1 , x t ] + b o ) is used to calculate which information is output in the output gate, and then the formula h t = o t * tanh (C t ) Calculate the output value of the neuron at the current moment, where o t represents the output threshold, W o represents the weight matrix of the output gate, bo represents the bias term of the output gate, and h t represents the output value of the neuron at the current moment.
S633:基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取训练好的ASR-LSTM语音识别模型。S633: Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long- and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.
首先,根据公式
Figure PCTCN2018094184-appb-000014
Figure PCTCN2018094184-appb-000015
Figure PCTCN2018094184-appb-000016
计算任意t时刻的输出门的误差项
Figure PCTCN2018094184-appb-000017
输入门的误差项
Figure PCTCN2018094184-appb-000018
遗忘门的误差项
Figure PCTCN2018094184-appb-000019
和神经元状态的误差项
Figure PCTCN2018094184-appb-000020
First, according to the formula
Figure PCTCN2018094184-appb-000014
Figure PCTCN2018094184-appb-000015
with
Figure PCTCN2018094184-appb-000016
Calculate the error term of the output gate at any time t
Figure PCTCN2018094184-appb-000017
Error term of input gate
Figure PCTCN2018094184-appb-000018
Error term of forget gate
Figure PCTCN2018094184-appb-000019
And neuron state error term
Figure PCTCN2018094184-appb-000020
然后,根据权值更新公式
Figure PCTCN2018094184-appb-000021
进行误差反传更新,其中,T表示时刻,W表示权值,如W i、W c、W o或W f,B表示输出值如i t、f t,o t
Figure PCTCN2018094184-appb-000022
δ表示误差项,
Figure PCTCN2018094184-appb-000023
为上一时刻神经元的状态数据,b t-1 h为上一时刻隐藏层的输出值。根据偏置更新公式
Figure PCTCN2018094184-appb-000024
更新偏置。其中,b为各门的偏置项,δ a,t表示t时刻各门的误差。
Then, update the formula based on the weights
Figure PCTCN2018094184-appb-000021
Updating the error back propagation, wherein, T is time, W is the weight, such as W i, W c, W o or W f, B represents a value such as the output i t, f t, o t or
Figure PCTCN2018094184-appb-000022
δ represents the error term,
Figure PCTCN2018094184-appb-000023
Is the state data of the neuron at the last moment, and b t-1 h is the output value of the hidden layer at the last moment. Update formula based on offset
Figure PCTCN2018094184-appb-000024
Update the offset. Among them, b is the offset term of each gate, and δ a, t represents the error of each gate at time t.
最后,根据该权值更新公式进行运算即可获取更新后的权值,根据偏置更新公式更新偏置,将获取的更新后的各层的权值和偏置,应用到长短时记忆神经网络模型中即可获取训练好的ASR-LSTM语音识别模型。进一步地,该ASR-LSTM语音识别模型中的各权值实现了ASR-LSTM语音识别模型决定丢弃哪些旧信息、增加哪些新信息以及输出哪些信息的功能。在ASR-LSTM语音识别模型的输出层最终会输出概率值。该概率值表示训练语音数据 在通过ASR-LSTM语音识别模型识别后确定其为语音数据的概率,可广泛应用于语音数据处理方面,以达到准确识别训练滤波器语音特征的目的。Finally, according to the weight update formula, the updated weights can be obtained, and the offsets are updated according to the offset update formula. The obtained weights and offsets of each layer are applied to the short-term memory neural network. The trained ASR-LSTM speech recognition model can be obtained from the model. Further, each weight in the ASR-LSTM speech recognition model implements the functions of the ASR-LSTM speech recognition model to decide which old information to discard, which new information to add, and which information to output. At the output layer of the ASR-LSTM speech recognition model, the probability value will eventually be output. The probability value indicates the probability that the training speech data is determined to be speech data after being recognized by the ASR-LSTM speech recognition model, and can be widely used in speech data processing to achieve the purpose of accurately identifying the speech features of the training filter.
本实施例中,通过在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元,以达到数据的降维,减少计算量,提高训练效率。在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值,以便基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取更新后的各权值和偏置,将更新后的各权值和偏置应用到长短时记忆神经网络模型中即可获取ASR-LSTM语音识别模型,可广泛应用于语音数据处理方面,以达到准确识别训练滤波器语音特征的目的。In this embodiment, the first activation function is used to calculate the training filter speech features in the hidden layer of the long-term and short-term memory neural network model to obtain the neurons carrying the identification of the active state in order to reduce the data and reduce the amount of calculation. Improve training efficiency. In the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neurons carrying the identification of the activation state to obtain the output value of the hidden layer of the long-term and short-term memory neural network model. The output value is used to update the long-term and short-term memory neural network model by error back-propagation, and obtain the updated weights and offsets. Applying the updated weights and offsets to the long-term and short-term memory neural network model can obtain ASR- The LSTM speech recognition model can be widely used in speech data processing to achieve the purpose of accurately identifying the speech features of the training filter.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
在一实施例中,提供一种语音数据处理装置,该语音数据处理装置与上述实施例中语音数据处理方法一一对应。如图7所示,该语音数据处理装置包括原始语音数据获取模块10、待测语音数据获取模块20、待测滤波器语音特征获取模块30、识别概率值获取模块40和目标语音数据获取模块50。各功能模块详细说明如下:In one embodiment, a voice data processing device is provided. The voice data processing device corresponds to the voice data processing method in the above embodiment. As shown in FIG. 7, the voice data processing device includes an original voice data acquisition module 10, a test voice data acquisition module 20, a filter voice characteristic acquisition module 30, a recognition probability value acquisition module 40, and a target voice data acquisition module 50. . The detailed description of each function module is as follows:
原始语音数据获取模块10,用于获取原始语音数据。The original voice data acquisition module 10 is configured to acquire original voice data.
待测语音数据获取模块20,用于采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据。The voice data to be tested module 20 is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested.
待测滤波器语音特征获取模块30,用于采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征。The voice feature acquisition module 30 of the filter under test is configured to use ASR voice feature extraction algorithm to extract features of the voice data of each frame to obtain the voice feature of the filter under test.
识别概率值获取模块40,用于采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值。The recognition probability value acquisition module 40 is configured to recognize the voice characteristics of the filter to be tested by using the trained ASR-LSTM speech recognition model to obtain the recognition probability value.
目标语音数据获取模块50,用于若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。The target voice data acquisition module 50 is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.
具体地,待测语音数据获取模块20包括单帧语音数据获取单元21、第一语音数据获取单元22和待测语音数据获取单元23。Specifically, the voice data acquisition module 20 includes a single frame voice data acquisition unit 21, a first voice data acquisition unit 22, and a voice data acquisition unit 23.
单帧语音数据获取单元21,用于对原始语音数据进行分帧处理,获取至少两帧单帧语音数据。The single-frame voice data obtaining unit 21 is configured to perform frame processing on the original voice data to obtain at least two frames of single-frame voice data.
第一语音数据获取单元22,用于采用短时能量计算公式对单帧语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据。The first voice data obtaining unit 22 is configured to perform segmentation processing on a single frame of voice data by using a short-term energy calculation formula, to obtain corresponding short-term energy, and to retain single-frame voice data whose short-term energy is greater than a first threshold threshold, as a first A voice data.
待测语音数据获取单元23,采用过零率计算公式对第一语音数据进行切分处理,获取对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。The voice data acquisition unit 23 uses the zero-crossing rate calculation formula to perform segmentation processing on the first voice data, acquires the corresponding zero-crossing rate, retains the first voice data whose zero-crossing rate is greater than the second threshold, and obtains at least two frames. Voice data under test.
具体地,短时能量计算公式为
Figure PCTCN2018094184-appb-000025
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列。
Specifically, the short-term energy calculation formula is
Figure PCTCN2018094184-appb-000025
Among them, N is a frame length of a single frame of voice data, x n (m) is an n-th frame of single frame of voice data, E (n) is short-term energy, and m is a time series.
过零率计算公式为
Figure PCTCN2018094184-appb-000026
其中,sgn[]为符号函数,x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
The zero-crossing rate calculation formula is
Figure PCTCN2018094184-appb-000026
Among them, sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is a zero-crossing rate, and m is a time series.
具体地,待测滤波器语音特征获取模块30包括频谱获取单元31和待测滤波器语音特征获取单元32。Specifically, the voice feature acquisition module 30 of the filter to be tested includes a spectrum acquisition unit 31 and a voice feature acquisition unit 32 of the filter to be tested.
频谱获取单元31,用于对每一帧待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱。The frequency spectrum acquiring unit 31 is configured to perform fast Fourier transform on each frame of voice data to be tested to acquire a frequency spectrum corresponding to the voice data to be tested.
待测滤波器语音特征获取单元32,用于将频谱通过Mel滤波器组,获取待测滤波器语音特征。The voice feature acquisition unit 32 of the filter under test is configured to pass the frequency spectrum through the Mel filter bank to obtain the voice feature of the filter under test.
具体地,语音数据处理装置还包括ASR-LSTM语音识别模型训练模块60,用于预先训练ASR-LSTM语音识别模型。Specifically, the speech data processing device further includes an ASR-LSTM speech recognition model training module 60 for pre-training the ASR-LSTM speech recognition model.
ASR-LSTM语音识别模型训练模块60包括训练语音数据获取单元61、训练滤波器语音特征获取单元62和ASR-LSTM语音识别模型获取单元63。The ASR-LSTM speech recognition model training module 60 includes a training speech data acquisition unit 61, a training filter speech feature acquisition unit 62, and an ASR-LSTM speech recognition model acquisition unit 63.
训练语音数据获取单元61,用于获取训练语音数据。The training voice data acquiring unit 61 is configured to acquire training voice data.
训练滤波器语音特征获取单元62,用于采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征。The training filter speech feature obtaining unit 62 is configured to use ASR speech feature extraction algorithm to perform feature extraction on the training speech data to obtain the training filter speech features.
ASR-LSTM语音识别模型获取单元63,用于将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。The ASR-LSTM speech recognition model acquisition unit 63 is configured to input the training filter speech features into a long-term and short-term memory neural network model for training, and obtain a trained ASR-LSTM speech recognition model.
具体地,ASR-LSTM语音识别模型获取单元63包括激活状态神经元获取子单元631、模型输出值获取子单元632和ASR-LSTM语音识别模型获取子单元633。Specifically, the ASR-LSTM speech recognition model acquisition unit 63 includes an activation state neuron acquisition subunit 631, a model output value acquisition subunit 632, and an ASR-LSTM speech recognition model acquisition subunit 633.
激活状态神经元获取子单元631,用于在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元。The activation state neuron acquisition subunit 631 is configured to calculate a speech filter feature of a training filter by using a first activation function in a hidden layer of a long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier.
模型输出值获取子单元632,用于在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值。The model output value acquisition subunit 632 is configured to calculate a neuron carrying an activation state identifier in a hidden layer of the long-term and short-term memory neural network model by using a second activation function to obtain an output value of the hidden layer of the long-term and short-term memory neural network model.
ASR-LSTM语音识别模型获取子单元633,用于基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取训练好的ASR-LSTM语音识别模型。The ASR-LSTM speech recognition model acquisition subunit 633 is configured to perform error back propagation update of the long-term and short-term memory neural network model based on the output value of the hidden layer of the long-term and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.
关于语音数据处理装置的具体限定可以参见上文中对于语音数据处理方法的限定,在此不再赘述。上述语音数据处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the voice data processing device, refer to the foregoing limitation on the voice data processing method, and details are not described herein again. Each module in the above-mentioned voice data processing device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储执行语音数据处理方法过程中生成或获取的数据,如目标语音数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音数据处理方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data generated or obtained during the execution of the voice data processing method, such as target voice data. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a voice data processing method.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取原始语音数据;采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征;采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值;若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: Voice data; use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested; use ASR voice feature extraction algorithm to feature extraction for each frame of voice data to be tested, and obtain the filter to be tested Speech features; the trained ASR-LSTM speech recognition model is used to identify the speech features of the filter under test to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value, the measured speech data is used as the target speech data.
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:对原始语音数据进 行分帧处理,获取至少两帧单帧语音数据;采用短时能量计算公式对单帧语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;采用过零率计算公式对第一语音数据进行切分处理,获取对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: frame processing the original voice data to obtain at least two frames of single-frame voice data; and using a short-term energy calculation formula to cut the single-frame voice data Divide the processing to obtain the corresponding short-term energy, and retain a single frame of voice data with short-term energy greater than the first threshold as the first voice data; use the zero-crossing rate calculation formula to perform segmentation processing on the first voice data to obtain the corresponding Zero-crossing rate, retaining first voice data with a zero-crossing rate greater than a second threshold, and obtaining at least two frames of voice data to be tested.
具体地,短时能量计算公式为
Figure PCTCN2018094184-appb-000027
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列;过零率计算公式为
Figure PCTCN2018094184-appb-000028
其中,sgn[]为符号函数,x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
Specifically, the short-term energy calculation formula is
Figure PCTCN2018094184-appb-000027
Among them, N is the frame length of a single frame of speech data, x n (m) is the nth frame of single frame of speech data, E (n) is short-term energy, and m is a time series; the formula for calculating the zero-crossing rate is
Figure PCTCN2018094184-appb-000028
Among them, sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is a zero-crossing rate, and m is a time series.
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:对每一帧待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;将频谱通过Mel滤波器组,获取待测滤波器语音特征。In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: performing fast Fourier transform on each frame of voice data to be tested to obtain a frequency spectrum corresponding to the voice data to be tested; passing the spectrum through a Mel filter Group to obtain the voice characteristics of the filter under test.
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:获取训练语音数据;采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: acquiring training speech data; using ASR speech feature extraction algorithm to perform feature extraction on the training speech data to obtain training filter speech features; and training the filter speech The features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
在一个实施例中,处理器执行计算机可读指令时还实现以下步骤:在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值;基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取ASR-LSTM语音识别模型。In one embodiment, when the processor executes the computer-readable instructions, the processor further implements the following steps: in the hidden layer of the long-term and short-term memory neural network model, the first activation function is used to calculate the voice characteristics of the training filter to obtain the nerve carrying the activation state identifier. Element; in the hidden layer of the long-term and short-term memory neural network model, the second activation function is used to calculate the neuron carrying the activation status identifier to obtain the output value of the hidden layer of the long-term and short-term memory neural network model; based on the hidden layer of the long-term and short-term memory neural network model The output value is used to update the long-term and short-term memory neural network model by error back propagation to obtain the ASR-LSTM speech recognition model.
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:获取原始语音数据;采用VAD算法对原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;采用ASR语音特征提取算法对每一帧待测语音数据进行特征提取,获取待测滤波器语音特征;采用训练好的ASR-LSTM语音识别模型对待测滤波器语音特征进行识别,获取识别概率值;若识别概率值大于预设概率值,则将待测语音数据作为目标语音数据。In one embodiment, one or more non-volatile readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, the one or more Each processor performs the following steps: obtaining the original voice data; using the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested; and using an ASR voice feature extraction algorithm for each frame of voice data to be tested Perform feature extraction to obtain the voice characteristics of the filter under test; use the trained ASR-LSTM speech recognition model to identify the voice characteristics of the filter under test to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value, the Speech data is used as the target speech data.
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:对原始语音数据进行分帧处理,获取至少两帧单帧语音数据;采用短时能量计算公式对单帧语音数据进行切分处理,获取对应的短时能量,保留短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;采用过零率计算公式对第一语音数据进行切分处理,获取对应的过零率,保留过零率大于第二门限阈值的第一语音数据,获取至少两帧待测语音数据。In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: framing the original voice data to obtain at least two Frame single frame of voice data; use the short-term energy calculation formula to segment the single-frame voice data to obtain the corresponding short-term energy, and retain the single-frame voice data whose short-term energy is greater than the first threshold value as the first voice data; The first voice data is segmented using a zero-crossing rate calculation formula to obtain the corresponding zero-crossing rate, and the first voice data with the zero-crossing rate greater than the second threshold is retained, and at least two frames of voice data to be tested are obtained.
具体地,短时能量计算公式为
Figure PCTCN2018094184-appb-000029
其中,N为单帧语音数据的帧长,x n(m)为第n帧单帧语音数据,E(n)为短时能量,m为时间序列;过零率计算公式为
Figure PCTCN2018094184-appb-000030
其中,sgn[]为符号函数,x n(m)为第n帧第一语音数据,Z n为过零率,m为时间序列。
Specifically, the short-term energy calculation formula is
Figure PCTCN2018094184-appb-000029
Among them, N is the frame length of a single frame of speech data, x n (m) is the nth frame of single frame of speech data, E (n) is short-term energy, and m is a time series; the formula for calculating the zero-crossing rate is
Figure PCTCN2018094184-appb-000030
Among them, sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is a zero-crossing rate, and m is a time series.
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:对每一帧待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;将频谱通过Mel滤波器组,获取待测滤波器语音特征。In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: performing fast Fourier Fourier processing on each frame of voice data to be tested. The leaf transform obtains the frequency spectrum corresponding to the voice data to be measured; passes the spectrum through the Mel filter bank to obtain the voice characteristics of the filter to be tested.
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:获取训练语音数据;采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;将训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: obtaining training voice data; and adopting an ASR voice feature extraction algorithm to The training speech data is used for feature extraction to obtain the training filter speech features; the training filter speech features are input to the long-term and short-term memory neural network model for training, and the trained ASR-LSTM speech recognition model is obtained.
在一个实施例中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时还实现以下步骤:在长短时记忆神经网络模型的隐藏层采用第一激活函数对训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;在长短时记忆神经网络模型的隐藏层采用第二激活函数对携带激活状态标识的神经元进行计算,获取长短时记忆神经网络模型隐藏层的输出值;基于长短时记忆神经网络模型隐藏层的输出值对长短时记忆神经网络模型进行误差反传更新,获取训练好的ASR-LSTM语音识别模型。In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: The hidden layer of the long-term memory neural network model adopts the first step. An activation function calculates the voice characteristics of the training filter to obtain the neurons carrying the identification of the active state; in the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neurons carrying the identification of the active state to obtain the duration The output value of the hidden layer of the memory neural network model; based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long- and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

  1. 一种语音数据处理方法,其特征在于,包括:A voice data processing method, comprising:
    获取原始语音数据;Get the original voice data;
    采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
    采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
    采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;
    若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
  2. 如权利要求1所述的语音数据处理方法,其特征在于,所述采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,包括:The voice data processing method according to claim 1, wherein the adopting a VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested, comprising:
    对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;Frame processing the original voice data to obtain at least two frames of single-frame voice data;
    采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;Use the short-term energy calculation formula to perform segmentation processing on the single-frame voice data to obtain corresponding short-term energy, and retain single-frame voice data in which the short-term energy is greater than a first threshold threshold as the first voice data;
    采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的第一语音数据,获取至少两帧所述待测语音数据。Use the zero-crossing rate calculation formula to segment the first voice data to obtain the corresponding zero-crossing rate, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames of the test data. Voice data.
  3. 如权利要求2所述的语音数据处理方法,其特征在于,所述短时能量计算公式为
    Figure PCTCN2018094184-appb-100001
    其中,N为单帧语音数据的帧长,x n(m)为第n帧所述单帧语音数据,E(n)为所述短时能量,m为时间序列;
    The speech data processing method according to claim 2, wherein the short-term energy calculation formula is
    Figure PCTCN2018094184-appb-100001
    Where N is the frame length of a single frame of speech data, x n (m) is the single frame of speech data at the nth frame, E (n) is the short-term energy, and m is a time series;
    所述过零率计算公式为
    Figure PCTCN2018094184-appb-100002
    其中,sgn[]为符号函数,x n(m)为第n帧所述第一语音数据,Z n为所述过零率,m为时间序列。
    The zero-crossing rate calculation formula is
    Figure PCTCN2018094184-appb-100002
    Where sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
  4. 如权利要求1所述的语音数据处理方法,其特征在于,所述采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征,包括:The voice data processing method according to claim 1, wherein the ASR voice feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested, comprising:
    对每一帧所述待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;Performing fast Fourier transform on the voice data to be tested for each frame to obtain a frequency spectrum corresponding to the voice data to be tested;
    将所述频谱通过Mel滤波器组,获取所述待测滤波器语音特征。Pass the spectrum through a Mel filter bank to obtain the voice characteristics of the filter under test.
  5. 如权利要求1所述的语音数据处理方法,其特征在于,所述语音数据处理方法还包括:预先训练所述ASR-LSTM语音识别模型;The speech data processing method according to claim 1, wherein the speech data processing method further comprises: pre-training the ASR-LSTM speech recognition model;
    所述预先训练所述ASR-LSTM语音识别模型,包括:The pre-training the ASR-LSTM speech recognition model includes:
    获取训练语音数据;Obtain training voice data;
    采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features;
    将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
  6. 如权利要求5所述的语音数据处理方法,其特征在于,所述将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,包括:The speech data processing method according to claim 5, wherein the inputting the speech features of the training filter into a long-term and short-term memory neural network model for training to obtain a trained ASR-LSTM speech recognition model, comprising: :
    在所述长短时记忆神经网络模型的隐藏层采用第一激活函数对所述训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;Using a first activation function to calculate the voice characteristics of the training filter in a hidden layer of the long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier;
    在所述长短时记忆神经网络模型的隐藏层采用第二激活函数对所述携带激活状态标识的神经元进行计算,获取所述长短时记忆神经网络模型隐藏层的输出值;Using a second activation function to calculate the neuron carrying the activation state identifier in the hidden layer of the long-term and short-term memory neural network model to obtain an output value of the hidden layer of the long-term and short-term memory neural network model;
    基于所述长短时记忆神经网络模型隐藏层的输出值对所述长短时记忆神经网络模型进行误差反传更新,获取训练好的所述ASR-LSTM语音识别模型。Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long-term and short-term memory neural network model to obtain the trained ASR-LSTM speech recognition model.
  7. 一种语音数据处理装置,其特征在于,包括:A voice data processing device, comprising:
    原始语音数据获取模块,用于获取原始语音数据;Raw voice data acquisition module, used to obtain raw voice data;
    待测语音数据获取模块,用于采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;The voice data to be tested module is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
    待测滤波器语音特征获取模块,用于采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;The filter feature acquisition module under test is configured to use ASR voice feature extraction algorithm to perform feature extraction on the frame of voice data to be tested for each frame to acquire the filter feature of the filter under test;
    识别概率值获取模块,用于采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;A recognition probability value acquisition module, configured to use the trained ASR-LSTM speech recognition model to recognize the voice characteristics of the filter under test to obtain a recognition probability value;
    目标语音数据获取模块,用于若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。The target voice data acquisition module is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.
  8. 如权利要求7所述的语音数据处理装置,其特征在于,所述待测语音数据获取模块包括:The voice data processing device according to claim 7, wherein the voice data acquisition module to be tested comprises:
    单帧语音数据获取单元,用于对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;A single-frame voice data acquisition unit, configured to perform frame processing on the original voice data to acquire at least two frames of single-frame voice data;
    第一语音数据获取单元,用于采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的原始语音数据,作为第一语音数据;A first voice data obtaining unit, configured to perform segmentation processing on the single-frame voice data by using a short-term energy calculation formula to obtain corresponding short-term energy, and retain original voice data in which the short-term energy is greater than a first threshold threshold, As the first voice data;
    待测语音数据获取单元,用于采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的原始语音数据,获取至少两帧所述待测语音数据。A voice data to be tested unit for segmenting the first voice data by using a zero-crossing rate calculation formula to obtain a corresponding zero-crossing rate, and retaining original voice data in which the zero-crossing rate is greater than a second threshold threshold, Acquire at least two frames of the voice data to be tested.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取原始语音数据;Get the original voice data;
    采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
    采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
    采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;
    若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
  10. 如权利要求9所述的计算机设备,其特征在于,所述采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,包括:The computer device according to claim 9, wherein the adopting a VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested comprises:
    对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;Frame processing the original voice data to obtain at least two frames of single-frame voice data;
    采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;Use the short-term energy calculation formula to perform segmentation processing on the single-frame voice data to obtain corresponding short-term energy, and retain single-frame voice data in which the short-term energy is greater than a first threshold threshold as the first voice data;
    采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的第一语音数据,获取至少两帧所述待测语音数据。Use the zero-crossing rate calculation formula to segment the first voice data to obtain the corresponding zero-crossing rate, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames of the test data. Voice data.
  11. 如权利要求10所述的计算机设备,其特征在于,所述短时能量计算公式为
    Figure PCTCN2018094184-appb-100003
    其中,N为单帧语音数据的帧长,x n(m)为第n帧所述单帧语音数据,E(n)为所述短时能量,m为时间序列;
    The computer device according to claim 10, wherein the short-term energy calculation formula is
    Figure PCTCN2018094184-appb-100003
    Where N is the frame length of a single frame of speech data, x n (m) is the single frame of speech data at the nth frame, E (n) is the short-term energy, and m is a time series;
    所述过零率计算公式为
    Figure PCTCN2018094184-appb-100004
    其中,sgn[]为符号函数,x n(m)为第n帧所述第一语音数据,Z n为所述过零率,m为时间序列。
    The zero-crossing rate calculation formula is
    Figure PCTCN2018094184-appb-100004
    Where sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
  12. 如权利要求9所述的计算机设备,其特征在于,所述采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征,包括:The computer device according to claim 9, wherein the adopting ASR speech feature extraction algorithm to perform feature extraction on each frame of the voice data to be tested to obtain the voice characteristics of the filter to be tested comprises:
    对每一帧所述待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;Performing fast Fourier transform on the voice data to be tested for each frame to obtain a frequency spectrum corresponding to the voice data to be tested;
    将所述频谱通过Mel滤波器组,获取所述待测滤波器语音特征。Pass the spectrum through a Mel filter bank to obtain the voice characteristics of the filter under test.
  13. 如权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:预先训练所述ASR-LSTM语音识别模型;The computer device according to claim 9, wherein when the processor executes the computer-readable instructions, the following step is further implemented: pre-training the ASR-LSTM speech recognition model;
    所述预先训练所述ASR-LSTM语音识别模型,包括:The pre-training the ASR-LSTM speech recognition model includes:
    获取训练语音数据;Obtain training voice data;
    采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features;
    将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
  14. 如权利要求13所述的计算机设备,其特征在于,所述将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,包括:The computer device according to claim 13, wherein the inputting the voice characteristics of the training filter into a long-term and short-term memory neural network model for training to obtain a trained ASR-LSTM speech recognition model comprises:
    在所述长短时记忆神经网络模型的隐藏层采用第一激活函数对所述训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;Using a first activation function to calculate the voice characteristics of the training filter in a hidden layer of the long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier;
    在所述长短时记忆神经网络模型的隐藏层采用第二激活函数对所述携带激活状态标识的神经元进行计算,获取所述长短时记忆神经网络模型隐藏层的输出值;Using a second activation function to calculate the neuron carrying the activation state identifier in the hidden layer of the long-term and short-term memory neural network model to obtain an output value of the hidden layer of the long-term and short-term memory neural network model;
    基于所述长短时记忆神经网络模型隐藏层的输出值对所述长短时记忆神经网络模型进行误差反传更新,获取训练好的所述ASR-LSTM语音识别模型。Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long-term and short-term memory neural network model to obtain the trained ASR-LSTM speech recognition model.
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    获取原始语音数据;Get the original voice data;
    采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据;Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;
    采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征;ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;
    采用训练好的ASR-LSTM语音识别模型对所述待测滤波器语音特征进行识别,获取识别概率值;Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;
    若所述识别概率值大于预设概率值,则将所述待测语音数据作为目标语音数据。If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用VAD算法对所述原始语音数据进行分帧和切分处理,获取至少两帧待测语音数据,包括:The non-volatile readable storage medium according to claim 15, wherein the adopting a VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested, comprising:
    对所述原始语音数据进行分帧处理,获取至少两帧单帧语音数据;Frame processing the original voice data to obtain at least two frames of single-frame voice data;
    采用短时能量计算公式对所述单帧语音数据进行切分处理,获取对应的短时能量,保留所述短时能量大于第一门限阈值的单帧语音数据,作为第一语音数据;Use the short-term energy calculation formula to perform segmentation processing on the single-frame voice data to obtain corresponding short-term energy, and retain single-frame voice data in which the short-term energy is greater than a first threshold threshold as the first voice data;
    采用过零率计算公式对所述第一语音数据进行切分处理,获取对应的过零率,保留所述过零率大于第二门限阈值的第一语音数据,获取至少两帧所述待测语音数据。Use the zero-crossing rate calculation formula to segment the first voice data to obtain the corresponding zero-crossing rate, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames of the test data. Voice data.
  17. 如权利要求16所述的计算机设备,其特征在于,所述短时能量计算公式为
    Figure PCTCN2018094184-appb-100005
    其中,N为单帧语音数据的帧长,x n(m)为第n帧所述单帧语音数 据,E(n)为所述短时能量,m为时间序列;
    The computer device according to claim 16, wherein the short-term energy calculation formula is
    Figure PCTCN2018094184-appb-100005
    Where N is the frame length of a single frame of speech data, x n (m) is the single frame of speech data at the nth frame, E (n) is the short-term energy, and m is a time series;
    所述过零率计算公式为
    Figure PCTCN2018094184-appb-100006
    其中,sgn[]为符号函数,x n(m)为第n帧所述第一语音数据,Z n为所述过零率,m为时间序列。
    The zero-crossing rate calculation formula is
    Figure PCTCN2018094184-appb-100006
    Where sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
  18. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用ASR语音特征提取算法对每一帧所述待测语音数据进行特征提取,获取待测滤波器语音特征,包括:The non-volatile readable storage medium according to claim 15, wherein the ASR voice feature extraction algorithm is used to perform feature extraction on each frame of the voice data to be tested to obtain the voice characteristics of the filter to be tested, include:
    对每一帧所述待测语音数据进行快速傅里叶变换,获取与待测语音数据对应的频谱;Performing fast Fourier transform on the voice data to be tested for each frame to obtain a frequency spectrum corresponding to the voice data to be tested;
    将所述频谱通过Mel滤波器组,获取所述待测滤波器语音特征。Pass the spectrum through a Mel filter bank to obtain the voice characteristics of the filter under test.
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:预先训练所述ASR-LSTM语音识别模型;The non-volatile readable storage medium of claim 15, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to further perform the following steps: Pre-train the ASR-LSTM speech recognition model;
    所述预先训练所述ASR-LSTM语音识别模型,包括:The pre-training the ASR-LSTM speech recognition model includes:
    获取训练语音数据;Obtain training voice data;
    采用ASR语音特征提取算法对训练语音数据进行特征提取,获取训练滤波器语音特征;ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features;
    将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型。The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
  20. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述将所述训练滤波器语音特征输入到长短时记忆神经网络模型中进行训练,获取训练好的ASR-LSTM语音识别模型,包括:The non-volatile readable storage medium according to claim 15, wherein the training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech is obtained Identify models, including:
    在所述长短时记忆神经网络模型的隐藏层采用第一激活函数对所述训练滤波器语音特征进行计算,获取携带激活状态标识的神经元;Using a first activation function to calculate the voice characteristics of the training filter in a hidden layer of the long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier;
    在所述长短时记忆神经网络模型的隐藏层采用第二激活函数对所述携带激活状态标识的神经元进行计算,获取所述长短时记忆神经网络模型隐藏层的输出值;Using a second activation function to calculate the neuron carrying the activation state identifier in the hidden layer of the long-term and short-term memory neural network model to obtain an output value of the hidden layer of the long-term and short-term memory neural network model;
    基于所述长短时记忆神经网络模型隐藏层的输出值对所述长短时记忆神经网络模型进行误差反传更新,获取训练好的所述ASR-LSTM语音识别模型。Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long-term and short-term memory neural network model to obtain the trained ASR-LSTM speech recognition model.
PCT/CN2018/094184 2018-06-04 2018-07-03 Voice data processing method and apparatus, and computer device, and storage medium WO2019232845A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810561725.6A CN108877775B (en) 2018-06-04 2018-06-04 Voice data processing method and device, computer equipment and storage medium
CN201810561725.6 2018-06-04

Publications (1)

Publication Number Publication Date
WO2019232845A1 true WO2019232845A1 (en) 2019-12-12

Family

ID=64336394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094184 WO2019232845A1 (en) 2018-06-04 2018-07-03 Voice data processing method and apparatus, and computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN108877775B (en)
WO (1) WO2019232845A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667817A (en) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 Voice recognition method, device, computer system and readable storage medium
CN111862973A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Voice awakening method and system based on multi-command words
CN112001482A (en) * 2020-08-14 2020-11-27 佳都新太科技股份有限公司 Vibration prediction and model training method and device, computer equipment and storage medium
CN112750461A (en) * 2020-02-26 2021-05-04 腾讯科技(深圳)有限公司 Voice communication optimization method and device, electronic equipment and readable storage medium
CN113140222A (en) * 2021-05-10 2021-07-20 科大讯飞股份有限公司 Voiceprint vector extraction method, device, equipment and storage medium
CN115862636A (en) * 2022-11-19 2023-03-28 杭州珍林网络技术有限公司 Internet man-machine verification method based on voice recognition technology

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106935247A (en) * 2017-03-08 2017-07-07 珠海中安科技有限公司 It is a kind of for positive-pressure air respirator and the speech recognition controlled device and method of narrow and small confined space
CN109584887B (en) * 2018-12-24 2022-12-02 科大讯飞股份有限公司 Method and device for generating voiceprint information extraction model and extracting voiceprint information
CN109658943B (en) * 2019-01-23 2023-04-14 平安科技(深圳)有限公司 Audio noise detection method and device, storage medium and mobile terminal
CN110060667B (en) * 2019-03-15 2023-05-30 平安科技(深圳)有限公司 Batch processing method and device for voice information, computer equipment and storage medium
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
CN112017676A (en) * 2019-05-31 2020-12-01 京东数字科技控股有限公司 Audio processing method, apparatus and computer readable storage medium
CN110473552A (en) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 Speech recognition authentication method and system
CN110600018B (en) * 2019-09-05 2022-04-26 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device
CN111048071B (en) * 2019-11-11 2023-05-30 京东科技信息技术有限公司 Voice data processing method, device, computer equipment and storage medium
CN110856064B (en) * 2019-11-27 2021-06-04 内蒙古农业大学 Livestock feeding sound signal acquisition device and acquisition method using same
CN112116912A (en) * 2020-09-23 2020-12-22 平安国际智慧城市科技股份有限公司 Data processing method, device, equipment and medium based on artificial intelligence
CN112349277B (en) * 2020-09-28 2023-07-04 紫光展锐(重庆)科技有限公司 Feature domain voice enhancement method combined with AI model and related product
CN112242147B (en) * 2020-10-14 2023-12-19 福建星网智慧科技有限公司 Voice gain control method and computer storage medium
CN112259114A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voice processing method and device, computer storage medium and electronic equipment
CN112908309A (en) * 2021-02-06 2021-06-04 漳州立达信光电子科技有限公司 Voice recognition method, device and equipment and massage sofa

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854661A (en) * 2014-03-20 2014-06-11 北京百度网讯科技有限公司 Method and device for extracting music characteristics
WO2017003903A1 (en) * 2015-06-29 2017-01-05 Amazon Technologies, Inc. Language model speech endpointing
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN107680597A (en) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer-readable recording medium
CN107705802A (en) * 2017-09-11 2018-02-16 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9786270B2 (en) * 2015-07-09 2017-10-10 Google Inc. Generating acoustic models
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
US9842106B2 (en) * 2015-12-04 2017-12-12 Mitsubishi Electric Research Laboratories, Inc Method and system for role dependent context sensitive spoken and textual language understanding with neural networks
US9972310B2 (en) * 2015-12-31 2018-05-15 Interactive Intelligence Group, Inc. System and method for neural network based feature extraction for acoustic model development
CN105825871B (en) * 2016-03-16 2019-07-30 大连理工大学 A kind of end-point detecting method without leading mute section of voice
US10373612B2 (en) * 2016-03-21 2019-08-06 Amazon Technologies, Inc. Anchored speech detection and speech recognition
US11069335B2 (en) * 2016-10-04 2021-07-20 Cerence Operating Company Speech synthesis using one or more recurrent neural networks
CN107704918B (en) * 2017-09-19 2019-07-12 平安科技(深圳)有限公司 Driving model training method, driver's recognition methods, device, equipment and medium
CN107832400B (en) * 2017-11-01 2019-04-16 山东大学 A kind of method that location-based LSTM and CNN conjunctive model carries out relationship classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854661A (en) * 2014-03-20 2014-06-11 北京百度网讯科技有限公司 Method and device for extracting music characteristics
WO2017003903A1 (en) * 2015-06-29 2017-01-05 Amazon Technologies, Inc. Language model speech endpointing
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN107705802A (en) * 2017-09-11 2018-02-16 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN107680597A (en) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer-readable recording medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750461A (en) * 2020-02-26 2021-05-04 腾讯科技(深圳)有限公司 Voice communication optimization method and device, electronic equipment and readable storage medium
CN112750461B (en) * 2020-02-26 2023-08-01 腾讯科技(深圳)有限公司 Voice communication optimization method and device, electronic equipment and readable storage medium
CN111667817A (en) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 Voice recognition method, device, computer system and readable storage medium
CN111862973A (en) * 2020-07-14 2020-10-30 杭州芯声智能科技有限公司 Voice awakening method and system based on multi-command words
CN112001482A (en) * 2020-08-14 2020-11-27 佳都新太科技股份有限公司 Vibration prediction and model training method and device, computer equipment and storage medium
CN113140222A (en) * 2021-05-10 2021-07-20 科大讯飞股份有限公司 Voiceprint vector extraction method, device, equipment and storage medium
CN113140222B (en) * 2021-05-10 2023-08-01 科大讯飞股份有限公司 Voiceprint vector extraction method, voiceprint vector extraction device, voiceprint vector extraction equipment and storage medium
CN115862636A (en) * 2022-11-19 2023-03-28 杭州珍林网络技术有限公司 Internet man-machine verification method based on voice recognition technology

Also Published As

Publication number Publication date
CN108877775B (en) 2023-03-31
CN108877775A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
WO2019232845A1 (en) Voice data processing method and apparatus, and computer device, and storage medium
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
CN109243491B (en) Method, system and storage medium for emotion recognition of speech in frequency spectrum
WO2021139294A1 (en) Method and apparatus for training speech separation model, storage medium, and computer device
CN110120224B (en) Method and device for constructing bird sound recognition model, computer equipment and storage medium
US9792897B1 (en) Phoneme-expert assisted speech recognition and re-synthesis
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
WO2019232848A1 (en) Voice distinguishing method and device, computer device and storage medium
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Archana et al. Gender identification and performance analysis of speech signals
CN108682432B (en) Speech emotion recognition device
Thirumuru et al. Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition
Ribeiro et al. Binary neural networks for classification of voice commands from throat microphone
Labied et al. An overview of automatic speech recognition preprocessing techniques
Devi et al. Automatic speaker recognition with enhanced swallow swarm optimization and ensemble classification model from speech signals
Sharan Cough sound detection from raw waveform using SincNet and bidirectional GRU
Ahmed et al. CNN-based speech segments endpoints detection framework using short-time signal energy features
Nasr et al. Arabic speech recognition by bionic wavelet transform and mfcc using a multi layer perceptron
Tawaqal et al. Recognizing five major dialects in Indonesia based on MFCC and DRNN
Mahesha et al. Classification of speech dysfluencies using speech parameterization techniques and multiclass SVM
Yerigeri et al. Meta-heuristic approach in neural network for stress detection in Marathi speech
Budiga et al. CNN trained speaker recognition system in electric vehicles
Razani et al. A reduced complexity MFCC-based deep neural network approach for speech enhancement
Manjutha et al. An optimized cepstral feature selection method for dysfluencies classification using Tamil speech dataset
Gomathy et al. Gender clustering and classification algorithms in speech processing: a comprehensive performance analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921688

Country of ref document: EP

Kind code of ref document: A1