WO2020015546A1 - 一种远场语音识别方法、语音识别模型训练方法和服务器 - Google Patents

一种远场语音识别方法、语音识别模型训练方法和服务器 Download PDF

Info

Publication number
WO2020015546A1
WO2020015546A1 PCT/CN2019/095075 CN2019095075W WO2020015546A1 WO 2020015546 A1 WO2020015546 A1 WO 2020015546A1 CN 2019095075 W CN2019095075 W CN 2019095075W WO 2020015546 A1 WO2020015546 A1 WO 2020015546A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
speech recognition
time
band energy
frequency
Prior art date
Application number
PCT/CN2019/095075
Other languages
English (en)
French (fr)
Inventor
薛少飞
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020015546A1 publication Critical patent/WO2020015546A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application belongs to the field of Internet technology, and particularly relates to a far-field speech recognition method, a speech recognition model training method, and a server.
  • Far-field speech recognition is an important technology in the field of speech interaction. Long-distance sounds can be recognized through far-field speech recognition technology (for example, speech within 1m to 5m can be recognized). Far-field speech recognition is mainly used in the field of smart homes. For example, it can be used in smart speakers, smart TVs and other devices, as well as in conference transcription and other fields.
  • the purpose of this application is to provide a far-field speech recognition method, a speech recognition model training method, and a server, so as to improve the recognition accuracy of the speech recognition model.
  • This application provides a far-field speech recognition method, a speech recognition model training method, and a server that are implemented as follows:
  • the speech data is identified by a speech recognition model, where the speech recognition model is based on the time dimension information and frequency dimension information of the passed speech data.
  • the speech features of the speech data are obtained by training the speech features obtained after band energy regularization.
  • a speech recognition model training method includes:
  • the speech recognition model is trained according to the speech features obtained after the band energy is regularized.
  • a model training server includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the following steps are implemented:
  • the speech recognition model is trained according to the speech features obtained after the band energy is regularized.
  • a computer-readable storage medium stores computer instructions thereon, the steps of the above method being implemented when the instructions are executed.
  • the far-field speech recognition method, the speech recognition model training method, and the server provided by the present application perform band energy regularization on the filtered speech features based on the time dimension information and frequency dimension information of the voice data; Speech features to train speech recognition models. Because the time dimension information and frequency dimension information are introduced in the process of band energy regulation, the influence of time and frequency on speech recognition accuracy can be weakened. Remote speech recognition based on the speech recognition model can effectively improve the recognition accuracy rate, thereby The technical effect of effectively improving the recognition accuracy of the speech recognition model is achieved.
  • FIG. 1 is a flowchart of a method for extracting Filter-Bank voice features
  • FIG. 2 is a flowchart of a method for extracting and obtaining static PECN voice features
  • FIG. 3 is a method flowchart of a speech recognition model training method provided by the present application.
  • FIG. 4 is a schematic diagram of a scenario for determining a voice feature provided by the present application.
  • FIG. 5 is a schematic diagram of a training model provided by the present application.
  • FIG. 6 is a schematic structural diagram of a model training server provided by the present application.
  • FIG. 7 is a structural block diagram of a speech recognition model training device provided by the present application.
  • the following methods can be used to extract features: obtain continuous voice data, pre-emphasize the acquired voice data, perform frame processing on the pre-emphasized voice data, and perform window processing on the framed voice data. , Perform FFT transformation on the windowed voice data, and filter the voice data through the MEL filter bank to obtain the voice characteristics.
  • the speech features may be compressed after filtering the speech data.
  • the following two methods may be used to obtain the speech features. :
  • Filter-Bank voice features are extracted. As shown in Figure 1, after filtering the voice data through the MEL filter bank, the voice features after passing the Mel filter bank are compressed to a range convenient for processing by Log operation.
  • PCEN per-channel energy normalization, band energy regularization
  • the PCEN speech feature extraction process may include static extraction of PCEN speech features and dynamic extraction of PCEN speech features.
  • E (t, f) represents the filterbank energy of each time-frequency block
  • M (t, f) represents the intermediate smoothing energy
  • s represents the smoothing coefficient
  • ⁇ , ⁇ , r, ⁇ preset parameters.
  • the setting values of the parameters in the above example are only an exemplary description, and other values may be used in actual implementation.
  • the dynamic PCEN speech features are extracted and PCEN can be set as a layer in the neural network.
  • the learning of the parameters in the PCEN operation formula can effectively improve the accuracy of the obtained speech features.
  • it can be understood as a processing method using an approximate FIR filter, that is, the parameters in the calculation formula are specified, without feedback, and without transformation.
  • multiple sets of s can be set to obtain multiple sets of intermediate smoothing energy M i (t, f), and then these intermediate smoothing energy are weighted to obtain the final M (t, f).
  • the PCEN operation formula can be expressed as:
  • sk may be a preset parameter value
  • z k (f) may be a learned parameter
  • other parameters may be preset or learned, which is not limited in this application.
  • time dimension information can be introduced to reduce the influence of time on recognition accuracy to a certain extent.
  • FIG. 3 is a method flowchart of an embodiment of a speech recognition model training method according to the present application.
  • this application provides method operation steps or device structures as shown in the following embodiments or drawings, based on conventional or no creative labor, the method or device may include more or fewer operation steps or module units. .
  • the execution order of these steps or the module structure of the device is not limited to the execution order or the module structure shown in the embodiments of the present application and shown in the accompanying drawings.
  • the method or the module structure shown in the embodiment or the drawings may be connected to execute sequentially or in parallel (for example, a parallel processor or multi-threaded processing). Environment, or even a distributed processing environment).
  • the speech recognition model training method may include the following steps:
  • Step 301 Acquire a filtered voice feature, where the voice feature is extracted from voice data
  • voice features can be extracted in the following ways: acquiring continuous voice data, pre-emphasizing the acquired voice data, performing frame processing on the pre-emphasized voice data, and performing window processing on the framed voice data, FFT is performed on the windowed speech data, and the speech data is filtered by the MEL filter bank to obtain speech characteristics.
  • Step 302 perform frequency band energy regularization on the voice characteristics by using time dimension information and frequency dimension information of the voice data
  • performing frequency band energy regularization on the voice characteristics by using time dimension information and frequency dimension information of the voice data may include:
  • S3 perform band energy regularization on the speech feature according to the intermediate smoothing energy at the current time.
  • Step 303 Train the speech recognition model according to the speech features obtained after the band energy is regularized.
  • the above-mentioned determination of the time influence parameter may be obtaining a frequency band energy regularization result at a previous moment, and then calculating the time influence parameter according to the frequency band energy regularization result at the previous moment; or, obtaining the previous time moment
  • the result of frequency band energy regularization and the energy of the time-frequency block at the current moment are calculated according to the result of the frequency band energy regularization at the previous moment and the energy of the time-frequency block at the current moment to calculate the time-affected parameter.
  • the time effect parameter can be calculated according to one of the following formulas, and can also be called an input gate:
  • i t (t, f) ⁇ (W ir * PCEN (t-1, f) + W ie * log (E (t, f)) + bias)
  • i t (t, f) ⁇ (W ir * PCEN (t-1, f) + W ie * (log (E (t, f))-E M (f)) + bias)
  • i t (t, f) ⁇ (W ir * PCEN (t-1, f) + W ie * (log (E (t, f))-log (M (t-1, f))) + bias)
  • i t (t, f) represents the time of impact parameters, used to E (t, f) and M (t-1, f) weighting
  • W ir denotes the energy band of a previous time warping results PCEN (t-1 f)
  • Wie represents the weight coefficient of the energy of the time-frequency block at the current moment and the time-influence parameter of the current moment.
  • Sigmoid function is a common S-shaped function in biology, also known as S-shaped growth curve, * represents matrix multiplication, t represents time, f represents frequency, ⁇ represents point multiplication, and E (t, f) represents the current moment
  • E M represents the average value of log (E (t, f)) calculated from the global data. This parameter can be fixed or learned during training.
  • PCEN can be calculated according to the following formula:
  • M (t, f) (1-i t (t, f)) ⁇ M (t-1, f) + i t (t, f) ⁇ E (t, f)
  • the band energy regularization can be used as a layer in the neural network acoustic model, that is, the band energy regularization (which can be called Gated-Recurrent-PCEN) can be used as the band energy regularization in the training model of the speech recognition model. Layer to train the speech recognition model.
  • Gated-Recurrent-PCEN is used as a layer in the neural network acoustic model, where BLSTM (Bidirectional Long Short-term Memory, based on two-way long-term short-term memory neural network) can represent one or more layers Layer BLSTM hidden layer, DNN can represent 1 layer or multiple DNN layers, BLSTM + DNN is a typical speech recognition acoustic model structure, that is, in this example, Gated-Recurrent-PCEN is inserted between the input and the BLSTM (Band energy regularization) layer, the parameters of Gated-Recurrent-PCEN can be adjusted as the network trains.
  • BLSTM Bidirectional Long Short-term Memory, based on two-way long-term short-term memory neural network
  • DNN can represent 1 layer or multiple DNN layers
  • BLSTM + DNN is a typical speech recognition acoustic model structure, that is, in this example, Gated-Recurrent-PCEN is inserted between the input and the BLSTM (Band energy regularization) layer, the parameters
  • the recorded test data of the real far field is used as the test set, where the test set contains 1000 recorded real field data and the distance is 1 m -5m, including: environmental noise such as music, vocal interference. Based on this, the results shown in Table 1 below are obtained:
  • Speech feature extraction Test data (word error rate%) General Log filter-bank 36 Static PCEN 33.7 Dynamic PCEN 28.4 Gated-Recurrent-PCEN 26.5
  • the above method can be used in, but not limited to, any smart home, such as a speaker, a TV, or a voice interaction system.
  • the speech features are obtained by performing band energy regularization by using the time dimension information and the frequency dimension information, thereby improving the recognition accuracy of the finally trained model.
  • a radio device for example, a smart speaker, a smart TV, a conference transcription device, etc.
  • a voice processing device such as a processor, etc.
  • the processor can process the voice data (for example, pre-emphasize the acquired voice data, and The weighted voice data is subjected to frame processing, the framed voice data is subjected to window processing, the windowed voice data is subjected to FFT transformation, and the voice data is filtered through the MEL filter bank to obtain the voice characteristics) to obtain the voice feature.
  • the weighted voice data is subjected to frame processing, the framed voice data is subjected to window processing, the windowed voice data is subjected to FFT transformation, and the voice data is filtered through the MEL filter bank to obtain the voice characteristics
  • the speech recognition model After obtaining the speech features with the band energy structured, the speech recognition model can be called for speech recognition, or the speech recognition model can be trained based on the speech features to make the recognition accuracy of the speech recognition model more accurate. high.
  • the specific application scenario is not limited in this application, and can be selected according to actual needs, which is not limited in this application.
  • an embodiment of the present application further provides a far-field speech recognition method, which may include the following steps:
  • Step 1 Acquire the filtered speech features, wherein the speech features are extracted from the speech data
  • voice features can be extracted in the following ways: acquiring continuous voice data, pre-emphasizing the acquired voice data, performing frame processing on the pre-emphasized voice data, and performing window processing on the framed voice data, FFT is performed on the windowed speech data, and the speech data is filtered by the MEL filter bank to obtain speech characteristics.
  • Step 2 perform frequency band energy regularization on the voice characteristics by using time dimension information and frequency dimension information of the voice data
  • Step 3 The speech features obtained after the band energy is normalized are input into a speech recognition model for speech recognition.
  • the above-mentioned determination of the time influence parameter may be obtaining a frequency band energy regularization result at a previous moment, and then calculating the time influence parameter according to the frequency band energy regularization result at the previous moment; or, obtaining the previous time moment
  • the result of frequency band energy regularization and the energy of the time-frequency block at the current moment are calculated according to the result of the frequency band energy regularization at the previous moment and the energy of the time-frequency block at the current moment to calculate the time-affected parameter.
  • a far-field speech recognition method is further provided, which may include the following steps:
  • the voice data is recognized by a voice recognition model trained by the above-mentioned voice recognition model training method.
  • the above-mentioned speech recognition model can be applied to speech recognition of remote speech data, and can effectively improve the recognition accuracy of far-field speech data.
  • FIG. 6 is a block diagram of a hardware structure of a server for a method for training a speech recognition model according to an embodiment of the present invention.
  • the server 10 may include one or more (only one shown in the figure) a processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), A memory 104 for storing data, and a transmission module 106 for communication functions.
  • the structure shown in FIG. 6 is only schematic, and it does not limit the structure of the electronic device.
  • the server 10 may further include more or fewer components than those shown in FIG. 6, or have a different configuration from that shown in FIG. 6.
  • the memory 104 may be used to store software programs and modules of application software, such as program instructions / modules corresponding to the speech recognition model training method in the embodiment of the present invention.
  • the processor 102 executes the software programs and modules stored in the memory 104 to execute Various functional applications and data processing, that is, the speech recognition model training method for implementing the above application program.
  • the memory 104 may include a high-speed random access memory, and may further include a non-volatile memory, such as one or more magnetic storage devices, a flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include memory remotely disposed with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the transmission module 106 is configured to receive or send data via a network.
  • a specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10.
  • the transmission module 106 includes a network adapter (NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission module 106 may be a radio frequency (RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF radio frequency
  • the above-mentioned voice recognition model training device may be as shown in FIG. 7 and includes: an acquisition module 701, a regularization module 702, and a training module 703. among them:
  • An acquisition module 701 configured to acquire a filtered voice feature, wherein the voice feature is extracted from voice data
  • a regularization module 702 configured to perform frequency band energy regularization on the voice characteristics by using time dimension information and frequency dimension information of the voice data
  • a training module 703 is configured to train a speech recognition model according to the speech features obtained after the band energy is normalized.
  • the regularization module 702 may perform frequency band energy regularization on the voice characteristics by using time dimension information and frequency dimension information of the voice data according to the following steps:
  • S3 perform band energy regularization on the speech feature according to the intermediate smoothing energy at the current time.
  • determining the time influence parameter may include: obtaining a frequency band energy regularization result at a previous moment; and calculating and obtaining the time influence parameter according to the frequency band energy regularization result at the previous moment.
  • determining the time-affected parameter according to the band energy regularization result at the previous moment may include: multiplying the weight coefficient matrix by the band energy regularization result at the previous moment to obtain a first result, wherein the weight coefficient The weight coefficient of the time impact parameter back to the current time band energy regularization result; the first result plus the offset to get the second result; sigmoid to the second result to get the time Affects parameters.
  • time influence parameter can be calculated according to the following formula:
  • i t (t, f) represents the time influence parameter
  • W ir represents the weighting coefficient of the time-band effect parameter PCEN (t-1, f) back to the current moment
  • bias represents the bias.
  • ⁇ () represents sigmoid function
  • * represents matrix multiplication
  • t represents time
  • f frequency
  • determining the time impact parameter may include: obtaining a frequency band energy regularization result at a previous moment and the energy of a time-frequency block at the current moment; and according to the frequency band energy regularization result at the previous moment and the current moment The energy of the time-frequency block is calculated to obtain the time effect parameter.
  • a time influence parameter may be calculated according to one of the following formulas:
  • i t (t, f) ⁇ (W ir * PCEN (t-1, f) + W ie * log (E (t, f)) + bias)
  • i t (t, f) ⁇ (W ir * PCEN (t-1, f) + W ie * (log (E (t, f))-E M (f)) + bias)
  • i t (t, f) ⁇ (W ir * PCEN (t-1, f) + W ie * (log (E (t, f))-log (M (t-1, f))) + bias )
  • i t (t, f) represents the time influence parameter
  • W ir represents the weighting coefficient of the time-band effect parameter PCEN (t-1, f) back to the current moment
  • bias represents the bias.
  • ⁇ () represents the sigmoid function
  • * represents the matrix multiplication
  • t represents the time
  • f represents the frequency
  • E (t, f) represents the energy of the time-frequency block at the current moment
  • E M represents the log (E ( t, f)).
  • the speech recognition model is trained by using the frequency band energy regularization as a frequency band energy regularization layer in the training model of the speech recognition model.
  • the band energy regularization layer may be located between the input of the training model and the bidirectional long-term and short-term memory neural network layer.
  • the speech recognition model training method and server provided by this application perform band energy regularization on the filtered speech features based on the time dimension information and frequency dimension information of the speech data; and perform speech recognition on the speech features obtained after the band energy regularization.
  • the model is trained. Because the time dimension information and the frequency dimension information are introduced in the process of band energy regularization, the influence of time and frequency on the accuracy of speech recognition can be weakened, and the technical effect of effectively improving the recognition accuracy of the speech recognition model is achieved.
  • the devices or modules described in the foregoing embodiments may be specifically implemented by a computer chip or entity, or may be implemented by a product having a certain function.
  • the functions are divided into various modules and described separately.
  • the functions of each module may be implemented in the same or multiple software and / or hardware.
  • a module that implements a certain function may also be implemented by combining multiple submodules or subunits.
  • the method, device or module described in this application may be implemented in a computer-readable program code by the controller in any suitable manner.
  • the controller may adopt, for example, a microprocessor or processor and the storage may be processed by the (micro) Computer-readable program code (such as software or firmware), computer-readable media, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers
  • microcontrollers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320.
  • the memory controller can also be implemented as part of the control logic of the memory.
  • controller logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded controllers by logically programming the method steps. Microcontrollers, etc. to achieve the same function. Therefore, such a controller can be considered as a hardware component, and the device included in the controller for implementing various functions can also be considered as a structure within the hardware component. Or even, the means for implementing various functions can be regarded as a structure that can be both a software module implementing the method and a hardware component.
  • program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types.
  • program modules may be located in local and remote computer storage media, including storage devices.
  • the present application can be implemented by means of software plus necessary hardware. Based on such an understanding, the technical solution of the present application in essence or a part that contributes to the existing technology may be embodied in the form of a software product, or may be reflected in the implementation process of data migration.
  • the computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disc, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute this software. Apply for the method described in each embodiment or some parts of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请提供了一种远场语音识别方法、语音识别模型训练方法和服务器,其中,该远场语音识别方法包括:获取语音数据;确定所述语音数据是否为远场语音数据;在确定所述语音数据为远场语音数据的情况下,通过语音识别模型对所述语音数据进行识别,其中,所述语音识别模型是根据通过语音数据的时间维度信息和频率维度信息,对所述语音数据的语音特征进行频带能量规整后得到的语音特征进行训练后得到的。利用本申请实施例提供的技术方案,因为在对频带能量规整过程中引入了时间维度信息和频率维度信息,从而可以弱化时间和频率对语音识别准确度的影响,基于该语音识别模型进行远程语音识别,可以有效提升识别准确率,从而达到了有效提升语音识别模型的识别准确率的技术效果。

Description

一种远场语音识别方法、语音识别模型训练方法和服务器
本申请要求2018年07月16日递交的申请号为201810775407.X、发明名称为“一种远场语音识别方法、语音识别模型训练方法和服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于互联网技术领域,尤其涉及一种远场语音识别方法、语音识别模型训练方法和服务器。
背景技术
远场语音识别是语音交互领域的重要技术,通过远场语音识别技术可以识别出远距离的声音(例如,可以识别到1m到5m内的语音)。远场语音识别主要应用在智能家居领域,例如,可以应用在智能音箱、智能电视等设备中,也可以应用在会议转录等领域中。
然而,由于在真实环境中,一般会存在大量的噪声、多径反射和混响等干扰问题,从而导致拾取的声音信号的质量下降。对于远场语音识别而言,导致识别准确率下降的主要原因就是由于距离引起的语音能量衰减。
如何有效减少语音能量衰减导致的语音模型识别准确度高的问题,目前尚未提出有效的解决方案。
发明内容
本申请目的在于提供一种远场语音识别方法、语音识别模型训练方法和服务器,以达到提升语音识别模型的识别准确率的目的。
本申请提供一种远场语音识别方法、语音识别模型训练方法和服务器是这样实现的:
一种远场语音识别方法,包括:
获取语音数据;
确定所述语音数据是否为远场语音数据;
在确定所述语音数据为远场语音数据的情况下,通过语音识别模型对所述语音数据进行识别,其中,所述语音识别模型是根据通过语音数据的时间维度信息和频率维度信息,对所述语音数据的语音特征进行频带能量规整后得到的语音特征进行训练后得到的。
一种语音识别模型训练方法,包括:
获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
通过所语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
根据频带能量规整后得到的语音特征,对语音识别模型进行训练。
一种模型训练服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现如下步骤:
获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
通过所语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
根据频带能量规整后得到的语音特征,对语音识别模型进行训练。
一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现上述方法的步骤。
本申请提供的远场语音识别方法、语音识别模型训练方法和服务器,通过语音数据的时间维度信息和频率维度信息,对滤波处理后的语音特征进行频带能量规整;并根据频带能量规整后得到的语音特征,对语音识别模型进行训练。因为在对频带能量规整过程中引入了时间维度信息和频率维度信息,从而可以弱化时间和频率对语音识别准确度的影响,基于该语音识别模型进行远程语音识别,可以有效提升识别准确率,从而达到了有效提升语音识别模型的识别准确率的技术效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是提取得到Filter-Bank语音特征的方法流程图;
图2是提取得到静态PECN语音特征的方法流程图;
图3是本申请提供的语音识别模型训练方法的方法流程图;
图4是本申请提供的语音特征确定的场景示意图;
图5是本申请提供的训练模型示意图;
图6是本申请提供的模型训练服务器的架构示意图;
图7是本申请提供的语音识别模型训练装置的结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
考虑到除了环境噪声等因素导致远场语音识别准确率降低的问题,由于距离的变化会导致语音能量衰减,这也将导致远场语音的识别准确率降低。然而,在实际的声音场景中,不仅距离会对语音识别准确率产生影响,人声音前后时刻音量的变化,也会对语音识别准确率产生影响。
然而,对于语音识别模型而言,一般需要先提取语音特征,然后将语音特征输入训练模型进行语音识别模型的训练。
在实现的时候,可以采用如下的方式提取特征:获取连续的语音数据,对获取的语音数据进行预加重,对预加重的语音数据进行分帧处理,对分帧后的语音数据进行加窗处理,对加窗后的语音数据进行FFT变换,通过MEL滤波器组对语音数据进行滤波,从而得到语音特征。
具体的,为了使得提取的语音特征训练得到的语音识别模型的准确度更高,可以在对语音数据进行滤波之后对语音特征进行压缩处理,例如,可以按照以下两种方式处理,以得到语音特征:
1)提取得到Filter-Bank语音特征,可以如图1所示,在通过MEL滤波器组对语音数据进行滤波之后,通过Log运算将经过Mel滤波器组后的语音特征压缩到便于处理的范围。
不过简单的Log运算操作对于能量较低的音频特征的分辨率比较低,会导致语音数据的信息产生损失。
2)提取得到PCEN(per-channel energy normalization,频带能量规整)语音特征,PCEN语音特征提取流程可以包括:静态提取PCEN语音特征和动态提取PCEN语音特征。
其中,如图2所示,静态提取PCEN语音特征与提取filter-bank语音特征相比,是将Log运算替换为PCEN运算,其中PCEN运算的公式可以表示为:
Figure PCTCN2019095075-appb-000001
M(t,f)=(1-s)M(t-1,f)+sE(t,f)
其中,E(t,f)表示每个时频块的filterbank能量,M(t,f)表示中间平滑能量,s表示平滑系数,α,δ,r,∈表示预先设定的参数,这些参数值可以通过经验确定,例如,可以设定为:s=0.025,α=0.98,δ=2,r=0.5,∈=0.000001。然而值得注意的是,上例中的参数的设定值仅是一种示例性描述,在实际实现的时候还可以采用其它数值。
其中,提取得到动态PCEN语音特征,可以将PCEN设置为神经网络中的一层,通过对中PCEN运算公式中的参数的学习来达到有效提升得到的语音特征的准确率的目的。在实现的时候,可以理解为采用近似FIR滤波器的处理方式,即,计算公式中的参数是规定的,无反馈,无变换的。具体的,可以设定多组s,从而得到多组中间平滑能量M i(t,f),然后,对这些中间平滑能量进行加权,从而得到最终的M(t,f)。具体的,PCEN运算公式可以表示为:
Figure PCTCN2019095075-appb-000002
Figure PCTCN2019095075-appb-000003
M k(t,f)=(1-s k)M i(t-1,f)+s kE(t,f)
其中,s k可以是预先设定的参数值,z k(f)可以是学习得到的参数,其它参数可以是预先设定的,也可以是学习得到的,本申请对此不作限定。
然而,对于上述提取得到动态PCEN语音特征而言,仅考虑了频率对中间平滑能量的影响,然而,在实际的声音拾取的过程中,不仅距离、频率对识别准确率产生影响,说话者如果先大声说活,再小声说话,或者说是先小声说活,再大声说话,即,前后说话的音量大小不同,也会对语音识别的准确度产生影响,也就说是,时间对语音识别的准确度也会产生影响。
为此,在本例中考虑到在动态PCEN语音特征提取的过程中,如果将时间的影响因素增加进去,那么可以有效提升识别的准确度。具体的,可以引入时间维度信息,从而 可以从一定程度上降低时间对识别准确度的影响。
图3是本申请所述一种语音识别模型训练方法一个实施例的方法流程图。虽然本申请提供了如下述实施例或附图所示的方法操作步骤或装置结构,但基于常规或者无需创造性的劳动在所述方法或装置中可以包括更多或者更少的操作步骤或模块单元。在逻辑性上不存在必要因果关系的步骤或结构中,这些步骤的执行顺序或装置的模块结构不限于本申请实施例描述及附图所示的执行顺序或模块结构。所述的方法或模块结构的在实际中的装置或终端产品应用时,可以按照实施例或者附图所示的方法或模块结构连接进行顺序执行或者并行执行(例如并行处理器或者多线程处理的环境,甚至分布式处理环境)。
具体的如图3所述,本申请一种实施例提供的语音识别模型训练方法可以包括如下步骤:
步骤301:获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
具体的,可以采用如下的方式提取语音特征:获取连续的语音数据,对获取的语音数据进行预加重,对预加重的语音数据进行分帧处理,对分帧后的语音数据进行加窗处理,对加窗后的语音数据进行FFT变换,通过MEL滤波器组对语音数据进行滤波,从而得到语音特征。
步骤302:通过所述语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
考虑到在实际的声音拾取的过程中,不仅距离、频率对识别准确率产生影响,说话者如果先大声说活,再小声说话,或者说是先小声说活,再大声说话,即,前后说话的音量大小不同,也会对语音识别的准确度产生影响,也就说是,时间对语音识别的准确度也会产生影响。因此,可以引入时间维度信息对语音特征进行频带能量规整。
具体的,通过所语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整,可以包括:
S1:确定时间影响参数;
S2:通过所述时间影响参数,对前一时刻的中间平滑能量和当前时刻的时频块的能量进行加权,得到当前时刻的中间平滑能量;
S3:根据所述当前时刻的中间平滑能量,对所述语音特征进行频带能量规整。
步骤303:根据频带能量规整后得到的语音特征,对语音识别模型进行训练。
在上例中,上述确定时间影响参数,可以是获取前一时刻的频带能量规整结果,然后,根据所述前一时刻的频带能量规整结果,计算得到时间影响参数;或者是,获取前一时刻的频带能量规整结果和当前时刻的时频块的能量,根据前一时刻的频带能量规整结果和所述当前时刻的时频块的能量,计算得到时间影响参数。
下面列举一个具体的计算得到的时间影响参数的公式进行说明,然而,值得注意的是,所列举的计算公式仅是一种示意性描述,本申请对此不作具体限定。
可以按照以下公式之一计算得到时间影响参数,也可以称为输入门:
1)i t(t,f)=σ(W ir*PCEN(t-1,f)+bias)
2)i t(t,f)=σ(W ir*PCEN(t-1,f)+W ie*log(E(t,f))+bias)
3)i t(t,f)=σ(W ir*PCEN(t-1,f)+W ie*(log(E(t,f))-E M(f))+bias)
4)i t(t,f)=σ(W ir*PCEN(t-1,f)+W ie*(log(E(t,f))-log(M(t-1,f)))+bias)
其中,i t(t,f)表示时间影响参数,用来对E(t,f)和M(t-1,f)加权,W ir表示前一时刻的频带能量规整结果PCEN(t-1,f)回连到当前时刻的时间影响参数的权重系数,W ie表示当前时刻的时频块的能量连接到当前时刻的时间影响参数的权重系数,bias表示偏置,σ()表示sigmoid函数,sigmoid函数是一个在生物学中常见的S型函数,也称为S型生长曲线,*表示矩阵乘,t表示时间,f表示频率,·表示点乘,E(t,f)表示当前时刻的时频块的能量,E M表示通过全局数据统计出的log(E(t,f))的均值,该参数可固定,也可以在训练中进行学习。
在按照上述公式计算得到时间影响参数之后,可以按照如下公式计算PCEN:
Figure PCTCN2019095075-appb-000004
M(t,f)=(1-i t(t,f))·M(t-1,f)+i t(t,f)·E(t,f)
上例中的,频带能量规整可以作为神经网络声学模型中的一层,即,可以将频带能量规整(可以称为Gated-Recurrent-PCEN)作为所述语音识别模型的训练模型中的频带能量规整层对所述语音识别模型进行训练。
如图4所示,为将Gated-Recurrent-PCEN作为神经网络声学模型中的一层的示意图,其中,BLSTM(Bidirectional Long Short-term Memory,基于双向长短期记忆神经网络)可以表示1层或多层BLSTM隐层,DNN可以表示1层或多层DNN层,BLSTM+DNN 是一种典型的语音识别声学模型结构,即,在本例中,在输入和BLSTM之间插入了Gated-Recurrent-PCEN(频带能量规整)层,Gated-Recurrent-PCEN的参数可以随着网络的训练进行调整。
在上例中,因采用了基于反馈的频带能量规整,从而引入了时间维度信息(输入门i t(t,f))对滤波器系数的影响,即,采用近似FIR滤波器的方式进行频带能量规整,相对于IIR滤波器可以有效减少性能损失,尤其在数据量大的时候,可以有效提升性能。
下面结合一组实际的实验结果对上述方法的效果进行说明,在本例中,采用录制的真实远场的测试数据作为测试集,其中,测试集中包含1000条真实录制的远场数据,距离1m-5m,包括:音乐,人声干扰等环境噪声。基于此,得到如下表1所示的结果:
表1
语音特征提取方式 测试数据(字错误率%)
一般的Log filter-bank 36
静态PCEN 33.7
动态PCEN 28.4
Gated-Recurrent-PCEN 26.5
由表1可以看出,采用本例中的方法进行频带能量规整,可以带来7%左右的字错误率下降。
上述方法可以但不限于用于音箱、电视、等任何智能家居中,或者是语音交互系统中。
在上例中,考虑到通常远场语音识别准确率与近场语音识别相比会有大幅下降,这主要因为距离会导致语音能量大幅度缩减,从而导致语音识别准确率大幅下降,能量过小的语音通常与识别模型存在较大程度的失配,从而导致语音识别准确率的下降,在实际应用场景中,人与收音麦克风的距离以及人本身的音量变化都会导致语音能量出现不同程度的衰减,因此,在本例中,通过时间维度信息和频率维度信息进行频带能量规整得到语音特征,从而提升最终训练得到的模型的识别准确率。
具体的,上述语音特征的处理方式可以应用在如图5所示的场景中,用户发出语音数据之后,收音装置(例如:智能音箱、智能电视、会议转录设备等)可以拾取语音数据,然后将拾取的语音数据传递至语音处理设备(例如:处理器等)进行处理,处理器在获取到连续的语音数据之后,可以对语音数据进行处理(例如:对获取的语音数据进 行预加重,对预加重的语音数据进行分帧处理,对分帧后的语音数据进行加窗处理,对加窗后的语音数据进行FFT变换,通过MEL滤波器组对语音数据进行滤波,从而得到语音特征)得到语音特征。在得到语音特征之后,可以采用上例所提供的方式,通过语音数据的时间维度信息和频率维度信息,对语音特征进行频带能量规整,从而削弱由于真实环境中存在大量的噪声、多径反射和混响,导致拾取信号的质量下的问题,从而得到最终的频带能量规整后的语音特征。
在得到频带能量规整后的语音特征之后,可以调取语音识别模型,对该语音特征进行语音识别,或者是基于该语音特征对语音识别模型进行训练,以使得该语音识别模型的识别准确度更高。具体的应用场景本申请不作限定,可以根据实际需要选择,本申请对此不作限定。
基于此,本申请一种实施例还提供了一种远场语音识别方法可以包括如下步骤:
步骤1:获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
具体的,可以采用如下的方式提取语音特征:获取连续的语音数据,对获取的语音数据进行预加重,对预加重的语音数据进行分帧处理,对分帧后的语音数据进行加窗处理,对加窗后的语音数据进行FFT变换,通过MEL滤波器组对语音数据进行滤波,从而得到语音特征。
步骤2:通过所述语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
考虑到在实际的声音拾取的过程中,不仅距离、频率对识别准确率产生影响,说话者如果先大声说活,再小声说话,或者说是先小声说活,再大声说话,即,前后说话的音量大小不同,也会对语音识别的准确度产生影响,也就说是,时间对语音识别的准确度也会产生影响。因此,可以引入时间维度信息对语音特征进行频带能量规整。
步骤3:将频带能量规整后得到的语音特征输入语音识别模型进行语音识别。
在上例中,上述确定时间影响参数,可以是获取前一时刻的频带能量规整结果,然后,根据所述前一时刻的频带能量规整结果,计算得到时间影响参数;或者是,获取前一时刻的频带能量规整结果和当前时刻的时频块的能量,根据前一时刻的频带能量规整结果和所述当前时刻的时频块的能量,计算得到时间影响参数。
在语音识别方法中的具体的数据处理步骤与上述语音识别模型训练方法中的数据步骤是相似的,本申请对此不再重复说明。
进一步的,在本申请实施例中还提供了一种远场语音识别方法,可以包括如下步骤:
S1:获取语音数据;
S2:确定所述语音数据是否为远场语音数据;
S3:在确定所述语音数据为远场语音数据的情况下,通过上述的语音识别模型训练方法训练得到的语音识别模型对所述语音数据进行识别。
即,上述语音识别模型可以应用在远程语音数据的语音识别中,可以有效提升远场语音数据的识别准确度。
本申请实施例所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在服务器端上为例,图6是本发明实施例的一种语音识别模型训练方法的服务器的硬件结构框图。如图6所示,服务器10可以包括一个或多个(图中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输模块106。本领域普通技术人员可以理解,图6所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,服务器10还可包括比图6中所示更多或者更少的组件,或者具有与图6所示不同的配置。
存储器104可用于存储应用软件的软件程序以及模块,如本发明实施例中的语音识别模型训练方法对应的程序指令/模块,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的应用程序的语音识别模型训练方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输模块106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输模块106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输模块106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
在软件层面,上述语音识别模型训练装置可以如图7所示,包括:获取模块701、规整模块702和训练模块703。其中:
获取模块701,用于获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
规整模块702,用于通过所述语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
训练模块703,用于根据频带能量规整后得到的语音特征,对语音识别模型进行训练。
在一个实施方式中,规整模块702可以按照以下步骤通过所语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整:
S1:确定时间影响参数;
S2:通过所述时间影响参数,对前一时刻的中间平滑能量和当前时刻的时频块的能量进行加权,得到当前时刻的中间平滑能量;
S3:根据所述当前时刻的中间平滑能量,对所述语音特征进行频带能量规整。
在一个实施方式中,确定时间影响参数,可以包括:获取前一时刻的频带能量规整结果;根据所述前一时刻的频带能量规整结果,计算得到时间影响参数。
在一个实施方式中,根据所述前一时刻的频带能量规整结果,确定时间影响参数,可以包括:权重系数矩阵乘以前一时刻的频带能量规整结果,得到第一结果,其中,所述权重系数为前一时刻的频带能量规整结果回连到当前时刻的时间影响参数的权重系数;所述第一结果加上偏置,得到第二结果;对所述第二结果求sigmoid,得到所述时间影响参数。
例如,可以按照以下公式计算得到时间影响参数:
i t(t,f)=σ(W ir*PCEN(t-1,f)+bias)
其中,i t(t,f)表示时间影响参数,W ir表示前一时刻的频带能量规整结果PCEN(t-1,f)回连到当前时刻的时间影响参数的权重系数,bias表示偏置,σ()表示sigmoid函数,*表示矩阵乘,t表示时间,f表示频率。
在一个实施方式中,确定时间影响参数,可以包括:获取前一时刻的频带能量规整结果和当前时刻的时频块的能量;根据所述前一时刻的频带能量规整结果和所述当前时刻的时频块的能量,计算得到时间影响参数。
在一个实施方式中,根据所述前一时刻的频带能量规整结果和所述当前时刻的时频块的能量,可以按照以下公式之一计算得到时间影响参数:
i t(t,f)=σ(W ir*PCEN(t-1,f)+W ie*log(E(t,f))+bias)
i t(t,f)=σ(W ir*PCEN(t-1,f)+W ie*(log(E(t,f))-E M(f))+bias)
i t(t,f)=σ(W ir*PCEN(t-1,f)+W ie*(log(E(t,f))-log(M(t-1,f)))+bias)
其中,i t(t,f)表示时间影响参数,W ir表示前一时刻的频带能量规整结果PCEN(t-1,f)回连到当前时刻的时间影响参数的权重系数,bias表示偏置,σ()表示sigmoid函数,*表示矩阵乘,t表示时间,f表示频率,E(t,f)表示当前时刻的时频块的能量,E M表示通过全局数据统计出的log(E(t,f))的均值。
在一个实施方式中,将频带能量规整作为所述语音识别模型的训练模型中的频带能量规整层对所述语音识别模型进行训练。
在一个实施方式中,频带能量规整层可以位于训练模型的输入端和基于双向长短期记忆神经网络层之间。
本申请提供的语音识别模型训练方法和服务器,通过语音数据的时间维度信息和频率维度信息,对滤波处理后的语音特征进行频带能量规整;并根据频带能量规整后得到的语音特征,对语音识别模型进行训练。因为在对频带能量规整过程中引入了时间维度信息和频率维度信息,从而可以弱化时间和频率对语音识别准确度的影响,达到了有效提升语音识别模型的识别准确率的技术效果。
虽然本申请提供了如实施例或流程图所述的方法操作步骤,但基于常规或者无创造性的劳动可以包括更多或者更少的操作步骤。实施例中列举的步骤顺序仅仅为众多步骤执行顺序中的一种方式,不代表唯一的执行顺序。在实际中的装置或客户端产品执行时,可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或者多线程处理的环境)。
上述实施例阐明的装置或模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。为了描述的方便,描述以上装置时以功能分为各种模块分别描述。在实施本申请时可以把各模块的功能在同一个或多个软件和/或硬件中实现。当然,也可以将实现某功能的模块由多个子模块或子单元组合实现。
本申请中所述的方法、装置或模块可以以计算机可读程序代码方式实现控制器按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器 和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内部包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
本申请所述装置中的部分模块可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、类等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的硬件的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,也可以通过数据迁移的实施过程中体现出来。该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,移动终端,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。本申请的全部或者部分可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、移动通信终端、多处理器系统、基于微处理器的系统、可编程的电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。
虽然通过实施例描绘了本申请,本领域普通技术人员知道,本申请有许多变形和变化而不脱离本申请的精神,希望所附的权利要求包括这些变形和变化而不脱离本申请的精神。

Claims (12)

  1. 一种远场语音识别方法,其特征在于,包括:
    获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的获取语音数据;
    确定所述语音数据是否为远场语音数据;
    在确定所述语音数据为远场语音数据的情况下,通过语音识别模型对所述语音数据进行识别,其中,所述语音识别模型是根据通过语音数据的时间维度信息和频率维度信息,对所述语音数据的语音特征进行频带能量规整后得到的语音特征进行训练后得到的。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
    通过所述语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
    根据频带能量规整后得到的语音特征,对语音识别模型进行训练,得到所述语音识别模型。
  3. 根据权利要求2所述的方法,其特征在于,通过所语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整,包括:
    确定时间影响参数;
    通过所述时间影响参数,对前一时刻的中间平滑能量和当前时刻的时频块的能量进行加权,得到当前时刻的中间平滑能量;
    根据所述当前时刻的中间平滑能量,对所述语音特征进行频带能量规整。
  4. 根据权利要求3所述的方法,其特征在于,确定时间影响参数,包括:
    获取前一时刻的频带能量规整结果;
    根据所述前一时刻的频带能量规整结果,计算得到时间影响参数。
  5. 根据权利要求4所述的方法,其特征在于,根据所述前一时刻的频带能量规整结果,确定时间影响参数,包括:
    权重系数矩阵乘以前一时刻的频带能量规整结果,得到第一结果,其中,所述权重系数为前一时刻的频带能量规整结果回连到当前时刻的时间影响参数的权重系数;
    所述第一结果加上偏置,得到第二结果;
    对所述第二结果求sigmoid,得到所述时间影响参数。
  6. 根据权利要求3所述的方法,其特征在于,确定时间影响参数,包括:
    获取前一时刻的频带能量规整结果和当前时刻的时频块的能量;
    根据所述前一时刻的频带能量规整结果和所述当前时刻的时频块的能量,计算得到时间影响参数。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,将频带能量规整作为所述语音识别模型的训练模型中的频带能量规整层对所述语音识别模型进行训练。
  8. 根据权利要求7所述的方法,其特征在于,所述频带能量规整层位于训练模型的输入端和基于双向长短期记忆神经网络层之间。
  9. 一种语音识别模型训练方法,其特征在于,包括:
    获取滤波处理后的语音特征,其中,所述语音特征是从语音数据中提取得到的;
    通过所语音数据的时间维度信息和频率维度信息,对所述语音特征进行频带能量规整;
    根据频带能量规整后得到的语音特征,对语音识别模型进行训练。
  10. 一种远场语音识别设备,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现权利要求1至8中任一项所述的方法。
  11. 一种模型训练服务器,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现权利要求9所述的方法。
  12. 一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现权利要求1至8中任一项所述方法的步骤。
PCT/CN2019/095075 2018-07-16 2019-07-08 一种远场语音识别方法、语音识别模型训练方法和服务器 WO2020015546A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810775407.XA CN110797008B (zh) 2018-07-16 2018-07-16 一种远场语音识别方法、语音识别模型训练方法和服务器
CN201810775407.X 2018-07-16

Publications (1)

Publication Number Publication Date
WO2020015546A1 true WO2020015546A1 (zh) 2020-01-23

Family

ID=69164997

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/095075 WO2020015546A1 (zh) 2018-07-16 2019-07-08 一种远场语音识别方法、语音识别模型训练方法和服务器

Country Status (2)

Country Link
CN (1) CN110797008B (zh)
WO (1) WO2020015546A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112331177A (zh) * 2020-11-05 2021-02-05 携程计算机技术(上海)有限公司 基于韵律的语音合成方法、模型训练方法及相关设备
CN112331186B (zh) * 2020-11-19 2022-03-25 思必驰科技股份有限公司 语音唤醒方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
CN107346659A (zh) * 2017-06-05 2017-11-14 百度在线网络技术(北京)有限公司 基于人工智能的语音识别方法、装置及终端
US20180197533A1 (en) * 2017-01-11 2018-07-12 Google Llc Systems and Methods for Recognizing User Speech

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4301896B2 (ja) * 2003-08-22 2009-07-22 シャープ株式会社 信号分析装置、音声認識装置、プログラム、記録媒体、並びに電子機器
CN1975856B (zh) * 2006-10-30 2011-11-09 邹采荣 一种基于支持向量机的语音情感识别方法
US10096321B2 (en) * 2016-08-22 2018-10-09 Intel Corporation Reverberation compensation for far-field speaker recognition
CN107610707B (zh) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN107680602A (zh) * 2017-08-24 2018-02-09 平安科技(深圳)有限公司 语音欺诈识别方法、装置、终端设备及存储介质
CN107452372B (zh) * 2017-09-22 2020-12-11 百度在线网络技术(北京)有限公司 远场语音识别模型的训练方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
US20180197533A1 (en) * 2017-01-11 2018-07-12 Google Llc Systems and Methods for Recognizing User Speech
CN107346659A (zh) * 2017-06-05 2017-11-14 百度在线网络技术(北京)有限公司 基于人工智能的语音识别方法、装置及终端

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG, YUXUAN ET AL.: "TRAINABLE FRONTEND FOR ROBUST AND FAR-FIELD KEYWORD SPOTTING", 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING(ICASSP, 31 December 2017 (2017-12-31), XP033259496 *

Also Published As

Publication number Publication date
CN110797008A (zh) 2020-02-14
CN110797008B (zh) 2024-03-29

Similar Documents

Publication Publication Date Title
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
US11657823B2 (en) Channel-compensated low-level features for speaker recognition
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
Han et al. Learning spectral mapping for speech dereverberation and denoising
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
CN110211575B (zh) 用于数据增强的语音加噪方法及系统
CN108417224B (zh) 双向神经网络模型的训练和识别方法及系统
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
WO2015047517A1 (en) Keyword detection
CN111429932A (zh) 语音降噪方法、装置、设备及介质
US10839820B2 (en) Voice processing method, apparatus, device and storage medium
WO2023001128A1 (zh) 音频数据的处理方法、装置及设备
WO2020015546A1 (zh) 一种远场语音识别方法、语音识别模型训练方法和服务器
CN106033673B (zh) 一种近端语音信号检测方法及装置
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
EP3516652A1 (en) Channel-compensated low-level features for speaker recognition
CN110875037A (zh) 语音数据处理方法、装置及电子设备
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
Han et al. Reverberation and noise robust feature compensation based on IMM
O’Reilly et al. Effective and inconspicuous over-the-air adversarial examples with adaptive filtering
Lu et al. Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition
Sun et al. Frame selection of interview channel for NIST speaker recognition evaluation
US20240079022A1 (en) General speech enhancement method and apparatus using multi-source auxiliary information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19837724

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19837724

Country of ref document: EP

Kind code of ref document: A1