WO2022063215A1 - 结合ai模型的特征域语音增强方法及相关产品 - Google Patents

结合ai模型的特征域语音增强方法及相关产品 Download PDF

Info

Publication number
WO2022063215A1
WO2022063215A1 PCT/CN2021/120226 CN2021120226W WO2022063215A1 WO 2022063215 A1 WO2022063215 A1 WO 2022063215A1 CN 2021120226 W CN2021120226 W CN 2021120226W WO 2022063215 A1 WO2022063215 A1 WO 2022063215A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
model
domain
gain
feature
Prior art date
Application number
PCT/CN2021/120226
Other languages
English (en)
French (fr)
Inventor
康力
叶顺舟
陆成
Original Assignee
紫光展锐(重庆)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 紫光展锐(重庆)科技有限公司 filed Critical 紫光展锐(重庆)科技有限公司
Publication of WO2022063215A1 publication Critical patent/WO2022063215A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the technical field of communication processing, and in particular, to a feature domain speech enhancement method combined with an AI model and related products.
  • the interactive terminal better understands the user's purpose and improves the user experience.
  • Speech enhancement has been researched for decades and is widely used in communication, security, home and other scenarios.
  • Traditional voice enhancement technologies include single-channel voice enhancement and multi-channel voice enhancement, wherein multi-channel voice enhancement uses microphone array technology.
  • Single-channel speech enhancement has a very wide range of application scenarios. On the one hand, the cost of single-channel voice enhancement is low, and the use is more flexible and convenient.
  • single-pass speech enhancement cannot utilize spatial information such as angle of arrival, and it is very difficult to deal with complex scenes, especially non-stationary noise scenes.
  • both the voice trigger detection function and the automatic speech detection function will increase the misrecognition rate and decrease the recognition rate, causing interaction difficulties.
  • the embodiments of the present application disclose a feature domain voice enhancement method and related products combined with an AI model, which improve recognition accuracy, reduce interaction difficulty, and improve user experience through feature domain voice enhancement.
  • a first aspect provides a feature domain speech enhancement method combined with an AI model, the method comprising the following steps:
  • the feature domain enhancement signal is input into the operation model as input data, and the operation is performed to obtain the output result of the initial speech signal.
  • a feature domain speech enhancement device combined with an AI model comprising:
  • a processing unit configured to perform an initial operation on the initial speech signal to obtain a characteristic domain signal; determine the gain of the characteristic domain signal based on the AI model, and perform enhancement processing on the characteristic domain signal according to the gain to obtain a characteristic domain enhanced signal;
  • the operation unit is used for inputting the feature domain enhancement signal as input data into the operation model, and performing operation to obtain the output result of the initial speech signal.
  • a third aspect provides a terminal comprising a processor, a memory, a communication interface, and one or more programs, the one or more programs being stored in the memory and configured to be executed by the processor,
  • the program includes instructions for performing the steps in the method of the first aspect.
  • a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method of the first aspect.
  • a computer program product in a fifth aspect, includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the first aspect of the embodiments of the present application. some or all of the steps described in .
  • the computer program product may be a software installation package.
  • a chip system in a sixth aspect, includes at least one processor, a memory and an interface circuit, the memory, the transceiver and the at least one processor are interconnected through a line, and the at least one memory stores There is a computer program; the computer program when executed by the processor implements the method of the first aspect.
  • the technical solution provided by the present application performs an initial operation on an initial speech signal to obtain a characteristic domain signal; performs gain processing on the characteristic domain signal based on an AI model to obtain a characteristic domain enhanced signal; and inputs the characteristic domain enhanced signal as input data To the operation model, perform operation to obtain the output result of the initial speech signal.
  • the output of the AI model is the gain and VAD (Voice Activity Detection) information in the feature domain.
  • the feature domain gain can directly enhance the signal in the feature domain, and the VAD information is used as the auxiliary information of KWS/ASR.
  • the enhanced feature domain signal can be used to further compute features and then perform KWS/ASR.
  • This application does not need to restore the enhanced signal to the time domain, but directly input it to KWS/ASR after the feature domain is enhanced.
  • This application only needs the voice feature domain information of one channel, which can be used in a single microphone scenario or Can be used for post-processing in multi-microphone arrays. Its hardware conditions are less restricted and the application scenarios are more extensive. Therefore, it improves the accuracy of recognition and improves the user experience.
  • FIG. 1 is a system architecture diagram of an example communication system provided by the present application.
  • FIG. 2 is a schematic flowchart of a feature domain speech enhancement method combined with an AI model provided by the present application
  • FIG. 3 is a schematic flowchart of a feature domain speech enhancement method combined with an AI model provided in Embodiment 1 of the present application;
  • FIG. 6 is a schematic structural diagram of a feature domain speech enhancement device combined with an AI model provided by the present application
  • FIG. 7 is a schematic structural diagram of a terminal provided by the present application.
  • connection in the embodiments of the present application refers to various connection modes such as direct connection or indirect connection, so as to realize communication between devices, which is not limited in the embodiments of the present application.
  • the terminal 100 may include: a processor, a microphone, a memory, and a communication unit.
  • the communication unit may have different types of terminals.
  • the communication unit may be a short-range communication module, such as a bluetooth module, a wifi module, etc., and the above-mentioned processor, microphone, memory and communication unit may be connected through a bus.
  • the terminal 100 may be a portable electronic device that also includes other functions such as a personal digital assistant and/or a music player function, such as a mobile phone, a tablet computer, a smart speaker, a Bluetooth headset, a vehicle-mounted terminal, a wearable electronic device (such as a wireless communication function) smart watch) etc.
  • portable electronic devices include, but are not limited to, portable electronic devices powered by IOS systems, Android systems, Microsoft systems, or other operating systems.
  • the above-mentioned portable electronic device may also be other portable electronic devices, such as a laptop computer (Laptop) or the like. It should also be understood that, in some other embodiments, the above-mentioned terminal may not be a portable electronic device, but a desktop computer.
  • the voice enhancement technology used by the terminal as shown in FIG. 1 may include single-channel voice enhancement and multi-channel voice enhancement, wherein the multi-channel voice enhancement uses the microphone array technology.
  • Single-channel voice enhancement technology has a wide range of applications, and can be used in single-microphone scenarios, such as low-end mobile phones (feature phones), smart watches, and some devices that have greater restrictions on power consumption, size, or cost. Can also be used in the post-processing stage of multi-mic scenes. Multiple microphones can utilize multiple channels of spatial information, as well as coherence information, to enhance speech. However, single-channel speech enhancement techniques are still needed to suppress incoherent noise.
  • the single-channel speech enhancement technology is based on two assumptions, one is that the non-stationarity of the noise signal is weaker than that of the speech signal, and the other is that the amplitude of the noise signal and the speech signal both satisfy the Gaussian distribution.
  • the traditional in-channel speech enhancement method is divided into two steps, one is noise power spectrum estimation, and the other is speech enhancement gain calculation.
  • the noise power spectrum estimation estimates the noise that may be contained in the current noisy speech signal, and updates the noise power spectrum.
  • the gain calculation part estimates the prior signal-to-noise ratio according to the noise power spectrum, and calculates the gain.
  • the input noisy speech signal is multiplied by the calculated gain to obtain the enhanced speech signal.
  • the processing method of speech enhancement is based on the statistical analysis of speech signal and noise signal. These statistical analyses are mainly used for estimation of the probability of speech existence. Once encountering statistical characteristics that do not meet expectations, such as some non-stationary noise, the effect of speech enhancement will decrease.
  • Fig. 2 provides a feature domain speech enhancement method combined with an AI model.
  • the method is shown in Fig. 2 and can be executed by the terminal shown in Fig. 1.
  • the method includes the following steps :
  • Step S200 performing an initial operation on the initial voice signal to obtain a characteristic domain signal
  • the above-mentioned initial operations include: frame-by-frame windowed FFT and feature domain transformation.
  • Step S201 determining the gain of the characteristic domain signal based on the AI model
  • the implementation method of the above step S201 may specifically include:
  • the signal-to-noise ratio estimation is performed on the eigendomain signal based on the AI model to obtain the signal-to-noise ratio of the eigendomain signal, and the eigendomain gain is calculated according to the signal-to-noise ratio.
  • the implementation method of the foregoing step S201 may specifically include:
  • the feature domain gain is obtained by performing gain estimation on the feature domain signal based on the AI model.
  • Step S202 performing enhancement processing on the characteristic domain signal according to the gain to obtain the characteristic domain enhanced signal
  • the implementation method of the above step S202 may specifically include: multiplying the characteristic domain signal by the gain to obtain the characteristic domain enhanced signal.
  • Step S203 Input the feature domain enhancement signal as input data into the operation model, and perform the operation to obtain the output result of the initial speech signal.
  • the above-mentioned operation model includes: a KWS (Key Word Spotting keyword detection) model or an ASR (Automatic Speech Recognition) model.
  • the technical solution provided by the present application performs an initial operation on an initial speech signal to obtain a feature domain signal; performs gain processing on the feature domain signal based on an AI model to obtain a feature domain enhanced signal; inputs the feature domain enhanced signal as input data into an operation model, and performs an operation to obtain the output result of the initial speech signal.
  • the output of the AI model is the gain and VAD (Voice Activity Detection) information in the feature domain.
  • the feature domain gain can directly enhance the signal in the feature domain, and the VAD information is used as the auxiliary information of KWS/ASR.
  • the enhanced feature domain signal can be used to further compute features and then perform KWS/ASR.
  • This application does not need to restore the enhanced signal to the time domain, but directly input it to KWS/ASR after the feature domain is enhanced.
  • This application only needs the voice feature domain information of one channel, which can be used in a single microphone scenario or Can be used for post-processing in multi-microphone arrays. Its hardware conditions are less restricted and the application scenarios are more extensive. Therefore, it improves the accuracy of recognition and improves the user experience.
  • the above method may further include:
  • the input data is discarded.
  • This technical solution can reduce the amount of data processing. Only when there is voice activity, the KWS/ASR operation is performed. When there is no voice activity, the input data is directly discarded, and the KWS/ASR operation is not performed, thereby reducing The amount of data computation increases the speed of speech recognition.
  • Embodiment 1 of the present application provides a feature domain speech enhancement method combined with an AI model.
  • the method can be executed by a terminal.
  • the flow of the method is shown in FIG. 3 , and the method can include the following steps:
  • Step S300 subjecting the noisy signal to frame-wise windowing FFT processing and feature domain transformation to obtain a feature domain signal
  • Step S301 calculating the gain of the characteristic domain to obtain the gain of the characteristic domain, and multiplying the characteristic domain signal by the gain to obtain the characteristic domain enhancement signal;
  • the first method is that the AI model estimates the signal-to-noise ratio of a feature domain, and calculates the gain according to the signal-to-noise ratio.
  • the second method is to directly estimate the gain of the feature domain.
  • Step S302 obtaining input data after further feature calculation on the feature domain enhanced signal, and inputting the input data into the KWS/ASR operation speech recognition result.
  • the technical solution provided by the present application performs an initial operation on an initial speech signal to obtain a feature domain signal; performs gain processing on the feature domain signal based on an AI model to obtain a feature domain enhanced signal; inputs the feature domain enhanced signal as input data into an operation model, and performs an operation to obtain the output result of the initial speech signal.
  • the output of the AI model is the gain and VAD (Voice Activity Detection) information in the feature domain.
  • the feature domain gain can directly enhance the signal in the feature domain, and the VAD information is used as the auxiliary information of KWS/ASR.
  • the enhanced feature domain signal can be used to further compute features and then perform KWS/ASR.
  • This application does not need to restore the enhanced signal to the time domain, but directly input it to KWS/ASR after the feature domain is enhanced.
  • This application only needs the voice feature domain information of one channel, which can be used in a single microphone scenario or Can be used for post-processing in multi-microphone arrays. Its hardware conditions are less restricted and the application scenarios are more extensive. Therefore, it improves the accuracy of recognition and improves the user experience.
  • the method of the AI model provided in the first embodiment of the present application is divided into two stages, namely, a training stage and an inference stage.
  • the flowchart of the training phase is shown in Figure 4.
  • Figure 4 has three rows, the first and second rows are the training targets, and the third row is the input features.
  • Input a piece of pure speech and pure noise.
  • SNR signal-to-noise ratio
  • the speech signal gain gs and noise gain gn can be calculated respectively. Use this ratio to mix to get a noisy signal.
  • the signal is subjected to frame windowing, FFT and feature extraction as the input features of the neural network.
  • the input pure speech and pure noise are multiplied by their respective gains gs and gn, and then go through frame-by-frame windowing, FFT, and feature extraction. Calculate the target SNR in the feature domain. At this time, the SNR cannot be directly used as the target of the neural network, but needs to be mapped to ensure the convergence effect of the neural network.
  • the reasoning stage is shown in Figure 5.
  • a frame of noisy speech signal is input, after frame-by-frame windowing and FFT, its speech features are extracted and used as the input of the neural network.
  • the output of the network is the predicted signal-to-noise ratio or gain of the current frame in the feature domain, and VAD information.
  • the speech gain can be calculated according to the signal-to-noise ratio or directly use the output gain and VAD information to achieve feature domain speech enhancement.
  • Input a section of noisy speech signal, and then go through frame-by-frame windowing, FFT, and then extract features. Speech enhancement is performed directly in the feature domain, and the enhanced speech features are used as the input of KWS or ASR.
  • the training objectives of the AI model in this application are gain or prior signal-to-noise ratio, and VAD.
  • gain and VAD information the range is between [0, 1], and it is not difficult to converge during the training process.
  • prior signal-to-noise ratio whether it is a linear value or a logarithmic value, its distribution is not conducive to the convergence of the neural network. It is necessary to convert the signal-to-noise ratio into a Gaussian-like distribution through mapping in order to optimize the performance of the neural network.
  • An optional training target mapping process is as follows.
  • variable a is used to control the slope of the tanh() function
  • variable b is used to adjust the bias of the tanh() function.
  • the range of the input SNR can be set by adjusting the values of a and b.
  • the training target has been mapped, its dynamic range has been limited from 0 to 1, and its value distribution also conforms to a Gaussian-like distribution.
  • This application can use cross entropy (cross entropy) or mean square error (mean square error) as the loss function, of course, in practical applications, other loss functions can also be used, and this application does not limit the specific expression of the above loss function.
  • Voice interaction may occur in various scenarios. Different languages have their own pronunciation characteristics, and different scenarios have corresponding environmental signal-to-noise ratios and room sizes. These factors may affect the generalization performance of neural networks.
  • This application uses multilingual clean speech signals as training data, which can enhance the generalization performance in multilingual environments.
  • the present application uses a wide range of SNR ranges, such as -10dB to 20dB, to calculate the gains of the training data speech signal and noise signal during training.
  • This application uses multiple real and simulated room impulse responses during training, and the input training data will be randomly convolved with these impulse responses to simulate the effects of different room responses.
  • the user equipment includes corresponding hardware and/or software modules for executing each function.
  • the present application can be implemented in hardware or in the form of a combination of hardware and computer software in conjunction with the algorithm steps of each example described in conjunction with the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functionality for each particular application in conjunction with the embodiments, but such implementations should not be considered beyond the scope of this application.
  • the electronic device can be divided into functional modules according to the above method examples.
  • each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware. It should be noted that, the division of modules in this embodiment is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 6 shows a schematic diagram of a feature domain speech enhancement device combined with an AI model.
  • the feature domain voice enhancement device 600 combined with the AI model may include: Operation unit 601 and processing unit 602.
  • the processing unit 602 may be used to support the user equipment to perform the above-mentioned step 201, etc., and/or be used for other processes of the techniques described herein.
  • the computing unit 601 may be used to support the user equipment to perform the above-mentioned steps 202, S203, etc., and/or other processes for the techniques described herein.
  • the electronic device provided in this embodiment is used to execute the above-mentioned method shown in FIG. 2 , and thus can achieve the same effect as the above-mentioned implementation method.
  • the user equipment may include a processing module, a storage module and a communication module.
  • the processing module may be used to control and manage the actions of the user equipment, for example, may be used to support the electronic equipment to perform the steps performed by the above computing unit 601 and the processing unit 602 .
  • the storage module may be used to support the electronic device to execute stored program codes and data, and the like.
  • the communication module can be used to support the communication between the electronic device and other devices.
  • the processing module may be a processor or a controller. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
  • the processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of digital signal processing (DSP) and a microprocessor, and the like.
  • the storage module may be a memory.
  • the communication module may specifically be a device that interacts with other electronic devices, such as a radio frequency circuit, a Bluetooth chip, and a Wi-Fi chip.
  • the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the user equipment.
  • the user equipment may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • FIG. 7 is a terminal 70 provided by an embodiment of the present application.
  • the terminal 70 includes a processor 701 , a memory 702 and a communication interface 703 , and the processor 701 , the memory 702 and the communication interface 703 communicate with each other through a bus 704 connect.
  • the memory 702 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or A portable read-only memory (compact disc read-only memory, CD-ROM), the memory 702 is used for related computer programs and data.
  • the communication interface 703 is used to receive and transmit data.
  • the processor 701 may be one or more central processing units (central processing units, CPUs). In the case where the processor 701 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 701 may include one or more processing units, for example, the processing unit may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor ( image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent components, or may be integrated in one or more processors.
  • the user equipment may also include one or more processing units.
  • the controller can generate an operation control signal according to the instruction operation code and the timing signal, and complete the control of fetching and executing instructions.
  • memory may also be provided in the processing unit for storing instructions and data.
  • the memory in the processing unit may be a cache memory. This memory can hold instructions or data that have just been used or recycled by the processing unit. If the processing unit needs to use the instruction or data again, it can be called directly from the memory. In this way, repeated access is avoided, and the waiting time of the processing unit is reduced, thereby improving the efficiency of the user equipment in processing data or executing instructions.
  • the processor 701 may include one or more interfaces.
  • the interface may include an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal) asynchronous receiver/transmitter, UART) interface, mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input/output (GPIO) interface, SIM card interface and/or USB interface, etc.
  • the USB interface is an interface that conforms to the USB standard specification, and can specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface can be used to connect a charger to charge the user equipment, and can also be used to transfer data between the user equipment and peripheral devices.
  • the USB port can also be used to connect headphones and play audio through the headphones.
  • the processor 701 in the terminal 70 is configured to read the computer program code stored in the memory 702, and perform the following operations:
  • the feature domain signal is gain processed to obtain the feature domain enhanced signal
  • the feature domain enhancement signal is input into the operation model as input data, and the operation is performed to obtain the output result of the initial speech signal.
  • An embodiment of the present application further provides a chip system, the chip system includes at least one processor, a memory, and an interface circuit, the memory, the transceiver, and the at least one processor are interconnected by lines, and the at least one memory
  • a computer program is stored in the computer; when the computer program is executed by the processor, the method flow shown in FIG. 2 and FIG. 3 is realized.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program runs on a network device, the method flows shown in FIG. 2 and FIG. 3 are implemented.
  • the embodiment of the present application further provides a computer program product, when the computer program product runs on the terminal, the method flow shown in FIG. 2 and FIG. 3 is realized.
  • Embodiments of the present application further provide a terminal, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor , the program includes instructions for executing the steps in the method of the embodiment shown in FIG. 2 and FIG. 3 .
  • the electronic device includes corresponding hardware structures and/or software templates for executing each function.
  • the present application can be implemented in hardware or a combination of hardware and computer software with the units and algorithm steps of each example described in conjunction with the embodiments provided herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • the electronic device may be divided into functional units according to the foregoing method examples.
  • each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation.
  • the disclosed apparatus may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the above-mentioned integrated units if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable memory.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Telephone Function (AREA)

Abstract

一种结合AI模型的特征域语音增强方法及相关产品,方法包括:步骤S200、将初始语音信号执行初始操作得到特征域信号;步骤S201、基于AI模型确定特征域信号的增益;步骤S202、依据增益对特征域信号进行增强处理得到特征域增强信号;步骤S203、将特征域增强信号作为输入数据输入到运算模型,执行运算得到初始语音信号的输出结果。结合AI模型的特征域语音增强方法及相关产品具有用户体验度高的优点。

Description

结合AI模型的特征域语音增强方法及相关产品 技术领域
本申请涉及通信处理技术领域,尤其涉及一种结合AI模型的特征域语音增强方法及相关产品。
背景技术
交互终端更好地理解用户的目的,提高用户体验。语音增强已经进行了数十年的研究,广泛用于通信,安防,家居等场景。传统的语音增强技术包括单通道语音增强和多通道语音增强,其中多通道语音增强会使用到麦克风阵列技术。单通道的语音增强具有非常广泛的应用场景。一方面,单通道语音增强成本低,使用更加灵活便捷。另一方面,单通语音增强无法利用到达角等空间信息,对于复杂场景,尤其是非平稳噪声场景,处理起来非常困难。
当人处于嘈杂的环境中使用终端的语音交互功能时,环境中的噪声会使得终端的语音交互性能下降。具体地说,关键词唤醒(voice trigger detection)功能和自动语音识别(Automatic speech detection)功能都会出现误识率增加,识别率降低的现象,造成交互困难。
发明内容
本申请实施例公开了一种结合AI模型的特征域语音增强方法及相关产品,通过特征域语音增强提高识别准确率,降低交互难度,提高用户体验度。
第一方面,提供一种结合AI模型的特征域语音增强方法,所述方法包括如下步骤:
将初始语音信号执行初始操作得到特征域信号;
基于AI模型确定特征域信号的增益,依据所述增益对特征域信号进行增强处理得到特征域增强信号;
将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。
第二方面,提供一种结合AI模型的特征域语音增强装置,所述装置包括:
处理单元,用于将初始语音信号执行初始操作得到特征域信号;基于AI 模型确定特征域信号的增益,依据所述增益对特征域信号进行增强处理得到特征域增强信号;
运算单元,用于将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。
第三方面,提供一种终端,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行第一方面所述的方法中的步骤的指令。
第四方面,提供了一种计算机可读存储介质,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面所述的方法。
第五方面,提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
第六方面,提供了芯片系统,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述收发器和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有计算机程序;所述计算机程序被所述处理器执行时实现第一方面所述的方法。
通过实施本申请实施例,本申请提供的技术方案将初始语音信号执行初始操作得到特征域信号;基于AI模型对特征域信号进行增益处理得到特征域增强信号;将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。AI模型的输出是特征域的增益和VAD(语音活动检测)信息。其中特征域增益可以直接对特征域的信号进行增强,而VAD信息则作为KWS/ASR的辅助信息。增强后的特征域信号可用来进一步计算特征然后进行KWS/ASR。本申请不需要将增强后的信号还原到时域,而是直接在特征域增强后输入到KWS/ASR,本申请仅需要一个通道的语音特征域信息,既可以用在单麦克风的场景,也可以用在多麦克风阵列的后处理。其硬件条件限制较少,应用场景更加广泛。因此其提高了识别的准确性,提高了用户体验度。
附图说明
以下对本申请实施例用到的附图进行介绍。
图1是本申请提供的一种示例通信系统的系统架构图;
图2是本申请提供的一种结合AI模型的特征域语音增强方法流程示意图;
图3是本申请实施例一提供的结合AI模型的特征域语音增强方法的流程示意图;
图4是本申请提供的AI模型的训练阶段的流程示意图;
图5是本申请提供的AI模型的推理阶段的流程示意图;
图6是本申请提供的一种结合AI模型的特征域语音增强装置的结构示意图;
图7是本申请提供的一种终端的结构示意图。
具体实施方式
下面结合本申请实施例中的附图对本申请实施例进行描述。
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,表示前后关联对象是一种“或”的关系。
本申请实施例中出现的“多个”是指两个或两个以上。本申请实施例中出现的第一、第二等描述,仅作示意与区分描述对象之用,没有次序之分,也不表示本申请实施例中对设备个数的特别限定,不能构成对本申请实施例的任何限制。本申请实施例中出现的“连接”是指直接连接或者间接连接等各种连接方式,以实现设备间的通信,本申请实施例对此不做任何限定。
本申请实施例的技术方案可以应用于如图1所示的终端,该终端100如图1所示,可以包括:处理器、麦克风、存储器和通信单元,该通信单元依据终端的不同类型可以有选择的配置,该通信单元可以为短距离通信模块,例如蓝牙模块、wifi模块等等,上述处理器、麦克风、存储器和通信单元可以通过总线连接。
终端100可以是还包含其它功能诸如个人数字助理和/或音乐播放器功能 的便携式电子设备,诸如手机、平板电脑、智能音箱、蓝牙耳机、车载终端、具备无线通讯功能的可穿戴电子设备(如智能手表)等。便携式电子设备的示例性实施例包括但不限于搭载IOS系统、Android系统、Microsoft系统或者其它操作系统的便携式电子设备。上述便携式电子设备也可以是其它便携式电子设备,诸如膝上型计算机(Laptop)等。还应当理解的是,在其他一些实施例中,上述终端也可以不是便携式电子设备,而是台式计算机。
如图1所示的终端使用的语音增强技术可以包括单通道语音增强和多通道语音增强,其中多通道语音增强会使用到麦克风阵列技术。
单通道的语音增强技术应用范围较广,既可以用与单麦克风场景,比如在低端手机(功能机),智能手表,以及一些对功耗,体积,或成本有较大限制的设备上。也可以用于多麦克风场景的后处理阶段。多麦克风可以利用多通道的空间信息,以及相干性信息来增强语音。但仍然需要单通道语音增强技术来对非相干噪声进行抑制。
单通道语音增强技术基于两个假设,一是噪声信号的非平稳性比语音信号要弱,二是噪声信号和语音信号,其幅度都满足高斯分布。基于这些假设,传统的当通道语音增强方法分为两个步骤,一是噪声功率谱估计,二是语音增强增益计算。噪声功率谱估计根据当前带噪语音信号估计出当中可能包含的噪声,更新噪声功率谱。增益计算部分根据噪声功率谱估计先验信噪比,并计算增益。输入的带噪语音信号乘以计算出来的增益,就得到了增强后的语音信号。详细流程如下。语音增强的处理方法是建立在语音信号和噪声信号的统计分析的基础之上。这些统计分析主要用于语音存在概率的估计。一旦遇到不符合预期的统计特征,比如一些非平稳噪声,则语音增强的效果会下降。
参阅图2,图2提供了一种结合AI模型的特征域语音增强方法,该方法如图2所示,可以由如图1所示的终端执行,该方法如图2所示,包括如下步骤:
步骤S200、将初始语音信号执行初始操作得到特征域信号;
在一种可选的方案中,上述初始操作包括:分帧加窗FFT以及特征域变换。
步骤S201、基于AI模型确定特征域信号的增益;
在一种可选的方案中,上述步骤S201的实现方法具体可以包括:
基于AI模型对特征域信号执行信噪比估计得到特征域信号的信噪比,依据所述信噪比计算得到特征域增益。
在另一种可选的方案中,上述步骤S201的实现方法具体可以包括:
基于AI模型对特征域信号执行增益估计得到特征域增益。
步骤S202、依据所述增益对特征域信号进行增强处理得到特征域增强信号;
上述步骤S202的实现方法具体可以包括:将该特征域信号乘以该增益得到特征域增强信号。
步骤S203、将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。
在一种可选的方案中,上述运算模型包括:KWS(Key Word Spotting关键词检测)模型或ASR(Automatic Speech Recognition自动语音识别)模型。
本申请提供的技术方案将初始语音信号执行初始操作得到特征域信号;基于AI模型对特征域信号进行增益处理得到特征域增强信号;将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。AI模型的输出是特征域的增益和VAD(语音活动检测)信息。其中特征域增益可以直接对特征域的信号进行增强,而VAD信息则作为KWS/ASR的辅助信息。增强后的特征域信号可用来进一步计算特征然后进行KWS/ASR。本申请不需要将增强后的信号还原到时域,而是直接在特征域增强后输入到KWS/ASR,本申请仅需要一个通道的语音特征域信息,既可以用在单麦克风的场景,也可以用在多麦克风阵列的后处理。其硬件条件限制较少,应用场景更加广泛。因此其提高了识别的准确性,提高了用户体验度。
在一种可选的方案中,上述方法在执行运算得到所述初始语音信号的输出结果之前还可以包括:
基于AI模型对特征域信号执行语音活动检测VAD估计,若确定该特征域信号具有语音活动,执行运算得到所述初始语音信号的输出结果;
若确定该特征域信号不具有语音活动,丢弃该输入数据。
此技术方案能够减少数据的处理量,只有在具有语音活动的情况下,才执 行KWS/ASR的运算,不具有语音活动的情况下,直接丢弃输入数据,不执行KWS/ASR的运算,进而减少数据的运算量,提高语音识别的速度。
实施例一
本申请实施例一提供了一种结合AI模型的特征域语音增强方法,该方法可以由终端执行,该方法的流程如图3所示,该方法可以包括如下步骤:
步骤S300、将带噪信号经过分帧加窗FFT处理和特征域变换得到特征域信号;
步骤S301、对特征域的增益进行计算得到特征域的增益,将特征域信号乘以增益得到特征域增强信号;
上述步骤S301的实现方法具有二种,
方法一是AI模型估计得到一个特征域的信噪比,根据信噪比计算增益。方法二是直接估计得到特征域的增益。
步骤S302、对特征域增强信号进一步特征计算后得到输入数据,将输入数据输入到KWS/ASR运算语音识别结果。
本申请提供的技术方案将初始语音信号执行初始操作得到特征域信号;基于AI模型对特征域信号进行增益处理得到特征域增强信号;将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。AI模型的输出是特征域的增益和VAD(语音活动检测)信息。其中特征域增益可以直接对特征域的信号进行增强,而VAD信息则作为KWS/ASR的辅助信息。增强后的特征域信号可用来进一步计算特征然后进行KWS/ASR。本申请不需要将增强后的信号还原到时域,而是直接在特征域增强后输入到KWS/ASR,本申请仅需要一个通道的语音特征域信息,既可以用在单麦克风的场景,也可以用在多麦克风阵列的后处理。其硬件条件限制较少,应用场景更加广泛。因此其提高了识别的准确性,提高了用户体验度。
本申请实施例一提供的AI模型该方法分为两个阶段,分别是训练阶段和推理阶段。训练阶段的流程图如图4所示。
参阅图4,图4有三行,第一二行得到的是训练目标,第三行是输入特征。
首先说明输入特征的流程,输入一段纯语音和纯噪声,根据一个随机信噪比SNR,可以分别计算出语音信号增益gs和噪声增益gn。使用这个比例混合 得到带噪信号。对该信号进行分帧加窗,FFT和特征提取,作为神经网络的输入特征。
然后说明得到目标SNR和目标增益的流程,输入的纯语音和纯噪声,乘以各自的增益gs和gn后,分别经过分帧加窗,FFT,和特征提取。在特征域计算目标SNR。此时的SNR并不能直接作为神经网络的目标,而需要经过映射处理,才能保证神经网络的收敛效果。目标增益的计算较为直接,G=(S/X) r,其中S是乘以增益gs后纯语音的功率,X是混合后带噪信号功率。r是幂指数,一般取0.5或者1。
推理阶段如图5所示,每次输入一帧带噪语音信号,经过分帧加窗,FFT之后,提取其语音特征,作为神经网络的输入。网络的输出为预测的当前帧在特征域上的信噪比或者增益,以及VAD信息。根据信噪比可以计算语音增益或者直接使用输出的增益和VAD信息实现特征域语音增强。输入一段带噪语音信号,分别经过分帧加窗,FFT,然后再提取特征。直接在特征域做语音增强,经过增强后的语音特征作为KWS或者ASR的输入。
本申请AI模型的训练目标有增益或者先验信噪比,和VAD。对于增益和VAD信息,其范围都在[0,1]之间,训练过程中收敛难度不大。但是对于先验信噪比,无论是线性值,还是对数值,其分布都不利于神经网络的收敛。需要通过映射,将信噪比转换为类似于高斯分布,才能使得神经网络的性能达到最优。一种可选的训练目标的的映射过程如下。
SNR mapped=0.5*(tanh(a·(SNR+b))+1)
其中变量a用来控制tanh()函数的斜率,而变量b用来调节tanh()函数的偏置。通过调整a和b的值可以设置输入SNR的范围。一种典型设置是a=0.1,b=6。b=6代表SNR=-6dB对应SNRmapped=0.5。因为每个频点的语音存在的概率是不同的,需要通过大量的语音和噪声数据的统计,得到适合每个频点的a,b值,以得到最优性能。
损失函数
训练目标经过映射,其动态范围已经被限定在从0到1之间,其数值的分布也符合一个类高斯分布。本申请可以使用交叉熵(cross entropy)或者均方误差(mean square error)作为损失函数,当然在实际应用中,也可以使用其他的损失 函数,本申请并不限制上述损失函数的具体表现形式。
泛化性能和数据增广
语音交互可能发生在各种场景中,不同语种有各自的发音特点,不同场景有相应的环境信噪比以及房间尺寸,这些因素都可能会影响到神经网络的泛化性能。
本申请使用了多语种的干净语音信号作为训练数据,可以增强在多语种环境下的泛化性能。
本申请在训练时使用了较宽范围的SNR范围,比如-10dB到20dB,来计算训练数据语音信号和噪声信号的增益。
本申请在训练时使用了多个真实的和模拟的房间冲击响应,输入的训练数据会随机地与这些冲激响应做卷积,以模拟不同房间响应带来的影响。
可以理解的是,用户设备为了实现上述功能,其包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本实施例可以根据上述方法示例对电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块可以采用硬件的形式实现。需要说明的是,本实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图6示出了结合AI模型的特征域语音增强装置的示意图,如图6所示,该结合AI模型的特征域语音增强装置600可以包括:运算单元601和处理单元602。
其中,处理单元602可以用于支持用户设备执行上述步骤201等,和/或用于本文所描述的技术的其他过程。
运算单元601可以用于支持用户设备执行上述步骤202、步骤S203等,和/或用于本文所描述的技术的其他过程。
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例提供的电子设备,用于执行上述如图2所示方法,因此可以达到与上述实现方法相同的效果。
在采用集成的单元的情况下,用户设备可以包括处理模块、存储模块和通信模块。其中,处理模块可以用于对用户设备的动作进行控制管理,例如,可以用于支持电子设备执行上述运算单元601和处理单元602执行的步骤。存储模块可以用于支持电子设备执行存储程序代码和数据等。通信模块,可以用于支持电子设备与其他设备的通信。
其中,处理模块可以是处理器或控制器。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理(digital signal processing,DSP)和微处理器的组合等等。存储模块可以是存储器。通信模块具体可以为射频电路、蓝牙芯片、Wi-Fi芯片等与其他电子设备交互的设备。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对用户设备的结构限定。在本申请另一些实施例中,用户设备也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
请参见图7,图7是本申请实施例提供的一种终端70,该终端70包括处理器701、存储器702和通信接口703,所述处理器701、存储器702和通信接口703通过总线704相互连接。
存储器702包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器702用于相关计算机程序及数据。通信接口703用于接收和发送数据。
处理器701可以是一个或多个中央处理器(central processing unit,CPU),在处理器701是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核 CPU。
处理器701可以包括一个或多个处理单元,例如:处理单元可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的部件,也可以集成在一个或多个处理器中。在一些实施例中,用户设备也可以包括一个或多个处理单元。其中,控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。在其他一些实施例中,处理单元中还可以设置存储器,用于存储指令和数据。示例性地,处理单元中的存储器可以为高速缓冲存储器。该存储器可以保存处理单元刚用过或循环使用的指令或数据。如果处理单元需要再次使用该指令或数据,可从所述存储器中直接调用。这样就避免了重复存取,减少了处理单元的等待时间,因而提高了用户设备处理数据或执行指令的效率。
在一些实施例中,处理器701可以包括一个或多个接口。接口可以包括集成电路间(inter-integrated circuit,I2C)接口、集成电路间音频(inter-integrated circuit sound,I2S)接口、脉冲编码调制(pulse code modulation,PCM)接口、通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口、移动产业处理器接口(mobile industry processor interface,MIPI)、用输入输出(general-purpose input/output,GPIO)接口、SIM卡接口和/或USB接口等。其中,USB接口是符合USB标准规范的接口,具体可以是Mini USB接口、Micro USB接口、USB Type C接口等。USB接口可以用于连接充电器为用户设备充电,也可以用于用户设备与外围设备之间传输数据。该USB接口也可以用于连接耳机,通过耳机播放音频。
该终端70中的处理器701用于读取所述存储器702中存储的计算机程序代码,执行以下操作:
将初始语音信号执行初始操作得到特征域信号;
基于AI模型对特征域信号进行增益处理得到特征域增强信号;
将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始 语音信号的输出结果。
其中,上述方法实施例涉及的各场景的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本申请实施例还提供一种芯片系统,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述收发器和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有计算机程序;所述计算机程序被所述处理器执行时,图2、图3所示的方法流程得以实现。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在网络设备上运行时,图2、图3所示的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在终端上运行时,图2、图3所示的方法流程得以实现。
本申请实施例还提供一种终端,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行图2、图3所示实施例的方法中的步骤的指令。
上述主要从方法侧执行过程的角度对本申请实施例的方案进行了介绍。可以理解的是,电子设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模板。本领域技术人员应该很容易意识到,结合本文中所提供的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对电子设备进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模板并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器 (RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。

Claims (17)

  1. 一种结合AI模型的特征域语音增强方法,其特征在于,所述方法包括如下步骤:
    将初始语音信号执行初始操作得到特征域信号;
    基于AI模型确定特征域信号的增益,依据所述增益对特征域信号进行增强处理得到特征域增强信号;
    将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。
  2. 根据权利要求1所述的方法,其特征在于,所述基于AI模型对特征域信号进行增益处理得到特征域增强信号具体包括:
    基于AI模型对特征域信号执行信噪比估计得到特征域信号的信噪比,依据所述信噪比计算得到特征域增益,将该特征域信号乘以该增益得到特征域增强信号。
  3. 根据权利要求1所述的方法,其特征在于,所述基于AI模型对特征域信号进行增益处理得到特征域增强信号具体包括:
    基于AI模型对特征域信号执行增益估计得到特征域增益,将该特征域信号乘以该增益得到特征域增强信号。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述方法在执行运算得到所述初始语音信号的输出结果之前还包括:
    基于AI模型对特征域信号执行语音活动检测VAD估计,若确定该特征域信号具有语音活动,执行运算得到所述初始语音信号的输出结果;
    若确定该特征域信号不具有语音活动,丢弃该输入数据。
  5. 根据权利要求1-4任意一项所述的方法,
    所述初始操作包括:分帧加窗FFT以及特征域变换。
  6. 根据权利要求1-5任意一项所述的方法,
    所述运算模型包括:关键词检测KWS模型或自动语音识别ASR模型。
  7. 一种结合AI模型的特征域语音增强装置,其特征在于,所述装置包括:
    处理单元,用于将初始语音信号执行初始操作得到特征域信号;基于AI模型确定特征域信号的增益,依据所述增益对特征域信号进行增强处理得到特征域增强信号;
    运算单元,用于将特征域增强信号作为输入数据输入到运算模型,执行运算得到所述初始语音信号的输出结果。
  8. 根据权利要求7所述的装置,其特征在于,
    所述处理单元,具体用于基于AI模型对特征域信号执行信噪比估计得到特征域信号的信噪比,依据所述信噪比计算得到特征域增益,将该特征域信号乘以该增益得到特征域增强信号。
  9. 根据权利要求7所述的装置,其特征在于,
    所述处理单元,具体用于基于AI模型对特征域信号执行增益估计得到特征域增益,将该特征域信号乘以该增益得到特征域增强信号。
  10. 根据权利要求7-9任意一项所述的装置,其特征在于,
    所述处理单元,还用于基于AI模型对特征域信号执行语音活动检测VAD估计,若确定该特征域信号具有语音活动,执行运算得到所述初始语音信号的输出结果;
    若确定该特征域信号不具有语音活动,丢弃该输入数据。
  11. 根据权利要求7-10任意一项所述的方法,
    所述初始操作包括:分帧加窗FFT以及特征域变换。
  12. 根据权利要求7-11任意一项所述的方法,
    所述运算模型包括:关键词检测KWS模型或自动语音识别ASR模型。
  13. 一种终端,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求1-6任意一项所述的方法中的步骤的指令。
  14. 一种芯片系统,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述收发器和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有计算机程序;所述计算机程序被所述处理器执行时实现如权利要求1-6任意一项所述的方法。
  15. 一种网络设备,其特征在于,所述网络设备用于支持终端设备执行如权利要求1-6任意一项所述的方法。
  16. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在用户设备上运行时,执行如权利要求1-6任意一项所述的方法。
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-6任一项所述的方法。
PCT/CN2021/120226 2020-09-28 2021-09-24 结合ai模型的特征域语音增强方法及相关产品 WO2022063215A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011046052.4A CN112349277B (zh) 2020-09-28 2020-09-28 结合ai模型的特征域语音增强方法及相关产品
CN202011046052.4 2020-09-28

Publications (1)

Publication Number Publication Date
WO2022063215A1 true WO2022063215A1 (zh) 2022-03-31

Family

ID=74361251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120226 WO2022063215A1 (zh) 2020-09-28 2021-09-24 结合ai模型的特征域语音增强方法及相关产品

Country Status (2)

Country Link
CN (1) CN112349277B (zh)
WO (1) WO2022063215A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349277B (zh) * 2020-09-28 2023-07-04 紫光展锐(重庆)科技有限公司 结合ai模型的特征域语音增强方法及相关产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085249A (zh) * 2019-05-09 2019-08-02 南京工程学院 基于注意力门控的循环神经网络的单通道语音增强方法
CN110335620A (zh) * 2019-07-08 2019-10-15 广州欢聊网络科技有限公司 一种噪声抑制方法、装置和移动终端
JP2020076907A (ja) * 2018-11-09 2020-05-21 沖電気工業株式会社 信号処理装置、信号処理プログラム及び信号処理方法
CN111445919A (zh) * 2020-03-13 2020-07-24 紫光展锐(重庆)科技有限公司 结合ai模型的语音增强方法、系统、电子设备和介质
CN112349277A (zh) * 2020-09-28 2021-02-09 紫光展锐(重庆)科技有限公司 结合ai模型的特征域语音增强方法及相关产品

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058572B1 (en) * 2000-01-28 2006-06-06 Nortel Networks Limited Reducing acoustic noise in wireless and landline based telephony
ES2678415T3 (es) * 2008-08-05 2018-08-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Aparato y procedimiento para procesamiento y señal de audio para mejora de habla mediante el uso de una extracción de característica
CN104867498A (zh) * 2014-12-26 2015-08-26 深圳市微纳集成电路与系统应用研究院 一种移动通讯终端及其语音增强方法和模块
CN104952448A (zh) * 2015-05-04 2015-09-30 张爱英 一种双向长短时记忆递归神经网络的特征增强方法及系统
DK3252766T3 (da) * 2016-05-30 2021-09-06 Oticon As Audiobehandlingsanordning og fremgangsmåde til estimering af signal-til-støj-forholdet for et lydsignal
CN106782504B (zh) * 2016-12-29 2019-01-22 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN107977183A (zh) * 2017-11-16 2018-05-01 百度在线网络技术(北京)有限公司 语音交互方法、装置及设备
CN108877775B (zh) * 2018-06-04 2023-03-31 平安科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质
CN108847251B (zh) * 2018-07-04 2022-12-02 武汉斗鱼网络科技有限公司 一种语音去重方法、装置、服务器及存储介质
EP3694229A1 (en) * 2019-02-08 2020-08-12 Oticon A/s A hearing device comprising a noise reduction system
CN109767760A (zh) * 2019-02-23 2019-05-17 天津大学 基于振幅和相位信息的多目标学习的远场语音识别方法
CN109712628B (zh) * 2019-03-15 2020-06-19 哈尔滨理工大学 一种基于rnn建立的drnn降噪模型的语音降噪方法及语音识别方法
CN110428849B (zh) * 2019-07-30 2021-10-08 珠海亿智电子科技有限公司 一种基于生成对抗网络的语音增强方法
CN110867181B (zh) * 2019-09-29 2022-05-06 北京工业大学 基于scnn和tcnn联合估计的多目标语音增强方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020076907A (ja) * 2018-11-09 2020-05-21 沖電気工業株式会社 信号処理装置、信号処理プログラム及び信号処理方法
CN110085249A (zh) * 2019-05-09 2019-08-02 南京工程学院 基于注意力门控的循环神经网络的单通道语音增强方法
CN110335620A (zh) * 2019-07-08 2019-10-15 广州欢聊网络科技有限公司 一种噪声抑制方法、装置和移动终端
CN111445919A (zh) * 2020-03-13 2020-07-24 紫光展锐(重庆)科技有限公司 结合ai模型的语音增强方法、系统、电子设备和介质
CN112349277A (zh) * 2020-09-28 2021-02-09 紫光展锐(重庆)科技有限公司 结合ai模型的特征域语音增强方法及相关产品

Also Published As

Publication number Publication date
CN112349277A (zh) 2021-02-09
CN112349277B (zh) 2023-07-04

Similar Documents

Publication Publication Date Title
US10469967B2 (en) Utilizing digital microphones for low power keyword detection and noise suppression
US11798531B2 (en) Speech recognition method and apparatus, and method and apparatus for training speech recognition model
US10573301B2 (en) Neural network based time-frequency mask estimation and beamforming for speech pre-processing
CN110554357B (zh) 声源定位方法和装置
TWI802602B (zh) 用於語音喚醒(wov)關鍵詞註冊的處理器實現的方法和系統
WO2021179416A1 (zh) 一种基于分离矩阵初始化频点选择的盲源分离方法及系统
CN110400572B (zh) 音频增强方法及系统
US11456007B2 (en) End-to-end multi-task denoising for joint signal distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) optimization
WO2020088153A1 (zh) 语音处理方法、装置、存储介质和电子设备
WO2021093380A1 (zh) 一种噪声处理方法、装置、系统
EP3501026B1 (en) Blind source separation using similarity measure
US11074249B2 (en) Dynamic adaptation of language understanding systems to acoustic environments
CN107240396B (zh) 说话人自适应方法、装置、设备及存储介质
CN114242044B (zh) 语音质量评估方法、语音质量评估模型训练方法及装置
CN112562742B (zh) 语音处理方法和装置
WO2022063215A1 (zh) 结合ai模型的特征域语音增强方法及相关产品
US10629184B2 (en) Cepstral variance normalization for audio feature extraction
WO2020134547A1 (zh) 数据的定点化加速方法、装置、电子设备及存储介质
CN111722696A (zh) 用于低功耗设备的语音数据处理方法和装置
WO2022100578A1 (zh) 5g系统中ofdm变换方法及相关产品
CN112951263A (zh) 语音增强方法、装置、设备和存储介质
US10650839B2 (en) Infinite impulse response acoustic echo cancellation in the frequency domain
CN114220430A (zh) 多音区语音交互方法、装置、设备以及存储介质
CN114664288A (zh) 一种语音识别方法、装置、设备及可存储介质
CN111489740A (zh) 语音处理方法及装置、电梯控制方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871584

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21871584

Country of ref document: EP

Kind code of ref document: A1