CN118335089A

CN118335089A - Speech interaction method based on artificial intelligence

Info

Publication number: CN118335089A
Application number: CN202410764506.3A
Authority: CN
Inventors: 沈国良; 景奕昕; 尚晓波; 黄爱军; 蔡梁元; 王磊; 余璐璐
Original assignee: Wuhan Pansheng Dingcheng Technology Co ltd
Current assignee: Wuhan Pansheng Dingcheng Technology Co ltd
Priority date: 2024-06-14
Filing date: 2024-06-14
Publication date: 2024-07-12
Anticipated expiration: 2044-06-14
Also published as: CN118335089B

Abstract

The present invention discloses a voice interaction method based on artificial intelligence, in the field of data processing technology, comprising the following steps: S1, using a microphone of a mobile terminal to collect the real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice; S2, using an audio fusion operator to process the real-time voice, and generating a characteristic voice; S3, converting the characteristic voice into text. The voice interaction method based on artificial intelligence splits the real-time voice input by the user, extracts the characteristic parameters of two audio sets respectively, and then splices them to generate an audio fusion operator, and uses the audio fusion operator to correct the gain operator of the enhancement processing, thereby improving the effect of the voice enhancement processing, improving the recognition accuracy of voice conversion to text, and improving the noise resistance of voice recognition.

Description

A voice interaction method based on artificial intelligence

技术领域Technical Field

本发明属于数据处理技术领域，具体涉及一种基于人工智能的语音互动方法。The present invention belongs to the technical field of data processing, and in particular relates to a voice interaction method based on artificial intelligence.

背景技术Background technique

随着人工智能的发展，基于人工智能的语音合成技术的应用越来越广泛。语音转文字是一种将语音内容转换为可编辑文本的技术。它可以帮助人们快速将语音记录转化为文字，提高工作和学习效率。人们在使用手机或者平板电脑时，很多人喜欢使用语音识别的方式进行文字输入，但用户在输入语音时所处环境的噪声可能较大，导致转换的文字内容不准确。With the development of artificial intelligence, the application of speech synthesis technology based on artificial intelligence is becoming more and more widespread. Speech-to-text is a technology that converts speech content into editable text. It can help people quickly convert speech records into text and improve work and study efficiency. When people use mobile phones or tablets, many people like to use speech recognition to input text, but the noise in the environment where users are inputting speech may be large, resulting in inaccurate converted text content.

发明内容Summary of the invention

本发明为了解决以上问题，提出了一种基于人工智能的语音互动方法。In order to solve the above problems, the present invention proposes a voice interaction method based on artificial intelligence.

本发明的技术方案是：一种基于人工智能的语音互动方法包括以下步骤：The technical solution of the present invention is: a voice interaction method based on artificial intelligence comprises the following steps:

S1、利用移动终端的麦克风采集用户的实时语音，生成实时语音对应的音频融合算子；S1. Use the microphone of the mobile terminal to collect the user's real-time voice and generate an audio fusion operator corresponding to the real-time voice;

S2、利用音频融合算子对实时语音进行处理，生成特征语音；S2, using the audio fusion operator to process the real-time speech and generate characteristic speech;

S3、将特征语音转换为文本。S3. Convert the feature speech into text.

将特征语音转换为文本时，可采用现有神经网络或深度学习方法实现。When converting characteristic speech into text, existing neural networks or deep learning methods can be used.

进一步地，所述S1包括以下子步骤：Furthermore, the S1 comprises the following sub-steps:

S11、利用移动终端的麦克风采集用户的实时语音，对用户的实时语音进行加窗处理；S11, using the microphone of the mobile terminal to collect the real-time voice of the user, and performing windowing processing on the real-time voice of the user;

S12、将加窗处理后的实时语音拆分为第一音频集和第二音频集；S12, splitting the real-time speech after the windowing process into a first audio set and a second audio set;

S13、对第一音频集和第二音频集中每帧音频信号进行小波变换，得到每帧音频信号的细节子带系数；S13, performing wavelet transform on each frame of audio signal in the first audio set and the second audio set to obtain detail subband coefficients of each frame of audio signal;

S14、根据第一音频集中每帧音频信号的细节子带系数，确定第一音频集的音频变换系数，根据第二音频集中每帧音频信号的细节子带系数，确定第二音频集的音频变换系数；S14, determining an audio transform coefficient of the first audio set according to a detail subband coefficient of each frame of the audio signal in the first audio set, and determining an audio transform coefficient of the second audio set according to the detail subband coefficient of each frame of the audio signal in the second audio set;

S15、根据第一音频集的音频变换系数以及第二音频集的音频变换系数，确定第一音频融合元素；S15. Determine a first audio fusion element according to the audio transformation coefficient of the first audio set and the audio transformation coefficient of the second audio set;

S16、确定第二音频融合元素；S16, determining a second audio fusion element;

S17、根据确定第一音频融合元素和第二音频融合元素，生成音频融合算子。S17. Generate an audio fusion operator according to the determined first audio fusion element and the second audio fusion element.

上述进一步方案的有益效果是：在本发明中，用户的实时语音包含了较多的语音信号，因此将实时语音进行拆分，对两个音频集的语音信号分别提取音频变换系数，根据两个音频变换系数确定第一音频融合元素；将两个音频集的伽马通滤波器倒谱系数，得到第二音频融合元素；将两个元素拼接，生成包含语音信号特征的音频融合算子，该音频融合算子可以体现音频特征，便于后期进行语音增强处理。其中，抗噪性能更加的伽马通滤波器倒谱系数是一种基于人耳耳蜗听觉模型的特征参数，主要用于音频数据特征提取和语音识别，具有良好的鲁棒性，可以在噪声以及不稳定环境下有效提升识别准确率。The beneficial effect of the above further scheme is: in the present invention, the user's real-time voice contains more voice signals, so the real-time voice is split, the audio transformation coefficients are extracted from the voice signals of the two audio sets respectively, and the first audio fusion element is determined according to the two audio transformation coefficients; the gammatone filter cepstral coefficients of the two audio sets are used to obtain the second audio fusion element; the two elements are spliced to generate an audio fusion operator containing the characteristics of the voice signal, and the audio fusion operator can reflect the audio characteristics and facilitate the subsequent voice enhancement processing. Among them, the gammatone filter cepstral coefficient with better noise resistance is a characteristic parameter based on the human ear cochlear auditory model, which is mainly used for audio data feature extraction and speech recognition, has good robustness, and can effectively improve the recognition accuracy in noisy and unstable environments.

进一步地，所述S11中，将实时语音的音频长度作为加窗函数的窗口宽度，利用加窗函数对用户的实时语音进行加窗处理；Further, in S11, the audio length of the real-time speech is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time speech of the user;

所述加窗函数Z(k)的表达式为：The expression of the windowing function Z(k) is:

；式中，S表示加窗函数的窗口宽度，k表示加窗函数的单位窗口宽度编号。 ; Wherein, S represents the window width of the windowing function, and k represents the unit window width number of the windowing function.

进一步地，所述S14中，第一音频集的音频变换系数c₁的计算公式为：Furthermore, in S14, the calculation formula of the audio transformation coefficient _c1 of the first audio set is:

；式中，p_m表示第一音频集中第m帧音频信号的细节子带系数，p_m-1表示第一音频集中第m-1帧音频信号的细节子带系数，p_m+1表示第一音频集中第m+1帧音频信号的细节子带系数，exp(·)表示指数函数，M表示第一音频集的音频信号总帧数； ; Wherein, _pm represents the detail subband coefficient of the m-th frame audio signal in the first audio set, _pm-1 represents the detail subband coefficient of the m-1-th frame audio signal in the first audio set, _pm+1 represents the detail subband coefficient of the m+1-th frame audio signal in the first audio set, exp(·) represents an exponential function, and M represents the total number of audio signal frames in the first audio set;

所述S14中，第二音频集的音频变换系数c₂的计算公式为：In S14, the calculation formula of the audio conversion coefficient _c2 of the second audio set is:

；式中，p_n表示第二音频集中第n帧音频信号的细节子带系数，p_n-1表示第二音频集中第n-1帧音频信号的细节子带系数，p_n+1表示第二音频集中第n+1帧音频信号的细节子带系数，N表示第二音频集的音频信号总帧数。 ; In the formula, _pn represents the detail subband coefficient of the n-th frame audio signal in the second audio set, pn _-1 represents the detail subband coefficient of the n-1-th frame audio signal in the second audio set, pn ₊₁ represents the detail subband coefficient of the n+1-th frame audio signal in the second audio set, and N represents the total number of audio signal frames in the second audio set.

进一步地，所述S15中，第一音频融合元素x₁的计算公式为：Furthermore, in S15, the calculation formula of the first audio fusion element _x1 is:

；式中，c₁表示第一音频集的音频变换系数，c₂表示第二音频集的音频变换系数，M表示第一音频集的音频信号总帧数，N表示第二音频集的音频信号总帧数，表示向上取整运算。 ; Wherein, _c1 represents the audio transformation coefficient of the first audio set, _c2 represents the audio transformation coefficient of the second audio set, M represents the total number of audio signal frames of the first audio set, N represents the total number of audio signal frames of the second audio set, Indicates a round-up operation.

进一步地，所述S16包括以下子步骤：Furthermore, the S16 includes the following sub-steps:

S161、提取第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数；S161, extracting gammatone filter cepstral coefficients of the first audio set and gammatone filter cepstral coefficients of the second audio set;

S162、根据第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数，计算第二音频融合元素。S162: Calculate a second audio fusion element according to the gammatone filter cepstral coefficients of the first audio set and the gammatone filter cepstral coefficients of the second audio set.

进一步地，所述S162中，第二音频融合元素x₂的计算公式为：Furthermore, in S162, the calculation formula of the second audio fusion element x ₂ is:

；式中，F₁表示第一音频集的伽马通滤波器倒谱系数，F₂表示第二音频集的伽马通滤波器倒谱系数。 ; Wherein, _F1 represents the gammatone filter cepstral coefficient of the first audio set, and _F2 represents the gammatone filter cepstral coefficient of the second audio set.

进一步地，所述S17中，音频融合算子R的表达式为：R=[x_1,x₂]；式中，[_,]表示拼接操作，x₁表示第一音频融合元素，x₂表示第二音频融合元素。Furthermore, in S17, the expression of the audio fusion operator R is: R=[x _1, x ₂ ]; wherein, [ _, ] represents a concatenation operation, x ₁ represents a first audio fusion element, and x ₂ represents a second audio fusion element.

进一步地，所述S2包括以下子步骤：Furthermore, the S2 comprises the following sub-steps:

S21、提取实时语音的原始增益因子；S21, extracting the original gain factor of the real-time speech;

S22、利用音频融合算子对原始增益因子进行修正，得到目标增益因子；S22, using an audio fusion operator to correct the original gain factor to obtain a target gain factor;

S23、利用目标增益因子对实时语音进行增强处理，得到特征语音。S23. Perform enhancement processing on the real-time speech using the target gain factor to obtain characteristic speech.

上述进一步方案的有益效果是：在本发明中，由于用户输入的实时语音的噪声干扰较强，因此需要利用音频融合算子对增益因子进行修正，通过修正结果对实时语音进行增强处理，抑制噪声的干扰。The beneficial effect of the above further scheme is: in the present invention, since the noise interference of the real-time speech input by the user is strong, it is necessary to use the audio fusion operator to correct the gain factor, and enhance the real-time speech through the correction result to suppress the interference of noise.

进一步地，所述S22中，目标增益因子Y的计算公式为：Furthermore, in S22, the target gain factor Y is calculated as follows:

；式中，R表示音频融合算子，y表示原始增益因子，表示向上取整运算。 ; In the formula, R represents the audio fusion operator, y represents the original gain factor, Indicates a round-up operation.

本发明的有益效果是：该基于人工智能的语音互动方法对用户输入的实时语音进行拆分，分别提取两个音频集的特征参数，再进行拼接，生成音频融合算子，利用音频融合算子对增强处理的增益算子进行修正，提高语音增强处理的效果，提高语音转换为文本的识别精度，改善了语音识别的抗噪性。The beneficial effects of the present invention are as follows: the artificial intelligence-based voice interaction method splits the real-time voice input by the user, extracts the characteristic parameters of the two audio sets respectively, and then splices them to generate an audio fusion operator, and uses the audio fusion operator to correct the gain operator of the enhancement processing, thereby improving the effect of the speech enhancement processing, improving the recognition accuracy of speech conversion to text, and improving the noise resistance of speech recognition.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为基于人工智能的语音互动方法的流程图。FIG1 is a flow chart of an artificial intelligence-based voice interaction method.

具体实施方式Detailed ways

下面结合附图对本发明的实施例作进一步的说明。The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

如图1所示，本发明提供了一种基于人工智能的语音互动方法，包括以下步骤：As shown in FIG1 , the present invention provides a voice interaction method based on artificial intelligence, comprising the following steps:

S3、将特征语音转换为文本。S3. Convert the feature speech into text.

在本发明实施例中，所述S1包括以下子步骤：In this embodiment of the present invention, S1 includes the following sub-steps:

在本发明中，用户的实时语音包含了较多的语音信号，因此将实时语音进行拆分，对两个音频集的语音信号分别提取音频变换系数，根据两个音频变换系数确定第一音频融合元素；将两个音频集的伽马通滤波器倒谱系数，得到第二音频融合元素；将两个元素拼接，生成包含语音信号特征的音频融合算子，该音频融合算子可以体现音频特征，便于后期进行语音增强处理。其中，抗噪性能更加的伽马通滤波器倒谱系数是一种基于人耳耳蜗听觉模型的特征参数，主要用于音频数据特征提取和语音识别，具有良好的鲁棒性，可以在噪声以及不稳定环境下有效提升识别准确率。In the present invention, the user's real-time voice contains more voice signals, so the real-time voice is split, the audio conversion coefficients are extracted from the voice signals of the two audio sets respectively, and the first audio fusion element is determined according to the two audio conversion coefficients; the gammatone filter cepstral coefficients of the two audio sets are used to obtain the second audio fusion element; the two elements are spliced to generate an audio fusion operator containing the voice signal characteristics, which can reflect the audio characteristics and facilitate the subsequent voice enhancement processing. Among them, the gammatone filter cepstral coefficient with better noise resistance is a characteristic parameter based on the human ear cochlear auditory model, which is mainly used for audio data feature extraction and speech recognition, has good robustness, and can effectively improve the recognition accuracy in noisy and unstable environments.

在本发明实施例中，所述S11中，将实时语音的音频长度作为加窗函数的窗口宽度，利用加窗函数对用户的实时语音进行加窗处理；In the embodiment of the present invention, in S11, the audio length of the real-time speech is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time speech of the user;

在本发明实施例中，所述S14中，第一音频集的音频变换系数c₁的计算公式为：In the embodiment of the present invention, in S14, the calculation formula of the audio conversion coefficient _c1 of the first audio set is:

在本发明实施例中，所述S15中，第一音频融合元素x₁的计算公式为：In the embodiment of the present invention, in S15, the calculation formula of the first audio fusion element _x1 is:

在本发明实施例中，所述S16包括以下子步骤：In this embodiment of the present invention, the S16 includes the following sub-steps:

在本发明实施例中，所述S162中，第二音频融合元素x₂的计算公式为：In the embodiment of the present invention, in S162, the calculation formula of the second audio fusion element x ₂ is:

在本发明实施例中，所述S17中，音频融合算子R的表达式为：R=[x_1,x₂]；式中，[_,]表示拼接操作，x₁表示第一音频融合元素，x₂表示第二音频融合元素。In the embodiment of the present invention, in S17, the expression of the audio fusion operator R is: R=[x _1, x ₂ ]; wherein, [ _, ] represents a concatenation operation, x ₁ represents a first audio fusion element, and x ₂ represents a second audio fusion element.

在本发明实施例中，所述S2包括以下子步骤：In this embodiment of the present invention, S2 includes the following sub-steps:

在本发明中，由于用户输入的实时语音的噪声干扰较强，因此需要利用音频融合算子对增益因子进行修正，通过修正结果对实时语音进行增强处理，抑制噪声的干扰。In the present invention, since the noise interference of the real-time speech input by the user is relatively strong, it is necessary to use the audio fusion operator to correct the gain factor, and enhance the real-time speech through the correction result to suppress the interference of noise.

在本发明实施例中，所述S22中，目标增益因子Y的计算公式为：In the embodiment of the present invention, in S22, the calculation formula of the target gain factor Y is:

本领域的普通技术人员将会意识到，这里所述的实施例是为了帮助读者理解本发明的原理，应被理解为本发明的保护范围并不局限于这样的特别陈述和实施例。本领域的普通技术人员可以根据本发明公开的这些技术启示做出各种不脱离本发明实质的其它各种具体变形和组合，这些变形和组合仍然在本发明的保护范围内。Those skilled in the art will appreciate that the embodiments described herein are intended to help readers understand the principles of the present invention, and should be understood that the protection scope of the present invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific variations and combinations that do not deviate from the essence of the present invention based on the technical revelations disclosed by the present invention, and these variations and combinations are still within the protection scope of the present invention.

Claims

1. A voice interaction method based on artificial intelligence, characterized in that it includes the following steps:

S1. Use the microphone of the mobile terminal to collect the user's real-time voice and generate an audio fusion operator corresponding to the real-time voice;

S2, using the audio fusion operator to process the real-time speech and generate characteristic speech;

S3. Convert the feature speech into text.

2. The artificial intelligence-based voice interaction method according to claim 1, characterized in that S1 comprises the following sub-steps:

S11, using the microphone of the mobile terminal to collect the real-time voice of the user, and performing windowing processing on the real-time voice of the user;

S12, splitting the real-time speech after the windowing process into a first audio set and a second audio set;

S13, performing wavelet transform on each frame of audio signal in the first audio set and the second audio set to obtain detail subband coefficients of each frame of audio signal;

S14, determining an audio transform coefficient of the first audio set according to a detail subband coefficient of each frame of the audio signal in the first audio set, and determining an audio transform coefficient of the second audio set according to the detail subband coefficient of each frame of the audio signal in the second audio set;

S15. Determine a first audio fusion element according to the audio transformation coefficient of the first audio set and the audio transformation coefficient of the second audio set;

S16, determining a second audio fusion element;

S17. Generate an audio fusion operator according to the determined first audio fusion element and the second audio fusion element.

3. The artificial intelligence-based voice interaction method according to claim 2, characterized in that, in S11, the audio length of the real-time voice is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the user's real-time voice;

The expression of the windowing function Z(k) is:

; Wherein, S represents the window width of the windowing function, and k represents the unit window width number of the windowing function.

4. The artificial intelligence-based voice interaction method according to claim 2, characterized in that, in S14, the calculation formula of the audio transformation coefficient _c1 of the first audio set is:

; Wherein, _pm represents the detail subband coefficient of the m-th frame audio signal in the first audio set, _pm-1 represents the detail subband coefficient of the m-1-th frame audio signal in the first audio set, _pm+1 represents the detail subband coefficient of the m+1-th frame audio signal in the first audio set, exp(·) represents an exponential function, and M represents the total number of audio signal frames in the first audio set;

In S14, the calculation formula of the audio conversion coefficient _c2 of the second audio set is:

; In the formula, _pn represents the detail subband coefficient of the n-th frame audio signal in the second audio set, pn _-1 represents the detail subband coefficient of the n-1-th frame audio signal in the second audio set, pn ₊₁ represents the detail subband coefficient of the n+1-th frame audio signal in the second audio set, and N represents the total number of audio signal frames in the second audio set.

5. The artificial intelligence-based voice interaction method according to claim 2, characterized in that, in S15, the calculation formula of the first audio fusion element _x1 is:

; Wherein, _c1 represents the audio transformation coefficient of the first audio set, _c2 represents the audio transformation coefficient of the second audio set, M represents the total number of audio signal frames of the first audio set, N represents the total number of audio signal frames of the second audio set, Indicates a round-up operation.

6. The artificial intelligence-based voice interaction method according to claim 2, characterized in that said S16 comprises the following sub-steps:

S161, extracting gammatone filter cepstral coefficients of the first audio set and gammatone filter cepstral coefficients of the second audio set;

S162: Calculate a second audio fusion element according to the gammatone filter cepstral coefficients of the first audio set and the gammatone filter cepstral coefficients of the second audio set.

7. The artificial intelligence-based voice interaction method according to claim 6, characterized in that, in S162, the calculation formula of the second audio fusion element x ₂ is:

; Wherein, _F1 represents the gammatone filter cepstral coefficient of the first audio set, and _F2 represents the gammatone filter cepstral coefficient of the second audio set.

8. The artificial intelligence-based voice interaction method according to claim 2 is characterized in that, in S17, the expression of the audio fusion operator R is: R=[x1 _, _x2 ]; wherein, [ _, ] represents a splicing operation, _x1 represents a first audio fusion element, and _x2 represents a second audio fusion element.

9. The artificial intelligence-based voice interaction method according to claim 1, characterized in that S2 comprises the following sub-steps:

S21, extracting the original gain factor of the real-time speech;

S22, using an audio fusion operator to correct the original gain factor to obtain a target gain factor;

S23. Perform enhancement processing on the real-time speech using the target gain factor to obtain characteristic speech.

10. The artificial intelligence-based voice interaction method according to claim 9, characterized in that, in S22, the calculation formula of the target gain factor Y is:

; In the formula, R represents the audio fusion operator, y represents the original gain factor, Indicates a round-up operation.