CN118335089A - Speech interaction method based on artificial intelligence - Google Patents

Speech interaction method based on artificial intelligence Download PDF

Info

Publication number
CN118335089A
CN118335089A CN202410764506.3A CN202410764506A CN118335089A CN 118335089 A CN118335089 A CN 118335089A CN 202410764506 A CN202410764506 A CN 202410764506A CN 118335089 A CN118335089 A CN 118335089A
Authority
CN
China
Prior art keywords
audio
real
coefficient
audio set
artificial intelligence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410764506.3A
Other languages
Chinese (zh)
Other versions
CN118335089B (en
Inventor
沈国良
景奕昕
尚晓波
黄爱军
蔡梁元
王磊
余璐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Pansheng Dingcheng Technology Co ltd
Original Assignee
Wuhan Pansheng Dingcheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Pansheng Dingcheng Technology Co ltd filed Critical Wuhan Pansheng Dingcheng Technology Co ltd
Priority to CN202410764506.3A priority Critical patent/CN118335089B/en
Publication of CN118335089A publication Critical patent/CN118335089A/en
Application granted granted Critical
Publication of CN118335089B publication Critical patent/CN118335089B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明公开了一种基于人工智能的语音互动方法,数据处理技术领域,包括以下步骤:S1、利用移动终端的麦克风采集用户的实时语音,生成实时语音对应的音频融合算子;S2、利用音频融合算子对实时语音进行处理,生成特征语音;S3、将特征语音转换为文本。该基于人工智能的语音互动方法对用户输入的实时语音进行拆分,分别提取两个音频集的特征参数,再进行拼接,生成音频融合算子,利用音频融合算子对增强处理的增益算子进行修正,提高语音增强处理的效果,提高语音转换为文本的识别精度,改善了语音识别的抗噪性。

The present invention discloses a voice interaction method based on artificial intelligence, in the field of data processing technology, comprising the following steps: S1, using a microphone of a mobile terminal to collect the real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice; S2, using an audio fusion operator to process the real-time voice, and generating a characteristic voice; S3, converting the characteristic voice into text. The voice interaction method based on artificial intelligence splits the real-time voice input by the user, extracts the characteristic parameters of two audio sets respectively, and then splices them to generate an audio fusion operator, and uses the audio fusion operator to correct the gain operator of the enhancement processing, thereby improving the effect of the voice enhancement processing, improving the recognition accuracy of voice conversion to text, and improving the noise resistance of voice recognition.

Description

一种基于人工智能的语音互动方法A voice interaction method based on artificial intelligence

技术领域Technical Field

本发明属于数据处理技术领域,具体涉及一种基于人工智能的语音互动方法。The present invention belongs to the technical field of data processing, and in particular relates to a voice interaction method based on artificial intelligence.

背景技术Background technique

随着人工智能的发展,基于人工智能的语音合成技术的应用越来越广泛。语音转文字是一种将语音内容转换为可编辑文本的技术。它可以帮助人们快速将语音记录转化为文字,提高工作和学习效率。人们在使用手机或者平板电脑时,很多人喜欢使用语音识别的方式进行文字输入,但用户在输入语音时所处环境的噪声可能较大,导致转换的文字内容不准确。With the development of artificial intelligence, the application of speech synthesis technology based on artificial intelligence is becoming more and more widespread. Speech-to-text is a technology that converts speech content into editable text. It can help people quickly convert speech records into text and improve work and study efficiency. When people use mobile phones or tablets, many people like to use speech recognition to input text, but the noise in the environment where users are inputting speech may be large, resulting in inaccurate converted text content.

发明内容Summary of the invention

本发明为了解决以上问题,提出了一种基于人工智能的语音互动方法。In order to solve the above problems, the present invention proposes a voice interaction method based on artificial intelligence.

本发明的技术方案是:一种基于人工智能的语音互动方法包括以下步骤:The technical solution of the present invention is: a voice interaction method based on artificial intelligence comprises the following steps:

S1、利用移动终端的麦克风采集用户的实时语音,生成实时语音对应的音频融合算子;S1. Use the microphone of the mobile terminal to collect the user's real-time voice and generate an audio fusion operator corresponding to the real-time voice;

S2、利用音频融合算子对实时语音进行处理,生成特征语音;S2, using the audio fusion operator to process the real-time speech and generate characteristic speech;

S3、将特征语音转换为文本。S3. Convert the feature speech into text.

将特征语音转换为文本时,可采用现有神经网络或深度学习方法实现。When converting characteristic speech into text, existing neural networks or deep learning methods can be used.

进一步地,所述S1包括以下子步骤:Furthermore, the S1 comprises the following sub-steps:

S11、利用移动终端的麦克风采集用户的实时语音,对用户的实时语音进行加窗处理;S11, using the microphone of the mobile terminal to collect the real-time voice of the user, and performing windowing processing on the real-time voice of the user;

S12、将加窗处理后的实时语音拆分为第一音频集和第二音频集;S12, splitting the real-time speech after the windowing process into a first audio set and a second audio set;

S13、对第一音频集和第二音频集中每帧音频信号进行小波变换,得到每帧音频信号的细节子带系数;S13, performing wavelet transform on each frame of audio signal in the first audio set and the second audio set to obtain detail subband coefficients of each frame of audio signal;

S14、根据第一音频集中每帧音频信号的细节子带系数,确定第一音频集的音频变换系数,根据第二音频集中每帧音频信号的细节子带系数,确定第二音频集的音频变换系数;S14, determining an audio transform coefficient of the first audio set according to a detail subband coefficient of each frame of the audio signal in the first audio set, and determining an audio transform coefficient of the second audio set according to the detail subband coefficient of each frame of the audio signal in the second audio set;

S15、根据第一音频集的音频变换系数以及第二音频集的音频变换系数,确定第一音频融合元素;S15. Determine a first audio fusion element according to the audio transformation coefficient of the first audio set and the audio transformation coefficient of the second audio set;

S16、确定第二音频融合元素;S16, determining a second audio fusion element;

S17、根据确定第一音频融合元素和第二音频融合元素,生成音频融合算子。S17. Generate an audio fusion operator according to the determined first audio fusion element and the second audio fusion element.

上述进一步方案的有益效果是:在本发明中,用户的实时语音包含了较多的语音信号,因此将实时语音进行拆分,对两个音频集的语音信号分别提取音频变换系数,根据两个音频变换系数确定第一音频融合元素;将两个音频集的伽马通滤波器倒谱系数,得到第二音频融合元素;将两个元素拼接,生成包含语音信号特征的音频融合算子,该音频融合算子可以体现音频特征,便于后期进行语音增强处理。其中,抗噪性能更加的伽马通滤波器倒谱系数是一种基于人耳耳蜗听觉模型的特征参数,主要用于音频数据特征提取和语音识别,具有良好的鲁棒性,可以在噪声以及不稳定环境下有效提升识别准确率。The beneficial effect of the above further scheme is: in the present invention, the user's real-time voice contains more voice signals, so the real-time voice is split, the audio transformation coefficients are extracted from the voice signals of the two audio sets respectively, and the first audio fusion element is determined according to the two audio transformation coefficients; the gammatone filter cepstral coefficients of the two audio sets are used to obtain the second audio fusion element; the two elements are spliced to generate an audio fusion operator containing the characteristics of the voice signal, and the audio fusion operator can reflect the audio characteristics and facilitate the subsequent voice enhancement processing. Among them, the gammatone filter cepstral coefficient with better noise resistance is a characteristic parameter based on the human ear cochlear auditory model, which is mainly used for audio data feature extraction and speech recognition, has good robustness, and can effectively improve the recognition accuracy in noisy and unstable environments.

进一步地,所述S11中,将实时语音的音频长度作为加窗函数的窗口宽度,利用加窗函数对用户的实时语音进行加窗处理;Further, in S11, the audio length of the real-time speech is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time speech of the user;

所述加窗函数Z(k)的表达式为:The expression of the windowing function Z(k) is:

;式中,S表示加窗函数的窗口宽度,k表示加窗函数的单位窗口宽度编号。 ; Wherein, S represents the window width of the windowing function, and k represents the unit window width number of the windowing function.

进一步地,所述S14中,第一音频集的音频变换系数c1的计算公式为:Furthermore, in S14, the calculation formula of the audio transformation coefficient c1 of the first audio set is:

;式中,pm表示第一音频集中第m帧音频信号的细节子带系数,pm-1表示第一音频集中第m-1帧音频信号的细节子带系数,pm+1表示第一音频集中第m+1帧音频信号的细节子带系数,exp(·)表示指数函数,M表示第一音频集的音频信号总帧数; ; Wherein, pm represents the detail subband coefficient of the m-th frame audio signal in the first audio set, pm-1 represents the detail subband coefficient of the m-1-th frame audio signal in the first audio set, pm+1 represents the detail subband coefficient of the m+1-th frame audio signal in the first audio set, exp(·) represents an exponential function, and M represents the total number of audio signal frames in the first audio set;

所述S14中,第二音频集的音频变换系数c2的计算公式为:In S14, the calculation formula of the audio conversion coefficient c2 of the second audio set is:

;式中,pn表示第二音频集中第n帧音频信号的细节子带系数,pn-1表示第二音频集中第n-1帧音频信号的细节子带系数,pn+1表示第二音频集中第n+1帧音频信号的细节子带系数,N表示第二音频集的音频信号总帧数。 ; In the formula, pn represents the detail subband coefficient of the n-th frame audio signal in the second audio set, pn -1 represents the detail subband coefficient of the n-1-th frame audio signal in the second audio set, pn +1 represents the detail subband coefficient of the n+1-th frame audio signal in the second audio set, and N represents the total number of audio signal frames in the second audio set.

进一步地,所述S15中,第一音频融合元素x1的计算公式为:Furthermore, in S15, the calculation formula of the first audio fusion element x1 is:

;式中,c1表示第一音频集的音频变换系数,c2表示第二音频集的音频变换系数,M表示第一音频集的音频信号总帧数,N表示第二音频集的音频信号总帧数,表示向上取整运算。 ; Wherein, c1 represents the audio transformation coefficient of the first audio set, c2 represents the audio transformation coefficient of the second audio set, M represents the total number of audio signal frames of the first audio set, N represents the total number of audio signal frames of the second audio set, Indicates a round-up operation.

进一步地,所述S16包括以下子步骤:Furthermore, the S16 includes the following sub-steps:

S161、提取第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数;S161, extracting gammatone filter cepstral coefficients of the first audio set and gammatone filter cepstral coefficients of the second audio set;

S162、根据第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数,计算第二音频融合元素。S162: Calculate a second audio fusion element according to the gammatone filter cepstral coefficients of the first audio set and the gammatone filter cepstral coefficients of the second audio set.

进一步地,所述S162中,第二音频融合元素x2的计算公式为:Furthermore, in S162, the calculation formula of the second audio fusion element x 2 is:

;式中,F1表示第一音频集的伽马通滤波器倒谱系数,F2表示第二音频集的伽马通滤波器倒谱系数。 ; Wherein, F1 represents the gammatone filter cepstral coefficient of the first audio set, and F2 represents the gammatone filter cepstral coefficient of the second audio set.

进一步地,所述S17中,音频融合算子R的表达式为:R=[x1,x2];式中,[,]表示拼接操作,x1表示第一音频融合元素,x2表示第二音频融合元素。Furthermore, in S17, the expression of the audio fusion operator R is: R=[x 1, x 2 ]; wherein, [ , ] represents a concatenation operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.

进一步地,所述S2包括以下子步骤:Furthermore, the S2 comprises the following sub-steps:

S21、提取实时语音的原始增益因子;S21, extracting the original gain factor of the real-time speech;

S22、利用音频融合算子对原始增益因子进行修正,得到目标增益因子;S22, using an audio fusion operator to correct the original gain factor to obtain a target gain factor;

S23、利用目标增益因子对实时语音进行增强处理,得到特征语音。S23. Perform enhancement processing on the real-time speech using the target gain factor to obtain characteristic speech.

上述进一步方案的有益效果是:在本发明中,由于用户输入的实时语音的噪声干扰较强,因此需要利用音频融合算子对增益因子进行修正,通过修正结果对实时语音进行增强处理,抑制噪声的干扰。The beneficial effect of the above further scheme is: in the present invention, since the noise interference of the real-time speech input by the user is strong, it is necessary to use the audio fusion operator to correct the gain factor, and enhance the real-time speech through the correction result to suppress the interference of noise.

进一步地,所述S22中,目标增益因子Y的计算公式为:Furthermore, in S22, the target gain factor Y is calculated as follows:

;式中,R表示音频融合算子,y表示原始增益因子,表示向上取整运算。 ; In the formula, R represents the audio fusion operator, y represents the original gain factor, Indicates a round-up operation.

本发明的有益效果是:该基于人工智能的语音互动方法对用户输入的实时语音进行拆分,分别提取两个音频集的特征参数,再进行拼接,生成音频融合算子,利用音频融合算子对增强处理的增益算子进行修正,提高语音增强处理的效果,提高语音转换为文本的识别精度,改善了语音识别的抗噪性。The beneficial effects of the present invention are as follows: the artificial intelligence-based voice interaction method splits the real-time voice input by the user, extracts the characteristic parameters of the two audio sets respectively, and then splices them to generate an audio fusion operator, and uses the audio fusion operator to correct the gain operator of the enhancement processing, thereby improving the effect of the speech enhancement processing, improving the recognition accuracy of speech conversion to text, and improving the noise resistance of speech recognition.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为基于人工智能的语音互动方法的流程图。FIG1 is a flow chart of an artificial intelligence-based voice interaction method.

具体实施方式Detailed ways

下面结合附图对本发明的实施例作进一步的说明。The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

如图1所示,本发明提供了一种基于人工智能的语音互动方法,包括以下步骤:As shown in FIG1 , the present invention provides a voice interaction method based on artificial intelligence, comprising the following steps:

S1、利用移动终端的麦克风采集用户的实时语音,生成实时语音对应的音频融合算子;S1. Use the microphone of the mobile terminal to collect the user's real-time voice and generate an audio fusion operator corresponding to the real-time voice;

S2、利用音频融合算子对实时语音进行处理,生成特征语音;S2, using the audio fusion operator to process the real-time speech and generate characteristic speech;

S3、将特征语音转换为文本。S3. Convert the feature speech into text.

将特征语音转换为文本时,可采用现有神经网络或深度学习方法实现。When converting characteristic speech into text, existing neural networks or deep learning methods can be used.

在本发明实施例中,所述S1包括以下子步骤:In this embodiment of the present invention, S1 includes the following sub-steps:

S11、利用移动终端的麦克风采集用户的实时语音,对用户的实时语音进行加窗处理;S11, using the microphone of the mobile terminal to collect the real-time voice of the user, and performing windowing processing on the real-time voice of the user;

S12、将加窗处理后的实时语音拆分为第一音频集和第二音频集;S12, splitting the real-time speech after the windowing process into a first audio set and a second audio set;

S13、对第一音频集和第二音频集中每帧音频信号进行小波变换,得到每帧音频信号的细节子带系数;S13, performing wavelet transform on each frame of audio signal in the first audio set and the second audio set to obtain detail subband coefficients of each frame of audio signal;

S14、根据第一音频集中每帧音频信号的细节子带系数,确定第一音频集的音频变换系数,根据第二音频集中每帧音频信号的细节子带系数,确定第二音频集的音频变换系数;S14, determining an audio transform coefficient of the first audio set according to a detail subband coefficient of each frame of the audio signal in the first audio set, and determining an audio transform coefficient of the second audio set according to the detail subband coefficient of each frame of the audio signal in the second audio set;

S15、根据第一音频集的音频变换系数以及第二音频集的音频变换系数,确定第一音频融合元素;S15. Determine a first audio fusion element according to the audio transformation coefficient of the first audio set and the audio transformation coefficient of the second audio set;

S16、确定第二音频融合元素;S16, determining a second audio fusion element;

S17、根据确定第一音频融合元素和第二音频融合元素,生成音频融合算子。S17. Generate an audio fusion operator according to the determined first audio fusion element and the second audio fusion element.

在本发明中,用户的实时语音包含了较多的语音信号,因此将实时语音进行拆分,对两个音频集的语音信号分别提取音频变换系数,根据两个音频变换系数确定第一音频融合元素;将两个音频集的伽马通滤波器倒谱系数,得到第二音频融合元素;将两个元素拼接,生成包含语音信号特征的音频融合算子,该音频融合算子可以体现音频特征,便于后期进行语音增强处理。其中,抗噪性能更加的伽马通滤波器倒谱系数是一种基于人耳耳蜗听觉模型的特征参数,主要用于音频数据特征提取和语音识别,具有良好的鲁棒性,可以在噪声以及不稳定环境下有效提升识别准确率。In the present invention, the user's real-time voice contains more voice signals, so the real-time voice is split, the audio conversion coefficients are extracted from the voice signals of the two audio sets respectively, and the first audio fusion element is determined according to the two audio conversion coefficients; the gammatone filter cepstral coefficients of the two audio sets are used to obtain the second audio fusion element; the two elements are spliced to generate an audio fusion operator containing the voice signal characteristics, which can reflect the audio characteristics and facilitate the subsequent voice enhancement processing. Among them, the gammatone filter cepstral coefficient with better noise resistance is a characteristic parameter based on the human ear cochlear auditory model, which is mainly used for audio data feature extraction and speech recognition, has good robustness, and can effectively improve the recognition accuracy in noisy and unstable environments.

在本发明实施例中,所述S11中,将实时语音的音频长度作为加窗函数的窗口宽度,利用加窗函数对用户的实时语音进行加窗处理;In the embodiment of the present invention, in S11, the audio length of the real-time speech is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time speech of the user;

所述加窗函数Z(k)的表达式为:The expression of the windowing function Z(k) is:

;式中,S表示加窗函数的窗口宽度,k表示加窗函数的单位窗口宽度编号。 ; Wherein, S represents the window width of the windowing function, and k represents the unit window width number of the windowing function.

在本发明实施例中,所述S14中,第一音频集的音频变换系数c1的计算公式为:In the embodiment of the present invention, in S14, the calculation formula of the audio conversion coefficient c1 of the first audio set is:

;式中,pm表示第一音频集中第m帧音频信号的细节子带系数,pm-1表示第一音频集中第m-1帧音频信号的细节子带系数,pm+1表示第一音频集中第m+1帧音频信号的细节子带系数,exp(·)表示指数函数,M表示第一音频集的音频信号总帧数; ; Wherein, pm represents the detail subband coefficient of the m-th frame audio signal in the first audio set, pm-1 represents the detail subband coefficient of the m-1-th frame audio signal in the first audio set, pm+1 represents the detail subband coefficient of the m+1-th frame audio signal in the first audio set, exp(·) represents an exponential function, and M represents the total number of audio signal frames in the first audio set;

所述S14中,第二音频集的音频变换系数c2的计算公式为:In S14, the calculation formula of the audio conversion coefficient c2 of the second audio set is:

;式中,pn表示第二音频集中第n帧音频信号的细节子带系数,pn-1表示第二音频集中第n-1帧音频信号的细节子带系数,pn+1表示第二音频集中第n+1帧音频信号的细节子带系数,N表示第二音频集的音频信号总帧数。 ; In the formula, pn represents the detail subband coefficient of the n-th frame audio signal in the second audio set, pn -1 represents the detail subband coefficient of the n-1-th frame audio signal in the second audio set, pn +1 represents the detail subband coefficient of the n+1-th frame audio signal in the second audio set, and N represents the total number of audio signal frames in the second audio set.

在本发明实施例中,所述S15中,第一音频融合元素x1的计算公式为:In the embodiment of the present invention, in S15, the calculation formula of the first audio fusion element x1 is:

;式中,c1表示第一音频集的音频变换系数,c2表示第二音频集的音频变换系数,M表示第一音频集的音频信号总帧数,N表示第二音频集的音频信号总帧数,表示向上取整运算。 ; Wherein, c1 represents the audio transformation coefficient of the first audio set, c2 represents the audio transformation coefficient of the second audio set, M represents the total number of audio signal frames of the first audio set, N represents the total number of audio signal frames of the second audio set, Indicates a round-up operation.

在本发明实施例中,所述S16包括以下子步骤:In this embodiment of the present invention, the S16 includes the following sub-steps:

S161、提取第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数;S161, extracting gammatone filter cepstral coefficients of the first audio set and gammatone filter cepstral coefficients of the second audio set;

S162、根据第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数,计算第二音频融合元素。S162: Calculate a second audio fusion element according to the gammatone filter cepstral coefficients of the first audio set and the gammatone filter cepstral coefficients of the second audio set.

在本发明实施例中,所述S162中,第二音频融合元素x2的计算公式为:In the embodiment of the present invention, in S162, the calculation formula of the second audio fusion element x 2 is:

;式中,F1表示第一音频集的伽马通滤波器倒谱系数,F2表示第二音频集的伽马通滤波器倒谱系数。 ; Wherein, F1 represents the gammatone filter cepstral coefficient of the first audio set, and F2 represents the gammatone filter cepstral coefficient of the second audio set.

在本发明实施例中,所述S17中,音频融合算子R的表达式为:R=[x1,x2];式中,[,]表示拼接操作,x1表示第一音频融合元素,x2表示第二音频融合元素。In the embodiment of the present invention, in S17, the expression of the audio fusion operator R is: R=[x 1, x 2 ]; wherein, [ , ] represents a concatenation operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.

在本发明实施例中,所述S2包括以下子步骤:In this embodiment of the present invention, S2 includes the following sub-steps:

S21、提取实时语音的原始增益因子;S21, extracting the original gain factor of the real-time speech;

S22、利用音频融合算子对原始增益因子进行修正,得到目标增益因子;S22, using an audio fusion operator to correct the original gain factor to obtain a target gain factor;

S23、利用目标增益因子对实时语音进行增强处理,得到特征语音。S23. Perform enhancement processing on the real-time speech using the target gain factor to obtain characteristic speech.

在本发明中,由于用户输入的实时语音的噪声干扰较强,因此需要利用音频融合算子对增益因子进行修正,通过修正结果对实时语音进行增强处理,抑制噪声的干扰。In the present invention, since the noise interference of the real-time speech input by the user is relatively strong, it is necessary to use the audio fusion operator to correct the gain factor, and enhance the real-time speech through the correction result to suppress the interference of noise.

在本发明实施例中,所述S22中,目标增益因子Y的计算公式为:In the embodiment of the present invention, in S22, the calculation formula of the target gain factor Y is:

;式中,R表示音频融合算子,y表示原始增益因子,表示向上取整运算。 ; In the formula, R represents the audio fusion operator, y represents the original gain factor, Indicates a round-up operation.

本领域的普通技术人员将会意识到,这里所述的实施例是为了帮助读者理解本发明的原理,应被理解为本发明的保护范围并不局限于这样的特别陈述和实施例。本领域的普通技术人员可以根据本发明公开的这些技术启示做出各种不脱离本发明实质的其它各种具体变形和组合,这些变形和组合仍然在本发明的保护范围内。Those skilled in the art will appreciate that the embodiments described herein are intended to help readers understand the principles of the present invention, and should be understood that the protection scope of the present invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific variations and combinations that do not deviate from the essence of the present invention based on the technical revelations disclosed by the present invention, and these variations and combinations are still within the protection scope of the present invention.

Claims (10)

1.一种基于人工智能的语音互动方法,其特征在于,包括以下步骤:1. A voice interaction method based on artificial intelligence, characterized in that it includes the following steps: S1、利用移动终端的麦克风采集用户的实时语音,生成实时语音对应的音频融合算子;S1. Use the microphone of the mobile terminal to collect the user's real-time voice and generate an audio fusion operator corresponding to the real-time voice; S2、利用音频融合算子对实时语音进行处理,生成特征语音;S2, using the audio fusion operator to process the real-time speech and generate characteristic speech; S3、将特征语音转换为文本。S3. Convert the feature speech into text. 2.根据权利要求1所述的基于人工智能的语音互动方法,其特征在于,所述S1包括以下子步骤:2. The artificial intelligence-based voice interaction method according to claim 1, characterized in that S1 comprises the following sub-steps: S11、利用移动终端的麦克风采集用户的实时语音,对用户的实时语音进行加窗处理;S11, using the microphone of the mobile terminal to collect the real-time voice of the user, and performing windowing processing on the real-time voice of the user; S12、将加窗处理后的实时语音拆分为第一音频集和第二音频集;S12, splitting the real-time speech after the windowing process into a first audio set and a second audio set; S13、对第一音频集和第二音频集中每帧音频信号进行小波变换,得到每帧音频信号的细节子带系数;S13, performing wavelet transform on each frame of audio signal in the first audio set and the second audio set to obtain detail subband coefficients of each frame of audio signal; S14、根据第一音频集中每帧音频信号的细节子带系数,确定第一音频集的音频变换系数,根据第二音频集中每帧音频信号的细节子带系数,确定第二音频集的音频变换系数;S14, determining an audio transform coefficient of the first audio set according to a detail subband coefficient of each frame of the audio signal in the first audio set, and determining an audio transform coefficient of the second audio set according to the detail subband coefficient of each frame of the audio signal in the second audio set; S15、根据第一音频集的音频变换系数以及第二音频集的音频变换系数,确定第一音频融合元素;S15. Determine a first audio fusion element according to the audio transformation coefficient of the first audio set and the audio transformation coefficient of the second audio set; S16、确定第二音频融合元素;S16, determining a second audio fusion element; S17、根据确定第一音频融合元素和第二音频融合元素,生成音频融合算子。S17. Generate an audio fusion operator according to the determined first audio fusion element and the second audio fusion element. 3.根据权利要求2所述的基于人工智能的语音互动方法,其特征在于,所述S11中,将实时语音的音频长度作为加窗函数的窗口宽度,利用加窗函数对用户的实时语音进行加窗处理;3. The artificial intelligence-based voice interaction method according to claim 2, characterized in that, in S11, the audio length of the real-time voice is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the user's real-time voice; 所述加窗函数Z(k)的表达式为:The expression of the windowing function Z(k) is: ;式中,S表示加窗函数的窗口宽度,k表示加窗函数的单位窗口宽度编号。 ; Wherein, S represents the window width of the windowing function, and k represents the unit window width number of the windowing function. 4.根据权利要求2所述的基于人工智能的语音互动方法,其特征在于,所述S14中,第一音频集的音频变换系数c1的计算公式为:4. The artificial intelligence-based voice interaction method according to claim 2, characterized in that, in S14, the calculation formula of the audio transformation coefficient c1 of the first audio set is: ;式中,pm表示第一音频集中第m帧音频信号的细节子带系数,pm-1表示第一音频集中第m-1帧音频信号的细节子带系数,pm+1表示第一音频集中第m+1帧音频信号的细节子带系数,exp(·)表示指数函数,M表示第一音频集的音频信号总帧数; ; Wherein, pm represents the detail subband coefficient of the m-th frame audio signal in the first audio set, pm-1 represents the detail subband coefficient of the m-1-th frame audio signal in the first audio set, pm+1 represents the detail subband coefficient of the m+1-th frame audio signal in the first audio set, exp(·) represents an exponential function, and M represents the total number of audio signal frames in the first audio set; 所述S14中,第二音频集的音频变换系数c2的计算公式为:In S14, the calculation formula of the audio conversion coefficient c2 of the second audio set is: ;式中,pn表示第二音频集中第n帧音频信号的细节子带系数,pn-1表示第二音频集中第n-1帧音频信号的细节子带系数,pn+1表示第二音频集中第n+1帧音频信号的细节子带系数,N表示第二音频集的音频信号总帧数。 ; In the formula, pn represents the detail subband coefficient of the n-th frame audio signal in the second audio set, pn -1 represents the detail subband coefficient of the n-1-th frame audio signal in the second audio set, pn +1 represents the detail subband coefficient of the n+1-th frame audio signal in the second audio set, and N represents the total number of audio signal frames in the second audio set. 5.根据权利要求2所述的基于人工智能的语音互动方法,其特征在于,所述S15中,第一音频融合元素x1的计算公式为:5. The artificial intelligence-based voice interaction method according to claim 2, characterized in that, in S15, the calculation formula of the first audio fusion element x1 is: ;式中,c1表示第一音频集的音频变换系数,c2表示第二音频集的音频变换系数,M表示第一音频集的音频信号总帧数,N表示第二音频集的音频信号总帧数,表示向上取整运算。 ; Wherein, c1 represents the audio transformation coefficient of the first audio set, c2 represents the audio transformation coefficient of the second audio set, M represents the total number of audio signal frames of the first audio set, N represents the total number of audio signal frames of the second audio set, Indicates a round-up operation. 6.根据权利要求2所述的基于人工智能的语音互动方法,其特征在于,所述S16包括以下子步骤:6. The artificial intelligence-based voice interaction method according to claim 2, characterized in that said S16 comprises the following sub-steps: S161、提取第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数;S161, extracting gammatone filter cepstral coefficients of the first audio set and gammatone filter cepstral coefficients of the second audio set; S162、根据第一音频集的伽马通滤波器倒谱系数和第二音频集的伽马通滤波器倒谱系数,计算第二音频融合元素。S162: Calculate a second audio fusion element according to the gammatone filter cepstral coefficients of the first audio set and the gammatone filter cepstral coefficients of the second audio set. 7.根据权利要求6所述的基于人工智能的语音互动方法,其特征在于,所述S162中,第二音频融合元素x2的计算公式为:7. The artificial intelligence-based voice interaction method according to claim 6, characterized in that, in S162, the calculation formula of the second audio fusion element x 2 is: ;式中,F1表示第一音频集的伽马通滤波器倒谱系数,F2表示第二音频集的伽马通滤波器倒谱系数。 ; Wherein, F1 represents the gammatone filter cepstral coefficient of the first audio set, and F2 represents the gammatone filter cepstral coefficient of the second audio set. 8.根据权利要求2所述的基于人工智能的语音互动方法,其特征在于,所述S17中,音频融合算子R的表达式为:R=[x1,x2];式中,[,]表示拼接操作,x1表示第一音频融合元素,x2表示第二音频融合元素。8. The artificial intelligence-based voice interaction method according to claim 2 is characterized in that, in S17, the expression of the audio fusion operator R is: R=[x1 , x2 ]; wherein, [ , ] represents a splicing operation, x1 represents a first audio fusion element, and x2 represents a second audio fusion element. 9.根据权利要求1所述的基于人工智能的语音互动方法,其特征在于,所述S2包括以下子步骤:9. The artificial intelligence-based voice interaction method according to claim 1, characterized in that S2 comprises the following sub-steps: S21、提取实时语音的原始增益因子;S21, extracting the original gain factor of the real-time speech; S22、利用音频融合算子对原始增益因子进行修正,得到目标增益因子;S22, using an audio fusion operator to correct the original gain factor to obtain a target gain factor; S23、利用目标增益因子对实时语音进行增强处理,得到特征语音。S23. Perform enhancement processing on the real-time speech using the target gain factor to obtain characteristic speech. 10.根据权利要求9所述的基于人工智能的语音互动方法,其特征在于,所述S22中,目标增益因子Y的计算公式为:10. The artificial intelligence-based voice interaction method according to claim 9, characterized in that, in S22, the calculation formula of the target gain factor Y is: ;式中,R表示音频融合算子,y表示原始增益因子,表示向上取整运算。 ; In the formula, R represents the audio fusion operator, y represents the original gain factor, Indicates a round-up operation.
CN202410764506.3A 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence Active CN118335089B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410764506.3A CN118335089B (en) 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410764506.3A CN118335089B (en) 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN118335089A true CN118335089A (en) 2024-07-12
CN118335089B CN118335089B (en) 2024-09-10

Family

ID=91777446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410764506.3A Active CN118335089B (en) 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN118335089B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013164572A (en) * 2012-01-10 2013-08-22 Toshiba Corp Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program
US20140188487A1 (en) * 2011-06-06 2014-07-03 Bridge Mediatech, S.L. Method and system for robust audio hashing
CN115440217A (en) * 2022-08-29 2022-12-06 西安讯飞超脑信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN115910034A (en) * 2022-09-30 2023-04-04 兴业银行股份有限公司 Speech language recognition method and system based on deep learning
CN116580706A (en) * 2023-07-14 2023-08-11 合肥朗永智能科技有限公司 A Speech Recognition Method Based on Artificial Intelligence
CN117746892A (en) * 2023-12-18 2024-03-22 国网福建省电力有限公司 Transformer voiceprint fault identification method and equipment based on wavelet transformation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188487A1 (en) * 2011-06-06 2014-07-03 Bridge Mediatech, S.L. Method and system for robust audio hashing
JP2013164572A (en) * 2012-01-10 2013-08-22 Toshiba Corp Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program
CN115440217A (en) * 2022-08-29 2022-12-06 西安讯飞超脑信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN115910034A (en) * 2022-09-30 2023-04-04 兴业银行股份有限公司 Speech language recognition method and system based on deep learning
CN116580706A (en) * 2023-07-14 2023-08-11 合肥朗永智能科技有限公司 A Speech Recognition Method Based on Artificial Intelligence
CN117746892A (en) * 2023-12-18 2024-03-22 国网福建省电力有限公司 Transformer voiceprint fault identification method and equipment based on wavelet transformation

Also Published As

Publication number Publication date
CN118335089B (en) 2024-09-10

Similar Documents

Publication Publication Date Title
CN109065067B (en) Conference terminal voice noise reduction method based on neural network model
CN105611477A (en) Depth and breadth neural network combined speech enhancement algorithm of digital hearing aid
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
CN108198571B (en) A bandwidth expansion method and system based on adaptive bandwidth judgment
CN114664322B (en) Single-microphone hearing aid and noise reduction method based on bluetooth headset chip and bluetooth headset
CN106328151A (en) Environment de-noising system and application method
CN111243617B (en) A Speech Enhancement Method Based on Deep Learning to Reduce MFCC Feature Distortion
US12094484B2 (en) General speech enhancement method and apparatus using multi-source auxiliary information
CN110534123A (en) Sound enhancement method, device, storage medium, electronic equipment
CN113838471A (en) Noise reduction method and system based on neural network, electronic device and storage medium
EP4371112A1 (en) Speech enhancement
CN105654955A (en) Voice recognition method and device
CN113409804B (en) Multichannel frequency domain voice enhancement algorithm based on variable expansion into generalized subspace
CN118354237A (en) Awakening method, device and equipment of MEMS earphone and storage medium
CN117457008A (en) Multi-person voiceprint recognition method and device based on telephone channel
CN118398033A (en) A speech-based emotion recognition method, system, device and storage medium
CN114189781B (en) Noise reduction method and system for dual-microphone neural network noise reduction headphones
CN118335089B (en) Speech interaction method based on artificial intelligence
CN111681649B (en) Speech recognition method, interactive system and performance management system including the system
Li et al. A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement
CN108922557A (en) A kind of the multi-person speech separation method and system of chat robots
CN113450814A (en) Voice environment noise reduction method based on deep neural network
CN110197657B (en) A dynamic sound feature extraction method based on cosine similarity
CN114023352B (en) Voice enhancement method and device based on energy spectrum depth modulation
Kaur et al. Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant