CN118335089A - Speech interaction method based on artificial intelligence - Google Patents

Speech interaction method based on artificial intelligence Download PDF

Info

Publication number
CN118335089A
CN118335089A CN202410764506.3A CN202410764506A CN118335089A CN 118335089 A CN118335089 A CN 118335089A CN 202410764506 A CN202410764506 A CN 202410764506A CN 118335089 A CN118335089 A CN 118335089A
Authority
CN
China
Prior art keywords
audio
voice
real
audio set
artificial intelligence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410764506.3A
Other languages
Chinese (zh)
Other versions
CN118335089B (en
Inventor
沈国良
景奕昕
尚晓波
黄爱军
蔡梁元
王磊
余璐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Pansheng Dingcheng Technology Co ltd
Original Assignee
Wuhan Pansheng Dingcheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Pansheng Dingcheng Technology Co ltd filed Critical Wuhan Pansheng Dingcheng Technology Co ltd
Priority to CN202410764506.3A priority Critical patent/CN118335089B/en
Publication of CN118335089A publication Critical patent/CN118335089A/en
Application granted granted Critical
Publication of CN118335089B publication Critical patent/CN118335089B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a voice interaction method based on artificial intelligence, which belongs to the technical field of data processing and comprises the following steps: s1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice; s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice; s3, converting the characteristic voice into a text. According to the artificial intelligence-based voice interaction method, real-time voice input by a user is split, characteristic parameters of two audio sets are extracted respectively, then the characteristic parameters are spliced to generate an audio fusion operator, the audio fusion operator is utilized to correct a gain operator for enhancement processing, the effect of voice enhancement processing is improved, recognition accuracy of voice conversion into text is improved, and noise immunity of voice recognition is improved.

Description

Speech interaction method based on artificial intelligence
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a voice interaction method based on artificial intelligence.
Background
With the development of artificial intelligence, the application of speech synthesis technology based on artificial intelligence is becoming wider and wider. Speech-to-text is a technique that converts speech content into editable text. The voice recording device can help people to quickly convert voice recordings into characters, and improves work and learning efficiency. When people use a mobile phone or a tablet computer, many people like to input characters in a voice recognition mode, but the noise of the environment where the user is positioned when inputting voice may be larger, so that the converted character content is inaccurate.
Disclosure of Invention
The invention provides a voice interaction method based on artificial intelligence for solving the problems.
The technical scheme of the invention is as follows: an artificial intelligence-based voice interaction method comprises the following steps:
S1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice;
s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice;
s3, converting the characteristic voice into a text.
When the characteristic voice is converted into the text, the method can be realized by adopting the existing neural network or the deep learning method.
Further, the step S1 includes the following substeps:
S11, utilizing a microphone of the mobile terminal to collect real-time voice of a user, and windowing the real-time voice of the user;
S12, splitting the windowed real-time voice into a first audio set and a second audio set;
S13, carrying out wavelet transformation on each frame of audio signals in the first audio set and the second audio set to obtain detail sub-band coefficients of each frame of audio signals;
S14, determining an audio conversion coefficient of the first audio set according to the detail sub-band coefficient of each frame of audio signal in the first audio set, and determining an audio conversion coefficient of the second audio set according to the detail sub-band coefficient of each frame of audio signal in the second audio set;
S15, determining a first audio fusion element according to the audio conversion coefficient of the first audio set and the audio conversion coefficient of the second audio set;
S16, determining a second audio fusion element;
s17, generating an audio fusion operator according to the determined first audio fusion element and the determined second audio fusion element.
The beneficial effects of the above-mentioned further scheme are: in the invention, the real-time voice of the user contains more voice signals, so that the real-time voice is split, the voice signals of the two audio sets are respectively extracted with audio conversion coefficients, and a first audio fusion element is determined according to the two audio conversion coefficients; the gamma-pass filters of the two audio sets are subjected to cepstrum coefficients to obtain a second audio fusion element; and splicing the two elements to generate an audio fusion operator containing the voice signal characteristics, wherein the audio fusion operator can embody the audio characteristics and is convenient for carrying out voice enhancement processing in the later period. The gamma-pass filter cepstrum coefficient with higher anti-noise performance is a characteristic parameter based on a human ear cochlear auditory model, is mainly used for extracting audio data characteristics and recognizing voice, has good robustness, and can effectively improve recognition accuracy under noise and unstable environments.
Further, in the step S11, the audio length of the real-time voice is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time voice of the user;
The expression of the windowing function Z (k) is as follows:
; where S represents the window width of the windowing function and k represents the unit window width number of the windowing function.
Further, in S14, the calculation formula of the audio transform coefficient c 1 of the first audio set is:
; wherein p m represents the detail subband coefficients of the M-th frame of audio signal in the first audio set, p m-1 represents the detail subband coefficients of the M-1 st frame of audio signal in the first audio set, p m+1 represents the detail subband coefficients of the m+1 th frame of audio signal in the first audio set, exp (·) represents the exponential function, and M represents the total frame number of audio signals in the first audio set;
in the step S14, the calculation formula of the audio transform coefficient c 2 of the second audio set is:
; where p n represents the detail subband coefficients of the N-th frame of audio signal in the second audio set, p n-1 represents the detail subband coefficients of the N-1 st frame of audio signal in the second audio set, p n+1 represents the detail subband coefficients of the n+1-th frame of audio signal in the second audio set, and N represents the total number of frames of audio signals in the second audio set.
Further, in S15, the calculation formula of the first audio fusion element x 1 is:
; where c 1 denotes the audio transform coefficients of the first audio set, c 2 denotes the audio transform coefficients of the second audio set, M denotes the total number of frames of the audio signals of the first audio set, N denotes the total number of frames of the audio signals of the second audio set, Representing a rounding up operation.
Further, the step S16 includes the following substeps:
S161, extracting a gamma-pass filter cepstrum coefficient of the first audio set and a gamma-pass filter cepstrum coefficient of the second audio set;
S162, calculating a second audio fusion element according to the gamma-pass filter cepstrum coefficient of the first audio set and the gamma-pass filter cepstrum coefficient of the second audio set.
Further, in S162, the calculation formula of the second audio fusion element x 2 is:
; where F 1 represents the gamma-pass filter cepstrum coefficients of the first audio set and F 2 represents the gamma-pass filter cepstrum coefficients of the second audio set.
Further, in S17, the expression of the audio fusion operator R is: r= [ x 1,x2 ]; where [ , ] represents a splicing operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.
Further, the step S2 includes the following substeps:
s21, extracting an original gain factor of real-time voice;
S22, correcting the original gain factor by utilizing an audio fusion operator to obtain a target gain factor;
S23, performing enhancement processing on the real-time voice by using the target gain factor to obtain characteristic voice.
The beneficial effects of the above-mentioned further scheme are: in the invention, because the noise interference of the real-time voice input by the user is stronger, the gain factor is required to be corrected by utilizing the audio fusion operator, and the real-time voice is enhanced by the correction result, so that the noise interference is restrained.
Further, in S22, the calculation formula of the target gain factor Y is:
; where R represents the audio fusion operator, y represents the original gain factor, Representing a rounding up operation.
The beneficial effects of the invention are as follows: according to the artificial intelligence-based voice interaction method, real-time voice input by a user is split, characteristic parameters of two audio sets are extracted respectively, then the characteristic parameters are spliced to generate an audio fusion operator, the audio fusion operator is utilized to correct a gain operator for enhancement processing, the effect of voice enhancement processing is improved, recognition accuracy of voice conversion into text is improved, and noise immunity of voice recognition is improved.
Drawings
FIG. 1 is a flow chart of an artificial intelligence based speech interaction method.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a voice interaction method based on artificial intelligence, which includes the following steps:
S1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice;
s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice;
s3, converting the characteristic voice into a text.
When the characteristic voice is converted into the text, the method can be realized by adopting the existing neural network or the deep learning method.
In an embodiment of the present invention, the step S1 includes the following substeps:
S11, utilizing a microphone of the mobile terminal to collect real-time voice of a user, and windowing the real-time voice of the user;
S12, splitting the windowed real-time voice into a first audio set and a second audio set;
S13, carrying out wavelet transformation on each frame of audio signals in the first audio set and the second audio set to obtain detail sub-band coefficients of each frame of audio signals;
S14, determining an audio conversion coefficient of the first audio set according to the detail sub-band coefficient of each frame of audio signal in the first audio set, and determining an audio conversion coefficient of the second audio set according to the detail sub-band coefficient of each frame of audio signal in the second audio set;
S15, determining a first audio fusion element according to the audio conversion coefficient of the first audio set and the audio conversion coefficient of the second audio set;
S16, determining a second audio fusion element;
s17, generating an audio fusion operator according to the determined first audio fusion element and the determined second audio fusion element.
In the invention, the real-time voice of the user contains more voice signals, so that the real-time voice is split, the voice signals of the two audio sets are respectively extracted with audio conversion coefficients, and a first audio fusion element is determined according to the two audio conversion coefficients; the gamma-pass filters of the two audio sets are subjected to cepstrum coefficients to obtain a second audio fusion element; and splicing the two elements to generate an audio fusion operator containing the voice signal characteristics, wherein the audio fusion operator can embody the audio characteristics and is convenient for carrying out voice enhancement processing in the later period. The gamma-pass filter cepstrum coefficient with higher anti-noise performance is a characteristic parameter based on a human ear cochlear auditory model, is mainly used for extracting audio data characteristics and recognizing voice, has good robustness, and can effectively improve recognition accuracy under noise and unstable environments.
In the embodiment of the present invention, in S11, the audio length of the real-time voice is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time voice of the user;
The expression of the windowing function Z (k) is as follows:
; where S represents the window width of the windowing function and k represents the unit window width number of the windowing function.
In the embodiment of the present invention, in S14, a calculation formula of the audio transform coefficient c 1 of the first audio set is:
; wherein p m represents the detail subband coefficients of the M-th frame of audio signal in the first audio set, p m-1 represents the detail subband coefficients of the M-1 st frame of audio signal in the first audio set, p m+1 represents the detail subband coefficients of the m+1 th frame of audio signal in the first audio set, exp (·) represents the exponential function, and M represents the total frame number of audio signals in the first audio set;
in the step S14, the calculation formula of the audio transform coefficient c 2 of the second audio set is:
; where p n represents the detail subband coefficients of the N-th frame of audio signal in the second audio set, p n-1 represents the detail subband coefficients of the N-1 st frame of audio signal in the second audio set, p n+1 represents the detail subband coefficients of the n+1-th frame of audio signal in the second audio set, and N represents the total number of frames of audio signals in the second audio set.
In the embodiment of the present invention, in S15, the calculation formula of the first audio fusion element x 1 is:
; where c 1 denotes the audio transform coefficients of the first audio set, c 2 denotes the audio transform coefficients of the second audio set, M denotes the total number of frames of the audio signals of the first audio set, N denotes the total number of frames of the audio signals of the second audio set, Representing a rounding up operation.
In an embodiment of the present invention, the step S16 includes the following substeps:
S161, extracting a gamma-pass filter cepstrum coefficient of the first audio set and a gamma-pass filter cepstrum coefficient of the second audio set;
S162, calculating a second audio fusion element according to the gamma-pass filter cepstrum coefficient of the first audio set and the gamma-pass filter cepstrum coefficient of the second audio set.
In the embodiment of the present invention, in S162, the calculation formula of the second audio fusion element x 2 is:
; where F 1 represents the gamma-pass filter cepstrum coefficients of the first audio set and F 2 represents the gamma-pass filter cepstrum coefficients of the second audio set.
In the embodiment of the present invention, in S17, the expression of the audio fusion operator R is: r= [ x 1,x2 ]; where [ , ] represents a splicing operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.
In an embodiment of the present invention, the step S2 includes the following substeps:
s21, extracting an original gain factor of real-time voice;
S22, correcting the original gain factor by utilizing an audio fusion operator to obtain a target gain factor;
S23, performing enhancement processing on the real-time voice by using the target gain factor to obtain characteristic voice.
In the invention, because the noise interference of the real-time voice input by the user is stronger, the gain factor is required to be corrected by utilizing the audio fusion operator, and the real-time voice is enhanced by the correction result, so that the noise interference is restrained.
In the embodiment of the present invention, in S22, the calculation formula of the target gain factor Y is:
; where R represents the audio fusion operator, y represents the original gain factor, Representing a rounding up operation.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (10)

1. The voice interaction method based on the artificial intelligence is characterized by comprising the following steps of:
S1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice;
s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice;
s3, converting the characteristic voice into a text.
2. The artificial intelligence based voice interaction method according to claim 1, wherein S1 comprises the sub-steps of:
S11, utilizing a microphone of the mobile terminal to collect real-time voice of a user, and windowing the real-time voice of the user;
S12, splitting the windowed real-time voice into a first audio set and a second audio set;
S13, carrying out wavelet transformation on each frame of audio signals in the first audio set and the second audio set to obtain detail sub-band coefficients of each frame of audio signals;
S14, determining an audio conversion coefficient of the first audio set according to the detail sub-band coefficient of each frame of audio signal in the first audio set, and determining an audio conversion coefficient of the second audio set according to the detail sub-band coefficient of each frame of audio signal in the second audio set;
S15, determining a first audio fusion element according to the audio conversion coefficient of the first audio set and the audio conversion coefficient of the second audio set;
S16, determining a second audio fusion element;
s17, generating an audio fusion operator according to the determined first audio fusion element and the determined second audio fusion element.
3. The artificial intelligence based voice interaction method according to claim 2, wherein in S11, the window width of the windowing function is used as the audio length of the real-time voice, and the windowing function is used to perform the windowing process on the real-time voice of the user;
The expression of the windowing function Z (k) is as follows:
; where S represents the window width of the windowing function and k represents the unit window width number of the windowing function.
4. The artificial intelligence-based voice interaction method according to claim 2, wherein in S14, the calculation formula of the audio transform coefficient c 1 of the first audio set is:
; wherein p m represents the detail subband coefficients of the M-th frame of audio signal in the first audio set, p m-1 represents the detail subband coefficients of the M-1 st frame of audio signal in the first audio set, p m+1 represents the detail subband coefficients of the m+1 th frame of audio signal in the first audio set, exp (·) represents the exponential function, and M represents the total frame number of audio signals in the first audio set;
in the step S14, the calculation formula of the audio transform coefficient c 2 of the second audio set is:
; where p n represents the detail subband coefficients of the N-th frame of audio signal in the second audio set, p n-1 represents the detail subband coefficients of the N-1 st frame of audio signal in the second audio set, p n+1 represents the detail subband coefficients of the n+1-th frame of audio signal in the second audio set, and N represents the total number of frames of audio signals in the second audio set.
5. The artificial intelligence-based voice interaction method according to claim 2, wherein in S15, the calculation formula of the first audio fusion element x 1 is:
; where c 1 denotes the audio transform coefficients of the first audio set, c 2 denotes the audio transform coefficients of the second audio set, M denotes the total number of frames of the audio signals of the first audio set, N denotes the total number of frames of the audio signals of the second audio set, Representing a rounding up operation.
6. The artificial intelligence based voice interaction method according to claim 2, wherein S16 comprises the sub-steps of:
S161, extracting a gamma-pass filter cepstrum coefficient of the first audio set and a gamma-pass filter cepstrum coefficient of the second audio set;
S162, calculating a second audio fusion element according to the gamma-pass filter cepstrum coefficient of the first audio set and the gamma-pass filter cepstrum coefficient of the second audio set.
7. The artificial intelligence-based voice interaction method according to claim 6, wherein in S162, the calculation formula of the second audio fusion element x 2 is:
; where F 1 represents the gamma-pass filter cepstrum coefficients of the first audio set and F 2 represents the gamma-pass filter cepstrum coefficients of the second audio set.
8. The artificial intelligence based voice interaction method according to claim 2, wherein in S17, the expression of the audio fusion operator R is: r= [ x 1,x2 ]; where [ , ] represents a splicing operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.
9. The artificial intelligence based voice interaction method according to claim 1, wherein S2 comprises the sub-steps of:
s21, extracting an original gain factor of real-time voice;
S22, correcting the original gain factor by utilizing an audio fusion operator to obtain a target gain factor;
S23, performing enhancement processing on the real-time voice by using the target gain factor to obtain characteristic voice.
10. The artificial intelligence based voice interaction method according to claim 9, wherein in S22, the calculation formula of the target gain factor Y is:
; where R represents the audio fusion operator, y represents the original gain factor, Representing a rounding up operation.
CN202410764506.3A 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence Active CN118335089B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410764506.3A CN118335089B (en) 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410764506.3A CN118335089B (en) 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN118335089A true CN118335089A (en) 2024-07-12
CN118335089B CN118335089B (en) 2024-09-10

Family

ID=91777446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410764506.3A Active CN118335089B (en) 2024-06-14 2024-06-14 Speech interaction method based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN118335089B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013164572A (en) * 2012-01-10 2013-08-22 Toshiba Corp Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program
US20140188487A1 (en) * 2011-06-06 2014-07-03 Bridge Mediatech, S.L. Method and system for robust audio hashing
CN115440217A (en) * 2022-08-29 2022-12-06 西安讯飞超脑信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN115910034A (en) * 2022-09-30 2023-04-04 兴业银行股份有限公司 Deep learning-based speech language identification method and system
CN116580706A (en) * 2023-07-14 2023-08-11 合肥朗永智能科技有限公司 Speech recognition method based on artificial intelligence
CN117746892A (en) * 2023-12-18 2024-03-22 国网福建省电力有限公司 Transformer voiceprint fault identification method and equipment based on wavelet transformation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188487A1 (en) * 2011-06-06 2014-07-03 Bridge Mediatech, S.L. Method and system for robust audio hashing
JP2013164572A (en) * 2012-01-10 2013-08-22 Toshiba Corp Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program
CN115440217A (en) * 2022-08-29 2022-12-06 西安讯飞超脑信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN115910034A (en) * 2022-09-30 2023-04-04 兴业银行股份有限公司 Deep learning-based speech language identification method and system
CN116580706A (en) * 2023-07-14 2023-08-11 合肥朗永智能科技有限公司 Speech recognition method based on artificial intelligence
CN117746892A (en) * 2023-12-18 2024-03-22 国网福建省电力有限公司 Transformer voiceprint fault identification method and equipment based on wavelet transformation

Also Published As

Publication number Publication date
CN118335089B (en) 2024-09-10

Similar Documents

Publication Publication Date Title
WO2022083083A1 (en) Sound conversion system and training method for same
US6691090B1 (en) Speech recognition system including dimensionality reduction of baseband frequency signals
CN111508498A (en) Conversational speech recognition method, system, electronic device and storage medium
CN106098078A (en) A kind of audio recognition method that may filter that speaker noise and system thereof
CN111883135A (en) Voice transcription method and device and electronic equipment
CN107274887A (en) Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN114495969A (en) Voice recognition method integrating voice enhancement
CN113838471A (en) Noise reduction method and system based on neural network, electronic device and storage medium
CN113782044B (en) Voice enhancement method and device
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
CN118335089B (en) Speech interaction method based on artificial intelligence
CN113269305A (en) Feedback voice strengthening method for strengthening memory
CN117041430A (en) Method and device for improving outbound quality and robustness of intelligent coordinated outbound system
Kaur et al. Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition
CN116564286A (en) Voice input method and device, storage medium and electronic equipment
Zhou et al. Environmental sound classification of western black-crowned gibbon habitat based on spectral subtraction and VGG16
CN111883178B (en) Double-channel voice-to-image-based emotion recognition method
CN114827363A (en) Method, device and readable storage medium for eliminating echo in call process
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
Cherukuru et al. CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing
US12094484B2 (en) General speech enhancement method and apparatus using multi-source auxiliary information
Pan et al. Application of hidden Markov models in speech command recognition
CN117909486B (en) Multi-mode question-answering method and system based on emotion recognition and large language model
CN113160816A (en) Man-machine interaction method based on neural network VAD algorithm
CN118571212B (en) Speech recognition method and device of intelligent earphone, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant