CN118335089A - Speech interaction method based on artificial intelligence - Google Patents
Speech interaction method based on artificial intelligence Download PDFInfo
- Publication number
- CN118335089A CN118335089A CN202410764506.3A CN202410764506A CN118335089A CN 118335089 A CN118335089 A CN 118335089A CN 202410764506 A CN202410764506 A CN 202410764506A CN 118335089 A CN118335089 A CN 118335089A
- Authority
- CN
- China
- Prior art keywords
- audio
- voice
- real
- audio set
- artificial intelligence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 22
- 230000003993 interaction Effects 0.000 title claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 61
- 238000006243 chemical reaction Methods 0.000 claims abstract description 18
- 238000012545 processing Methods 0.000 claims abstract description 17
- 230000005236 sound signal Effects 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 15
- 230000009466 transformation Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 230000036039 immunity Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 12
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Landscapes
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a voice interaction method based on artificial intelligence, which belongs to the technical field of data processing and comprises the following steps: s1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice; s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice; s3, converting the characteristic voice into a text. According to the artificial intelligence-based voice interaction method, real-time voice input by a user is split, characteristic parameters of two audio sets are extracted respectively, then the characteristic parameters are spliced to generate an audio fusion operator, the audio fusion operator is utilized to correct a gain operator for enhancement processing, the effect of voice enhancement processing is improved, recognition accuracy of voice conversion into text is improved, and noise immunity of voice recognition is improved.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a voice interaction method based on artificial intelligence.
Background
With the development of artificial intelligence, the application of speech synthesis technology based on artificial intelligence is becoming wider and wider. Speech-to-text is a technique that converts speech content into editable text. The voice recording device can help people to quickly convert voice recordings into characters, and improves work and learning efficiency. When people use a mobile phone or a tablet computer, many people like to input characters in a voice recognition mode, but the noise of the environment where the user is positioned when inputting voice may be larger, so that the converted character content is inaccurate.
Disclosure of Invention
The invention provides a voice interaction method based on artificial intelligence for solving the problems.
The technical scheme of the invention is as follows: an artificial intelligence-based voice interaction method comprises the following steps:
S1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice;
s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice;
s3, converting the characteristic voice into a text.
When the characteristic voice is converted into the text, the method can be realized by adopting the existing neural network or the deep learning method.
Further, the step S1 includes the following substeps:
S11, utilizing a microphone of the mobile terminal to collect real-time voice of a user, and windowing the real-time voice of the user;
S12, splitting the windowed real-time voice into a first audio set and a second audio set;
S13, carrying out wavelet transformation on each frame of audio signals in the first audio set and the second audio set to obtain detail sub-band coefficients of each frame of audio signals;
S14, determining an audio conversion coefficient of the first audio set according to the detail sub-band coefficient of each frame of audio signal in the first audio set, and determining an audio conversion coefficient of the second audio set according to the detail sub-band coefficient of each frame of audio signal in the second audio set;
S15, determining a first audio fusion element according to the audio conversion coefficient of the first audio set and the audio conversion coefficient of the second audio set;
S16, determining a second audio fusion element;
s17, generating an audio fusion operator according to the determined first audio fusion element and the determined second audio fusion element.
The beneficial effects of the above-mentioned further scheme are: in the invention, the real-time voice of the user contains more voice signals, so that the real-time voice is split, the voice signals of the two audio sets are respectively extracted with audio conversion coefficients, and a first audio fusion element is determined according to the two audio conversion coefficients; the gamma-pass filters of the two audio sets are subjected to cepstrum coefficients to obtain a second audio fusion element; and splicing the two elements to generate an audio fusion operator containing the voice signal characteristics, wherein the audio fusion operator can embody the audio characteristics and is convenient for carrying out voice enhancement processing in the later period. The gamma-pass filter cepstrum coefficient with higher anti-noise performance is a characteristic parameter based on a human ear cochlear auditory model, is mainly used for extracting audio data characteristics and recognizing voice, has good robustness, and can effectively improve recognition accuracy under noise and unstable environments.
Further, in the step S11, the audio length of the real-time voice is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time voice of the user;
The expression of the windowing function Z (k) is as follows:
; where S represents the window width of the windowing function and k represents the unit window width number of the windowing function.
Further, in S14, the calculation formula of the audio transform coefficient c 1 of the first audio set is:
; wherein p m represents the detail subband coefficients of the M-th frame of audio signal in the first audio set, p m-1 represents the detail subband coefficients of the M-1 st frame of audio signal in the first audio set, p m+1 represents the detail subband coefficients of the m+1 th frame of audio signal in the first audio set, exp (·) represents the exponential function, and M represents the total frame number of audio signals in the first audio set;
in the step S14, the calculation formula of the audio transform coefficient c 2 of the second audio set is:
; where p n represents the detail subband coefficients of the N-th frame of audio signal in the second audio set, p n-1 represents the detail subband coefficients of the N-1 st frame of audio signal in the second audio set, p n+1 represents the detail subband coefficients of the n+1-th frame of audio signal in the second audio set, and N represents the total number of frames of audio signals in the second audio set.
Further, in S15, the calculation formula of the first audio fusion element x 1 is:
; where c 1 denotes the audio transform coefficients of the first audio set, c 2 denotes the audio transform coefficients of the second audio set, M denotes the total number of frames of the audio signals of the first audio set, N denotes the total number of frames of the audio signals of the second audio set, Representing a rounding up operation.
Further, the step S16 includes the following substeps:
S161, extracting a gamma-pass filter cepstrum coefficient of the first audio set and a gamma-pass filter cepstrum coefficient of the second audio set;
S162, calculating a second audio fusion element according to the gamma-pass filter cepstrum coefficient of the first audio set and the gamma-pass filter cepstrum coefficient of the second audio set.
Further, in S162, the calculation formula of the second audio fusion element x 2 is:
; where F 1 represents the gamma-pass filter cepstrum coefficients of the first audio set and F 2 represents the gamma-pass filter cepstrum coefficients of the second audio set.
Further, in S17, the expression of the audio fusion operator R is: r= [ x 1,x2 ]; where [ , ] represents a splicing operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.
Further, the step S2 includes the following substeps:
s21, extracting an original gain factor of real-time voice;
S22, correcting the original gain factor by utilizing an audio fusion operator to obtain a target gain factor;
S23, performing enhancement processing on the real-time voice by using the target gain factor to obtain characteristic voice.
The beneficial effects of the above-mentioned further scheme are: in the invention, because the noise interference of the real-time voice input by the user is stronger, the gain factor is required to be corrected by utilizing the audio fusion operator, and the real-time voice is enhanced by the correction result, so that the noise interference is restrained.
Further, in S22, the calculation formula of the target gain factor Y is:
; where R represents the audio fusion operator, y represents the original gain factor, Representing a rounding up operation.
The beneficial effects of the invention are as follows: according to the artificial intelligence-based voice interaction method, real-time voice input by a user is split, characteristic parameters of two audio sets are extracted respectively, then the characteristic parameters are spliced to generate an audio fusion operator, the audio fusion operator is utilized to correct a gain operator for enhancement processing, the effect of voice enhancement processing is improved, recognition accuracy of voice conversion into text is improved, and noise immunity of voice recognition is improved.
Drawings
FIG. 1 is a flow chart of an artificial intelligence based speech interaction method.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a voice interaction method based on artificial intelligence, which includes the following steps:
S1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice;
s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice;
s3, converting the characteristic voice into a text.
When the characteristic voice is converted into the text, the method can be realized by adopting the existing neural network or the deep learning method.
In an embodiment of the present invention, the step S1 includes the following substeps:
S11, utilizing a microphone of the mobile terminal to collect real-time voice of a user, and windowing the real-time voice of the user;
S12, splitting the windowed real-time voice into a first audio set and a second audio set;
S13, carrying out wavelet transformation on each frame of audio signals in the first audio set and the second audio set to obtain detail sub-band coefficients of each frame of audio signals;
S14, determining an audio conversion coefficient of the first audio set according to the detail sub-band coefficient of each frame of audio signal in the first audio set, and determining an audio conversion coefficient of the second audio set according to the detail sub-band coefficient of each frame of audio signal in the second audio set;
S15, determining a first audio fusion element according to the audio conversion coefficient of the first audio set and the audio conversion coefficient of the second audio set;
S16, determining a second audio fusion element;
s17, generating an audio fusion operator according to the determined first audio fusion element and the determined second audio fusion element.
In the invention, the real-time voice of the user contains more voice signals, so that the real-time voice is split, the voice signals of the two audio sets are respectively extracted with audio conversion coefficients, and a first audio fusion element is determined according to the two audio conversion coefficients; the gamma-pass filters of the two audio sets are subjected to cepstrum coefficients to obtain a second audio fusion element; and splicing the two elements to generate an audio fusion operator containing the voice signal characteristics, wherein the audio fusion operator can embody the audio characteristics and is convenient for carrying out voice enhancement processing in the later period. The gamma-pass filter cepstrum coefficient with higher anti-noise performance is a characteristic parameter based on a human ear cochlear auditory model, is mainly used for extracting audio data characteristics and recognizing voice, has good robustness, and can effectively improve recognition accuracy under noise and unstable environments.
In the embodiment of the present invention, in S11, the audio length of the real-time voice is used as the window width of the windowing function, and the windowing function is used to perform windowing processing on the real-time voice of the user;
The expression of the windowing function Z (k) is as follows:
; where S represents the window width of the windowing function and k represents the unit window width number of the windowing function.
In the embodiment of the present invention, in S14, a calculation formula of the audio transform coefficient c 1 of the first audio set is:
; wherein p m represents the detail subband coefficients of the M-th frame of audio signal in the first audio set, p m-1 represents the detail subband coefficients of the M-1 st frame of audio signal in the first audio set, p m+1 represents the detail subband coefficients of the m+1 th frame of audio signal in the first audio set, exp (·) represents the exponential function, and M represents the total frame number of audio signals in the first audio set;
in the step S14, the calculation formula of the audio transform coefficient c 2 of the second audio set is:
; where p n represents the detail subband coefficients of the N-th frame of audio signal in the second audio set, p n-1 represents the detail subband coefficients of the N-1 st frame of audio signal in the second audio set, p n+1 represents the detail subband coefficients of the n+1-th frame of audio signal in the second audio set, and N represents the total number of frames of audio signals in the second audio set.
In the embodiment of the present invention, in S15, the calculation formula of the first audio fusion element x 1 is:
; where c 1 denotes the audio transform coefficients of the first audio set, c 2 denotes the audio transform coefficients of the second audio set, M denotes the total number of frames of the audio signals of the first audio set, N denotes the total number of frames of the audio signals of the second audio set, Representing a rounding up operation.
In an embodiment of the present invention, the step S16 includes the following substeps:
S161, extracting a gamma-pass filter cepstrum coefficient of the first audio set and a gamma-pass filter cepstrum coefficient of the second audio set;
S162, calculating a second audio fusion element according to the gamma-pass filter cepstrum coefficient of the first audio set and the gamma-pass filter cepstrum coefficient of the second audio set.
In the embodiment of the present invention, in S162, the calculation formula of the second audio fusion element x 2 is:
; where F 1 represents the gamma-pass filter cepstrum coefficients of the first audio set and F 2 represents the gamma-pass filter cepstrum coefficients of the second audio set.
In the embodiment of the present invention, in S17, the expression of the audio fusion operator R is: r= [ x 1,x2 ]; where [ , ] represents a splicing operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.
In an embodiment of the present invention, the step S2 includes the following substeps:
s21, extracting an original gain factor of real-time voice;
S22, correcting the original gain factor by utilizing an audio fusion operator to obtain a target gain factor;
S23, performing enhancement processing on the real-time voice by using the target gain factor to obtain characteristic voice.
In the invention, because the noise interference of the real-time voice input by the user is stronger, the gain factor is required to be corrected by utilizing the audio fusion operator, and the real-time voice is enhanced by the correction result, so that the noise interference is restrained.
In the embodiment of the present invention, in S22, the calculation formula of the target gain factor Y is:
; where R represents the audio fusion operator, y represents the original gain factor, Representing a rounding up operation.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.
Claims (10)
1. The voice interaction method based on the artificial intelligence is characterized by comprising the following steps of:
S1, utilizing a microphone of a mobile terminal to collect real-time voice of a user, and generating an audio fusion operator corresponding to the real-time voice;
s2, processing the real-time voice by utilizing an audio fusion operator to generate characteristic voice;
s3, converting the characteristic voice into a text.
2. The artificial intelligence based voice interaction method according to claim 1, wherein S1 comprises the sub-steps of:
S11, utilizing a microphone of the mobile terminal to collect real-time voice of a user, and windowing the real-time voice of the user;
S12, splitting the windowed real-time voice into a first audio set and a second audio set;
S13, carrying out wavelet transformation on each frame of audio signals in the first audio set and the second audio set to obtain detail sub-band coefficients of each frame of audio signals;
S14, determining an audio conversion coefficient of the first audio set according to the detail sub-band coefficient of each frame of audio signal in the first audio set, and determining an audio conversion coefficient of the second audio set according to the detail sub-band coefficient of each frame of audio signal in the second audio set;
S15, determining a first audio fusion element according to the audio conversion coefficient of the first audio set and the audio conversion coefficient of the second audio set;
S16, determining a second audio fusion element;
s17, generating an audio fusion operator according to the determined first audio fusion element and the determined second audio fusion element.
3. The artificial intelligence based voice interaction method according to claim 2, wherein in S11, the window width of the windowing function is used as the audio length of the real-time voice, and the windowing function is used to perform the windowing process on the real-time voice of the user;
The expression of the windowing function Z (k) is as follows:
; where S represents the window width of the windowing function and k represents the unit window width number of the windowing function.
4. The artificial intelligence-based voice interaction method according to claim 2, wherein in S14, the calculation formula of the audio transform coefficient c 1 of the first audio set is:
; wherein p m represents the detail subband coefficients of the M-th frame of audio signal in the first audio set, p m-1 represents the detail subband coefficients of the M-1 st frame of audio signal in the first audio set, p m+1 represents the detail subband coefficients of the m+1 th frame of audio signal in the first audio set, exp (·) represents the exponential function, and M represents the total frame number of audio signals in the first audio set;
in the step S14, the calculation formula of the audio transform coefficient c 2 of the second audio set is:
; where p n represents the detail subband coefficients of the N-th frame of audio signal in the second audio set, p n-1 represents the detail subband coefficients of the N-1 st frame of audio signal in the second audio set, p n+1 represents the detail subband coefficients of the n+1-th frame of audio signal in the second audio set, and N represents the total number of frames of audio signals in the second audio set.
5. The artificial intelligence-based voice interaction method according to claim 2, wherein in S15, the calculation formula of the first audio fusion element x 1 is:
; where c 1 denotes the audio transform coefficients of the first audio set, c 2 denotes the audio transform coefficients of the second audio set, M denotes the total number of frames of the audio signals of the first audio set, N denotes the total number of frames of the audio signals of the second audio set, Representing a rounding up operation.
6. The artificial intelligence based voice interaction method according to claim 2, wherein S16 comprises the sub-steps of:
S161, extracting a gamma-pass filter cepstrum coefficient of the first audio set and a gamma-pass filter cepstrum coefficient of the second audio set;
S162, calculating a second audio fusion element according to the gamma-pass filter cepstrum coefficient of the first audio set and the gamma-pass filter cepstrum coefficient of the second audio set.
7. The artificial intelligence-based voice interaction method according to claim 6, wherein in S162, the calculation formula of the second audio fusion element x 2 is:
; where F 1 represents the gamma-pass filter cepstrum coefficients of the first audio set and F 2 represents the gamma-pass filter cepstrum coefficients of the second audio set.
8. The artificial intelligence based voice interaction method according to claim 2, wherein in S17, the expression of the audio fusion operator R is: r= [ x 1,x2 ]; where [ , ] represents a splicing operation, x 1 represents a first audio fusion element, and x 2 represents a second audio fusion element.
9. The artificial intelligence based voice interaction method according to claim 1, wherein S2 comprises the sub-steps of:
s21, extracting an original gain factor of real-time voice;
S22, correcting the original gain factor by utilizing an audio fusion operator to obtain a target gain factor;
S23, performing enhancement processing on the real-time voice by using the target gain factor to obtain characteristic voice.
10. The artificial intelligence based voice interaction method according to claim 9, wherein in S22, the calculation formula of the target gain factor Y is:
; where R represents the audio fusion operator, y represents the original gain factor, Representing a rounding up operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410764506.3A CN118335089B (en) | 2024-06-14 | 2024-06-14 | Speech interaction method based on artificial intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410764506.3A CN118335089B (en) | 2024-06-14 | 2024-06-14 | Speech interaction method based on artificial intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118335089A true CN118335089A (en) | 2024-07-12 |
CN118335089B CN118335089B (en) | 2024-09-10 |
Family
ID=91777446
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410764506.3A Active CN118335089B (en) | 2024-06-14 | 2024-06-14 | Speech interaction method based on artificial intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118335089B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013164572A (en) * | 2012-01-10 | 2013-08-22 | Toshiba Corp | Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program |
US20140188487A1 (en) * | 2011-06-06 | 2014-07-03 | Bridge Mediatech, S.L. | Method and system for robust audio hashing |
CN115440217A (en) * | 2022-08-29 | 2022-12-06 | 西安讯飞超脑信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN115910034A (en) * | 2022-09-30 | 2023-04-04 | 兴业银行股份有限公司 | Deep learning-based speech language identification method and system |
CN116580706A (en) * | 2023-07-14 | 2023-08-11 | 合肥朗永智能科技有限公司 | Speech recognition method based on artificial intelligence |
CN117746892A (en) * | 2023-12-18 | 2024-03-22 | 国网福建省电力有限公司 | Transformer voiceprint fault identification method and equipment based on wavelet transformation |
-
2024
- 2024-06-14 CN CN202410764506.3A patent/CN118335089B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140188487A1 (en) * | 2011-06-06 | 2014-07-03 | Bridge Mediatech, S.L. | Method and system for robust audio hashing |
JP2013164572A (en) * | 2012-01-10 | 2013-08-22 | Toshiba Corp | Voice feature quantity extraction device, voice feature quantity extraction method, and voice feature quantity extraction program |
CN115440217A (en) * | 2022-08-29 | 2022-12-06 | 西安讯飞超脑信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN115910034A (en) * | 2022-09-30 | 2023-04-04 | 兴业银行股份有限公司 | Deep learning-based speech language identification method and system |
CN116580706A (en) * | 2023-07-14 | 2023-08-11 | 合肥朗永智能科技有限公司 | Speech recognition method based on artificial intelligence |
CN117746892A (en) * | 2023-12-18 | 2024-03-22 | 国网福建省电力有限公司 | Transformer voiceprint fault identification method and equipment based on wavelet transformation |
Also Published As
Publication number | Publication date |
---|---|
CN118335089B (en) | 2024-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022083083A1 (en) | Sound conversion system and training method for same | |
US6691090B1 (en) | Speech recognition system including dimensionality reduction of baseband frequency signals | |
CN111508498A (en) | Conversational speech recognition method, system, electronic device and storage medium | |
CN106098078A (en) | A kind of audio recognition method that may filter that speaker noise and system thereof | |
CN111883135A (en) | Voice transcription method and device and electronic equipment | |
CN107274887A (en) | Speaker's Further Feature Extraction method based on fusion feature MGFCC | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN113838471A (en) | Noise reduction method and system based on neural network, electronic device and storage medium | |
CN113782044B (en) | Voice enhancement method and device | |
CN111681649B (en) | Speech recognition method, interaction system and achievement management system comprising system | |
CN118335089B (en) | Speech interaction method based on artificial intelligence | |
CN113269305A (en) | Feedback voice strengthening method for strengthening memory | |
CN117041430A (en) | Method and device for improving outbound quality and robustness of intelligent coordinated outbound system | |
Kaur et al. | Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition | |
CN116564286A (en) | Voice input method and device, storage medium and electronic equipment | |
Zhou et al. | Environmental sound classification of western black-crowned gibbon habitat based on spectral subtraction and VGG16 | |
CN111883178B (en) | Double-channel voice-to-image-based emotion recognition method | |
CN114827363A (en) | Method, device and readable storage medium for eliminating echo in call process | |
Zheng et al. | Bandwidth extension WaveNet for bone-conducted speech enhancement | |
Cherukuru et al. | CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing | |
US12094484B2 (en) | General speech enhancement method and apparatus using multi-source auxiliary information | |
Pan et al. | Application of hidden Markov models in speech command recognition | |
CN117909486B (en) | Multi-mode question-answering method and system based on emotion recognition and large language model | |
CN113160816A (en) | Man-machine interaction method based on neural network VAD algorithm | |
CN118571212B (en) | Speech recognition method and device of intelligent earphone, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |