CN116631392A - Voice interaction processing method and device, electronic equipment and storage medium - Google Patents

Voice interaction processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116631392A
CN116631392A CN202310477203.9A CN202310477203A CN116631392A CN 116631392 A CN116631392 A CN 116631392A CN 202310477203 A CN202310477203 A CN 202310477203A CN 116631392 A CN116631392 A CN 116631392A
Authority
CN
China
Prior art keywords
voice
microphone
microphones
loudness
main microphone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310477203.9A
Other languages
Chinese (zh)
Inventor
王英茂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Oppo Communication Technology Co ltd
Original Assignee
Xi'an Oppo Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Oppo Communication Technology Co ltd filed Critical Xi'an Oppo Communication Technology Co ltd
Priority to CN202310477203.9A priority Critical patent/CN116631392A/en
Publication of CN116631392A publication Critical patent/CN116631392A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/004Monitoring arrangements; Testing arrangements for microphones
    • H04R29/005Microphone arrays
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The application relates to a voice interaction processing method, a voice interaction processing device, electronic equipment and a storage medium. The method comprises the following steps: determining the loudness of the voice signals respectively received by the at least two microphones; determining a loudness difference between the loudness of the speech signals received by the microphones; selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value; performing voice instruction recognition processing on the voice signal of the main microphone to obtain a voice instruction in the voice signal of the main microphone; and carrying out interactive response according to the voice command. By adopting the method, the voice interaction efficiency can be improved.

Description

Voice interaction processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for processing voice interaction, an electronic device, and a storage medium.
Background
With the development of speech recognition technology, electronic devices with man-machine speech interaction function are becoming more popular, such as: the mobile phone or the intelligent sound box has the function of man-machine voice interaction. The user may speak into the electronic device, which may interactively respond to the user's instructions by recognizing the voice signal entered by the user.
In a conventional voice interaction method, a specific wake-up word is generally preset, and after the device recognizes the wake-up word from a voice signal, the device continues to respond to a subsequent voice operation of a user. However, in this method, the user must input the wake-up word before each voice interaction, so that the subsequent voice interaction can be performed, resulting in low efficiency of the voice interaction.
Disclosure of Invention
The embodiment of the application provides a voice interaction processing method, a voice interaction processing device, electronic equipment, a computer readable storage medium and a computer program product, which can improve voice interaction efficiency.
In a first aspect, the present application provides a method for processing voice interaction. The method comprises the following steps:
determining the loudness of the voice signals respectively received by the at least two microphones;
determining a loudness difference between the loudness of the speech signals received by the microphones;
selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
performing voice instruction recognition processing on the voice signal of the main microphone to obtain a voice instruction in the voice signal of the main microphone;
And carrying out interactive response according to the voice command.
In a second aspect, the application also provides a voice interaction processing device. The device comprises:
the loudness difference determining module is used for determining the loudness of the voice signals respectively received by the at least two microphones; determining a loudness difference between the loudness of the speech signals received by the microphones;
the main microphone determining module is used for selecting a microphone with the largest loudness of the voice signals from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
the voice command recognition module is used for carrying out voice command recognition processing on the voice signal of the main microphone to obtain a voice command in the voice signal of the main microphone;
and the interaction response module is used for carrying out interaction response according to the voice instruction.
In a third aspect, the application further provides electronic equipment. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor executes the computer program to realize the following steps:
determining the loudness of the voice signals respectively received by the at least two microphones;
Determining a loudness difference between the loudness of the speech signals received by the microphones;
selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
performing voice instruction recognition processing on the voice signal of the main microphone to obtain a voice instruction in the voice signal of the main microphone;
and carrying out interactive response according to the voice command.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
determining the loudness of the voice signals respectively received by the at least two microphones;
determining a loudness difference between the loudness of the speech signals received by the microphones;
selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
performing voice instruction recognition processing on the voice signal of the main microphone to obtain a voice instruction in the voice signal of the main microphone;
And carrying out interactive response according to the voice command.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
determining the loudness of the voice signals respectively received by the at least two microphones;
determining a loudness difference between the loudness of the speech signals received by the microphones;
selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
performing voice instruction recognition processing on the voice signal of the main microphone to obtain a voice instruction in the voice signal of the main microphone;
and carrying out interactive response according to the voice command.
According to the voice interaction processing method, the device, the electronic equipment, the computer readable storage medium and the computer program product, the loudness of the voice signals received by at least two microphones is determined, the loudness difference between the loudness of the voice signals received by the microphones is determined, and when the loudness difference is larger than or equal to the preset threshold value, the condition that the user is speaking in a short distance is indicated, the probability that the user uses the voice interaction function is very high, therefore, the microphone with the largest loudness of the voice signals is selected from the microphones as the main microphone, voice instruction recognition processing is directly carried out on the voice signals of the main microphone, voice instructions in the voice signals of the main microphone are obtained, then interaction response is carried out according to the voice instructions, and the wake-up instruction in the voice signals is not required to be recognized first, so that the step of voice wake-up is omitted, and the voice interaction efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an application environment diagram of a voice interaction processing method in one embodiment;
FIG. 2 is a flow chart of a method of voice interaction processing in one embodiment;
FIG. 3 is a flow diagram of a method of voice interaction processing in one embodiment;
FIG. 4 is a block diagram of a voice interaction processing apparatus in one embodiment;
FIG. 5 is a block diagram of a voice interaction processing apparatus according to another embodiment;
FIG. 6 is a block diagram showing a structure of a voice interaction processing apparatus according to still another embodiment;
FIG. 7 is a block diagram showing a structure of a voice interaction processing apparatus according to still another embodiment;
fig. 8 is an internal structural diagram of an electronic device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The voice interaction processing method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein at least two microphones are provided on the electronic device 102, for example: a microphone 1022 and a microphone 1024 are provided on the electronic device 102 in fig. 1. The user may speak into the electronic device 102 for voice interaction between the user and the electronic device 102. The electronic device 102 may perform the voice interaction processing method in the embodiments of the present application to perform voice interaction with a user. The electronic device 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, portable wearable devices, and the like, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like, and the portable wearable devices may be smart watches, smart bracelets, head-mounted devices, and the like.
In some embodiments, as shown in fig. 2, a voice interaction processing method is provided, and the method is applied to the electronic device 102 in fig. 1 for illustration, and includes steps 202 to 210. Wherein:
step 202, determining the loudness of speech signals received by at least two microphones, respectively.
In some embodiments, at least two microphones on the electronic device may each receive a speech signal, and the electronic device may determine the loudness of the speech signals respectively received by the at least two microphones.
In some embodiments, the distance or orientation between at least two microphones on the electronic device meets a preset condition.
Step 204 determines the loudness differences between the loudness of the speech signals received by the microphones.
In some embodiments, where the number of microphones is greater than 2, the electronic device may determine a loudness difference between two of the loudness of each microphone.
In some embodiments, where the number of microphones is 2, the electronic device may determine the loudness difference between the two microphones.
In step 206, when the loudness difference is greater than or equal to the preset threshold, the microphone with the largest loudness of the voice signal is selected from the microphones as the main microphone.
In some embodiments, in a case where the number of microphones is greater than 2 and there is a loudness difference greater than or equal to a preset threshold value among the loudness differences between two microphones, the electronic device may select the microphone with the largest loudness of the speech signal from among the microphones as the primary microphone.
In some embodiments, in a case where the number of microphones is 2 and the loudness difference between the two microphones is greater than or equal to a preset threshold, the electronic device may select, as the primary microphone, the microphone having the largest loudness of the speech signal from the two microphones. For example: the loudness of the voice signals of the two microphones is A and B respectively, and the preset threshold value is N, so that the electronic equipment can select the microphone with the largest loudness of the voice signals from the two microphones as the main microphone under the condition that the A-B-I is not less than N (namely, the A-B is not less than N or the B-A is not less than N).
Step 208, performing voice command recognition processing on the voice signal of the main microphone to obtain a voice command in the voice signal of the main microphone.
The voice command recognition process is a process of recognizing a voice command in a voice signal.
In some embodiments, when the loudness difference is greater than or equal to the preset threshold, the electronic device may directly perform a voice command recognition process on the voice signal of the main microphone after selecting, from among the microphones, the microphone with the largest loudness of the voice signal as the main microphone, to obtain a voice command in the voice signal of the main microphone.
In some embodiments, in the voice command recognition process, the electronic device may perform voice recognition processing on the voice signal of the main microphone, determine information included in the voice signal, and then perform semantic understanding on the information included in the voice signal to obtain the voice command in the voice signal of the main microphone.
Step 210, performing interactive response according to the voice command.
In some embodiments, the type of interactive response may include any of a voice response, an operational response, and the like. The voice response means that the electronic equipment responds to the voice instruction to perform voice playing to respond. The operation response means that the electronic equipment responds to the voice command to perform corresponding operation in the system to respond.
In some embodiments, the electronic device may determine the type of interactive response based on the voice command. For example: the information in the speech signal is "what is the weather today? The electronic device may determine that the voice command is "play weather", and the electronic device may play the voice of "today sunny day, temperature 20 to 25 degrees … …" in a voice response manner. For example: the information in the voice signal is "help to fix the alarm clock of 10 points", then the electronic equipment can confirm that the voice instruction is "fix the alarm clock of 10 points", then the electronic equipment can carry out the operation of fixing the alarm clock in the system in an operation response mode.
In some embodiments, in the instruction information of the electronic device, prompt information for guiding the user to speak in a short distance can be displayed, so as to trigger the direct voice instruction recognition processing.
In the voice interaction processing method, the loudness of the voice signals received by at least two microphones is determined, the loudness difference between the loudness of the voice signals received by the microphones is determined, and when the loudness difference is larger than or equal to a preset threshold value, the user is indicated to speak in a short distance, and the probability of using the voice interaction function is very high. It can be understood that, because the step of voice awakening can be omitted when the user speaks in a short distance, the voice command is directly recognized, the efficiency is higher and the use is convenient, so that the user can be prompted to speak in a short distance each time the voice interaction function is used, and the voice interaction efficiency is further improved.
In some embodiments, the method further comprises: and according to the voice signals respectively received by the microphones, performing spatial noise reduction on the voice signals of the main microphone to obtain noise-reduced voice signals. Performing voice command recognition processing on the voice signal of the main microphone to obtain a voice command in the voice signal of the main microphone, including: and carrying out voice instruction recognition processing on the voice signal subjected to the noise reduction processing to obtain a voice instruction in the voice signal of the main microphone.
The spatial noise reduction processing refers to noise reduction processing according to phase information of voice signals received by a plurality of microphones.
In some embodiments, when the loudness difference is greater than or equal to the preset threshold, the electronic device may select, from among the microphones, a microphone with the largest loudness of the speech signal as the main microphone, and then perform spatial noise reduction processing on the speech signal of the main microphone according to the speech signals respectively received by the microphones, to obtain a noise-reduced speech signal, and then directly perform speech instruction recognition processing on the noise-reduced speech signal to obtain a speech instruction in the speech signal of the main microphone. For example: in the case of two microphones, as shown in fig. 3, in the case of |ase:Sub>A-b|n (i.e., ase:Sub>A-B N or B-ase:Sub>A N), the electronic device may select the microphone with the largest loudness of the speech signal from the two microphones as the main microphone, then perform spatial noise reduction processing on the speech signal of the main microphone, and then perform speech instruction recognition processing on the speech signal after the spatial noise reduction processing.
In some embodiments, after the spatial noise reduction processing, the electronic device may further perform conventional noise reduction processing on the spatial noise reduction processed voice signal to obtain a noise reduction processed voice signal, and perform voice command recognition processing on the noise reduction processed voice signal to obtain a voice command in the voice signal of the main microphone. The conventional noise reduction process may be a steady-state noise reduction method, for filtering steady-state noise.
In some embodiments, the electronic device may determine the environmental noise according to the time information corresponding to the same waveform in the voice signals of the microphones, and then remove the environmental noise from the voice signals of the main microphone, so as to perform spatial noise reduction on the voice signals of the main microphone, and obtain the voice signals after the noise reduction. In some embodiments, the same waveform may be of the same amplitude or of the same frequency.
In other embodiments, the electronic device may determine a main direction of the voice signal of the main microphone according to the phase information of the voice signals of the microphones, and enhance the voice signal in the main direction in the voice signal of the main microphone, so as to perform spatial noise reduction on the voice signal of the main microphone, and obtain the noise-reduced voice signal.
In the above embodiment, the spatial noise reduction processing is performed on the voice signals of the main microphone according to the voice signals respectively received by each microphone, so that the signal to noise ratio of the voice signals of the main microphone can be greatly improved by combining the voice signals of a plurality of microphones, thereby improving the quality of the voice signals, further improving the success rate of voice recognition based on the voice signals with high quality, and optimizing the performance of voice interaction. It can be understood that, because the difference of phase information between the voice signals of each microphone is larger when the user speaks in a short distance, the voice signals of the main microphone can be spatially noise-reduced according to the voice signals respectively received by each microphone, so that the success rate of voice recognition can be improved, and the performance of voice interaction can be optimized.
In some embodiments, according to the voice signals received by the microphones, performing spatial noise reduction on the voice signals of the main microphone to obtain noise-reduced voice signals, including: determining time information corresponding to the same waveform in the voice signals respectively received by the microphones; if the time information corresponding to the target waveform in the voice signal of the main microphone is later than the time information corresponding to the target waveform in the voice signal of the auxiliary microphone, determining the signal corresponding to the target waveform as environmental noise; the auxiliary microphone is a microphone except the main microphone in each microphone; and removing the environmental noise from the voice signal of the main microphone to obtain a voice signal after noise reduction treatment.
It can be understood that the time information corresponding to the target waveform exists in the voice signal of the main microphone and is later than the time information corresponding to the target waveform in the voice signal of the auxiliary microphone, which indicates that the signal corresponding to the target waveform arrives at each auxiliary microphone first and finally arrives at the main microphone, and indicates that the signal corresponding to the target waveform is ambient noise.
In some embodiments, in a case where the number of the auxiliary microphones is 1 (i.e., the total number of microphones is 2), if there is time information corresponding to the target waveform in the voice signal of the main microphone later than the time information corresponding to the target waveform in the voice signal of the auxiliary microphone, the electronic device may determine the signal corresponding to the target waveform as the environmental noise.
In other embodiments, in a case where the number of the auxiliary microphones is greater than 1 (i.e., the total number of microphones is greater than 2), if there is time information corresponding to the target waveform in the voice signal of the main microphone later than the time information corresponding to the target waveform in the voice signal of each auxiliary microphone, the electronic device may determine the signal corresponding to the target waveform as the environmental noise.
In some embodiments, the time information may be any one of a time period or a point in time.
In some embodiments, the electronic device may determine a time period corresponding to the same waveform in the voice signals of the microphones, and if the time period corresponding to the target waveform exists in the voice signals of the primary microphone and is later than the time period corresponding to the target waveform in the voice signals of the secondary microphone, determine the signal corresponding to the target waveform as the environmental noise.
In other embodiments, the electronic device may determine a start time point or an end time point corresponding to the same waveform in the voice signals of the microphones, and if the start time point corresponding to the target waveform exists in the voice signal of the primary microphone and is later than the start time point corresponding to the target waveform in the voice signal of the secondary microphone, or if the end time point corresponding to the target waveform exists in the voice signal of the primary microphone and is later than the end time point corresponding to the target waveform in the voice signal of the secondary microphone, determine the signal corresponding to the target waveform as ambient noise.
In the above embodiment, the time information corresponding to the same waveform in the voice signals received by each microphone is determined, if the time information corresponding to the target waveform is later than the time information corresponding to the target waveform in the voice signals of the auxiliary microphone in the voice signals of the main microphone, the signal corresponding to the target waveform is determined to be the environmental noise, the environmental noise is removed from the voice signals of the main microphone, the accurate spatial noise reduction processing of the voice signals of the main microphone is realized, the signal to noise ratio of the voice signals of the main microphone can be improved by combining the voice signals of a plurality of microphones, so that the quality of the voice signals is improved, the success rate of voice recognition can be improved based on the voice signals with high quality, and the performance of voice interaction is optimized.
In some embodiments, in a case where there is a loudness difference greater than or equal to a preset threshold, selecting, from among the microphones, a microphone with a loudness of the speech signal that is the largest as the primary microphone, including: under the condition that the target loudness difference is larger than or equal to a preset threshold value, selecting a microphone with the largest loudness of the voice signal from all microphones as a main microphone; the target loudness difference is the maximum value among the loudness differences.
In some embodiments, when the maximum value of the loudness differences is greater than or equal to a preset threshold, a microphone with the largest loudness of the voice signal is selected from the microphones as a main microphone, and then voice command recognition processing is performed on the voice signal of the main microphone to obtain a voice command in the voice signal of the main microphone, and interactive response is performed according to the voice command.
In the above embodiment, the maximum value in each loudness difference is compared with the preset threshold value, so that it can be more accurately determined that the user is speaking at a short distance and the probability that the user uses the voice interaction function is very high, thereby being capable of more accurately starting whether to directly perform voice instruction recognition processing on the voice signal of the main microphone, and improving the voice interaction performance.
In some embodiments, each microphone includes a first microphone and a second microphone; the target loudness difference is a loudness difference between the loudness of the speech signals received by the first microphone and the second microphone, respectively.
In some embodiments, in the case that the total number of microphones is 2, the electronic device may select, from the first microphone and the second microphone, the microphone with the largest loudness of the voice signals as the main microphone, and then perform voice instruction recognition processing on the voice signals of the main microphone, to obtain voice instructions in the voice signals of the main microphone, where the loudness difference between the voice signals received by the first microphone and the second microphone is greater than or equal to a preset threshold, and perform interactive response according to the voice instructions.
For example: as shown in fig. 3, the loudness of the voice signals of the first microphone a and the second microphone B are a and B respectively, and the preset threshold is N, if a > B (i.e., a-B is greater than or equal to N) in the case of |a-b|is greater than or equal to N, the first microphone a is determined as the main microphone; if B > A (i.e., B-A. Gtoreq.N), then the second microphone B is determined to be the primary microphone.
In the above embodiment, when the loudness difference between the loudness of the voice signals received by the first microphone and the loudness of the voice signals received by the second microphone are greater than or equal to the preset threshold, the microphone with the largest loudness of the voice signals is selected from the first microphone and the second microphone as the main microphone, so that the probability that the user is speaking in a short distance and the user uses the voice interaction function is more accurately determined, and thus, whether to directly perform voice instruction recognition processing on the voice signals of the main microphone can be more accurately started or not, and the voice interaction performance is improved.
In some embodiments, the above method further comprises: under the condition that the loudness differences are smaller than a preset threshold value, any microphone is selected from the microphones to serve as a main microphone; performing wake-up instruction recognition processing on the voice signal of the main microphone; under the condition that the wake-up instruction exists in the voice signal of the main microphone, acquiring the voice signal received by the main microphone again; and carrying out voice instruction recognition processing on the voice signal received by the main microphone again to obtain a voice instruction in the received voice signal again, and carrying out interactive response according to the voice instruction.
The wake-up instruction recognition process is a process of recognizing a wake-up instruction in a speech signal. The wake-up instruction is a preset instruction for waking up the electronic equipment to perform voice instruction recognition processing.
In some embodiments, one of the microphones may be preset as the default primary microphone. In the case where the loudness differences are all smaller than the preset threshold, the electronic device may use a microphone preset as a default primary microphone as the primary microphone. For example: in the case where the number of microphones is 2, assuming that the first microphone is a and the second microphone is b, and microphone a is preset as a default primary microphone, the electronic device may use microphone a as the primary microphone if the loudness difference between microphone a and microphone b is smaller than a preset threshold.
In other embodiments, the electronic device may arbitrarily select one microphone from the microphones as the primary microphone in the event that the loudness differences are both less than a preset threshold. For example: in the case where the number of microphones is 2, assuming that the first microphone is a and the second microphone is b, the electronic device may arbitrarily select either microphone a or microphone b as the primary microphone in the case where the difference in loudness between microphone a and microphone b is less than the preset threshold.
In some embodiments, in the wake-up instruction identifying process, the electronic device may perform voice recognition on the voice signal of the main microphone, determine information in the voice signal of the main microphone, and then identify whether the information in the voice signal includes a preset wake-up instruction, if so, the electronic device may determine that the voice signal of the main microphone has the wake-up instruction, and if not, the electronic device may determine that the voice signal of the main microphone does not have the wake-up instruction. For example: the preset wake-up instruction is "Xiaomingming Ming", and the information identifying the voice signal of the main microphone includes "Xiaomingming Ming", so that the electronic device can determine that the wake-up instruction exists in the voice signal.
In some embodiments, under the condition that a wake-up instruction exists in the voice signal of the main microphone, the electronic device may wake up the function of performing voice instruction recognition processing, acquire the voice signal received again by the main microphone, perform voice instruction recognition processing on the voice signal received again by the main microphone, obtain the voice instruction in the received voice signal again, and perform interactive response according to the voice instruction.
In some embodiments, in the case that a wake-up instruction exists in the voice signal of the primary microphone, the electronic device may play the voice of the wake-up response first, then acquire the voice signal received again by the primary microphone, and perform the voice instruction recognition processing on the voice signal received again by the primary microphone. The voice of the wake-up response is a voice for responding to the wake-up instruction. For example: the electronic device may play a voice of the "aike, i am" etc. wake-up response.
In some embodiments, in the case that the wake-up instruction is not present in the voice signals of the main microphone, the electronic device may perform the step of identifying whether the audio signals respectively received by the at least two microphones are voice signals and the subsequent step, so as to perform the step of determining the loudness of the voice signals respectively received by the at least two microphones and the subsequent step in the case that the audio signals again received are voice signals.
In some embodiments, under the condition that the loudness differences are smaller than the preset threshold, the electronic device may perform conventional noise reduction processing on the voice signal of the main microphone to obtain a voice signal after the conventional noise reduction processing, and then perform wake-up instruction recognition processing on the voice signal after the conventional noise reduction processing.
For example: in the case of two microphones, as shown in fig. 3, in the case of |a-b| < N (i.e., the loudness differences are smaller than the preset threshold), one microphone is arbitrarily selected as the main microphone, then the voice signal of the main microphone is subjected to conventional noise reduction processing, then the voice signal after the conventional noise reduction processing is subjected to wake-up instruction recognition, and if the wake-up instruction exists, the voice signal received again by the main microphone is subjected to voice instruction recognition processing; if the wake-up instruction does not exist, the step of identifying whether the audio signal received by the microphone is a voice signal is executed.
In the above embodiment, under the condition that the loudness differences are smaller than the preset threshold, it is indicated that the user does not speak at a short distance, so that the wake-up instruction recognition processing is performed on the voice signal of the main microphone, and under the condition that the wake-up instruction is recognized to exist in the voice signal of the main microphone, the voice instruction recognition processing is performed on the voice signal received by the main microphone again, and the interactive response is performed according to the voice instruction, so that the accuracy of the interactive response can be improved, the erroneous response is avoided, and the performance of the voice interaction is improved.
In some embodiments, prior to determining the loudness of the speech signals received by the at least two microphones, respectively, the method further comprises: identifying whether the audio signals respectively received by the at least two microphones are voice signals; if yes, executing the following steps of determining the loudness of the voice signals respectively received by at least two microphones; if not, returning to the step of identifying whether the audio signals respectively received by the at least two microphones are voice signals for further execution.
In some embodiments, the electronic device may identify whether the audio signals respectively received by the at least two microphones are speech signals, and if so, determine the loudness of the speech signals respectively received by the at least two microphones and subsequent steps.
In some embodiments, the electronic device may identify whether the audio signals respectively received by the at least two microphones are voice signals, and if not, return to the step of identifying whether the audio signals respectively received by the at least two microphones are voice signals to continue execution.
As shown in fig. 3, the electronic device may identify whether the audio signals received by the microphones are voice signals, and if yes, determine the primary microphone and the subsequent steps according to the loudness a of the voice signal of the microphone a and the loudness B of the voice signal of the microphone B; if not, waiting is performed, and whether the received audio signal is a voice signal or not is continuously recognized.
In the above embodiment, the electronic device first identifies whether the audio signal is a voice signal, so that under the condition that the voice signal is identified, the loudness of the voice signal and the subsequent steps are executed to perform voice interaction, so that whether the loudness of the voice signals received by at least two microphones needs to be determined accurately can be determined, and waste of system resources is avoided.
In some embodiments, the distance between the microphones is greater than or equal to a preset reference distance value.
It is understood that each microphone is disposed on the electronic device, and a distance between each microphone and the electronic device is greater than or equal to a preset reference distance value.
For example: in fig. 1, two microphones are disposed on the electronic device 102, one microphone 1022 is disposed at an upper end of the electronic device 102, and the other microphone 1024 is disposed at a lower end of the electronic device 102, so that a distance between the two microphones is greater than or equal to a preset reference distance value.
In the above embodiment, the distance between the microphones is greater than or equal to the preset reference distance value, so that it can be more accurately determined whether the user is speaking in a short distance according to the loudness difference between the loudness of the voice signals received by the microphones, so as to accurately determine whether the wake-up instruction recognition processing can be not executed, but the voice instruction recognition processing can be directly executed, and the performance of voice interaction is improved.
In some embodiments, the orientations of the microphones are different.
For example: two microphones are arranged on the electronic equipment, one microphone faces upwards, and the other microphone faces downwards.
In the above embodiment, the orientations of the microphones are different, so that whether the user is speaking in a short distance or not can be determined accurately according to the loudness difference between the loudness of the voice signals received by the microphones, so as to determine whether the wake-up instruction recognition processing can be omitted, and the voice instruction recognition processing can be directly performed, thereby improving the performance of voice interaction.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a voice interaction processing device for realizing the above related voice interaction processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of one or more voice interaction processing devices provided below may refer to the limitation of the voice interaction processing method described above, and will not be repeated here.
In one embodiment, as shown in fig. 4, there is provided a voice interaction processing apparatus 400, including: a loudness difference determination module 402, a primary microphone determination module 404, a speech instruction recognition module 406, and an interactive response module 408, wherein:
a loudness difference determining module 402, configured to determine loudness of speech signals received by at least two microphones respectively; a loudness difference between the loudness of the speech signals received by the microphones is determined.
The primary microphone determining module 404 is configured to, when the loudness difference is greater than or equal to a preset threshold, select, from among the microphones, a microphone with a maximum loudness of the speech signal as the primary microphone.
The voice command recognition module 406 is configured to perform voice command recognition processing on the voice signal of the main microphone, so as to obtain a voice command in the voice signal of the main microphone.
The interactive response module 408 is configured to perform interactive response according to the voice command.
In some embodiments, as shown in fig. 5, the voice interaction processing apparatus 400 further includes:
the noise reduction module 410 is configured to perform spatial noise reduction processing on the voice signals of the main microphone according to the voice signals received by the microphones, so as to obtain noise-reduced voice signals.
The voice command recognition module 406 is further configured to perform voice command recognition processing on the noise-reduced voice signal, so as to obtain a voice command in the voice signal of the main microphone.
In some embodiments, the noise reduction module 410 is further configured to determine time information corresponding to the same waveform in the voice signals respectively received by the microphones; if the time information corresponding to the target waveform in the voice signal of the main microphone is later than the time information corresponding to the target waveform in the voice signal of the auxiliary microphone, determining the signal corresponding to the target waveform as environmental noise; the auxiliary microphone is a microphone except the main microphone in each microphone; and removing the environmental noise from the voice signal of the main microphone to obtain a voice signal after noise reduction treatment.
In some embodiments, the primary microphone determining module 404 is further configured to select, from among the microphones, a microphone with a maximum loudness of the speech signal as the primary microphone if the target loudness difference is greater than or equal to a preset threshold; the target loudness difference is the maximum value among the loudness differences.
In some embodiments, each microphone includes a first microphone and a second microphone; the target loudness difference is a loudness difference between the loudness of the speech signals received by the first microphone and the second microphone, respectively.
In some embodiments, as shown in fig. 6, the voice interaction processing apparatus 400 further includes: a wake instruction identification module 412, wherein:
the primary microphone determining module 404 is further configured to arbitrarily select a microphone from the microphones as the primary microphone if the loudness differences are all smaller than a preset threshold.
A wake-up instruction recognition module 412, configured to perform wake-up instruction recognition processing on a voice signal of the main microphone; and under the condition that the wake-up instruction exists in the voice signals of the main microphone, acquiring the voice signals received by the main microphone again.
The voice command recognition module 406 is further configured to perform voice command recognition processing on the voice signal received again by the primary microphone, so as to obtain a voice command in the received voice signal.
The interactive response module 408 is further configured to perform interactive response according to the voice command.
In some embodiments, as shown in fig. 7, the voice interaction processing apparatus 400 further includes:
a pre-speech recognition module 414, configured to recognize whether the audio signals received by the at least two microphones are speech signals; if so, the loudness difference determination module 402 is notified to perform the determining of the loudness of the speech signals received by the at least two microphones, respectively, and subsequent steps. If not, returning to the step of identifying whether the audio signals respectively received by the at least two microphones are voice signals for further execution.
In some embodiments, the distance between the microphones is greater than or equal to a preset reference distance value.
In some embodiments, the orientations of the microphones are different.
According to the voice interaction processing device, the loudness of the voice signals received by at least two microphones is determined, the loudness difference between the loudness of the voice signals received by the microphones is determined, and when the loudness difference is larger than or equal to the preset threshold value, the user is indicated to speak in a short distance, and the probability of using the voice interaction function is very high.
The above-mentioned various modules in the voice interaction processing apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or independent of a processor in the electronic device, or may be stored in software in a memory in the electronic device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, an electronic device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 8. The electronic device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the electronic device is used to exchange information between the processor and the external device. The communication interface of the electronic device is used for conducting wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a voice interaction processing method. The display unit of the electronic device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform steps of a voice interaction processing method.
The embodiment of the application also provides a computer program product containing instructions, which when run on a computer, cause the computer to execute the voice interaction processing method.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (13)

1. A voice interaction processing method, comprising:
determining the loudness of the voice signals respectively received by the at least two microphones;
determining a loudness difference between the loudness of the speech signals received by the microphones;
selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
Performing voice instruction recognition processing on the voice signal of the main microphone to obtain a voice instruction in the voice signal of the main microphone;
and carrying out interactive response according to the voice command.
2. The method according to claim 1, wherein the method further comprises:
according to the voice signals respectively received by the microphones, performing spatial noise reduction on the voice signals of the main microphone to obtain noise-reduced voice signals;
the voice command recognition processing is performed on the voice signal of the main microphone to obtain a voice command in the voice signal of the main microphone, including:
and carrying out voice instruction recognition processing on the voice signal subjected to the noise reduction processing to obtain a voice instruction in the voice signal of the main microphone.
3. The method according to claim 2, wherein the performing spatial noise reduction processing on the voice signals of the main microphones according to the voice signals received by the microphones respectively to obtain noise-reduced voice signals includes:
determining time information corresponding to the same waveform in the voice signals respectively received by the microphones;
if the time information corresponding to the target waveform in the voice signal of the main microphone is later than the time information corresponding to the target waveform in the voice signal of the auxiliary microphone, determining the signal corresponding to the target waveform as environmental noise; the auxiliary microphones are microphones except the main microphone in the microphones;
And removing the environmental noise from the voice signal of the main microphone to obtain a voice signal after noise reduction processing.
4. The method according to claim 1, wherein, in the case where the loudness difference is greater than or equal to a preset threshold, selecting, as a primary microphone, a microphone having the greatest loudness of the speech signal from among the microphones, comprises:
selecting a microphone with the largest loudness of the voice signal from the microphones as a main microphone under the condition that the target loudness difference is larger than or equal to a preset threshold value; the target loudness differences are the maximum of the loudness differences.
5. The method of claim 4, wherein each of the microphones comprises a first microphone and a second microphone; the target loudness difference is a loudness difference between the loudness of the speech signals received by the first microphone and the second microphone, respectively.
6. The method according to claim 1, wherein the method further comprises:
if the loudness differences are smaller than a preset threshold value, randomly selecting a microphone from the microphones as a main microphone;
performing wake-up instruction recognition processing on the voice signal of the main microphone;
Under the condition that a wake-up instruction exists in the voice signals of the main microphone, acquiring the voice signals received by the main microphone again;
and carrying out voice instruction recognition processing on the voice signal received by the main microphone again to obtain a voice instruction in the received voice signal again, and carrying out interactive response according to the voice instruction.
7. The method according to any one of claims 1 to 6, characterized in that before said determining the loudness of the speech signals respectively received by at least two microphones, the method further comprises:
identifying whether the audio signals respectively received by the at least two microphones are voice signals;
if yes, executing the step of determining the loudness of the voice signals respectively received by the at least two microphones and the subsequent steps;
if not, returning to the step of identifying whether the audio signals respectively received by the at least two microphones are voice signals for further execution.
8. The method of any one of claims 1 to 6, wherein a distance between the microphones is greater than or equal to a preset reference distance value.
9. The method of any one of claims 1 to 6, wherein each of the microphones is oriented differently.
10. A voice interaction processing apparatus, comprising:
the loudness difference determining module is used for determining the loudness of the voice signals respectively received by the at least two microphones; determining a loudness difference between the loudness of the speech signals received by the microphones;
the main microphone determining module is used for selecting a microphone with the largest loudness of the voice signals from the microphones as a main microphone when the loudness difference is larger than or equal to a preset threshold value;
the voice command recognition module is used for carrying out voice command recognition processing on the voice signal of the main microphone to obtain a voice command in the voice signal of the main microphone;
and the interaction response module is used for carrying out interaction response according to the voice instruction.
11. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the computer program, when executed by the processor, causes the processor to perform the steps of the voice interaction processing method according to any of claims 1 to 9.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 9.
13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 9.
CN202310477203.9A 2023-04-27 2023-04-27 Voice interaction processing method and device, electronic equipment and storage medium Pending CN116631392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310477203.9A CN116631392A (en) 2023-04-27 2023-04-27 Voice interaction processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310477203.9A CN116631392A (en) 2023-04-27 2023-04-27 Voice interaction processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116631392A true CN116631392A (en) 2023-08-22

Family

ID=87609001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310477203.9A Pending CN116631392A (en) 2023-04-27 2023-04-27 Voice interaction processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116631392A (en)

Similar Documents

Publication Publication Date Title
US11580964B2 (en) Electronic apparatus and control method thereof
CN107134279B (en) Voice awakening method, device, terminal and storage medium
US9418662B2 (en) Method, apparatus and computer program product for providing compound models for speech recognition adaptation
CN112470217A (en) Method for determining electronic device to perform speech recognition and electronic device
CN105793921A (en) Initiating actions based on partial hotwords
CN109785845B (en) Voice processing method, device and equipment
CN110570857A (en) Voice wake-up method and device, electronic equipment and storage medium
US10269347B2 (en) Method for detecting voice and electronic device using the same
US20200125603A1 (en) Electronic device and system which provides service based on voice recognition
CN111326146A (en) Method and device for acquiring voice awakening template, electronic equipment and computer readable storage medium
CN111079438A (en) Identity authentication method and device, electronic equipment and storage medium
CN111276127B (en) Voice awakening method and device, storage medium and electronic equipment
CN112259076A (en) Voice interaction method and device, electronic equipment and computer readable storage medium
US20220270604A1 (en) Electronic device and operation method thereof
CN116631392A (en) Voice interaction processing method and device, electronic equipment and storage medium
CN116978370A (en) Speech processing method, device, computer equipment and storage medium
CN112232059B (en) Text error correction method and device, computer equipment and storage medium
CN114694661A (en) First terminal device, second terminal device and voice awakening method
KR20220118818A (en) Electronic device and operation method thereof
US20230094274A1 (en) Electronic device and operation method thereof
US20220319497A1 (en) Electronic device and operation method thereof
CN111933138B (en) Voice control method, device, terminal and storage medium
US20220262359A1 (en) Electronic device and operation method thereof
US20220383877A1 (en) Electronic device and operation method thereof
CN116030817B (en) Voice wakeup method, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination