CN108461081B - Voice control method, device, equipment and storage medium - Google Patents

Voice control method, device, equipment and storage medium Download PDF

Info

Publication number
CN108461081B
CN108461081B CN201810235732.7A CN201810235732A CN108461081B CN 108461081 B CN108461081 B CN 108461081B CN 201810235732 A CN201810235732 A CN 201810235732A CN 108461081 B CN108461081 B CN 108461081B
Authority
CN
China
Prior art keywords
domain signal
frequency
frequency domain
signal
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810235732.7A
Other languages
Chinese (zh)
Other versions
CN108461081A (en
Inventor
李志铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Internet Security Software Co Ltd
Original Assignee
Beijing Kingsoft Internet Security Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Internet Security Software Co Ltd filed Critical Beijing Kingsoft Internet Security Software Co Ltd
Priority to CN201810235732.7A priority Critical patent/CN108461081B/en
Publication of CN108461081A publication Critical patent/CN108461081A/en
Application granted granted Critical
Publication of CN108461081B publication Critical patent/CN108461081B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for voice control, wherein the method comprises the following steps: carrying out signal processing on a voice signal acquired by terminal equipment to obtain a frequency domain signal; extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; and converting the complete frequency domain signal into a time domain signal, performing voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result, so that the voice signal which is easily influenced by the environment is transferred to a frequency band which is not easily influenced by the environment, the anti-interference capability is enhanced, the distortion degree of the voice signal is reduced, and the recognition rate of the voice signal and the accuracy of voice control by the terminal equipment are improved.

Description

Voice control method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice control method, a voice control device, voice control equipment and a storage medium.
Background
According to the IEC581 standard of the International electrotechnical Commission and the GB/T14277-93 standard of China, 30-150Hz is set as a low frequency range, 150-500Hz is a medium-low frequency range, 500-5KHz is a medium-high frequency range, and 5K-16KHz is a high frequency range. Wherein the frequency of the male voice reference zone is 60-523Hz, and the frequency of the female voice reference zone is 160-1200 Hz. The pitch period information has wide application in a plurality of fields such as speech recognition, speaker recognition, speech analysis and speech synthesis, low-bit-rate speech coding, pronunciation system disease diagnosis, language guidance of hearing handicapped people and the like.
The fundamental tone plays a key role in speech recognition, and the human fundamental tone region is a range of frequencies below 200 Hz. At present, a microphone is mainly used for picking up sound by terminal equipment based on voice control, when the microphone and a loudspeaker play high sound pressure, because the loudspeaker vibrates greatly under the action of the low-frequency sound under the action of the high sound pressure, the Distortion degree of the low-frequency sound is high, after analyzing a plurality of intelligent sound boxes on the market, the audio signal with the voice frequency below 200Hz is found, the Distortion degree of the audio information played by the loudspeaker is larger, particularly, the value of total harmonic Distortion plus Noise (TotalHarmonic Distortion + Noise, THD + N) of the low-frequency voice signal with the voice frequency about 100Hz reaches more than 10 percent, which brings great challenge to the echo cancellation algorithm processing of a voice recognition system, the low-frequency voice signal can lead the box of the intelligent sound box to generate stronger vibration, the problem of nonlinear Distortion is caused in the sound picking up process of the microphone, and the data collected by the microphone is larger than the real data, the smart sound box cannot recognize the voice signal with the voice control keyword, so that the success rate of voice control of devices such as the smart sound box is low.
Disclosure of Invention
The invention provides a voice control method, a voice control device, voice control equipment and a voice control storage medium, which are used for realizing the purposes of transferring a voice signal which is easily influenced by the environment to a frequency band which is not easily influenced by the environment, enhancing the anti-interference capability, reducing the distortion degree of the voice signal and improving the recognition rate of a terminal device to the voice signal and the accuracy of voice control.
In a first aspect, an embodiment of the present invention provides a method for voice control, where the method includes:
carrying out signal processing on a voice signal acquired by terminal equipment to obtain a frequency domain signal;
extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal;
and converting the complete frequency domain signal into a time domain signal, carrying out voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result.
In a second aspect, an embodiment of the present invention further provides a voice-controlled apparatus, where the apparatus includes:
the signal conversion module is used for carrying out signal processing on the voice signal acquired by the terminal equipment to obtain a frequency domain signal;
the complete frequency domain signal generating module is used for extracting a target frequency domain signal in a first frequency range in the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal;
and the signal conversion response module is used for converting the complete frequency domain signal into a time domain signal, carrying out voice recognition on the time domain signal and controlling the terminal equipment to respond according to a voice recognition result.
In a third aspect, an embodiment of the present invention further provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the voice control method when executing the program.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the above-mentioned voice control method.
The invention provides a voice control method, a voice control device, voice control equipment and a storage medium, wherein the method comprises the following steps: carrying out signal processing on a voice signal acquired by terminal equipment to obtain a frequency domain signal; extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; and converting the complete frequency domain signal into a time domain signal, performing voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result, so that the voice signal which is easily influenced by the environment is transferred to a frequency band which is not easily influenced by the environment, the influence of the environmental noise on the voice signal is reduced, and the recognition rate of the voice signal and the accuracy of voice control by the terminal equipment are improved.
Drawings
Fig. 1 is a flowchart of a voice control method according to an embodiment of the present invention;
fig. 2 is a flowchart of a voice control method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a voice-controlled apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a voice control method according to an embodiment of the present invention, where this embodiment is applicable to synthesizing a voice signal from a frequency band that is susceptible to an environment to a frequency band that is not susceptible to the environment, so as to enhance an anti-interference capability, reduce an influence of an environmental noise, and improve a recognition rate of a voice signal and accuracy of voice control by a terminal device, and the method may be implemented by software/hardware that is integrated with a voice control apparatus, and specifically includes the following steps:
and step S110, carrying out signal processing on the voice signal collected by the terminal equipment to obtain a frequency domain signal.
The terminal equipment refers to equipment with a recording function and a broadcasting function, and can be an intelligent sound box, wherein the intelligent sound box uses a microphone to collect voice signals, and audio information is played through a loudspeaker and the like.
The method comprises the steps that voice is used as an analog signal carrying specific information, a microphone converts a sound sample of the collected analog signal into a digital signal, and the collected voice signal is a time-domain-based voice signal, so that the shape of the voice signal can be visually observed by the time-domain-based voice signal, but the signal cannot be accurately described by limited parameters; the time domain-based speech signal takes time as an independent variable, and the speech signal changes along with the time change and is not easy to analyze, so that the frequency structure of the speech signal needs to be further analyzed to describe the signal in a frequency domain. However, the frequency-domain-based speech signal is independent of frequency, and the speech signal does not change with time, and the frequency-domain-based speech signal is a superposition that decomposes a complex signal into simple signals, such as sinusoidal signals, and the structure of the signals is more accurately known.
Step S120, extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal.
The target frequency domain signal in the first frequency range may be a voice signal with a frequency of 200Hz or less, and the synthesized frequency domain signal in the second frequency range may be a voice signal with a frequency of 200Hz or more.
When the microphone is used for collecting voice signals, the loudspeaker is used for playing audio information, the audio information played by the loudspeaker can be immediately transmitted into the microphone, the collection of the microphone on normal voice signals is influenced, and therefore echo cancellation is needed for processing the audio information played by the loudspeaker. In addition, the voice signal in the low frequency band can bring strong vibration to the sound box in the playing process, especially when the voice signal is played under high sound pressure, the vibration sense of the cavity of the sound box is stronger, more serious nonlinear distortion is brought to the voice data acquired by the microphone, the difference between the data acquired by the microphone and the real data is larger, and the awakening rate is low, wherein the low frequency band refers to the frequency band with the frequency below 200 Hz. Since the frequencies of the speech signals collected by the microphone are multiple frequencies, it is necessary to synthesize the frequency domain signals collected by the microphone below 200Hz, especially the low frequency band, into frequency domain signals above 200Hz, so as to reduce the influence of the environmental noise on the speech signals. The target frequency domain signal is migrated to the synthesized frequency domain signal in the second frequency range, and the speech signal may be adjusted by using a technique such as a mixing algorithm.
The target frequency domain signal is migrated to the synthesized frequency domain signal in the second frequency range by using a mixing algorithm to form a complete frequency domain signal, where the complete frequency domain signal may include the target frequency domain signal in the first frequency range and the synthesized frequency domain signal in the second frequency range, or may include only the synthesized frequency domain signal in the second frequency range, or only the target frequency domain signal in the first frequency range.
Step S130, converting the complete frequency domain signal into a time domain signal, performing voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result. Preferably, the complete frequency domain signal is a mixed frequency domain signal of the target frequency domain signal and the synthesized frequency domain signal, and the target frequency domain signal in the first frequency range is transferred to the synthesized frequency signal in the second frequency range, so that the synthesized frequency domain signal in the high frequency range covers the vibration of the target frequency domain signal in the low frequency range, and simultaneously, the voice result required by the target frequency domain signal is retained, thereby improving the voice pickup accuracy while not affecting the voice pickup effect of the terminal device such as a sound box, and further improving the efficiency and the awakening rate of voice recognition. The low frequency band referred to in this embodiment refers to a frequency band with a frequency below 200Hz, and the high frequency band refers to a frequency band with a frequency above 200 Hz. And transforming the obtained complete frequency domain signal into a time domain signal corresponding to the complete frequency domain signal again, carrying out subsequent voice recognition processing on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result.
Illustratively, when a user sends a voice signal to the terminal device, such as a voice input device of a smart speaker, the voice signal may be a single word containing only a wake-up word, such as "off", "on", "high", or "low", etc.; or sentences containing the wake-up word, such as "turn-off", "turn-on", "turn-up" and "turn-down" may be used. Because the voice signal contains signals with various frequencies, the voice signal of a single word or sentence containing an awakening word is processed in the steps S110 and S120 to obtain a complete frequency domain signal of the voice signal, the complete frequency domain signal is converted into a time domain signal, the time domain signal of the voice is subjected to digital signal processing such as digital-to-analog conversion and the like and then voice recognition is carried out, the voice recognition process comprises the extraction of the awakening word, the preprocessing and the feature extraction are carried out on the time domain signal, and then the voice recognition result is obtained after the feature matching, for example, the result after the voice recognition is 'off' on the voice signal is sent to a control module of the terminal equipment, and the control module controls the terminal equipment to make an 'off' action response according to the received 'off' control instruction.
The working principle of the voice control method can be as follows: the embodiment can be particularly applied to an intelligent sound box with a microphone and a loudspeaker, and the intelligent sound box can intelligently control the sound box to work by recognizing a wake-up word sent by a user. Firstly, a microphone of the intelligent sound box collects voice signals sent by a user, analog voice signals are converted into digital voice signals, and time domain signals processed by the digital signals are obtained. Performing Fourier transform on a time domain signal to obtain a frequency domain signal corresponding to the voice signal, extracting a target frequency domain signal with a frequency in a first frequency range from the frequency domain signal, and dynamically transferring the target frequency domain signal to a second frequency range to obtain a synthesized frequency domain signal, wherein the first frequency range is 0-200 Hz and covers the whole low-frequency band, the second frequency range is more than 200Hz, replacing the target frequency domain signal with the synthesized frequency domain signal to obtain a new complete frequency domain signal, performing inverse Fourier transform on the complete frequency domain signal to convert the complete frequency domain signal into a time domain signal corresponding to the complete frequency domain signal, and transferring the time domain signal to a voice recognition device for voice recognition, and because the voice signal with a frequency below 200Hz is dynamically transferred to a frequency band with a frequency above 200Hz, the interference of environmental noise on the voice signal is reduced, and the distortion of the voice signal is reduced, in the subsequent voice recognition process, the voice recognition device can accurately recognize the awakening words, and control the intelligent sound box to perform corresponding response, so that the accuracy and the awakening rate of the voice recognition are improved.
In the embodiment, a frequency domain signal is obtained by performing signal processing on a voice signal acquired by a terminal device; extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; and converting the complete frequency domain signal into a time domain signal, performing voice recognition on the time domain signal, controlling the terminal equipment to respond according to a voice recognition result, synthesizing the voice signal in the first frequency range into a voice signal in a second frequency range, reducing the influence of environmental noise on the voice signal, reducing the distortion degree of the voice signal, and improving the recognition rate of the terminal equipment on the voice signal and the accuracy of voice control.
On the basis of the above embodiment, in this embodiment, the target frequency domain signal in the original frequency domain signal may be replaced by the synthesized frequency domain signal, so as to eliminate the existence of the target frequency domain signal in the first frequency range, which is equivalent to the filtering function, the synthesized frequency domain signal is converted into the corresponding time domain signal for speech recognition, and the synthesized frequency domain signal is recognized for multiple times and is analyzed and compared with the recognition result of the original frequency domain signal, so as to improve the accuracy of the speech recognition.
The method comprises the specific steps of sampling the time domain signal, dividing the sampled time domain signal into a plurality of frame segment signals, and partially overlapping the frame segment signals, wherein the time domain signal is divided into a plurality of frame segment signals, the frame segment signals are smoothly transited to keep the continuity of the frame segment signals, the accuracy and the awakening rate of voice recognition are improved by overlapping the frame segment signals, corresponding characteristic parameters including characteristic prediction parameters, linear prediction cepstrum coefficients, ME L frequency cepstrum coefficients and the like are extracted according to the frame segment signals after frame processing, the similarity of the voice templates is judged according to the extracted characteristic parameters, the voice recognition result is output according to the judgment result, the judgment process of the similarity of the voice templates comprises the corresponding voice templates according to the characteristic parameters, and then the corresponding distortion measure of the characteristic parameters is calculated, the distortion measure comprises gain normalization measure, the weighted cepstrum measure, the likelihood measure and the like, and the similarity judgment result is generated according to the Euclidean similarity judgment result and the like.
It should be noted that the speech recognition processing on the time domain signal not only includes performing overlap framing processing on the sampled time domain signal, but also includes continuous framing processing, and the difference between the overlap framing and the continuous framing processing is that there is no overlapped frame segment signal.
Example two
Fig. 2 is a flowchart of a voice control method according to a second embodiment of the present invention. This embodiment is further illustrated and described with respect to the voice control method in the above embodiment, which specifically includes the following steps:
step S210, performing analog-to-digital conversion processing on a voice signal acquired by a terminal device to obtain an initial time domain signal, performing Fourier transform processing on the initial time domain signal to obtain a corresponding initial frequency domain signal, and performing filtering processing on the initial frequency domain signal to obtain a frequency domain signal.
Because the microphone and the speaker both work through analog signals, but chips such as a processor and the like can process digital signals, and initial voice signals collected by sound input devices such as the microphone and the like are analog signals, the initial voice signals need to be converted into voice digital signals through an analog-to-digital conversion circuit, so that initial time domain signals are obtained. The process of microphone recording is the process of converting analog signals into digital signals, and meanwhile, because noise exists in the environment, certain influence is caused to the extraction and transmission of information, and in order to more accurately process voice signals, the collected voice signals need to be filtered so as to improve the fidelity of the voice signals. The voice signal collected by the microphone is a voice signal based on a time domain, filtering is carried out on the time domain, convolution operation needs to be carried out on the time domain signal, and the calculation amount is large; and based on filtering in the frequency domain, multiplication operation needs to be carried out on the frequency domain signal, and fast processing is carried out through fast Fourier transform and fast Fourier transform, so that the frequency domain filtering calculation amount is smaller, and the method is simpler and easier. Therefore, in order to avoid distortion and reduction of calculation amount caused in the signal processing process, it is necessary to perform fourier transform on the initial time domain signal obtained through the analog-to-digital conversion processing to obtain a corresponding initial frequency domain signal, and then perform filtering processing on the initial frequency domain signal to obtain a frequency domain signal. The filtering process is generally a low-pass filtering process, and may be a butterworth filter, a chebyshev filter, or the like.
Step S220, extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and dynamically transferring the target frequency domain signal from the frequency band in the first frequency range to a frequency band in a second frequency range where the synthesized frequency domain signal is located, so as to form a complete frequency domain signal.
Because the collected voice signals based on the frequency domain include voice signals of a plurality of frequency bands, and the audio signals with the frequency lower than 200Hz are greatly interfered by the environment, the voice signals with the frequency lower than 200Hz need to be dynamically migrated to the audio frequency band with the frequency higher than 200Hz to reduce the influence of the environmental factors on the voice signals, and the audio signals with the frequency higher than 200Hz can be directly subjected to voice recognition. Firstly, extracting a voice signal with the frequency lower than 200Hz from a voice signal based on a frequency domain, then transferring the frequency domain signal of a frequency band below 200Hz in the voice signal to a frequency band with the frequency higher than 200Hz through a mixing algorithm, wherein the waveform is not distorted, the voice signal with the frequency below 200Hz before the transfer is the same as the voice signal with the frequency band above 200Hz after the transfer, and the mixing algorithm can be a normalized mixing algorithm or an average equal value after linear superposition. The normalization sound mixing algorithm attenuates the voice by using a variable attenuation factor, the attenuation factor changes along with the change of the audio data, when overflow occurs, the attenuation factor is small, so that the overflowed data can be within a critical value after attenuation, and when overflow does not occur, the attenuation factor is slowly increased, so that the data changes more slowly; the averaging after linear superposition refers to averaging after linear superposition of input audio data, and the method does not overflow the audio data and has low noise.
Illustratively, the frequency range covered by the frequency domain signal below 200Hz in the extraction is 50-100 Hz, and at the moment, the frequency domain signal is dynamically transferred to 250-300 Hz correspondingly through a sound mixing algorithm; if the frequency of the voice signal extracted in the first frequency range is 80-180 Hz, the voice signal is correspondingly shifted to 280-380 Hz. From the aspect of frequency domain, the relative size and mutual interval of frequency components of each voice signal before and after dynamic migration are not changed, the waveform of the voice signal with the output frequency higher than 200Hz is the same as the frequency of the voice signal with the input frequency higher than 200Hz, and the waveform of the voice signal with the output frequency higher than 200Hz is the same as the frequency spectrum structure of the voice signal with the input frequency higher than 200Hz, and the only difference is the frequency of the voice signal.
The target frequency domain signal in the first frequency range is dynamically transferred to the synthesized frequency domain signal in the second frequency range through a normalized mixing algorithm or a mixing algorithm such as averaging after linear superposition to form a complete frequency domain signal, that is, the complete frequency domain signal is a mixture of a low-frequency band frequency domain signal and a high-frequency band frequency domain signal of the same voice signal, wherein the first frequency range is lower than the second frequency range, illustratively, the first frequency range is a range with a frequency lower than 200Hz, the second frequency range is a range with a frequency higher than 200Hz, at this time, the target frequency domain signal is in a frequency band where the first frequency range is located, and the synthesized frequency domain signal is in a frequency band where the second frequency range is located. Namely, the target frequency domain signal is dynamically transferred from the frequency band of the first frequency range to the frequency band of the second frequency range through a sound mixing algorithm to form a synthesized frequency domain signal, and then the target frequency domain signal is dynamically transferred to the synthesized frequency domain signal to form a complete frequency domain signal. The complete frequency domain signal enables the synthesized frequency domain signal in the high frequency band to cover the vibration of the target frequency domain signal in the low frequency band while keeping the voice result of the target frequency domain signal, and reduces the low frequency vibration of the target frequency domain signal, so that the pickup accuracy is improved, and the voice recognition efficiency and the awakening rate are improved.
And step S230, transforming the complete frequency domain signal into a corresponding time domain signal through Fourier inversion.
The inverse fourier transform is the inverse operation of the fourier transform. The Fourier transformation is to convert the time domain signal into the frequency domain signal, the inverse Fourier transformation is to convert the frequency domain signal into the time domain signal, and the obtained complete frequency domain signal is converted into the corresponding time domain signal after the inverse Fourier transformation, so that the subsequent processing of the voice signal is facilitated.
Step S240, performing digital-to-analog conversion on the time domain signal, performing voice recognition on the time domain signal after digital-to-analog conversion, and controlling the terminal device to respond according to a voice recognition result.
The time domain signal obtained through the inverse fourier transform is a digital signal, so the digital signal needs to be converted into an analog signal, and the analog signal is transmitted to the voice recognition device for subsequent processing, and the terminal device is controlled to respond according to the voice recognition result.
In this embodiment, the speech recognition process for the time domain signal corresponding to the complete frequency domain signal is mentioned, and reference may be specifically made to the above embodiments of the present invention, which is not described herein again. By utilizing the technical scheme provided by the embodiment of the invention, the accuracy rate and the awakening rate of the terminal equipment for voice identification can be greatly improved, so that the distortion rate of the voice signal with the frequency below 200Hz is weakened, and the anti-interference capability is enhanced. Due to the fact that the distortion rate of the voice signal is reduced, the terminal device can accurately obtain a voice recognition result, the terminal device sends the voice recognition result to the related control module, and the control module obtains the voice control instruction to control the terminal device to conduct different operations such as turning on, turning off, increasing the volume or reducing the volume.
According to the technical scheme of the embodiment of the invention, an initial time domain signal is obtained by carrying out analog-to-digital conversion processing on a voice signal acquired by terminal equipment, a corresponding initial frequency domain signal is obtained by carrying out Fourier transform processing on the initial time domain signal, and a frequency domain signal is obtained by carrying out filtering processing; extracting a target frequency domain signal in a first frequency range in the frequency domain signals, and dynamically transferring the target frequency domain signal from the first frequency range frequency band to a second frequency range frequency band to form a complete frequency domain signal; and transforming the complete frequency domain signal into a time domain signal, and converting the time domain signal into a corresponding time domain signal through inverse Fourier transform, thereby transferring the voice signal which is easily influenced by the environment into a frequency band which is not easily influenced by the environment, enhancing the anti-interference capability, reducing the distortion of the voice signal, and improving the recognition rate of the terminal equipment to the voice signal and the accuracy of voice control.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a speech-controlled apparatus according to a third embodiment of the present invention, where the apparatus can be implemented by software and/or hardware, and can perform speech signal processing by performing the speech-controlled method. As shown in fig. 3, the apparatus includes: a signal conversion module 310, a complete frequency domain signal generation module 320, and a signal transformation response module 330.
The signal conversion module 310 is configured to perform signal processing on a voice signal acquired by a terminal device to obtain a frequency domain signal; the complete frequency domain signal generating module 320 is configured to extract a target frequency domain signal in a first frequency range from the frequency domain signals, and migrate the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; the signal transformation response module 330 is configured to transform the complete frequency domain signal into a time domain signal, perform speech recognition on the time domain signal, and control the terminal device to respond according to a speech recognition result.
In the voice control device provided by the third embodiment of the present invention, the target frequency domain signal in the first frequency range, which is susceptible to the environment, is extracted from the collected voice signal and transferred to the synthesized frequency domain signal in the second frequency range, so as to form a complete frequency domain signal, and the complete frequency domain signal is converted into a time domain signal for voice recognition, thereby reducing the interference of the environmental noise on the voice signal, and improving the recognition rate of the terminal device on the voice signal and the accuracy of voice control.
On the basis of the foregoing embodiment, the signal conversion module 310 is specifically configured to perform analog-to-digital conversion on a voice signal acquired by a terminal device to obtain an initial time domain signal, perform fourier transform processing on the initial time domain signal to obtain a corresponding initial frequency domain signal, and perform filtering processing on the initial frequency domain signal to obtain a frequency domain signal.
On the basis of the foregoing embodiment, the complete frequency domain signal generating module 320 is specifically configured to extract a target frequency domain signal in a first frequency range from the frequency range of the frequency domain signals, and dynamically migrate the target frequency domain signal from the frequency band in which the first frequency range is located to the frequency band in a second frequency range in which the synthesized frequency domain signal is located, so as to form the complete frequency domain signal.
On the basis of the above embodiment, the signal transformation response module 330 further includes a signal inverse transformation unit and a signal response unit, wherein the signal inverse transformation unit is configured to perform fourier inverse transformation on the synthesized complete frequency domain signal to obtain a corresponding time domain signal; and the signal response unit is used for performing digital-to-analog conversion on the time domain signal, performing voice recognition on the time domain signal subjected to digital-to-analog conversion, and controlling the terminal equipment to respond according to a voice recognition result.
The voice control device can execute the voice control method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention. Fig. 4 shows a block diagram of an exemplary terminal device 12 suitable for use in implementing embodiments of the present invention. The terminal device 12 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 4, terminal device 12 is in the form of a general purpose computing device. The components of terminal device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Terminal device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by terminal device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Terminal device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Terminal device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), and may also communicate with one or more devices that enable a user to interact with terminal device 12, and/or with any devices (e.g., network card, modem, etc.) that enable terminal device 12 to communicate with one or more other computing devices, such communication may occur via input/output (I/O) interfaces 22. furthermore, terminal device 12 may also communicate with one or more networks (e.g., local area network (L AN), Wide Area Network (WAN) and/or a public network, such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of terminal device 12 via bus 18. it should be appreciated that, although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with terminal device 12, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, etc.
The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, implementing a method of voice control provided by an embodiment of the present invention:
that is, the processing unit implements, when executing the program: carrying out signal processing on a voice signal acquired by terminal equipment to obtain a frequency domain signal; extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; and converting the complete frequency domain signal into a time domain signal, carrying out voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result.
EXAMPLE five
Fifth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for voice control provided in all embodiments of the present invention:
that is, the program when executed by the processor implements: carrying out signal processing on a voice signal acquired by terminal equipment to obtain a frequency domain signal; extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; and converting the complete frequency domain signal into a time domain signal, carrying out voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result.
Embodiment five of the present invention provides a computer-readable storage medium that may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method of voice control, comprising:
carrying out signal processing on a voice signal acquired by terminal equipment to obtain a frequency domain signal;
extracting a target frequency domain signal in a first frequency range from the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; the first frequency range is a low-frequency band, and the second frequency range is a high-frequency band;
and converting the complete frequency domain signal into a time domain signal, carrying out voice recognition on the time domain signal, and controlling the terminal equipment to respond according to a voice recognition result.
2. The method according to claim 1, wherein the signal processing is performed on the voice signal collected by the terminal device to obtain a frequency domain signal, and specifically comprises:
the method comprises the steps of carrying out analog-to-digital conversion processing on voice signals collected by terminal equipment to obtain initial time domain signals, carrying out Fourier transform processing on the initial time domain signals to obtain corresponding initial frequency domain signals, and carrying out filtering processing on the initial frequency domain signals to obtain frequency domain signals.
3. The method according to claim 1, wherein the extracting a target frequency-domain signal in a first frequency range from the frequency-domain signals, and migrating the target frequency-domain signal to a synthesized frequency-domain signal in a second frequency range to form a complete frequency-domain signal, specifically:
and extracting a target frequency domain signal in a first frequency range in the frequency domain signals, and dynamically transferring the target frequency domain signal from a frequency band in which the first frequency range is located to a frequency band in a second frequency range in which the synthesized frequency domain signal is located to form a complete frequency domain signal.
4. The method according to claim 1, wherein said transforming the complete frequency domain signal into a time domain signal is specifically:
and transforming the complete frequency domain signal into a corresponding time domain signal through Fourier inversion.
5. The method according to claim 1, wherein performing speech recognition on the time domain signal and controlling the terminal device to respond according to a speech recognition result specifically comprises:
and D/A conversion is carried out on the time domain signal, voice recognition is carried out on the time domain signal after D/A conversion, and the terminal equipment is controlled to respond according to a voice recognition result.
6. A voice-controlled apparatus, comprising:
the signal conversion module is used for carrying out signal processing on the voice signal acquired by the terminal equipment to obtain a frequency domain signal;
the complete frequency domain signal generating module is used for extracting a target frequency domain signal in a first frequency range in the frequency domain signals, and transferring the target frequency domain signal to a synthesized frequency domain signal in a second frequency range to form a complete frequency domain signal; the first frequency range is a low-frequency band, and the second frequency range is a high-frequency band;
and the signal conversion response module is used for converting the complete frequency domain signal into a time domain signal, carrying out voice recognition on the time domain signal and controlling the terminal equipment to respond according to a voice recognition result.
7. The speech-controlled apparatus according to claim 6, wherein the signal conversion module is specifically configured to:
the method comprises the steps of carrying out analog-to-digital conversion processing on voice signals collected by terminal equipment to obtain initial time domain signals, carrying out Fourier transform processing on the initial time domain signals to obtain corresponding initial frequency domain signals, and carrying out filtering processing on the initial frequency domain signals to obtain frequency domain signals.
8. The speech-controlled apparatus according to claim 6, wherein the synthesis frequency-domain signal generating module is specifically configured to:
and extracting a target frequency domain signal in a first frequency range in the frequency domain signals, and dynamically transferring the target frequency domain signal from a frequency band in which the first frequency range is located to a frequency band in a second frequency range in which the synthesized frequency domain signal is located to form a complete frequency domain signal.
9. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-5 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201810235732.7A 2018-03-21 2018-03-21 Voice control method, device, equipment and storage medium Active CN108461081B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810235732.7A CN108461081B (en) 2018-03-21 2018-03-21 Voice control method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810235732.7A CN108461081B (en) 2018-03-21 2018-03-21 Voice control method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108461081A CN108461081A (en) 2018-08-28
CN108461081B true CN108461081B (en) 2020-07-31

Family

ID=63236771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810235732.7A Active CN108461081B (en) 2018-03-21 2018-03-21 Voice control method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108461081B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110913310A (en) * 2018-09-14 2020-03-24 成都启英泰伦科技有限公司 Echo cancellation method for broadcast distortion correction
CN109119078A (en) * 2018-10-26 2019-01-01 北京石头世纪科技有限公司 Automatic robot's control method, device, automatic robot and medium
CN109669663B (en) * 2018-12-28 2021-10-12 百度在线网络技术(北京)有限公司 Method and device for acquiring range amplitude, electronic equipment and storage medium
CN109905808B (en) * 2019-03-13 2021-12-07 北京百度网讯科技有限公司 Method and apparatus for adjusting intelligent voice device
CN112951262B (en) * 2021-02-24 2023-03-10 北京小米松果电子有限公司 Audio recording method and device, electronic equipment and storage medium
CN115132204B (en) * 2022-06-10 2024-03-22 腾讯科技(深圳)有限公司 Voice processing method, equipment, storage medium and computer program product

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1354455A (en) * 2000-11-18 2002-06-19 深圳市中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
CN1494053A (en) * 2002-09-24 2004-05-05 松下电器产业株式会社 Speaking person standarding method and speech identifying apparatus using the same
US7089184B2 (en) * 2001-03-22 2006-08-08 Nurv Center Technologies, Inc. Speech recognition for recognizing speaker-independent, continuous speech
CN1839427A (en) * 2003-08-22 2006-09-27 夏普株式会社 Signal analysis device, signal processing device, speech recognition device, signal analysis program, signal processing program, speech recognition program, recording medium, and electronic device
CN101030382A (en) * 2005-12-09 2007-09-05 Qnx软件操作系统(威美科)有限公司 System for improving speech intelligibility through high frequency compression
CN101901601A (en) * 2010-05-17 2010-12-01 天津大学 Method and system for reducing noise of voice communication in vehicle
CN105825863A (en) * 2015-01-22 2016-08-03 株式会社东芝 Speech processing apparatus, speech processing method
CN105978634A (en) * 2015-03-10 2016-09-28 西万拓私人有限公司 Method for frequency-dependent noise reduction in an input signal
CN205829977U (en) * 2016-05-31 2016-12-21 北京语智科技有限公司 A kind of wireless speech dialogue being applicable to tradition sound equipment and control device
JP2017021088A (en) * 2015-07-08 2017-01-26 パナソニックIpマネジメント株式会社 Sound recognition apparatus
CN106448672A (en) * 2016-10-27 2017-02-22 Tcl通力电子(惠州)有限公司 Sound system and control method
CN107393550A (en) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 Method of speech processing and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6438524B1 (en) * 1999-11-23 2002-08-20 Qualcomm, Incorporated Method and apparatus for a voice controlled foreign language translation device
EP1540646A4 (en) * 2002-07-31 2005-08-10 Arie Ariav Voice controlled system and method
CN101256776B (en) * 2007-02-26 2011-03-23 财团法人工业技术研究院 Method for processing voice signal
CN202535490U (en) * 2012-03-29 2012-11-14 深圳市信利康电子有限公司 Intelligent voice-controlled sound box apparatus
CN102638755B (en) * 2012-04-25 2014-04-09 南京邮电大学 Digital hearing aid loudness compensation method based on frequency compression and movement
CN203588455U (en) * 2013-10-21 2014-05-07 宁波瑞明电器有限公司 Voice remote controller
CN103905646B (en) * 2014-04-09 2016-08-17 努比亚技术有限公司 Communicating terminal and sound processing method thereof

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1354455A (en) * 2000-11-18 2002-06-19 深圳市中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
US7089184B2 (en) * 2001-03-22 2006-08-08 Nurv Center Technologies, Inc. Speech recognition for recognizing speaker-independent, continuous speech
CN1494053A (en) * 2002-09-24 2004-05-05 松下电器产业株式会社 Speaking person standarding method and speech identifying apparatus using the same
CN1839427A (en) * 2003-08-22 2006-09-27 夏普株式会社 Signal analysis device, signal processing device, speech recognition device, signal analysis program, signal processing program, speech recognition program, recording medium, and electronic device
CN101030382A (en) * 2005-12-09 2007-09-05 Qnx软件操作系统(威美科)有限公司 System for improving speech intelligibility through high frequency compression
CN101901601A (en) * 2010-05-17 2010-12-01 天津大学 Method and system for reducing noise of voice communication in vehicle
CN105825863A (en) * 2015-01-22 2016-08-03 株式会社东芝 Speech processing apparatus, speech processing method
CN105978634A (en) * 2015-03-10 2016-09-28 西万拓私人有限公司 Method for frequency-dependent noise reduction in an input signal
JP2017021088A (en) * 2015-07-08 2017-01-26 パナソニックIpマネジメント株式会社 Sound recognition apparatus
CN205829977U (en) * 2016-05-31 2016-12-21 北京语智科技有限公司 A kind of wireless speech dialogue being applicable to tradition sound equipment and control device
CN106448672A (en) * 2016-10-27 2017-02-22 Tcl通力电子(惠州)有限公司 Sound system and control method
CN107393550A (en) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 Method of speech processing and device

Also Published As

Publication number Publication date
CN108461081A (en) 2018-08-28

Similar Documents

Publication Publication Date Title
CN108461081B (en) Voice control method, device, equipment and storage medium
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
CN110322891B (en) Voice signal processing method and device, terminal and storage medium
US9536540B2 (en) Speech signal separation and synthesis based on auditory scene analysis and speech modeling
JP4986393B2 (en) Method for determining an estimate for a noise reduction value
CN108335694B (en) Far-field environment noise processing method, device, equipment and storage medium
CN108564965B (en) Anti-noise voice recognition system
CN106098078A (en) A kind of audio recognition method that may filter that speaker noise and system thereof
CN110503940B (en) Voice enhancement method and device, storage medium and electronic equipment
CN113077806B (en) Audio processing method and device, model training method and device, medium and equipment
CN105719657A (en) Human voice extracting method and device based on microphone
Chao et al. Cross-domain single-channel speech enhancement model with bi-projection fusion module for noise-robust ASR
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
WO2017000772A1 (en) Front-end audio processing system
KR102220964B1 (en) Method and device for audio recognition
CN117542373A (en) Non-air conduction voice recovery system and method
Jaroslavceva et al. Robot Ego‐Noise Suppression with Labanotation‐Template Subtraction
CN112634937A (en) Sound classification method without digital feature extraction calculation
CN106228984A (en) Voice recognition information acquisition methods
CN112309425A (en) Sound tone changing method, electronic equipment and computer readable storage medium
CN109741761B (en) Sound processing method and device
US10747494B2 (en) Robot and speech interaction recognition rate improvement circuit and method thereof
CN115035887A (en) Voice signal processing method, device, equipment and medium
CN111627426B (en) Method and system for eliminating channel difference in voice interaction, electronic equipment and medium
CN116982111A (en) Audio characteristic compensation method, audio identification method and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190130

Address after: 100085 East District, Second Floor, 33 Xiaoying West Road, Haidian District, Beijing

Applicant after: BEIJING KINGSOFT INTERNET SECURITY SOFTWARE Co.,Ltd.

Address before: 511400 Tian'an Science and Technology Industrial Building, Panyu Energy-saving Science Park, 555 North Panyu Avenue, Donghuan Street, Panyu District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU LANBO INTELLIGENT TECHNOLOGY CO.,LTD.

GR01 Patent grant
GR01 Patent grant