CN117636836A

CN117636836A - Headset voice processing method and headset

Info

Publication number: CN117636836A
Application number: CN202311587705.3A
Authority: CN
Inventors: 李林峰; 黄海荣
Original assignee: Hubei Xingji Meizu Group Co ltd
Current assignee: Hubei Xingji Meizu Group Co ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-03-01

Abstract

A method for processing voice of a head-mounted device and the head-mounted device are disclosed. The voice processing method comprises the following steps: acquiring an original time domain signal acquired by a microphone; performing direction-based enhancement suppression processing on the original time domain signal to obtain a processed time domain signal, wherein the direction which needs enhancement and/or suppression in the enhancement suppression processing is selected according to the current use scene of the head-mounted equipment; and judging whether to perform voice processing based on the original time domain signal or not according to the energy of the original time domain signal and the processed time domain signal. The method and the device determine whether the acquired signal contains the voice information of the target speaker based on the energy attenuation degree of the processed signal compared with the original signal, and particularly can accurately distinguish the situation that the non-target speaker speaks aloud, so that misoperation of the head-mounted device is avoided.

Description

Headset voice processing method and headset

Technical Field

The disclosure relates to the field of voice processing, in particular to a method for processing voice of head-mounted equipment and the head-mounted equipment using the method.

Background

Headgear typically has the form of eyeglasses, eye shields or helmets. By focusing the display screen in the form of a lens proximate to the user's eyes and optical path, the headset is able to generate a wide angle view at a much smaller volume than a conventional display at close range.

In addition to using lenses for content display, headsets are often equipped with microphones and speakers for providing various types of voice interaction services. In a speech interaction scenario, it is necessary to correctly collect the voice of the target speaker and filter out extraneous sounds of the surrounding environment. In the prior art, extraneous sounds are typically distinguished based on the magnitude of the frequency domain energy. However, there are many situations that the extraneous sound cannot be filtered out, for example, the loud speaker of the nearby person cannot be completely suppressed, so that the extraneous sound is wrongly identified, and the use experience of the head-mounted device of the wearer is reduced.

Accordingly, there is a need for an improved method of extraneous sound filtering for a head mounted device.

Disclosure of Invention

One technical problem to be solved by the present disclosure is to provide a method for processing a sound of a head-mounted device and a head-mounted device using the method. The voice processing method can perform direction enhancement/suppression processing based on the current use scene on the original time domain signals acquired by the head-mounted equipment, and judges whether to perform subsequent voice recognition processing on the original signals according to the energy of the time domain signals before and after the processing.

According to a first aspect of the present disclosure, there is provided a method for processing speech of a headset, including: acquiring an original time domain signal acquired by a microphone; performing direction-based enhancement suppression processing on the original time domain signal to obtain a processed time domain signal, wherein the direction which needs enhancement and/or suppression in the enhancement suppression processing is selected according to the current use scene of the head-mounted equipment; a determination is made as to whether to perform speech processing based on the original time domain signal based on energy of the original time domain signal and the processed time domain signal.

Optionally, determining whether to perform speech processing based on the original time domain signal based on energy of the original time domain signal and the processed time domain signal comprises: calculating an energy ratio of the processed time domain signal to the original time domain signal; and judging whether to perform voice recognition processing based on the original time domain signal according to the energy ratio.

Optionally, calculating the energy ratio of the processed time domain signal to the original time domain signal comprises: a ratio of energy means of the processed time domain signal to the original time domain signal over a time window is calculated as the energy ratio.

Optionally, determining whether to perform speech processing based on the original time domain signal according to the magnitude of the energy ratio includes: when the energy ratio is less than a first threshold, not performing speech processing based on the original time domain signal; and performing speech processing based on the original time domain signal when the energy ratio is greater than the first threshold.

Optionally, the headset includes a plurality of microphones arranged at different positions, the plurality of microphones being the main microphones in a plurality of usage scenarios, respectively; performing a direction-based enhanced suppression process on the original time domain signal to obtain a processed time domain signal, comprising: and performing direction-based enhancement suppression processing on the original time domain signal of the main microphone acquired by the main microphone in the current use scene so as to acquire the processed time domain signal.

Optionally, acquiring the raw time domain signal acquired by the microphone includes: a primary microphone raw time domain signal acquired by a primary microphone in the current usage scenario and other microphone raw time domain signals acquired by other microphones than the primary microphone are acquired. At this time, performing a direction-based enhancement suppression process on the primary microphone original time domain signal acquired by the primary microphone in the current usage scenario, to obtain the processed time domain signal includes: and performing directional enhancement processing on the primary microphone original time domain signals acquired by the primary microphone in the current use scene by using the other microphone original time domain signals so as to acquire the processed time domain signals.

Optionally, determining whether to perform speech processing based on the original time domain signal based on energy of the original time domain signal and the processed time domain signal comprises: and if the value of the frequency domain energy of the processed time domain signal is larger than a second threshold value and smaller than a third threshold value, judging whether to perform voice processing based on the original time domain signal according to the energy of the original time domain signal and the energy of the processed time domain signal.

Optionally, the method further comprises: when the value of the frequency domain energy is smaller than a second threshold value, not performing voice processing based on the original time domain signal; and performing speech processing based on the original time domain signal when the value of the frequency domain energy is greater than a third threshold.

Optionally, the headset includes a speaker, and performing a direction-based enhanced suppression process on the original time-domain signal to obtain the processed time-domain signal includes: and when the loudspeaker outputs the audio frequency in the current use scene, using the audio frequency signal to perform echo cancellation processing based on the direction of the loudspeaker.

Optionally, a speaker direction is determined according to the current usage scenario, and the signal strength of the speaker direction is enhanced in the enhancement suppression process.

According to a second aspect of the present disclosure, there is provided a head-mounted device comprising: the microphone is used for collecting voice signals; and a processing unit for performing the speech processing method as described in the first aspect.

Thus, the present disclosure determines whether the acquisition signal contains voice information of the targeted speaker based on the degree of energy attenuation of the processed signal (e.g., the directionally enhanced signal) compared to the original signal, particularly enabling accurate discrimination of situations where the non-targeted speaker speaks aloud, thereby avoiding misoperations of the headset. The above decision may be an alternative or supplement to conventional frequency domain energy size based decisions.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 shows an example of a smart glasses structure.

Fig. 2 shows a schematic flow chart of a headset speech processing method according to one embodiment of the present disclosure.

Fig. 3 shows a schematic flow chart of a headset speech processing method according to one embodiment of the present disclosure.

Fig. 4A-B show graphical representations of the relative magnitudes of pre-and post-processing speech signals for speakers of different volumes at different locations.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. In addition, it should be understood that the terms "first," "second," and "third," or the like, in this disclosure are used for descriptive and distinguishing purposes only and are not to be construed as indicating or implying any particular order or importance between such terms.

The headset provides a more immersive audio and visual experience for the wearer by projecting a wide view angle picture in close range on the device lens, in combination with headphones or speakers near the ears. The head-mounted devices are classified into light-permeable and light-impermeable ones. The light-permeable head-mounted device can be used for displaying pictures and also can enable a user to see images behind a display screen. Such a feature is often used for the development of applications for Augmented Reality (AR).

The intelligent glasses are light-permeable type head-mounted equipment. Fig. 1 shows an example of a smart glasses structure. As shown, the smart glasses 100 are first a kind of glasses, and thus may have a structure similar to conventional glasses, including lenses 110, a frame 120, and temples 130. Further, the smart glasses 100 can provide various "intelligent" functions that cannot be provided by conventional glasses, such as voice interaction, real-time translation, and audio-visual playing. Thus, the smart glasses 100 also need to integrate the relevant components required for the smart computing device. In particular, the processing unit of the smart glasses 100 may be arranged within the temple 130, for example at the location of the illustration 131. The processing unit may perform various operations according to the acquired instructions, for example, projecting left and right images respectively on the left and right lenses 110 for display using left and right display units (not shown in the figure). To receive voice instructions from the wearer, the smart glasses 100 may also be equipped with a microphone, such as the illustrated microphone 132. The microphone 132 may preferably be provided on the temple 130 so as to be able to receive oral instructions of the wearer from a location where clear voice of the wearer is readily available. Accordingly, the smart glasses 100 may further include a speaker 133 for playing voice feedback to the wearer or sound content designated by the wearer, for example, playing music. In order to provide a better sound effect to the wearer, speakers 133 may be provided at the positions of the left and right temples near the ears as shown.

In a preferred embodiment, the smart glasses 100 may be provided with a plurality of microphones, which may be arranged at different positions of the smart glasses for picking up sound from different directions. In the example shown in fig. 1, a microphone 121 is provided in the lens frame 120 in addition to a microphone 132 provided in the temple 130. At this time, the microphone 132 may be regarded as a microphone mainly used for acquiring a voice instruction of the wearer, and the microphone 121 may be regarded as a microphone mainly used for acquiring a voice input of a conversation object of the wearer. In this case, the microphones 132 and 121 may constitute a microphone array. A specific algorithm may be performed on the speech signals collected by the microphone array to extract therefrom signals containing valid speech information for speech arousal and/or speech recognition.

As a computing device, the smart glasses 100 should include a storage unit and a communication unit (not shown in the figure) in addition to the processing unit 131 shown in the figure.

The processing unit 131 may be one processor or may include a plurality of processors. In some embodiments, processing unit 131 may comprise a general-purpose main processor and one or more special coprocessors, such as a Graphics Processor (GPU), digital Signal Processor (DSP), etc. In some embodiments, at least a portion of processing unit 131 may be implemented using custom circuitry, such as an application specific integrated circuit (ASIC, application Specific Integrated Circuit) or a field programmable gate array (FPGA, field Programmable Gate Arrays).

The storage units may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processing unit or other modules of the computing device. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computing device is powered down. The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processing units at runtime. The storage unit stores executable code thereon, which when processed by the processing unit, causes the processing unit to perform various methods required to implement the smart glasses-related functions.

The communication unit may be used for wireless communication with the outside world. For example, the smart glasses may acquire a result of the voice recognition through wireless communication-based interaction of the communication unit with a voice model disposed on the server.

It should be appreciated that while fig. 1 illustrates a particular framed eyeglass configuration, in other embodiments, the smart eyeglass may have a rimless design. In this case, the elements originally arranged in the frame may instead be arranged in other positions of the spectacles. In other embodiments, the smart glasses may include other components in addition to the glasses body structure, such as an image projection device attached to the outside of the lens, an additional lens with near/far vision on the inside of the lens, or a removable in-ear plug, etc. The present disclosure is not limited herein to the specific implementation of smart glasses.

As can be seen from the description above in connection with fig. 1, headsets such as smart glasses are typically equipped with microphones and speakers and may be used to provide various types of voice interaction services. In a speech interaction scenario, it is necessary to correctly collect the voice of the target speaker and filter out extraneous sounds of the surrounding environment. In the prior art, extraneous sounds are typically distinguished based on the frequency domain energy magnitude of the noise suppressed signal. However, there are many situations that the extraneous sound cannot be filtered out, for example, the loud speaker of the nearby person cannot be completely suppressed, so that the extraneous sound is wrongly identified, and the device use experience of the wearer is reduced.

For this reason, the present disclosure proposes a time-domain speech signal processing method as an alternative or supplement to the existing frequency-domain energy determination method. The voice processing method of the present disclosure performs a direction enhancing/suppressing process based on a current usage scenario on an original time domain signal acquired by a headset, and determines whether to perform a subsequent voice recognition process on the original signal according to energy of the time domain signal before and after the process (for example, energy attenuation degree of the processed signal compared with the original signal).

Fig. 2 shows a schematic flow chart of a headset speech processing method according to one embodiment of the present disclosure. The speech processing method shown in fig. 2 can be used by various types of head-mounted devices equipped with microphones, for example, the method is suitable for use with eye-shield VR glasses, head-mounted audio-visual devices, and is particularly suitable for use with smart glasses equipped with microphones as shown in fig. 1. In one embodiment, the speech processing method shown in fig. 2 may be performed by a processing unit of a headset, such as processing unit 131 of the smart glasses shown in fig. 1.

In step S210, an original time domain signal acquired by the microphone is acquired. Subsequently, in step S220, a direction-based enhancement suppression process is performed on the original time-domain signal to obtain a processed time-domain signal. Here, the direction-based enhancement suppression processing refers to performing enhancement and/or suppression operations in a specific direction on the original time domain signal. In the present application, the direction in which the enhancement suppression process is based may be selected according to the current usage scenario of the head-mounted device. In different scenarios, the headset needs to collect speech input from different targeted speakers. In a specific application scenario of the headset, the relative position and/or direction of the speaker and the headset are often fixed, so that the enhancement and/or suppression operation in a specific direction can be performed according to the relative position and/or direction of the speaker in the current usage field Jing Tuiduan relative to the microphone that collects the original time domain signal.

In one embodiment, the direction in which enhancement and/or suppression is desired may be determined based on the current usage scenario. For example, the direction may be a target sound source (speaker) direction in the current use scenario, and the signal strength of the speaker direction is enhanced in the enhancement suppression process, and/or the signal strength in other directions is suppressed.

In one embodiment, the enhancement suppression process may be a directional enhancement process based on a beamforming algorithm. The beam forming algorithm needs to be based on multiple paths of signals acquired by a plurality of microphones and perform enhancement processing on one path of signals in a specific direction. As the signal in this particular direction is enhanced, the signal in the corresponding other direction is equivalent to being suppressed.

It should be understood that in step S220, the original time domain signal is subjected to the direction-based enhancement suppression process in the time domain, so that the obtained processed time domain signal is still the time domain signal. But it can be considered that the processed time domain signal enhances sound in the target direction and/or suppresses sound in an irrelevant direction. After the processed time domain signal is acquired, it may be determined whether to perform a voice process based on the original time domain signal according to the energy of the processed time domain signal and the original time domain signal at step S230. Since the strength of the target speaker in the direction and/or the signal strength in other directions is/are enhanced in step S220, if the voice signal of the target speaker is collected in the original time domain signal, the energy of the processed time domain signal subjected to the enhancement suppression processing in step S220 should not be significantly attenuated compared to the energy of the original time domain signal. For example, in the case where the target speaker is the speaker opposite the wearer (simply referred to as the "opposite"), step S220 may enhance the signal strength in the front direction of the headset and/or suppress the signal strength in other directions. If the collected signal includes a voice signal of the opposite person, the voice signal is enhanced after the enhancement suppression processing in step S220, so that the energy of the processed time domain signal is not significantly attenuated compared to the energy of the original time domain signal. However, if the collected signal does not include the voice signal of the opposite person, but the speech of the person in other directions beside (referred to as "beside person") is collected, the voice signal is suppressed after the enhancement suppression processing in step S220 due to the fact that the voice signal originates from other directions, and the energy of the processed time domain signal is greatly attenuated compared with the energy of the original time domain signal. Thus, it may be determined whether the original signal includes the voice signal of the target speaker based on the energy attenuation magnitude of the signal processed in step 220 compared to the original signal.

The energy referred to in step 230 may be time domain energy or frequency domain energy. In one embodiment, the energy of the time domain signal may be directly calculated since the original signal and the processed signal are still time domain signals. For example, a time domain signal average over a predetermined period of time may be calculated as the time domain energy. In another embodiment, the original signal and the processed signal may also be transformed into the frequency domain and the transformed frequency domain energy calculated, for example, by a short time fourier transform. Obviously, the calculation amount required for obtaining the time domain energy is smaller than the frequency domain energy. Thus, it is possible to determine whether the original time domain signal contains the voice information of the target speaker based on the signal energy before and after the processing. Illustratively, the signal energy can be determined by comparing the ratio or the difference value of the signal energy before and after the processing. It is to be understood that "speech processing" herein is processing performed on a signal determined to be useful. In different embodiments, the speech processing may be a speech wake-up processing based on the original signal, or a speech recognition processing performed on the original signal.

Thus, the voice processing method of the present disclosure uses the fact that the relative position and/or direction of the target speaker and the microphone arranged on the head-mounted device can be estimated in a specific usage scenario, performs enhancement suppression processing in a specific direction on the collected original signal, and determines whether the voice information of the target speaker is included in the original signal according to the energy of the signals before and after the processing. Thereby, a fast decision whether to speech process or ignore the acquired signal can be achieved with relatively small computational and memory requirements. Meanwhile, compared with the scheme of judging whether to process according to the source direction of the voice signal, the voice signal to be processed can be identified without using an additional sound source orientation module, and the method is beneficial to reducing the product cost.

A headset typically includes a plurality of microphones, e.g., at least two microphones, arranged in different locations. In this case, it is necessary to select one signal from among the plurality of microphone acquisition signals to perform the direction-based enhancement suppression processing, and to determine whether or not to perform the speech processing on the original signal based on the energy before and after the processing of the one signal. For ease of illustration, a plurality of microphones disposed in a headset may be considered, each of which may be used as a primary microphone in a corresponding one or more use scenarios. Here, the term "primary microphone" means that, in a specific use scenario, a signal collected by this microphone is selected to perform a direction-based enhancement suppression process, and a process such as speech recognition or speech wake-up may be performed based on the processed signal or the original signal. In different usage scenarios, different microphones may be used as primary microphones. Taking the example shown in fig. 1, in a scenario where the voice of the wearer needs to be captured (e.g., in a voice assistant scenario), the microphone 132 is considered to be the primary microphone; while in a scenario where a speaker's voice needs to be captured (e.g., in a usage scenario where a translation agency application needs to capture the content of the speaker's voice), microphone 121 may be considered the primary microphone.

At this time, step S220 includes: and performing direction-based enhancement suppression processing on the original time domain signal of the main microphone acquired by the main microphone in the current use scene to acquire a processed time domain signal. In one embodiment, the direction-based enhancement suppression processing of the primary microphone raw time domain signal requires the acquisition of signals from other microphones than the primary microphone. For example, the primary microphone original time domain signal may be directionally enhanced using a beamforming algorithm, but the beamforming algorithm requires acquisition of multiple signals for computation. At this time, step S210 needs to acquire multiple original signals, specifically, acquire the original time domain signals acquired by the microphone, including: a primary microphone raw time domain signal acquired by a primary microphone in the current usage scenario and other microphone raw time domain signals acquired by other microphones than the primary microphone are acquired. The following step S220 includes: and performing directional enhancement processing, for example, directional enhancement processing, on the primary microphone original time domain signals acquired by the primary microphone in the current use scene to acquire processed time domain signals. The following will describe a two-microphone configuration of the smart glasses having the microphone 132 at the temple and the microphone 121 at the frame, as shown in fig. 1.

When the smart glasses are in a use scenario interacting with the wearer, for example, in case the voice assistant of the smart glasses is turned on, it can be assumed that the smart glasses need to collect voice instructions of the wearer. At this time, the wearer is the target speaker. Since the relative positions and directions of the mouth and the various components on the smart glasses are fixed when the wearer wears the smart glasses, the microphone 132 may be considered to be a primary microphone and the microphone 121 may be other than the primary microphone, and may also be referred to as a secondary microphone, in a use scenario where voice input of the wearer is required to be acquired by default. In step S210, two signals acquired by the primary microphone 132 and the secondary microphone 121 are acquired. In the subsequent step S220, the one-way original time-domain signal (i.e., the primary microphone original time-domain signal) collected by the primary microphone 132, which is more likely to collect the clear voice of the wearer, is subjected to the enhancement suppression process. Specifically, using the other original time domain signal acquired by the secondary microphone 121 (i.e., the other microphone original time domain signal), a directional enhancement process for the original time domain signal acquired by the primary microphone is performed according to the direction of the wearer's mouth relative to the primary microphone, for example, based on a beamforming algorithm.

In a use scenario where the smart glasses need to interact with other speakers outside the wearer, for example, in a case where the translation function of the smart glasses is turned on to collect other speakers who are talking to the wearer, it may be inferred that the smart glasses need to collect voice input content of the other speakers. Since the wearer is usually faced with the talking object, the voice signal to be collected can be considered to be transmitted from the front of the smart glasses. At this time, the talking object of the wearer (i.e., the person opposite the wearer, which may be hereinafter referred to as the "opposite person") is the target speaker. In other embodiments, the relative position and/or direction of the talking object and the smart glasses can also be calculated by the image signal collected by the head-mounted device. At this time, the microphone 121 is considered to be a main microphone, and the microphone 132 is a microphone other than the main microphone, and may also be referred to as a sub microphone. It is also necessary to acquire two signals acquired by the primary microphone 121 and the secondary microphone 132 in step S210. In the subsequent step S220, the enhancement suppression process is performed on one original time-domain signal (i.e., the original time-domain signal of the main microphone) collected by the main microphone 121, which is more likely to collect the clear voice of the opposite person. Specifically, another original time-domain signal collected by the secondary microphone 132 (i.e., the other microphone original time-domain signal) may be used to perform directional enhancement processing for the original time-domain signal collected by the primary microphone 121 according to the direction of the opponent relative to the primary microphone 121, for example, based on a beamforming algorithm.

In other embodiments, the headset may be equipped with two microphones at locations other than that shown in fig. 1, or with more microphones. However, these headsets may select one microphone signal as the main microphone signal according to the estimated or measured speaker position/direction in the current scene, perform the direction-based enhancement suppression processing, and determine whether the original signal contains the target voice information according to the energy before and after the processing, and thus determine whether to perform voice processing or directly ignore the signal.

In one embodiment, the step S230 of determining according to the energy before and after the signal processing may be implemented to determine according to the ratio of the energy before and after the signal processing. At this time, step S230 may include: the energy ratio of the processed time domain signal to the original time domain signal is calculated, and it is determined whether to perform voice processing on the original time domain signal acquired from step S210 according to the magnitude of the calculated energy ratio. The ratio of the energy mean of the processed time domain signal to the original time domain signal over a time window may be calculated as the energy ratio. Here, the energy mean used may be a root mean square mean value (RMS). In one embodiment, a time window of two frames may be used, each frame being 15ms long, then the time window is 30ms long, with the frames being shifted by 15ms, which is equivalent to smoothing 2 frames. In a specific scene, the continuous energy smoothing operation can be carried out on the original audio signals collected by the selected one-path microphone, so that the energy mean value RMS1 of the original time domain signals in a time window is obtained; and performs a direction-based enhancement suppression process (e.g., a directional enhancement operation) on the path of the original audio signal, thereby yielding an energy mean RMS2 of the processed time-domain signal over a time window. Then, it is determined whether to process the original audio signal based on the ratio of RMS1 to RMS2 or the ratio of RMS2 to RMS 1. When the ratio of RMS2 to RMS1, i.e., RMS2/RMS1, is used, it may be considered that the processed time-domain signal does not contain enough valid information when it is less than a certain threshold (e.g., a first threshold value, hcate, as will be described in detail below in connection with fig. 3), and thus speech processing based on the original time-domain signal is not performed; while above this threshold it is considered that the processed time domain signal still contains enough valid information, so that the original audio signal needs to be processed. Similarly, where the ratio of RMS1 to RMS2 is used, i.e., RMS1/RMS2, speech processing may be performed when it is less than a particular threshold and selected for omission when it is greater than the particular threshold. In other embodiments, the energy ratio may be obtained by converting the signals before and after the processing into the frequency domain, but this pair leads to an increase in the amount of computation.

The time domain signal processing method based on the current usage scenario as described above in connection with fig. 2 may be used as an alternative or in addition to the conventional frequency domain energy based extraneous signal determination method. In one embodiment, the determination on the frequency domain energy may be performed on the processed time domain signal first, and then the determination on the energy before and after the signal processing in step S230 may be performed. In making decisions based on frequency domain energy, the headset also needs to include multiple microphones arranged in different locations. For example, the headset shown in fig. 1 includes two microphones 132 and 121; other headsets may also be arranged with three or even more microphones. And at this time, the voice processing method of the present disclosure may further include: frequency domain energy of the processed time domain signal is calculated. In one embodiment, the processed time domain signal may be subjected to a Short Time Fourier Transform (STFT) to obtain frequency domain energy of the processed time domain signal.

When the determination is required based on the frequency domain energy, the execution condition of step S230 is that the frequency domain energy falls within a specific range. At this time, step S230 may include: and if the value of the frequency domain energy of the processed time domain signal is larger than a second threshold value and smaller than a third threshold value, judging whether to perform voice processing based on the original time domain signal according to the energy of the original time domain signal and the energy of the processed time domain signal. That is, in one possible embodiment, the frequency domain energy of the processed signal is calculated first after the primary microphone raw data is subjected to enhancement/suppression processing to obtain the processed signal. Only when the frequency domain energy of the integrated signal is greater than the second threshold and less than the third threshold, the decision in step S230 based on the time domain before and after the signal processing is performed. And when the value of the frequency domain energy is smaller than the second threshold or larger than the third threshold, the determination may be directly performed without executing step S230. Specifically, if the value of the frequency domain energy is smaller than the second threshold, which indicates that the signal is highly probable as noise, it may be directly determined that the speech processing based on the original time domain signal is not performed. And when the value of the frequency domain energy is larger than a third threshold value, the signal is a signal needing to be subjected to voice processing in a large probability, and voice recognition processing based on the original time domain signal is performed. It is obvious that the second threshold is smaller than the third threshold here, and that the values of the second and third thresholds may be values preset based on experience.

Fig. 3 shows a schematic flow chart of a headset speech processing method according to one embodiment of the present disclosure. The method shown in fig. 3 describes a more refined embodiment. The method is applicable to a headset equipped with a plurality of microphones, and is particularly applicable to smart glasses equipped with two microphones, one primarily for capturing the wearer's voice and the other primarily for capturing the wearer's facing voice, such as microphones 132 and 121 shown in fig. 1. Such a microphone arrangement in itself enables directional acquisition, i.e. a relatively high sound volume for the wearer and the opposite person, and a relatively low sound for the nearby person speaking.

In addition to directionally collecting audio based on microphone arrangement position, a voice enhancement algorithm combining beam forming and frequency domain energy comparison can be used for suppressing side speaking and enhancing the sound of a wearer or a facing person, but when the side speaking loudly, even if a beam forming algorithm is used for directionally enhancing a selected signal, the processed signal still has equivalent frequency domain energy residues, and cannot be distinguished from the small speaking of the wearer (or the facing person) in terms of frequency domain energy. At this point, making only frequency domain energy based decisions may result in irrelevant voices of nearby people causing the headset to wake up, or its voice content to be recognized. These misrecognitions do not meet the design objectives of smart glasses.

For this situation, the inventors of the present disclosure found that while talking aloud from the side and talking aloud to the wearer (or to the person) cannot be distinguished in frequency domain energy, the talking voice of the side person is greatly attenuated by the direction-based enhancement suppression process, while the talking voice of the wearer (or to the person) is not. According to this feature, the energy ratio of the audio before and after the processing can be used to distinguish whether the wearer (or the opposite person) is speaking little or the next person is speaking loudly, so as to avoid false triggering of the intelligent glasses by the next person speaking loudly.

In the method shown in fig. 3, two threshold values may be set for the frequency domain energy: hx and Hy. Hx corresponds to a second threshold as described above, and may be used herein to represent a frequency domain energy threshold. If the frequency domain energy of the processed time domain signal resulting from the direction-based emphasis suppression process (e.g., the directional emphasis operation based on a beamforming algorithm) is less than Hx, this indicates that the sound is too small, and that it is assumed that no nearby speech is being acquired, or that ambient noise is being ignored directly (i.e., these signals are ignored). Hy corresponds to a third threshold as described above, which may be used herein to represent an audio frequency domain threshold for normal wake-up. If the frequency domain energy of the processed time domain signal exceeds the threshold value Hy, it indicates that the collected high probability is normally awakened audio, and corresponding voice processing is performed based on the signal, for example, the smart glasses are awakened successfully and/or the collected content is subjected to voice recognition. Clearly, here Hy > Hx.

Further, for the case where the frequency domain energy is located between Hx and Hy, the energy ratio threshold value hcrate may be set. The Hrate corresponds to the first threshold as described above. When the ratio of the energy mean RMS2 of the processed time domain signal to the energy mean RMS1 of the original time domain signal (i.e., RMS2/RMS 1) is greater than the Hrate, the energy attenuation of the processed time domain signal compared to the original time domain signal may be considered to be not great, the acquired signal may be considered to be the speaking signal of the target speaker, and corresponding speech processing is performed; when RMS2/RMS1 is smaller than the Hrate, most of the energy is considered to be suppressed in the directional enhancement suppression process, and irrelevant speech of a nearby person is considered to be acquired, and relevant signals can be directly ignored.

Specifically, in step S310, a plurality of signals are acquired. If described in connection with the example of fig. 1, a path of raw time-domain signals acquired by each of microphones 132 and 121 may be acquired at this time.

Subsequently, in step S320, the directional enhancement signal is calculated. At this time, a signal (a signal of the microphone 132 when the wearer is a speaker, a signal of the microphone 121 when the opposite person is a speaker) may be selected as a main microphone signal based on the current usage scenario of the smart glasses, and a direction-based enhancement suppression process may be performed on the signal, for example, a beam-forming algorithm-based directional enhancement process may be performed on the main microphone signal using a sub-microphone signal, thereby acquiring a directional enhancement signal as a processed signal, and in step S330, frequency domain energy, for example, STFT energy, of the directional enhancement signal may be calculated.

Then, it is determined whether to perform subsequent speech processing on the signal based on the STFT energy of the directional enhancement signal itself, and then it is determined based on the energy ratio of the signal before and after the processing. Specifically, in step S340, if the STFT energy of the directional enhanced signal is less than the second threshold Hx, it may be directly determined that the signal is not subjected to any processing, that is, it proceeds to the ignore step S390. Whereas if it is determined at step S340 that the STFT energy of the directional enhanced signal is greater than the second threshold Hx, the STFT energy of the directional enhanced signal may be compared with the third threshold Hy at step S350. If the STFT energy is greater than the third threshold, indicating that the energy of the speech signal is sufficiently high, the process may proceed directly to step S380, where it is determined that the signal is to be subjected to corresponding speech processing. If it is determined at step S350 that the STFT energy of the directional enhancement signal is less than Hy, at which point, since the STFT energy of the directional enhancement signal is between Hx and Hy, then proceed to step S360, calculate the energy of the selected original and processed signals, e.g., calculate respective time domain energy averages RMS1 and RMS2, and determine at step S370 whether RMS2/RMS1 is greater than a first threshold hcrate. If it is greater than the Hrate, it is indicated that the processed signal still contains sufficient energy compared to the original signal, so the original signal is not irrelevant speech, which can be speech processed; whereas if RMS2/RMS1 is less than the hrs, it indicates that the processed signal does not contain sufficient energy compared to the original signal, so the original signal is irrelevant speech and thus subsequent speech processing thereof can be ignored. It should be appreciated that in other embodiments, the determination of the frequency domain energy magnitude may be performed after comparing RMS2/RMS1, i.e., based on the frequency domain energy magnitude after excluding the case of excessive attenuation after processing.

Table 1 below shows a list of energy values for the original signal energy and the processed signal energy for a scenario where the wearer is the targeted speaker, where the non-wearer is speaking at different amounts of sound at various locations, and where the wearer is speaking at different amounts of sound. Fig. 4A-B show graphical representations of the relative magnitudes of pre-and post-processing speech signals for speakers of different volumes at different locations. Wherein fig. 4A shows a line graph of specific values of the original signal energy and the processed signal energy for each of the 12 cases shown in table 1. The abscissa in the figure corresponds to the 12 cases numbered in table 1, and the ordinate corresponds to the energy value of the signal, for example, the time-domain mean. Fig. 4B shows the ratio of the processed signal energy to the original signal energy, i.e., the RMS2/RMS1 value. The abscissa in the figure corresponds to 12 cases numbered in table 1, and the ordinate corresponds to the percentage (%). It can be seen intuitively in fig. 4B that for whatever volume the wearer speaks (corresponding to numbers 9-12), the energy attenuation of the processed signal compared to the original signal is small, with both RMS2/RMS1 being greater than 60% in the figure, i.e. the attenuation of the processed signal is less than 40%. In the case of non-wearer speaking, the attenuation of the processed signal is large, even if the non-wearer speaks loudly or the superimposed noise is large (wind noise is superimposed in number 8), and the RMS2/RMS1 of numbers 1-8 in the figure is less than 60%, i.e. the attenuation of the processed signal is greater than 40%. Thus, in one example, the determination threshold for RMS2/RMS1 may be set to 60% (i.e., hcrate=60%; and accordingly, the threshold for RMS1/RMS 2=1/60% =1.67). In another embodiment, the Hrate may be set lower, for example, to 50%, to prevent missed judgment of the wearer speaking (although false judgment may occur for non-wearer speaking). Thus, when the energy attenuation of the processed signal compared to the original signal exceeds 40%, it can be determined that the acquired signal is the signal of the non-targeted speaker, and thus these signals can be ignored without corresponding speech processing.

Numbering device	Position of	Speaker (S)	Original signal energy	Processed signal energy
					1	Right 1m_loud volume	Non-wearer	5.686406334	3.271750792
2	Right 1m_megavolume 2	Non-wearer	6.811186228	2.248167799
					3	Right 1m_normal volume	Non-wearer	3.773225022	1.427485008
4	Right 0.5 m_loud volume	Non-wearer	4.561988071	2.099041611
					5	Right 0.5 m_slightly lower volume	Non-wearer	4.762516492	2.049944041
6	Right 0.5 m_slightly lower volume	Non-wearer	2.865536829	1.182368333
					7	Right 0.5m normal volume	Non-wearer	5.168982877	1.661084051
8	Right 1m_normal volume_wind	Non-wearer	7.019005822	3.575141032
					9	Loud volume	Wearing person	11.10117482	7.944101545
10	Low volume	Wearing person	3.099744673	1.953550006
					11	Normal volume	Wearing person	5.032631461	3.281730654
12	Minimum volume	Wearing person	3.081347329	1.988503758

TABLE 1

Additionally, in one embodiment, if the headset includes a speaker as shown in FIG. 1, the acquisition of the processed signal also requires consideration in the case where the speaker is playing audio. At this time, step S220 or step S320 may include: and when the loudspeaker outputs the audio frequency in the current use scene, using the audio frequency signal to perform echo cancellation processing based on the direction of the loudspeaker. At this time, the audio signal acquired by the software may be used as a third signal for the echo cancellation operation.

The present disclosure may also be embodied as a headset device comprising: the microphone is used for collecting voice signals; and a processing unit for performing the speech processing method as described above in connection with fig. 2 and 3. In one embodiment, the headset may be smart glasses, such as the smart glasses shown in fig. 1 with two microphones. The smart glasses may consider the microphones 132 and 121 as a microphone array, and based on the current specific application scenario, select one main microphone signal from the voice signals acquired by the microphone array to perform directional enhancement processing (the processing itself needs to use signals acquired by other microphones), and acquire the directional enhanced signal as a processed signal. Then, whether the acquired voice signal needs to be processed or not may be determined according to the magnitude of the frequency domain energy (e.g., STFT energy) of the processed signal, and if the magnitude of the frequency domain energy alone cannot be determined, further determination may be made according to the energy before and after the signal processing. If the energy attenuation of the processed signal is not great (e.g., less than 40%) compared to the original signal, the acquired signal may be deemed to be the speech signal of the targeted speaker (e.g., the wearer or the opposite person) and the corresponding speech processing is performed accordingly; otherwise, the acquired signal may be discarded.

A head-mounted device voice processing method and a head-mounted device using the same according to the present disclosure have been described in detail above with reference to the accompanying drawings. The present disclosure determines whether the acquisition signal contains voice information of the targeted speaker based on the degree of energy attenuation of the processed signal (e.g., the directionally enhanced processed signal) compared to the original signal, particularly enabling accurate discrimination of situations where the non-targeted speaker speaks loudly, thereby avoiding misoperations of the headset. The above decision may be an alternative or supplement to conventional frequency domain energy size based decisions.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above steps defined in the above method of the present disclosure.

Alternatively, the present disclosure may also be implemented as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A headset speech processing method, comprising:

acquiring an original time domain signal acquired by a microphone;

performing direction-based enhancement suppression processing on the original time domain signal to obtain a processed time domain signal, wherein the direction which needs enhancement and/or suppression in the enhancement suppression processing is selected according to the current use scene of the head-mounted equipment; and

a determination is made as to whether to perform speech processing based on the original time domain signal based on energy of the original time domain signal and the processed time domain signal.

2. The method of claim 1, wherein determining whether to perform speech processing based on the original time domain signal based on energy of the original time domain signal and the processed time domain signal comprises:

Calculating an energy ratio of the processed time domain signal to the original time domain signal; and

and judging whether to perform voice processing based on the original time domain signal according to the energy ratio.

3. The method of claim 2, wherein calculating an energy ratio of the processed time domain signal to the original time domain signal comprises:

a ratio of energy means of the processed time domain signal to the original time domain signal over a time window is calculated as the energy ratio.

4. The method of claim 2, wherein determining whether to perform speech processing based on the original time domain signal according to the magnitude of the energy ratio comprises:

when the energy ratio is less than a first threshold, not performing speech processing based on the original time domain signal; and

and when the energy ratio is larger than the first threshold, performing voice processing based on the original time domain signal.

5. The method of claim 1, wherein the headset comprises a plurality of microphones arranged in different locations, the plurality of microphones being respective primary microphones in a plurality of usage scenarios;

performing a direction-based enhanced suppression process on the original time domain signal to obtain a processed time domain signal, comprising:

And performing direction-based enhancement suppression processing on the original time domain signal of the main microphone acquired by the main microphone in the current use scene so as to acquire the processed time domain signal.

6. The method of claim 5, wherein acquiring the raw time domain signal acquired by the microphone comprises:

acquiring a primary microphone original time-domain signal acquired by a primary microphone in the current usage scenario and other microphone original time-domain signals acquired by other microphones than the primary microphone,

performing direction-based enhancement suppression processing on a primary microphone original time domain signal acquired by the primary microphone in the current use scene to acquire the processed time domain signal includes:

and performing directional enhancement processing on the primary microphone original time domain signals acquired by the primary microphone in the current use scene by using the other microphone original time domain signals so as to acquire the processed time domain signals.

7. The method of claim 5, wherein determining whether to perform speech processing based on the original time domain signal based on energy of the original time domain signal and the processed time domain signal comprises:

and if the value of the frequency domain energy of the processed time domain signal is larger than a second threshold value and smaller than a third threshold value, judging whether to perform voice processing based on the original time domain signal according to the energy of the original time domain signal and the energy of the processed time domain signal.

8. The method of claim 7, further comprising:

when the value of the frequency domain energy is smaller than a second threshold value, not performing voice processing based on the original time domain signal; and

and when the value of the frequency domain energy is larger than a third threshold value, performing voice processing based on the original time domain signal.

9. The method of claim 1, wherein the headset comprises a speaker, and performing a direction-based enhanced suppression process on the original time-domain signal to obtain a processed time-domain signal comprises:

and when the loudspeaker outputs the audio frequency in the current use scene, using the audio frequency signal to perform echo cancellation processing based on the direction of the loudspeaker.

10. The method of claim 1, wherein a speaker direction is determined based on the current usage scenario, and the signal strength of the speaker direction is enhanced in the enhancement suppression process.

11. A headset device, comprising:

the microphone is used for collecting voice signals; and

a processing unit for performing the speech processing method according to any of claims 1-9.