CN115662468A

CN115662468A - Handheld posture detection method and device and computer readable storage medium

Info

Publication number: CN115662468A
Application number: CN202211262021.1A
Authority: CN
Inventors: 涂晴莹; 董斐; 纪伟
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-01-31

Abstract

A handheld gesture detection method and device and a computer readable storage medium are provided, wherein the handheld gesture detection method comprises the following steps: acquiring an energy difference between a first signal and a second signal in a current voice frame; the first signal is a signal acquired by a first microphone on user equipment, and the first signal is a noisy voice signal; the second signal is a signal acquired by a second microphone on the user equipment, and the second signal is a noise signal containing voice; and determining the current handheld attitude according to the size relation between the energy difference and a preset attitude threshold value. According to the scheme, the detection of the handheld gesture can be accurately realized without adding extra hardware equipment.

Description

Handheld posture detection method and device and computer readable storage medium

Technical Field

The invention relates to the technical field of terminal equipment, in particular to a handheld gesture detection method and device and a computer readable storage medium.

Background

The traditional single-microphone noise reduction algorithm has poor noise reduction capability on non-stationary noise. Currently, most of the user devices (such as smart phones) on the market are configured with at least two microphones to improve the suppression of non-stationary noise. In a handheld call scenario, background noise is picked up by a microphone disposed on the top or back of the user device, and noisy speech signals are picked up by a microphone disposed on the bottom of the user device.

In a handheld call scene, due to the problem of the use habit of the user, the gesture of a part of users holding the user equipment may be an outward expansion gesture, and at this time, the distance from the microphone at the bottom of the user equipment to the mouth of the user is far, so that the situation of serious voice loss occurs.

In order to solve the above problem, it is necessary to detect the hand-held posture of the user. One prior art solution adds a sensor (e.g., a photosensor) to detect the hand-held posture of the user. However, the addition of the sensor leads to an increase in hardware cost, and the accuracy of detection is also low.

Disclosure of Invention

The embodiment of the invention solves the technical problem that hardware equipment needs to be added to realize the detection of the hand-held gesture of the user.

In order to solve the above technical problem, an embodiment of the present invention provides a method for detecting a handheld gesture, including: acquiring an energy difference between a first signal and a second signal in a current voice frame; the first signal is a signal acquired by a first microphone on user equipment, and the first signal is a noisy voice signal; the second signal is a signal acquired by a second microphone on the user equipment, and the second signal is a noise signal containing voice; and determining the current handheld attitude according to the size relation between the energy difference and a preset attitude threshold value.

Optionally, the determining the current handheld gesture according to the size relationship between the energy difference and the preset gesture threshold includes: determining that the current handheld gesture is a first external expansion gesture, wherein the absolute value of the difference between the energy difference and the gesture threshold is not greater than the first value; determining that the current handheld gesture is a second external expansion gesture if the difference between the gesture threshold and the energy difference is greater than a first value; determining that the current handheld gesture is a non-flaring gesture if the difference between the energy difference and the gesture threshold is greater than the first value; the flaring angle corresponding to the first flaring posture is smaller than the flaring angle corresponding to the second flaring posture.

Optionally, the absolute value of the difference between the energy difference and the posture threshold is not greater than the first value, and determining that the current handheld posture is the first outward expansion posture includes: and determining that the current handheld gesture is a first outward expansion gesture when the first signal is detected to have voice and the absolute value of the difference between the energy difference and the gesture threshold is not greater than the first value.

Optionally, the obtaining an energy difference between the first signal and the second signal in the current speech frame includes: and acquiring the energy difference of the first signal and the second signal in the frequency domain.

Optionally, the obtaining an energy difference between the first signal and the second signal in the frequency domain by using the following formula includes: energyDif (λ) = min (rate (λ), rate2 (λ)); wherein min (rate 1 (λ), rate2 (λ)) is the minimum value between rate (λ) and rate2 (λ),

SNR(λ，i)＝α _v *G(λ-1，i) ² *SNR(λ-1，i)+(1-α _v )max(pSNR(λ，i)--1，0)，

if P min (. Lamda. -1, i) < P (. Lamda., i), then

Otherwise P min (λ, i) = P (λ, i); p (λ, i) = α P (λ -1, i) + (1- α) | Sa1 (λ, i) | ² (ii) a Sa1 (λ, i) is the amplitude spectrum of the first frequency domain signal corresponding to the first signal at the ith frequency point, and Sa2 (λ, i) is the amplitude spectrum of the second frequency domain signal corresponding to the second signal at the ith frequency point; lambda is the frame index of the current voice frame, i is the frequency point index, P (lambda, i) is the power spectrum of the ith frequency point after being smoothed by the first smoothing parameter alpha, and P min (lambda, i) is the minimum tracking method estimationIn the measured noise power spectrum, beta and gamma are noise estimation parameters, and the values of alpha, beta and gamma are all between 0 and 1; FS is the initial frequency point, and FE is the cut-off frequency point.

Optionally, the obtaining an energy difference between the first signal and the second signal in the current speech frame includes: an energy difference between the first signal and the second signal in the time domain is obtained.

Optionally, the energy difference between the first signal and the second signal in the time domain is obtained by using the following formula: energyDif (λ) = min (rate 3 (λ), rate4 (λ)); wherein min (rate 3 (λ), rate4 (λ)) is the minimum value between rate3 (λ) and rate4 (λ),

λ is the frame index corresponding to the current speech frame, M is the number of sampling points of the time domain signal, s ₁ (λ, n) is the first signal at the nth sample point, s ₂ (λ, n) is the second signal at the nth sampling point, and h (λ, n) is the tap coefficient of the time domain filter at the nth sampling point of the current speech frame.

Optionally, the gesture posture detection method further includes: and performing interframe smoothing processing on the energy difference.

Optionally, inter-frame smoothing is performed on the energy difference by using the following formula: energyDif _sm (λ)＝α _sm EnergyDif _sm (λ-1)+(1-α _sm ) EnergyDif (λ); wherein EnergyDif _sm (lambda) is the energy difference after interframe smoothing; alpha is alpha _sm Is the third smoothing coefficient, and 0 < alpha _sm ＜1；EnergyDif _sm (lambda-1) is the energy difference corresponding to the previous voice frame after interframe smoothing processing; energyDif (λ) is the energy difference.

The embodiment of the invention also provides a handheld gesture detection device, which comprises: the acquiring unit is used for acquiring the energy difference between a first signal and a second signal in the current voice frame; the first signal is a signal acquired by a first microphone on user equipment, and the first signal is a noisy voice signal; the second signal is a signal acquired by a second microphone on the user equipment, and the second signal is a noise signal; and the determining unit is used for determining the current handheld attitude according to the size relation between the energy difference and a preset attitude threshold value.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium is a non-volatile storage medium or a non-transitory storage medium, and a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs any of the steps of the above-mentioned hand-held gesture detection method.

The embodiment of the present invention further provides another hand-held gesture detection apparatus, which includes a memory and a processor, where the memory stores a computer program that can be executed on the processor, and the processor executes any of the steps of the hand-held gesture detection method when executing the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

and acquiring the energy difference between the first signal and the second signal in the current voice frame, and determining the current handheld posture of the user according to the magnitude relation between the energy difference and a preset posture threshold value. According to the scheme, a sensor for detecting the attitude does not need to be additionally arranged, so that the hardware cost does not need to be additionally increased. The current handheld gesture of the user can be accurately determined through the energy difference between the signals acquired by different microphones.

Drawings

FIG. 1 is a flow chart of a method for detecting a hand-held gesture in an embodiment of the present invention;

fig. 2 is a schematic diagram of the distribution of a dual-microphone structure in a user equipment;

fig. 3 is a schematic distribution diagram of a dual-microphone structure in another user equipment;

FIG. 4 is a schematic hand-held pose of a non-splayed pose in an embodiment of the present invention;

FIG. 5 is a hand-held pose schematic diagram of a first splayed pose in an embodiment of the present invention;

FIG. 6 is a hand-held gesture diagram of a second, outward-extending gesture in an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a handheld gesture detection device in an embodiment of the present invention.

Detailed Description

For a dual-microphone or multi-microphone user device, the signals picked up by the top or back microphone are usually used as background noise (essentially including a speech signal), and the signals picked up by the bottom microphone are noisy speech signals (i.e. the speech signal is doped with a noise signal). In a normal hand-held posture, the bottom microphone is close to the mouth, and when a person holding the telephone speaks normally, the energy of a signal picked up by the bottom microphone (referred to as a first microphone for short) and the energy of a signal picked up by the top or back microphone (referred to as a second microphone for short) have an energy difference of nearly 6 dB.

However, due to the habit of the user, the gesture of holding the user device by a part of the user may be a flaring gesture, in which the first microphone is far away from the mouth of the user, the energy of the signal picked up by the first microphone is closer to the energy of the signal picked up by the second microphone, and even the energy of the signal picked up by the second microphone is greater than the energy of the signal picked up by the first microphone.

Hand-held gesture detection can reduce the degree of speech loss. The existing hand-held gesture detection scheme estimates the direction of voice by a sound source positioning method, and further identifies the hand-held gesture of a user. However, this solution is affected by the signal-to-noise ratio, the response difference of the microphone frequencies, the number of microphones, and other factors, and it is difficult to accurately detect the hand-held gesture of the user. Another hand-held gesture detection scheme is to add a sensor (e.g., a photosensor) to detect the hand-held gesture of the user. However, the addition of the sensor leads to an increase in hardware cost, and the accuracy of detection is also low.

In the embodiment of the invention, a sensor for detecting the attitude is not required to be additionally arranged, so that the hardware cost is not required to be additionally increased. In different hand-held postures, the energy difference between signals acquired by different microphones is different. Based on the method, the energy difference between the signals acquired by different microphones is compared with the preset gesture threshold value, and the current handheld gesture of the user can be accurately determined.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying figures are described in detail below.

The embodiment of the invention provides a handheld gesture detection method, which is described in detail through specific steps in reference to fig. 1.

Step 101, obtaining an energy difference between a first signal and a second signal in a current speech frame.

In a specific implementation, the current speech frame may include a first signal and a second signal, where: the first signal may be a signal picked up by a first microphone provided in the user equipment, and the second signal may be a signal picked up by a second microphone provided in the user equipment.

In the embodiment of the present invention, the first signal may be a noisy speech signal, and the second signal may be a noise signal. The above mentioned noisy speech signal may refer to: a signal dominated by speech signals and doped with noise; the noise signal may be: a signal dominated by noise and with a small amount of speech.

That is, the speech signal in the first signal has a stronger energy than the noise signal, and the noise signal in the second signal has a stronger energy than the speech signal.

In the implementation, referring to fig. 2 and fig. 3, distribution diagrams of a dual-microphone structure in two user equipments are given.

In fig. 2, a first microphone 201 and a second microphone 202 are disposed in the user equipment 20, the first microphone 201 and the second microphone 202 are disposed on the front surface of the user equipment 20, the first microphone 201 is disposed on the bottom of the user equipment 20, and the second microphone 202 is disposed on the top of the user equipment 20.

In fig. 3, the user device 30 is provided with a first microphone 301 and a second microphone 302, the first microphone 301 is disposed at the bottom of the front surface of the user device 30, and the second microphone 302 is disposed at the upper portion of the back surface of the user device 30.

It will be appreciated that fig. 2 and 3 are merely illustrative of the arrangement of two microphones in the user equipment. In practical applications, there may be other arrangements of two microphones, which are not illustrated here.

When the user is using the user device normally, the ear is close to the first microphone in the user device and the mouth is close to the second microphone in the user device. Therefore, normally, when a user speaks, most of the signals picked up by the first microphone are noise, and most of the signals picked up by the second microphone are voice.

As shown in fig. 2, a first signal within a current speech frame is picked up by a first microphone 201; a second signal within the current speech frame is picked up by a second microphone 202. As shown in fig. 3, a first signal within a current speech frame is picked up by a first microphone 301; a second signal within the current speech frame is picked up by a second microphone 302.

And 102, determining the current handheld attitude according to the energy difference and the size relation of a preset attitude threshold.

In the embodiment of the invention, if the difference value between the preset posture threshold value and the energy difference is greater than the first value, the current handheld posture is determined to be a second outward expansion posture; if the difference value between the energy difference and the posture threshold value is larger than a first value, determining that the current handheld posture is a non-outward-expansion posture; and if the absolute value of the difference between the energy difference and the posture threshold value is not greater than the first value, determining that the current handheld posture is the first outward expansion posture.

In a specific implementation, the flare angle corresponding to the first flared attitude is smaller than the flare angle corresponding to the second flared attitude. Referring to fig. 4, a schematic hand-held posture diagram of a non-splayed posture in an embodiment of the present invention is shown. Referring to fig. 5, a schematic hand-held posture diagram of a first outward-extending posture in an embodiment of the present invention is shown. Referring to fig. 6, a handheld gesture diagram of a second outward-extending gesture in an embodiment of the present invention is shown.

With reference to fig. 4 to 6. In fig. 4, the non-extended holding position may be regarded as a normal holding position, i.e. a holding position during normal use of the user equipment 40. The first microphone 401 is proximate to the user's mouth and the second microphone 402 is proximate to the user's ear. The difference between the energy of the voice picked up by the first microphone 401 and the energy of the voice picked up by the second microphone 402 is large.

In fig. 5, in the first flared position, the bottom of the user device 50 has been offset a distance from the user's mouth. At this time, the first microphone 501 is located at a certain distance from the user's mouth. The difference between the energy of the voice picked up by the first microphone 401 and the energy of the voice picked up by the second microphone 402 becomes small.

In fig. 6, in the second outward position, the bottom of the user device 60 is offset a greater distance from the user's mouth. At this time, the first microphone 601 is located at a large distance from the user's mouth. The difference between the energy of the voice picked up by the first microphone 401 and the energy of the voice picked up by the second microphone 402 becomes small.

In the non-flared position, the energy of the speech signal in the first signal is greater due to the closer distance of the mouth of the user device from the first microphone. In the first outward-extending posture, the mouth of the user equipment is far away from the first microphone, so that the energy of the voice signal in the first signal is reduced, and the energy of the corresponding noise signal is increased. In the second outward-extending posture, the mouth of the user device is farthest away from the first microphone, so that the energy of the voice signal in the first signal is smaller, and the energy of the corresponding noise signal is larger.

In the embodiment of the invention, when the voice is detected to exist in the first signal and the absolute value of the difference value between the energy difference and the gesture threshold value is not larger than the first value, the current hand-held gesture is determined to be the first external expansion gesture.

That is, when the energy difference between the first signal and the second signal is small, it may be further determined whether the target voice exists in the first signal. If the target voice exists in the first signal, judging that the gesture with a smaller outward expansion angle is obtained; if the target voice does not exist in the first signal, the current handheld gesture can not be judged any more.

In an implementation, voice Activity Detection (VAD) may be performed on the first signal to obtain a VAD value of the first signal. When the VAD value of the first signal is 1, representing that target voice exists in the first signal; when the VAD value of the first signal is 0, the target voice is not existed in the first signal.

Specifically, the VAD performed on the first signal may be pitch detection, a voice activity detection algorithm based on deep learning, and the like, and may refer to the existing VAD method, which is not described herein again.

In the embodiment of the present invention, the value of the preset posture threshold may be selected in a normal holding posture state (such as the above non-extended state). The attitude threshold value can be between 0.4 and 0.8.

In the embodiment of the present invention, when acquiring the energy difference between the first signal and the second signal, the energy difference between the first signal and the second signal in the frequency domain may be acquired. When acquiring the energy difference between the first signal and the second signal, the energy difference between the first signal and the second signal in the time domain may also be acquired.

In a specific implementation, the energy difference between the first signal and the second signal in the frequency domain can be calculated by using the following formula (1):

EnergyDif(λ)＝min(rate1(λ)，rate2(λ))； (1)

wherein min (rate (λ), rate2 (λ)) is the minimum value between rate1 (λ) and rate2 (λ); λ is the frame index of the current speech frame.

The following describes a specific acquisition process of rate1 (λ) and rate2 (λ).

Setting the first signal as s ₁ (n) the second signal is s ₂ (n), the first frequency domain signal corresponding to the first signal is:

the second frequency domain signal corresponding to the second signal is:

wherein i =1,2,3, \8230;, N are the number of points of fourier transform, ω (N) is a window function of N points, and i is the frequency point index. The window function may be a rectangular window, a Sine window, a Hanning window, a Hamming window, a Tukey window, or the like.

The amplitude spectrum of the first frequency domain signal at the ith frequency point is as follows: sa (Sa) ₁ (λ，i)＝abs(S ₁ (λ, i)); the amplitude spectrum of the second frequency domain signal at the ith frequency point is Sa ₂ (λ，i)＝abs(S ₂ (λ，i))。

And performing pre-filtering processing on the first signal and the second signal to filter out part of stationary noise in the first signal and the second signal, and highlighting the energy difference of the voice under the noise condition. The pre-filtering process may be a method of noise estimation combined with gain. The noise estimation part can adopt a minimum tracking method, a recursive average method, an MCRA2 algorithm, an IMCRA2 algorithm and the like. The gain calculation method can be a method of solving wiener gain by an improved decision-guided method and the like.

In the embodiment of the invention, a minimum tracking method is adopted in the noise estimation part, and a wiener gain calculation method is adopted in the gain calculation method.

Smoothing the power spectrum of the ith frequency point in the first frequency domain signal by adopting a first smoothing parameter alpha to obtain P (lambda, i):

P(λ，i)＝αP(λ1，i)+(1-a)|Sa1(λ，i)| ² ； (4)

lambda 1 is the frame index of the previous speech frame, and the value range of alpha is 0-1. When calculating P (λ 1, i), λ -1 may be substituted for λ in the above formula (4) according to the calculation method in the above formula (4). Accordingly, in each of the following equations, if the parameter λ -1 exists, the corresponding calculation may correspond to the calculation process of the reference parameter λ.

The noise power spectrum estimated by the minimum tracking method is as follows: p min (λ, i), wherein:

if P min (λ -1, i) < P (λ, i), then P min (λ, i) is:

conversely, P min (λ, i) is:

P min(λ，i)＝P(λ，i)； (6)

in the above formula (5), β and γ are both noise estimation parameters, and their values are both between 0 and 1.

The known noise estimation result adopts the result of wiener gain calculation G, which is calculated as follows:

the posterior signal-to-noise ratio of the first frequency domain signal corresponding to the current speech frame at the ith frequency point is as follows:

SNR(λ，i)＝α _v *G(λ-1，i) ² *SNR(λ-1，i)+(1-α _v )max(pSNR(λ，i)-1，0)； (9)

α _υ the value range of the second smoothing parameter used for judging the guiding method is between 0 and 1; SNR (lambda, i) is a noise estimation result (prior signal-to-noise ratio) of the first frequency domain signal at the ith frequency point, and pSNR (lambda-1, i) is a posterior signal-to-noise ratio of the first frequency domain signal corresponding to the previous speech frame (namely, the speech frame with the frame index of lambda-1) at the ith frequency point.

Finally, the resulting rate1 (λ) can be calculated using the following formula (11):

rate2 (λ) can be calculated using the following formula (12):

in equations (11) and (12), FS is the start frequency point of speech, and FE is the cut-off frequency point of speech. In specific implementation, the value of FS may be 100hz, and the value of fe may be 4000Hz.

In a specific implementation, the energy difference between the first signal and the second signal in the time domain can be calculated by the following equation (13):

EnergyDif(λ)＝min(rate3(λ)，rate4(λ))； (13)

where EnergyDif (λ) is an energy difference between the first signal and the second signal in the time domain, and min (rate 3 (λ), rate4 (λ)) is a minimum value between rate3 (λ) and rate4 (λ).

rate3 (λ) can be calculated using the following formula (14):

rate4 (λ) can be calculated using the following equation (15):

in the above equation (15), h (λ, n) is a tap coefficient of the time-domain filter at the nth sampling point, and is used for eliminating a part of noise of the first microphone in the time domain. The time domain filter may adopt an LMS adaptive filtering algorithm that uses the noise signal picked up by the second microphone as a reference signal, and may also adopt algorithms such as time domain wiener filtering processing, which are not described herein again.

In the embodiment of the present invention, after obtaining the energy difference between the first signal and the second signal, before determining the current handheld gesture according to the energy difference, inter-frame smoothing may be performed on the calculated energy difference.

In a specific implementation, the energy difference is smoothed between frames using the following equation (16):

EnergyDif _sm (λ)＝α _sm EnergyDif _sm (λ-1)+(1-α _sm )EnergyDif(λ)； (16)

wherein EnergyDif _sm (lambda) is the energy difference after interframe smoothing; alpha is alpha _sm Is a third smoothing coefficient, and 0 < alpha _sm ＜1；EnergyDif _sm And (lambda-1) is the energy difference corresponding to the previous voice frame after interframe smoothing processing.

That is to say, the energy difference after the interframe smoothing processing may be adopted to determine the current gesture posture. By the interframe smoothing processing, the judgment of the current gesture posture can be more accurately realized.

In summary, in the embodiment of the present invention, there is no need to additionally provide a sensor for detecting an attitude, so that there is no need to additionally increase hardware cost. The current handheld gesture of the user can be accurately determined through the energy difference between the signals acquired by different microphones.

Referring to fig. 7, a schematic structural diagram of a gesture posture detection apparatus 70 in the embodiment of the present invention is shown. The gesture posture detecting device 70 includes: an acquisition unit 701 and a determination unit 702, wherein:

an obtaining unit 701, configured to obtain an energy difference between a first signal and a second signal in a current speech frame; the first signal is a signal acquired by a first microphone on user equipment, and the first signal is a noisy voice signal; the second signal is a signal acquired by a second microphone on the user equipment, and the second signal is a noise signal containing voice;

a determining unit 702, configured to determine a current handheld gesture according to a size relationship between the energy difference and a preset gesture threshold.

In a specific implementation, the specific execution processes of the obtaining unit 701 and the determining unit 702 may refer to steps 101 to 102, which are not described herein.

In specific implementation, regarding each module/unit included in each apparatus and product described in the foregoing embodiments, it may be a software module/unit, or may also be a hardware module/unit, or may also be a part of the software module/unit and a part of the hardware module/unit.

For example, for each device or product applied to or integrated into a chip, each module/unit included in the device or product may be implemented by hardware such as a circuit, or at least a part of the module/unit may be implemented by a software program running on a processor integrated within the chip, and the rest (if any) part of the module/unit may be implemented by hardware such as a circuit; for each device or product applied to or integrated with the chip module, each module/unit included in the device or product may be implemented by using hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components of the chip module, or at least some of the modules/units may be implemented by using a software program running on a processor integrated within the chip module, and the rest (if any) of the modules/units may be implemented by using hardware such as a circuit; for each device and product applied to or integrated in the terminal, each module/unit included in the device and product may be implemented by using hardware such as a circuit, and different modules/units may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal, or at least part of the modules/units may be implemented by using a software program running on a processor integrated in the terminal, and the rest (if any) part of the modules/units may be implemented by using hardware such as a circuit.

The embodiment of the present invention further provides a computer-readable storage medium, which is a non-volatile storage medium or a non-transitory storage medium, and on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the gesture posture detection method provided in steps 101 to 102.

The embodiment of the invention also provides a gesture posture detection device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the gesture posture detection method provided in the steps 101 to 102 when running the computer program.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructing the relevant hardware by a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A hand-held gesture detection method is characterized by comprising the following steps:

acquiring an energy difference between a first signal and a second signal in a current voice frame; the first signal is a signal acquired by a first microphone on user equipment, and the first signal is a noisy voice signal; the second signal is a signal acquired by a second microphone on the user equipment, and the second signal is a noise signal containing voice;

and determining the current handheld attitude according to the size relation between the energy difference and a preset attitude threshold value.

2. The method as claimed in claim 1, wherein the determining the current handheld gesture according to the magnitude relationship between the energy difference and the preset gesture threshold comprises:

determining that the current handheld gesture is a first external expansion gesture, wherein the absolute value of the difference between the energy difference and the gesture threshold is not greater than a first value;

determining that the current handheld gesture is a second external expansion gesture if the difference between the gesture threshold and the energy difference is greater than the first value;

determining that the current handheld gesture is a non-flaring gesture if the difference between the energy difference and the gesture threshold is greater than the first value;

and the outward expansion angle corresponding to the first outward expansion posture is smaller than the outward expansion angle corresponding to the second outward expansion posture.

3. The hand-held gesture detection method of claim 2, wherein an absolute value of a difference between the energy difference and the gesture threshold is not greater than the first value, determining the current hand-held gesture as a first flared gesture comprises:

and determining that the current handheld gesture is a first external expansion gesture when the first signal is detected to have voice and the absolute value of the difference between the energy difference and the gesture threshold is not greater than the first value.

4. The method of claim 1, wherein said obtaining an energy difference between a first signal and a second signal within a current speech frame comprises:

and acquiring the energy difference of the first signal and the second signal in the frequency domain.

5. The hand-held gesture detection method of claim 4, wherein said obtaining an energy difference of the first signal and the second signal in the frequency domain using the following equation comprises:

EnergyDif(λ)＝min(rate1(λ),rate2(λ))；

wherein min (rate 1 (λ), rate2 (λ)) is the minimum value between rate1 (λ) and rate2 (λ),

SNR(λ,i)＝α _v *G(λ-1,i) ² *SNR(λ-1,i)+(1－α _v )max(pSNR(λ,i)－1,0)，

if Pmin (λ -1, i) < P (λ, i), then

Otherwise Pmin (λ, i) = P (λ, i)；P(λ,i)＝αP(λ－1,i)+(1-α)|Sa1(λ,i)| ² (ii) a Sa1 (lambda, i) is the amplitude spectrum of the first frequency domain signal corresponding to the first signal at the ith frequency point, and Sa2 (lambda, i) is the amplitude spectrum of the second frequency domain signal corresponding to the second signal at the ith frequency point; lambda is the frame index of the current voice frame, i is the frequency point index, P (lambda, i) is the power spectrum of the ith frequency point after being smoothed by the first smoothing parameter alpha, P min (lambda, i) is the noise power spectrum estimated by the minimum tracking method, beta and gamma are noise estimation parameters, alpha is _υ As a second smoothing parameter, α _υ Beta and gamma are both between 0 and 1; FS is the initial frequency point, and FE is the cut-off frequency point.

6. The method of claim 1, wherein said obtaining an energy difference between a first signal and a second signal within a current speech frame comprises:

an energy difference between the first signal and the second signal in the time domain is obtained.

7. The hand-held gesture detection method of claim 6, wherein the energy difference between the first signal and the second signal in the time domain is obtained using the following equation:

EnergyDif(λ)＝min(rate3(λ),rate4(λ))；

wherein min (rate 3 (λ), rate4 (λ)) is the minimum value between rate3 (λ) and rate4 (λ),

λ is the frame index corresponding to the current speech frame, M is the number of sampling points of the time domain signal, s ₁ (λ, n) is the first signal at the nth sample point, s ₂ (λ, n) is the second signal at the nth sample point, and h (λ, n) is the tap coefficient of the time domain filter at the nth sample point of the current speech frame.

8. The hand-held gesture detection method of claim 1, further comprising:

and performing interframe smoothing processing on the energy difference.

9. The hand-held gesture detection method of claim 8, wherein the energy difference is inter-frame smoothed using the following equation:

EnergyDif _sm (λ)＝a _sm EnergyDif _sm (λ-1)+(1-a _sm )EnergyDif(λ)；

wherein EnergyDif _sm (lambda) is the energy difference after interframe smoothing; alpha is alpha _sm Is the third smoothing coefficient, and 0 < alpha _sm ＜1；EnergyDif _sm (lambda-1) is the energy difference corresponding to the previous voice frame after interframe smoothing processing; energyDif (λ) is the energy difference.

10. A hand-held gesture detection device, comprising:

the acquiring unit is used for acquiring the energy difference between a first signal and a second signal in the current voice frame; the first signal is a signal acquired by a first microphone on user equipment, and the first signal is a noisy voice signal; the second signal is a signal acquired by a second microphone on the user equipment, and the second signal is a noise signal;

and the determining unit is used for determining the current handheld attitude according to the size relation between the energy difference and a preset attitude threshold value.

11. A computer-readable storage medium, being a non-volatile storage medium or a non-transitory storage medium, having a computer program stored thereon, wherein the computer program, when being executed by a processor, performs the steps of the hand-held gesture detection method according to any one of claims 1 to 9.

12. An echo processing device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor executes the steps of the hand-held gesture detection method of any one of claims 1 to 9 when executing the computer program.