CN111933140A

CN111933140A - Method, device and storage medium for detecting voice of earphone wearer

Info

Publication number: CN111933140A
Application number: CN202010876645.7A
Authority: CN
Inventors: 王治聪; 李倩
Original assignee: Bestechnic Shanghai Co Ltd
Current assignee: Bestechnic Shanghai Co Ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-11-13
Anticipated expiration: 2040-08-27
Also published as: CN111933140B

Abstract

The present disclosure discloses a method, an apparatus, and a storage medium for detecting a voice of a headphone wearer. The headset comprises an in-ear microphone, an out-of-ear microphone and an intelligent voice system, and the method comprises the following steps: respectively collecting an in-ear audio signal and an out-of-ear audio signal by an in-ear microphone and an out-of-ear microphone; calculating a phase difference related parameter and an energy difference related parameter between the in-ear and out-of-ear spectral vectors based on data of the in-ear audio signal and the out-of-ear audio signal; based on the phase difference related parameters and the energy difference related parameters between the inner ear spectrum vectors and the outer ear spectrum vectors, whether the earphone wearer sends out voice is determined by using a discrimination model based on a plurality of decision trees, so that the intelligent voice system is started under the condition that the earphone wearer sends out voice. The method effectively improves the experience of the earphone wearer when using the intelligent voice, has low power consumption and small operand, and greatly improves the accuracy of detection.

Description

Method, device and storage medium for detecting voice of earphone wearer

Technical Field

The present disclosure relates to the field of voice detection technologies, and in particular, to a method, an apparatus, and a storage medium for detecting a voice of a user wearing an earphone.

Background

Along with the improvement of the performance of the earphones, more and more earphone chips begin to carry a keyword recognition model as a trigger of an intelligent voice system. However, since there may be keywords in the environment that are uttered by the non-earphone wearer, the smart voice system is often triggered by other sound sources, rather than the wearer himself, which affects the experience of the earphone wearer when using smart voice.

Disclosure of Invention

The present disclosure is provided to solve the above-mentioned drawbacks in the background art. There is a need for a method, an apparatus, and a storage medium for detecting a voice of a user wearing an earphone, which detect the voice of the user wearing the earphone by using a software algorithm, have low power consumption, and do not require additional hardware designs such as a sensor, thereby simplifying the structure of the earphone and improving user experience.

A first aspect of the present disclosure provides a method of detecting speech of a wearer of a headset comprising an in-ear microphone, an out-of-ear microphone and a smart speech system, the method comprising: acquiring an in-ear audio signal and an out-of-ear audio signal by the in-ear microphone and the out-of-ear microphone, respectively; calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectral vector and an out-of-ear spectral vector based on the in-ear audio signal and the out-of-ear audio signal; and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice.

A second aspect of the present disclosure provides an apparatus for detecting speech of a wearer of a headset comprising an in-ear microphone, an out-of-ear microphone and a smart speech system, the apparatus comprising: an interface configured to acquire data of in-ear and out-of-ear audio signals acquired via the in-ear and out-of-ear microphones; a processor configured to: calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectral vector and an out-of-ear spectral vector based on data of the in-ear audio signal and the out-of-ear audio signal; and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice.

A third aspect of the present disclosure provides an apparatus for detecting speech of a wearer of a headset comprising an in-ear microphone, an out-of-ear microphone and a smart speech system, the apparatus comprising: an acquisition module configured to: acquiring data of an in-ear audio signal and an out-of-ear audio signal acquired via the in-ear microphone and the out-of-ear microphone; a parameter calculation module configured to: calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectral vector and an out-of-ear spectral vector based on data of the in-ear audio signal and the out-of-ear audio signal; a speech determination module configured to: and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice.

A fourth aspect of the disclosure provides a non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a processor, perform the method as any one of the above.

The method provided by the embodiment of the disclosure includes the steps of respectively acquiring an in-ear audio signal and an out-of-ear audio signal by an in-ear microphone and an out-of-ear microphone, calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectrum vector and an out-of-ear spectrum vector based on the acquired in-ear audio signal and the acquired out-of-ear audio signal, and determining whether an earphone wearer sends out a voice or not by using a discrimination model based on a plurality of decision trees according to the phase difference related parameter and the energy difference related parameter between the in-ear spectrum vector and the out-of-ear spectrum vector, so that the real-time detection of the voice of the earphone wearer is realized. The detection result can be used as the assistance of an intelligent voice system, so that the intelligent voice system is started under the condition that the earphone wearer is determined to send out voice; alternatively, the intelligent voice system can also integrate the recognition result of the voice recognition model (for example, the keyword recognition model for recognizing the voice through the keyword) and the detection result to select whether to start, so that the voice can be determined to be sent by the earphone wearer only when the voice recognition model recognizes the voice and the earphone wearer is detected to send the voice, and then the intelligent voice system is triggered to start, thereby avoiding the problem that the intelligent voice system is easily triggered by the non-wearer himself due to the fact that the intelligent voice system only depends on the recognition of the voice recognition model, reducing the probability of false triggering, and effectively improving the experience of the earphone wearer when the intelligent voice is used. In addition, the embodiment of the disclosure detects the voice of the earphone wearer by using a software algorithm, has low power consumption, does not need additional hardware design such as an additional sensor, and the like, is not constrained by circuit design, has flexible implementation mode and low cost, simplifies the earphone structure, and simultaneously meets the low power consumption requirement of the earphone chip. In addition, the detection method selects a discrimination model with small operand, high robustness and strong generalization capability, and integrates the results of a plurality of decision trees as the judgment result of whether the earphone wearer sends out the voice, so that the detection accuracy is greatly improved.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having letter suffixes or different letter suffixes may represent different instances of similar components. The drawings illustrate various embodiments generally by way of example and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present apparatus or method.

Fig. 1 is a flow chart illustrating a method of detecting a voice of a wearer of a headset according to an embodiment of the present disclosure.

Fig. 2 is a flow chart illustrating step 120 of a method according to an embodiment of the present disclosure.

Fig. 3 is a flow chart illustrating step 120 of a method according to another embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating a lightgbm discriminant model according to an exemplary embodiment of the disclosure.

Fig. 5 is a schematic structural diagram of an apparatus for detecting a voice of a wearer of a headset according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the technical aspects of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings. Embodiments of the present disclosure are described in further detail below with reference to the figures and the detailed description, but the present disclosure is not limited thereto. The order in which the various steps described herein are described as examples should not be construed as a limitation if there is no requirement for a context relationship between each other, and one skilled in the art would know that sequential adjustments may be made without destroying the logical relationship between each other, rendering the overall process impractical.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

Fig. 1 is a flow chart illustrating a method of detecting a voice of a wearer of a headset according to an embodiment of the present disclosure. The headset comprises an in-ear microphone, an out-of-ear microphone and an intelligent voice system, as shown in fig. 1, the method comprises the following steps:

step 110: an in-ear audio signal and an out-of-ear audio signal are acquired by an in-ear microphone and an out-of-ear microphone, respectively.

The frequencies for acquiring the in-ear audio signal and the out-of-ear audio signal may be set differently according to specific needs, for example, the signal is acquired every 5ms, 10ms, or 20ms, and the like, which is not particularly limited in this disclosure.

Step 120: based on the in-ear audio signal and the out-of-ear audio signal, a phase difference related parameter and an energy difference related parameter between the in-ear and out-of-ear spectral vectors are calculated.

The phase difference-related parameter and the energy difference-related parameter may represent characteristic parameters related to a difference between the in-ear audio signal and the out-of-ear audio signal due to a transmission difference of a transmission path of sound during in-ear and out-of-ear transmission, such as a phase shift, an energy difference, and a correlation of the out-of-ear audio signal with respect to the in-ear audio signal, but are not limited thereto. Different sound sources can cause different differences between the in-ear audio signal and the out-of-ear audio signal, and the differences are further reflected by the phase difference related parameter and the energy difference related parameter.

Step 130: based on the phase difference related parameters and the energy difference related parameters between the inner ear spectrum vectors and the outer ear spectrum vectors, whether the earphone wearer sends out voice is determined by using a discrimination model based on a plurality of decision trees, so that the intelligent voice system is started under the condition that the earphone wearer sends out voice.

The decision tree is a tree structure, e.g. a binary or non-binary tree, comprising one root node, several internal nodes and several leaf nodes. The root node and each internal node of the decision tree represent a judgment on an attribute, each branch represents the output of a judgment result, and finally each leaf node represents a classification result. The process of using the decision tree to make a decision is to start from the root node, test the corresponding characteristic attributes in the items to be classified, select an output branch according to the values of the characteristic attributes until the leaf nodes are reached, and take the categories stored in the leaf nodes as decision results. The discrimination model has the advantages of small computation amount, high robustness and strong generalization capability.

The embodiments of the present disclosure use multiple decision trees as the decision model, which may be the same or different. And respectively inputting the phase difference related parameters and the energy difference related parameters between the inner ear spectrum vectors and the outer ear spectrum vectors into each of a plurality of decision trees and operating to obtain a plurality of different decision results, and synthesizing the outputs of all the trees to obtain a final judgment value so as to determine whether the earphone wearer sends out voice. Therefore, the result of a plurality of decision trees is integrated as the final judgment result, the problem of low accuracy caused by the adoption of a single tree can be avoided, and the accuracy is greatly improved.

In some embodiments, the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors employ at least one of: a phase difference related parameter and an energy difference related parameter between inner and outer ear spectrum vectors for a preset frequency band in each preset time period of a plurality of preset time periods; and the phase difference related parameter and the energy difference related parameter between the inner ear spectrum vector and the outer ear spectrum vector of each frequency band in the whole frequency band in at least one preset time period.

In the case of adopting the phase difference related parameter and the energy difference related parameter for the preset frequency band in each of the plurality of preset time periods, as shown in fig. 2, the step 120 may include the following steps:

1201: and respectively carrying out filtering sampling processing on the collected in-ear audio signals and the collected out-of-ear audio signals so as to filter out signals of other frequency bands except the preset frequency band.

The preset frequency band can be set differently according to specific needs, such as 100Hz-400Hz, 100Hz-500Hz, 200Hz-600Hz, and the like. Preferably, the signal of the 200Hz-500Hz frequency band is used as the signal of the preset frequency band for detection, because the voice signal of the frequency band has a relatively obvious phase difference through a plurality of experimental measurements of the inventor, so that an accurate detection result can be obtained by using the phase difference related parameter on the frequency band of 200Hz-500Hz in combination with the corresponding energy difference related parameter.

Taking the signal of 200Hz-500Hz frequency band as an example, the in-ear audio signal and the out-of-ear audio signal can be filtered and sampled by a band-pass filter to remove signals of other frequency bands except the 200Hz-500Hz frequency band.

1203: and aiming at the preset frequency band, phase difference related parameters and energy difference related parameters between the spectrum vectors of the inner ear and the outer ear are calculated by utilizing the amplitude values of sampling points for respectively sampling the audio signals in the ear and the audio signals outside the ear in the preset time period, so that a plurality of phase difference related parameters and energy difference related parameters corresponding to a plurality of preset time periods are obtained.

For the preset time period, if the time of one frame is taken as the preset time period, taking the time of one frame as 10ms and the sampling rate as 16kHz (i.e. 16k samples are included in 1 s) as an example, one frame includes 160 samples. Thus, for the preset frequency band, the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectrum vectors can be calculated by using the amplitude values of 160 sampling points obtained by sampling the audio signal in the ear and the audio signal out of the ear within one frame time, and the phase difference related parameter and the energy difference related parameter corresponding to the frame data can be obtained. For a plurality of preset time periods, namely amplitude values of 160 sampling points obtained by respectively sampling the in-ear audio signal and the out-of-ear audio signal in a multi-frame time are utilized to calculate phase difference related parameters and energy difference related parameters between the in-ear and the out-of-ear spectrum vectors, so that a plurality of phase difference related parameters and energy difference related parameters corresponding to the multi-frame data can be obtained. Due to the continuity of voice, more accurate detection results can be obtained by adopting data of multiple frames.

In some specific embodiments, the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors may be a correlation and an energy ratio between the in-ear audio signal and the out-of-ear audio signal within a preset frequency band, respectively. The correlation and the energy ratio can be calculated by the following formula (1) to formula (2), respectively, i is any integer between 1 and M, and M is the number of sampling points in the preset time period:

wherein inner_iAmplitude, outer, of each sample point i for sampling an in-ear audio signal_iThe amplitude of each sample point i for sampling the off-ear audio signal.

Still taking the time of one frame as the preset time period, taking the time of one frame as 10ms and the sampling rate as 16kHz as an example, then M is 160, correlation represents the correlation between the in-ear audio signal and the out-of-ear audio signal within one frame (160 sampling points), and the energy ratio represents the energy ratio between the in-ear audio signal and the out-of-ear audio signal within one frame (160 sampling points).

It is understood that the parameters such as the sampling rate are only illustrated by the above data as examples, and are not intended to limit the present disclosure, and those skilled in the art can make different settings according to the specific situation and requirement.

Because only one frequency band in the whole frequency band is adopted as the detection signal for processing, the required calculation amount is small, the power consumption is low, and the resource consumption is saved.

In the case of using the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectrum vectors for each frequency band in the whole frequency band within at least one preset time period, as shown in fig. 3, step 120 may include the following steps:

1202: fourier transform is performed on the collected in-ear audio signal and the collected out-of-ear audio signal respectively to obtain frequency spectrums of the in-ear audio signal and the out-of-ear audio signal.

1204: phase angles and energies of the in-ear audio signal and the out-of-ear audio signal are calculated for each of the entire frequency bands, respectively, based on the frequency spectra.

Here, the entire frequency band is divided into a plurality of different frequency bands, such as 50 or 65 frequency bands, and the phase angle and energy of the in-ear audio signal and the out-of-ear audio signal are calculated for each frequency band, respectively.

In some specific embodiments, the phase angle and energy of the in-ear audio signal and the out-of-ear audio signal of each frequency band are expressed as formula (3) and formula (4), respectively:

wherein, the band_real and band_imageRepresenting the real and imaginary parts of the spectral vector of the respective frequency band, respectively.

1206: phase differences and energy ratios between the in-ear and out-of-ear spectral vectors of the respective frequency bands are calculated based on phase angles and energies of the in-ear audio signals and the out-of-ear audio signals as phase difference related parameters and energy difference related parameters.

That is, after the phase angle and the energy of the in-ear audio signal and the out-of-ear audio signal are calculated by the formula (3) and the formula (4), the phase difference and the energy ratio between the in-ear and the out-of-ear spectral vectors of the corresponding frequency band can be calculated based on the calculation as the phase difference related parameter and the energy difference related parameter. Specifically, the phase difference angle between the inner and outer ear spectral vectors of the corresponding band_differenceEnergy ratio energy_ratioCalculated by the following equation (5) to equation (6), respectively:

angle_difference＝angle_inner-angle_outerformula (5)

Wherein, angle_inner and angle_outerPhase angles, energy, of in-ear and out-of-ear audio signals of respective frequency bands_innerAnd energy_outerThe energy of the in-ear audio signal and the out-of-ear audio signal of the corresponding frequency band respectively.

In the embodiment, signals in the whole frequency band are considered, and all frequency bands are calculated separately, so that better inner and outer ear difference can be obtained, and a more accurate detection result can be obtained. Because the method adopted by the embodiment considers the signals of all frequency bands in the whole frequency band more comprehensively, the data in a preset time period is adopted to obtain a more ideal detection result, and meanwhile, the calculation amount is reduced. In addition, data in a plurality of preset time periods can be adopted for detection, and the detection accuracy can be further improved.

In some embodiments, step 130 may include the steps of: inputting the phase difference out-correlation parameter and the energy difference correlation parameter as characteristic values into each decision tree; summing the results output by each decision tree, and normalizing the summed result to be used as the probability value of the voice of the earphone wearer; and when the probability value is greater than or equal to a preset probability threshold value, judging that the earphone wearer gives out voice, and when the probability value is smaller than the preset probability threshold value, judging that the earphone wearer does not give out voice.

The threshold value threshold and the preset probability threshold value for distinguishing the characteristic value in the distinguishing model based on the plurality of decision trees are obtained by measuring in advance through a plurality of use scenes in the design stage of the earphone. The usage scenario is defined by any one of or a combination of a wearing condition (tightness of wearing of the earphone), an ear canal structure of the user or the artificial ear, a magnitude of voice volume, and a magnitude of ambient noise, for example.

In the design stage, the voice of a wearer and the voice of a non-wearer can be recorded respectively as collected data under various use scenes aiming at a specific earphone. Some scenes (such as wearing tightness) need to be fully considered during recording, some scenes can be subjected to similar condition amplification by using a simulation means, such as volume amplification, noise amplification and the like, and designers can perform different treatments according to specific conditions. Therefore, the trained model can be normally used in most daily scenes, such as music playing, noise environment, different voice volumes and the like, and good effects can be achieved.

The collected data can then be processed in two cases: (1) adopting a phase difference related parameter and an energy difference related parameter aiming at a preset frequency band in each preset time period of a plurality of preset time periods; (2) and adopting phase difference related parameters and energy difference related parameters between inner and outer ear spectrum vectors aiming at each frequency band in the whole frequency band in at least one preset time period.

As described above, the case (1) consumes less power and may be referred to as a low power consumption version, and the case (2) consumes more power and may be referred to as a high power consumption version because the amount of calculation required for processing is larger than that in the case (1). For the low power consumption version,

steps

1201 and 1203 may be adopted to perform data processing to obtain a phase difference related parameter and an energy difference related parameter for a preset frequency band in each preset time period of a plurality of preset time periods; for the high power consumption version,

steps

1202, 1204 and 1206 may be employed to obtain the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors for each frequency band in the whole frequency band within at least one preset time period. The data obtained in the two cases can be respectively input into a discriminant model to carry out two versions of training.

For decision trees, for example, a gradient lifting tree model can be selected, which has high training efficiency and accuracy, and low requirements on training hardware, and can train large-scale data. For the algorithm of the model, for example, a lightgbm algorithm or an XGBoost algorithm, etc. may be used, which is not specifically limited by the present disclosure.

In the high power consumption version, the entire frequency band signal in one frame is divided into 65 frequency bands, and the phase difference and the energy ratio between the inner and outer ear spectrum vectors are taken as the phase difference related parameter and the energy difference related parameter. Since each frequency band corresponds to two parameters of phase difference and energy ratio, a total of 130 parameters are obtained. When performing model training, the 130 parameters may be input into each decision tree as eigenvalues (i.e., a group of eigenvectors including 130 eigenvectors), where the phase difference of the first frequency band may be used as a first attribute, the energy ratio of the first frequency band may be used as a second attribute, the phase difference of the second frequency band may be used as a third attribute, the energy ratio of the second frequency band may be used as a fourth attribute, … …, and so on.

Fig. 4 is a schematic diagram illustrating a lightgbm discriminant model according to an exemplary embodiment of the disclosure. After the above feature values are input into the lightgbm discrimination model, the model can randomly extract a feature vector from the feature values from the root node to the internal node of each layer to perform the test of feature attributes, in the process, an output branch is selected according to the judgment value of each node until the leaf node is reached, and finally the category stored by the leaf node is taken as a decision result. Here, the decision result may be quantized, e.g. taking the value 1, 0 or-1 etc. for different decision results. As shown in fig. 4, in the root node, the first attribute is used as the feature attribute for testing, when the first attribute is smaller than the first threshold, the branch of "yes" is selected, and the result "-1" is output, when the first attribute is larger than the first threshold, the branch of "no" is selected, the first internal node is reached, at this point, the fourth attribute is used as the feature attribute for testing, when the fourth attribute is smaller than the fourth threshold, the branch of "yes" is selected, and the result "-1" is output, when the fourth attribute is larger than the fourth threshold, the branch of "no" is selected, the second internal node is reached, and the judgment is continued until the final decision result is output.

It can be understood that fig. 4 is only exemplified by specific parameters such as the first attribute, the fourth attribute, and the like, in the actual training, the selection of the discrimination model for the attributes in the root node and each internal node is random, the depth of the tree (the total number of layers of the root node and the internal node) may also be set differently according to the specific situation, and the threshold for discriminating each characteristic attribute is obtained by training the model under various usage scenarios.

The structure and the discrimination principle of other trees in the plurality of decision trees are similar to the discrimination model in fig. 4, but the number of layers may be the same as or different from the discrimination model in fig. 4, and the disclosure does not limit this.

After the output results of each decision tree are obtained, summing processing can be carried out on the output results, and the summed results are normalized to be used as the probability value of the sound sent by the earphone wearer. In one embodiment, the probability value P can be obtained by using the following formula (7) as a normalization formula, where AVERAGE represents the AVERAGE value of the output results of each decision tree:

P＝1/(1+exp(-AVERAGE)) (7)

and when the probability value P is larger than or equal to a preset probability threshold value, judging that the earphone wearer sends voice, and when the probability value P is smaller than the preset probability threshold value, judging that the earphone wearer does not send voice, wherein the preset probability threshold value is also obtained by training the judgment model for multiple times in multiple use scenes.

In the process of training the model, the number of the trees can be adjusted under the condition of the preset tree depth, so that the required accuracy is achieved, and the number of the adjusted trees is obtained. Because the depth of the tree is found to play a dominant role in memory occupation through a plurality of experiments of the inventor, but plays a secondary role in accuracy, for example, in the case of increasing the same number and depth, the number of trees is more improved in accuracy, and the depth of the tree is more occupied in memory. Therefore, a shallow depth and a large number of decision trees can be selected, and in particular, the number of trees (for example, 20-30 trees, etc.) can be adjusted under the condition of a preset tree depth (for example, 5-8 layers, etc.) to ensure accuracy and reduce the occupation of resources. Since the model is finally deployed on the earphone chip, the model has severe constraints on memory occupation and computation, and the earphone chip can obtain the best effect under the condition of the lowest-limit resource through the adjustment.

It is to be understood that the specific number of depths and numbers of the trees described above are for example only and are not intended to limit the present disclosure.

The training process of the low power consumption version is similar to that of the high power consumption version, and only parameters used as characteristic values are different, which is not described herein again.

The adjusted model can be deployed in an earphone chip to operate after being cured, and a low-power-consumption version model or a high-power-consumption version model can be selected according to different conditions for different earphones. For example, if the memory of the headset chip is relatively large, a high power consumption version model can be selected and configured in the design stage, and if the memory of the headset chip is relatively small, a low power consumption version model can be selected and configured in the design stage. In addition, in the training process of the model, a double-precision floating point type can be adopted, but in deployment, the precision and operation of the double-precision floating point occupy a large space for the earphone chip, and all floating point numbers are quantized with lower bits (such as 8 bits), so that the size of the compiled code can be greatly reduced, and the memory consumption of the model is reduced.

In an actual application scenario, if the model configured in the headset is a high-power-consumption version model, the phase difference related parameter and the energy difference related parameter obtained in the

above steps

1202, 1204 and 1206 can be used as characteristic values to be input into the model to judge whether the headset wearer utters voice; if the model configured in the earphone is a low-power-consumption version model, the phase difference related parameter and the energy difference related parameter obtained in the

above steps

1201 and 1203 are used as the characteristic values to be input into the model to judge whether the earphone wearer utters voice. The process of the specific model judgment is similar to the process of the model training, and is not repeated here to avoid redundancy.

In some embodiments, the method further comprises: when the earphone plays music, the audio signal in the ear is processed with echo cancellation. Because the in-ear microphone receives an extra music signal in the music playing scene, the extraction of the characteristics is interfered, and the music echo signal received by the in-ear microphone can be filtered through echo cancellation processing, so that the interference on the characteristic extraction is avoided.

Particularly, in the case that the model configured in the headset is a high power consumption version model, the higher sampling rate can be adopted for performing echo cancellation processing, such as the sampling rate of 8k, and the effect is ideal. In the case that the model configured in the earphone is a low-power consumption version model, the lower sampling rate can be adopted for echo cancellation processing. Because the sampling frequency is higher than twice of the highest frequency of the sound signal, the sound represented by the digital signal can be restored to the original sound, and the low-power-consumption version carries out filtering processing on the voice information, if only the signal of the frequency band of 200Hz-500Hz is selected for training, in practical application, the echo cancellation processing under the sampling rate of 1k can be adopted. Therefore, on the premise of ensuring small computation amount, acceptable echo cancellation effect can be obtained.

In some embodiments, the decision tree based discriminative model is trained without active noise reduction, then the method further comprises: the method comprises the steps of acquiring a reference signal under the condition that the earphone carries out active noise reduction, calculating the difference of the reference signal under the conditions of active noise reduction and inactive noise reduction, and compensating an in-ear audio signal according to the difference so as to eliminate the influence of the active noise reduction on the in-ear audio signal. Compared with the mode of specially training the model under the active noise reduction condition by independently using different data sets, the mode is more efficient, and the consumption of manpower and time is reduced.

In some embodiments, the method further comprises: meanwhile, white noise signals with a certain proportion are added to the in-ear audio signals and the out-of-ear audio signals, and the white noise signals can be added by adopting the following formula during training and deployment:

Signal_inner＝Signal_{inner raw}+a*WhiteNoise (8)

Signal_outer＝Signal_{outer raw}+b*WhiteNoise (9)

wherein, Signal_{inner raw}And Signal_{outer raw}Respectively, an original in-ear audio signal and an original out-of-ear audio signal; whitenoid is a white noise signal; signal_innerAnd Signal_outerRespectively an in-ear audio signal and an out-of-ear audio signal after white noise is added; a and b are adjustable parameters which can be set differently according to actual conditions.

By the method, the error problem caused by the difference between the data of the training model and the real data can be effectively avoided. The method can not only be used for noise amplification, but also avoid the difference between quiet time and non-quiet time, and uniformly use non-quiet scenes as training data. In addition, if in the scene of playing music, the echo residue after the echo cancellation processing is carried out on the music signal can be partially weakened, and the difference of active noise reduction can be neutralized by controlling the proportion and the size of noise when the device is deployed.

The embodiment of the disclosure also provides a device for detecting the voice of a user wearing the earphone. The headset includes an in-ear microphone, an out-of-ear microphone, and a smart voice system, as shown in fig. 5, the apparatus 500 includes a processor 510, a memory 520, and an interface 530. Interface 530 is configured to obtain data for in-ear and out-of-ear audio signals captured via in-ear and out-of-ear microphones, and processor 510 when executing instructions stored in memory 520 may implement: calculating a phase difference related parameter and an energy difference related parameter between the in-ear and out-of-ear spectral vectors based on data of the in-ear audio signal and the out-of-ear audio signal; and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between the inner and outer ear spectrum vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice.

In the case of employing the phase difference related parameter and the energy difference related parameter for the preset frequency band in each of the plurality of preset time periods, the processor 510 is further configured to: respectively filtering and sampling the collected in-ear audio signals and the collected out-of-ear audio signals to filter out signals of other frequency bands except the preset frequency band; and aiming at the preset frequency band, calculating phase difference related parameters and energy difference related parameters between the inner ear spectrum vector and the outer ear spectrum vector by using the amplitudes of sampling points which respectively sample the inner ear audio signal and the outer ear audio signal in the preset time period so as to obtain a plurality of phase difference related parameters and energy difference related parameters corresponding to a plurality of preset time periods.

The phase difference related parameter and the energy difference related parameter between the inner ear spectrum vector and the outer ear spectrum vector are respectively the correlation and the energy ratio between the in-ear audio signal and the out-ear audio signal in a preset frequency band, the correlation and the energy ratio are respectively obtained by calculation according to the following formula (1) -formula (2), i is any integer between 1 and M, and M is the number of sampling points in the preset time period:

wherein inner_iThe amplitude, outer, of each sample point i for sampling the in-ear audio signal_iFor each of the sampling of the off-ear audio signalThe amplitude of the sample point i.

In the case of employing the phase difference-related parameter and the energy difference-related parameter for each of the entire frequency bands for at least one preset time period, the processor 510 is further configured to: performing Fourier transform on the acquired in-ear audio signal and the out-of-ear audio signal respectively to obtain frequency spectrums of the in-ear audio signal and the out-of-ear audio signal; calculating phase angles and energies of the in-ear audio signal and the out-of-ear audio signal, respectively, for each of the entire frequency bands based on the frequency spectra; calculating a phase difference and an energy ratio between the in-ear and out-of-ear spectral vectors of the respective frequency bands based on the phase angles and the energies of the in-ear audio signal and the out-of-ear audio signal as the phase difference related parameter and the energy difference related parameter.

Specifically, the phase angle and the energy of the in-ear audio signal and the out-of-ear audio signal of each frequency band are expressed as formula (3) and formula (4), respectively:

wherein, the band_real and band_imageRespectively representing the real and imaginary parts of the spectral vectors of the respective bands, the phase difference angle between the inner and outer ear spectral vectors of the respective bands_differenceEnergy ratio energy_ratioCalculated by the following equation (5) to equation (6), respectively:

angle_difference＝angle_inner-angle_outerformula (5)

Wherein, angle_inner and angle_outerAre respectively provided withFor the phase angles of the in-ear audio signal and the out-of-ear audio signal of the respective frequency bands, energy_innerAnd energy_outerThe energy of the in-ear audio signal and the out-of-ear audio signal of the corresponding frequency band respectively.

In some embodiments, processor 510 is further configured to: inputting the phase difference out-correlation parameter and the energy difference correlation parameter as characteristic values into each decision tree; summing the results output by each decision tree, and normalizing the summed result to be used as the probability value of the voice of the earphone wearer; and when the probability value is larger than or equal to a preset probability threshold value, judging that the earphone wearer sends out voice, and when the probability value is smaller than the preset probability threshold value, judging that the earphone wearer does not send out voice.

In some embodiments, processor 510 is further configured to: when the earphone plays music, performing echo cancellation processing on the in-ear audio signal; under the condition that phase difference related parameters and energy difference related parameters between inner and outer ear frequency spectrum vectors of a preset frequency band are trained in each preset time period of a plurality of preset time periods, echo cancellation processing is carried out by adopting a first sampling rate; performing echo cancellation processing by adopting a second sampling rate under the condition that phase difference related parameters and energy difference related parameters between inner and outer ear frequency spectrum vectors of each frequency band in the whole frequency band are trained in at least one preset time period; the second sampling rate is greater than the first sampling rate.

In some embodiments, the decision tree based discriminative model is trained without active noise reduction, the processor 510 is further configured to: and under the condition that the earphone carries out active noise reduction, acquiring a reference signal, calculating the difference of the reference signal under the conditions of active noise reduction and inactive noise reduction, and compensating the in-ear audio signal according to the difference so as to eliminate the influence of active noise reduction on the in-ear audio signal.

In some embodiments, processor 510 is further configured to: simultaneously adding a white noise signal to the in-ear audio signal and the out-of-ear audio signal.

Processor 510 may be a processing device including more than one general-purpose processing device, such as a microprocessor, Central Processing Unit (CPU), Graphics Processing Unit (GPU), etc. More specifically, processor 510 may be a Complex Instruction Set Computing (CISC) microprocessor, Reduced Instruction Set Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, processor running other instruction sets, or processors running a combination of instruction sets. Processor 510 may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like. The processor 510 may be communicatively coupled to the memory 520 and configured to execute computer-executable instructions stored thereon to perform the method of the headset of the above-described embodiments.

The memory 520 may be a non-transitory computer-readable medium such as Read Only Memory (ROM), Random Access Memory (RAM), phase change random access memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), other types of Random Access Memory (RAM), flash disk or other forms of flash memory, cache, registers, static memory, compact disk read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes or other magnetic storage devices, or any other possible non-transitory medium that can be used to store information or instructions that can be accessed by a computer device, and so forth.

The disclosed embodiments also provide a device for detecting the voice of a wearer of an earphone, the earphone comprising an in-ear microphone, an out-of-ear microphone and an intelligent voice system, the device comprising: an acquisition module configured to: acquiring data of an in-ear audio signal and an out-of-ear audio signal acquired via the in-ear microphone and the out-of-ear microphone; a parameter calculation module configured to: calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectral vector and an out-of-ear spectral vector based on data of the in-ear audio signal and the out-of-ear audio signal; a speech determination module configured to: and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice. The acquisition module can be realized through hardware or software such as an interface, the parameter calculation module and the voice judgment module can be realized through a software algorithm, the power consumption requirement on the earphone chip is low, the realization mode is simple and flexible, and the cost is low.

The disclosed embodiments also provide a non-transitory computer readable medium having stored thereon instructions that, when executed by a processor, perform a method according to any of the above.

Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the disclosure with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the specification or during the prosecution of the disclosure, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more versions thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the foregoing detailed description, various features may be grouped together to streamline the disclosure. This should not be interpreted as an intention that a disclosed feature not claimed is essential to any claim. Rather, inventive subject matter may lie in less than all features of a particular disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims

1. A method of detecting speech of a wearer of a headset comprising an in-ear microphone, an out-of-ear microphone, and a smart speech system, the method comprising:

acquiring an in-ear audio signal and an out-of-ear audio signal by the in-ear microphone and the out-of-ear microphone, respectively;

calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectral vector and an out-of-ear spectral vector based on data of the in-ear audio signal and the out-of-ear audio signal;

and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice.

2. The method of claim 1, wherein the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors employ at least one of:

a phase difference related parameter and an energy difference related parameter between inner and outer ear spectrum vectors for a preset frequency band in each preset time period of a plurality of preset time periods;

and the phase difference related parameter and the energy difference related parameter between the inner ear spectrum vector and the outer ear spectrum vector of each frequency band in the whole frequency band in at least one preset time period.

3. The method of claim 2, wherein in the case of using the phase difference related parameter and the energy difference related parameter for the preset frequency band in each of the plurality of preset time periods, the calculating the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectrum vectors further comprises:

respectively filtering and sampling the collected in-ear audio signals and the collected out-of-ear audio signals to filter out signals of other frequency bands except the preset frequency band;

and aiming at the preset frequency band, calculating phase difference related parameters and energy difference related parameters between the inner ear spectrum vector and the outer ear spectrum vector by using the amplitudes of sampling points which respectively sample the inner ear audio signal and the outer ear audio signal in the preset time period so as to obtain a plurality of phase difference related parameters and energy difference related parameters corresponding to a plurality of preset time periods.

4. The method according to claim 3, wherein the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors are a correlation and an energy ratio between an in-ear audio signal and an out-of-ear audio signal in a preset frequency band, respectively, and the correlation and the energy ratio are calculated by the following formula (1) -formula (2), respectively, i is any integer between 1 and M, and M is the number of sampling points in the preset time period:

wherein inner_iThe amplitude, outer, of each sample point i for sampling the in-ear audio signal_iFor the out-of-ear audio signalThe amplitude of each sample point i being sampled.

5. The method according to claim 2, wherein the calculating of the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectrum vectors using the phase difference related parameter and the energy difference related parameter for each of the whole frequency bands for at least one preset time period comprises:

performing Fourier transform on the acquired in-ear audio signal and the out-of-ear audio signal respectively to obtain frequency spectrums of the in-ear audio signal and the out-of-ear audio signal;

calculating phase angles and energies of the in-ear audio signal and the out-of-ear audio signal, respectively, for each of the entire frequency bands based on the frequency spectra;

calculating a phase difference and an energy ratio between the in-ear and out-of-ear spectral vectors of the respective frequency bands based on the phase angles and the energies of the in-ear audio signal and the out-of-ear audio signal as the phase difference related parameter and the energy difference related parameter.

6. The method according to claim 5, wherein the phase angles and energies of the in-ear audio signal and the out-of-ear audio signal of each frequency band are expressed as formula (3) and formula (4), respectively:

wherein, the band_real and band_imageRespectively representing the real and imaginary parts of the spectral vectors of the respective bands, the phase difference angle between the inner and outer ear spectral vectors of the respective bands_differenceEnergy ratio energy_ratioRespectively by the following formula (5)) -formula (6) results in:

angle_difference＝angle_inner-angle_outerformula (5)

7. The method of claim 1, wherein determining whether the headset wearer uttered speech using a multi-decision tree based discriminative model based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors further comprises:

inputting the phase difference out-correlation parameter and the energy difference correlation parameter as characteristic values into each decision tree;

summing the results output by each decision tree, and normalizing the summed result to be used as the probability value of the voice of the earphone wearer;

and when the probability value is larger than or equal to a preset probability threshold value, judging that the earphone wearer sends out voice, and when the probability value is smaller than the preset probability threshold value, judging that the earphone wearer does not send out voice.

8. The method according to claim 7, wherein the threshold and the preset probability threshold for discriminating the feature value in the multi-decision tree based discrimination model are obtained by pre-measuring a plurality of usage scenarios in the design stage of the earphone, and the usage scenarios are defined by any one or a combination of wearing conditions, ear canal structures of a user or an artificial ear, voice volume and ambient noise.

9. The method according to claim 8, wherein the decision tree-based discriminative model comprises a gradient-boosted tree model, and the gradient-boosted tree model is used for adjusting the number of trees at a preset tree depth in the design stage of the headset so as to achieve a required accuracy rate to obtain the number of tuned trees.

10. The method of any one of claims 1 to 9, further comprising:

when the earphone plays music, performing echo cancellation processing on the in-ear audio signal; wherein,

under the condition that phase difference related parameters and energy difference related parameters between inner and outer ear frequency spectrum vectors of a preset frequency band are trained in each preset time period of a plurality of preset time periods, echo cancellation processing is carried out by adopting a first sampling rate;

performing echo cancellation processing by adopting a second sampling rate under the condition that phase difference related parameters and energy difference related parameters between inner and outer ear frequency spectrum vectors of each frequency band in the whole frequency band are trained in at least one preset time period;

the second sampling rate is greater than the first sampling rate.

11. The method according to any one of claims 1 to 9, wherein a decision tree based discriminative model is trained without active noise reduction, the method further comprising:

and under the condition that the earphone carries out active noise reduction, acquiring a reference signal, calculating the difference of the reference signal under the conditions of active noise reduction and inactive noise reduction, and compensating the in-ear audio signal according to the difference so as to eliminate the influence of active noise reduction on the in-ear audio signal.

12. The method of any one of claims 1 to 9, further comprising:

simultaneously adding a white noise signal to the in-ear audio signal and the out-of-ear audio signal.

13. An apparatus for detecting speech of a wearer of a headset including an in-ear microphone, an out-of-ear microphone, and a smart speech system, the apparatus comprising:

an interface configured to acquire data of in-ear and out-of-ear audio signals acquired via the in-ear and out-of-ear microphones;

a processor configured to:

14. The apparatus of claim 13, wherein the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors employ at least one of:

15. The apparatus of claim 14, wherein in case of employing the phase difference related parameter and the energy difference related parameter for the preset frequency band in each of the plurality of preset time periods, the processor is further configured to:

16. The apparatus according to claim 15, wherein the phase difference related parameter and the energy difference related parameter between the inner and outer ear spectral vectors are a correlation and an energy ratio between an in-ear audio signal and an out-of-ear audio signal in a preset frequency band, respectively, and the correlation and the energy ratio are calculated by the following formula (1) -formula (2), respectively, i is any integer between 1 and M, and M is the number of sampling points in the preset time period:

wherein inner_iThe amplitude, outer, of each sample point i for sampling the in-ear audio signal_iThe amplitude of each sample point i for sampling the out-of-ear audio signal.

17. The apparatus of claim 14, wherein in case of employing the phase difference related parameter and the energy difference related parameter for each frequency band in the whole frequency band within at least one preset time period, the processor is further configured to:

18. The apparatus according to claim 17, wherein the phase angles and energies of the in-ear audio signal and the out-of-ear audio signal of the respective frequency bands are expressed as formula (3) and formula (4), respectively:

angle_difference＝angle_inner-angle_outerformula (5)

19. The apparatus of claim 13, wherein the processor is further configured to:

20. The apparatus of any of claims 13 to 19, wherein the processor is further configured to:

the second sampling rate is greater than the first sampling rate.

21. The apparatus according to any one of claims 13 to 19, wherein the decision tree based discriminative model is trained without active noise reduction, the processor further configured to:

22. The apparatus of any of claims 13 to 19, wherein the processor is further configured to:

23. An apparatus for detecting speech of a wearer of a headset including an in-ear microphone, an out-of-ear microphone, and a smart speech system, the apparatus comprising:

an acquisition module configured to: acquiring data of an in-ear audio signal and an out-of-ear audio signal acquired via the in-ear microphone and the out-of-ear microphone;

a parameter calculation module configured to: calculating a phase difference related parameter and an energy difference related parameter between an in-ear spectral vector and an out-of-ear spectral vector based on data of the in-ear audio signal and the out-of-ear audio signal;

a speech determination module configured to: and determining whether the earphone wearer pronounces voice or not by utilizing a discrimination model based on a plurality of decision trees based on phase difference related parameters and energy difference related parameters between inner and outer ear spectral vectors, so that the intelligent voice system is started under the condition that the earphone wearer pronounces voice.

24. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a processor, perform the method of any one of claims 1-12.