CN117292691A

CN117292691A - Audio energy analysis method and related device

Info

Publication number: CN117292691A
Application number: CN202210697612.5A
Authority: CN
Inventors: 郝斌
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2023-12-26
Also published as: WO2023245700A1

Abstract

The embodiment of the application discloses an audio energy analysis method and a related device, when audio energy analysis is performed, energy loss parameters of a first device can be determined first, and the energy loss parameters can identify loss when a second device transmits audio energy to the first device. And the processing device can acquire the first total audio energy received by the first device and the self audio energy generated by the second device through playing the audio, so that the audio energy received by the first device from the voice interaction sound source can be obtained through combining the data analysis, the interference of the self-played audio of the second device on the voice interaction recognition is eliminated, the device which the user wants to interact with can be more accurately analyzed based on the audio energy, and the voice interaction experience of the user is improved.

Description

Audio energy analysis method and related device

Technical Field

The present disclosure relates to the field of data analysis technologies, and in particular, to an audio energy analysis method and a related device.

Background

The voice interaction is one of the common man-machine interaction means, and when a plurality of devices supporting voice interaction exist in a scene, the related devices need to judge which device the user really wants to interact with and perform corresponding voice interaction.

In the related art, when the equipment for performing voice interaction does not emit sound, the equipment for user interaction can be accurately judged; however, when these devices themselves make sounds, it is difficult to determine the device that the user wants to interact with, and the user's voice interaction experience is poor.

Disclosure of Invention

In order to solve the technical problems, the application provides an audio energy analysis method, and processing equipment can analyze the audio interference of the equipment, so that a sound source of voice interaction can be accurately identified, and the voice interaction experience of a user is improved.

The embodiment of the application discloses the following technical scheme:

in a first aspect, embodiments of the present application disclose an audio energy analysis method, the method comprising:

determining an energy loss parameter of a first device, the energy loss parameter being used to identify a loss in a second device delivering audio energy to the first device;

acquiring first total audio energy of the first device and self audio energy of the second device, wherein the first total audio energy is audio energy received by the first device, and the self audio energy is energy generated based on audio played by the second device;

and determining first sound source audio energy of the first device according to the energy loss parameter, the self audio energy and the first total audio energy, wherein the first sound source audio energy is the audio energy acquired from a sound source by the first device.

In one possible implementation, the energy loss parameter is based on the following:

determining a test own audio energy of the second device when playing audio and a received audio energy received by the first device from the second device in an environment without other sound sources than the second device;

and determining an energy loss parameter of the first device according to the ratio of the received audio energy to the test own audio energy.

In one possible implementation, the method further includes:

determining a second total audio energy received by the second device;

determining second sound source audio energy of the second device according to the self audio energy and the second total audio energy, wherein the second sound source audio energy and the first sound source audio energy are from the same sound source;

and waking up a voice interaction function of the second device in response to the second source audio energy being greater than the first source audio energy.

In a possible implementation manner, the determining the second sound source audio energy of the second device according to the self audio energy and the second total audio energy includes:

determining nonlinear energy of the second device according to the self audio energy, wherein the nonlinear energy is generated based on vibration generated when the second device plays audio;

and removing the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy.

In a possible implementation manner, the determining, according to the self audio energy, nonlinear energy of the second device includes:

and determining nonlinear energy of the second equipment according to the self audio energy through a neural network model.

In one possible implementation, the neural network model is trained by:

acquiring a training sample set, wherein the training sample set comprises sample self audio energy and sample total audio energy acquired by target equipment in an environment without other sound sources;

determining sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy;

and training an initial neural network model through the total audio energy of the sample, the self audio energy of the sample and the nonlinear energy of the sample to obtain the neural network model.

In one possible implementation, the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer.

In a second aspect, an embodiment of the present application discloses an audio energy analysis apparatus, the apparatus including a first determination unit, an acquisition unit, and a second determination unit:

the first determining unit is configured to determine an energy loss parameter of a first device, where the energy loss parameter is used to identify a loss when a second device transmits audio energy to the first device;

the acquiring unit is configured to acquire a first total audio energy of the first device and an own audio energy of the second device, where the first total audio energy is audio energy received by the first device, and the own audio energy is energy generated based on audio played by the second device;

the second determining unit is configured to determine, according to the energy loss parameter, the own audio energy, and the first total audio energy, first sound source audio energy of the first device, where the first sound source audio energy is audio energy obtained by the first device from a sound source.

In a possible implementation manner, the apparatus further includes a third determining unit, a fourth determining unit, and a wake-up unit:

the third determining unit is configured to determine a second total audio energy received by the second device;

the fourth determining unit is configured to determine second sound source audio energy of the second device according to the self audio energy and the second total audio energy, where the second sound source audio energy and the first sound source audio energy are from a same sound source;

the awakening unit is used for awakening the voice interaction function of the second equipment if the second sound source audio energy is larger than the first sound source audio energy;

and if the second sound source audio energy is smaller than the first sound source audio energy, waking up the voice interaction function of the first device.

In a possible implementation manner, the fourth determining unit is specifically configured to:

In one possible implementation, the neural network model is trained by:

In a third aspect, embodiments of the present application disclose a computer-readable storage medium, which when executed by at least one processor performs the audio energy analysis method of any of the first aspects.

In a fourth aspect, embodiments of the present application disclose a computer device comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer device executable instructions, when executed by the at least one processor, perform the audio energy analysis method of any of the first aspects.

As can be seen from the above technical solution, when performing audio energy analysis, the processing device may first determine an energy loss parameter of the first device, where the energy loss parameter can identify a loss when the second device transmits audio energy to the first device. And the processing device can acquire the first total audio energy received by the first device and the self audio energy generated by the second device through playing the audio, so that the audio energy received by the first device from the voice interaction sound source can be obtained through combining the data analysis, the interference of the self-played audio of the second device on the voice interaction recognition is eliminated, the device which the user wants to interact with can be more accurately analyzed based on the audio energy, and the voice interaction experience of the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an audio energy analysis method according to an embodiment of the present application;

fig. 2 is a block diagram of an audio energy analysis device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

It will be appreciated that the method may be applied to a processing device that is capable of audio energy analysis, for example a terminal device or a server with audio energy analysis functionality. The method can be independently executed by the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed by the cooperation of the terminal equipment and the server. The terminal equipment can be a computer, a mobile phone and other equipment. The server can be understood as an application server or a Web server, and can be an independent server or a cluster server in actual deployment.

Referring to fig. 1, fig. 1 is a flowchart of an audio energy analysis method according to an embodiment of the present application, where the method includes:

s101: an energy loss parameter of the first device is determined.

Wherein the energy loss parameter is used to identify a loss in the transfer of audio energy to the first device by the second device, i.e. how much of the audio energy the first device is able to receive when the audio energy emanates from the second device. For example, the energy loss parameter may be 0.9, i.e. when the second device emits audio energy, the first device is able to receive 90% of the emitted audio energy.

S102: a first total audio energy of the first device and an own audio energy of the second device are acquired.

The first total audio energy is audio energy received by the first device, and the self audio energy is energy generated based on audio played by the second device. It will be appreciated that in a scenario with a device self-playing, the audio energy received by the first device comprises two parts, one part being audio energy generated by audio emitted by a user performing a voice interaction and the other part being audio energy generated by audio emitted by the second device. In a multi-device scenario, in order to accurately analyze which device the user wants to interact with, typically, the audio energy received by each device from the user is determined, so in the embodiment of the present application, the processing device needs to remove the audio energy portion from the second device from the first total audio energy first, so that accurate voice interaction can be performed.

S103: a first source audio energy of the first device is determined based on the energy loss parameter, the own audio energy, and the first total audio energy.

The above-mentioned energy loss parameter can identify the loss when the second device transmits the audio energy to the first device, where the self audio energy is the audio energy generated by playing the audio by the second device, so that the processing device can determine, through the energy loss parameter and the self audio energy, the audio energy received by the first device from the second device, so that the energy can be removed from the first total audio energy, and a first sound source audio energy of the first device is obtained, where the first sound source audio energy is the audio energy obtained by the first device from a sound source, where the sound source is a sound source performing voice interaction, and where the sound source is a different sounding sound source with the second device, for example, may be a user performing voice interaction, and so on.

In one possible implementation, the energy loss parameter may be based on the following. In an environment where there are no other sound sources than the second device, the processing device may determine a test own audio energy of the second device when playing audio, and a received audio energy received by the first device from the second device, the test own audio energy being audio energy generated by audio played by the second device. The processing device can determine the energy loss parameter of the first device according to the ratio of the received audio energy to the audio energy of the test device, and the ratio can reflect the difference between the audio energy actually received by the first device and the audio energy sent by the second device, so that the loss of the second device when the second device transmits the audio energy to the first device can be identified.

In one possible implementation, to determine whether the user wants to perform voice interaction with the second device or the first device, the processing device may further determine a second total audio energy received by the second device, and then determine a second audio source audio energy of the second device according to the second audio energy and the second total audio energy, that is, remove a portion of the second audio energy in the second total audio energy, where the second audio source audio energy and the first audio source audio energy are from the same sound source. The processing device may determine a magnitude relation between the second source audio energy and the first source audio energy, where the magnitude relation may represent, to a certain extent, a user's willingness to interact with the second device and the first device. If the second sound source audio energy is larger than the first sound source audio energy, the user is more expected to interact with the second device, and the processing device can wake up the voice interaction function of the second device; if the second source audio energy is less than the first source audio energy, indicating that the user would like to interact with the first device, the processing device may wake up the voice interaction function of the first device.

For example, two speakers a, B are taken as an example, and the flow is described. The two audios are configured with the same wake-up word and have a distributed wake-up function. When the two devices receive the wake-up word, the processing device can calculate audio energy and upload the audio energy to a cloud terminal for decision making, and the cloud terminal selects the device needing to respond according to the scoring standard.

Firstly, the cloud can control the loudspeaker box A to play white noise audio for a period of time, at this moment, audio signals received by the loudspeaker box A and B are subjected to stft conversion, and the average value of audio energy for the period of time is counted as X _A (k) And X _A→B (k) A. The invention relates to a method for producing a fibre-reinforced plastic composite Taking the 16k sampling rate as an example, the FFT length 512, and the frequency statistical range is selected to be 200-5000Hz (corresponding to the frequency band k=3-160), the energy loss parameter of the sound box B can be C _A→B (k)＝X _A→B (k)/X _A (k)。

Otherwise, the energy loss parameter of the sound box A can be obtained to be C _B→A (k)＝X _B→A (k)/X _B (k).

Wherein X is _A (k) And X _A→B (k) The equipment calculation is uploaded to the cloud end, and the cloud end calculates to obtain C _A→B (k) Pushing the device A; push C in the same way _B→A (k)。

When the device A plays the audio, the user gets close to the A and sends out the wake-up word, and the audio energy received by the A comprises:

Y _A (l,k)＝S _A (l,k)+E _A (l,k)，S _A (k) Representing the audio energy of the sound source of the loudspeaker A, E _A (k) Representing the self audio energy of the sound box A, Y _A (l, k) represents a second total audio energy.

The audio energy received by sound box B at this point includes:

Y _A→B (l,k)＝S _B (l,k)+E _A→B (l,k)，S _B (l, k) represents the audio energy of the sound source of the loudspeaker B, E _A→B (l, k) represents the audio energy received by sound box B from the audio emitted by sound box AAmount of the components.

Through AEC, the sound source audio energy Y 'of the sound box A can be obtained' _A (l,k)＝S _A (l, k), then E _A (l,k)＝Y _A (l,k)-Y′ _A (l, k). After correction E _A→B (l,k)＝E _A (l,k)*C _A→B (l, k) so that the sound source audio energy S of the sound box B can be determined _B (l,k)。

According to S _B (l, k) and S _A The size judgment of (l, k) the processing device can determine the sound box that the user actually wants to wake up.

It will be appreciated that the audio energy received by the device based on its own generation may not be equivalent to the audio energy being played due to vibrations or the like generated when the audio is played. Thus, in one possible implementation, for more accurate audio energy analysis, in analyzing the audio energy of the sound source of the second device, the processing device may determine the nonlinear energy of the second device based on its own audio energy, the nonlinear energy being generated based on vibrations generated when the second device plays audio. The processing device may remove the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy, so as to obtain more accurate sound source energy.

In one possible implementation, the processing device may determine the nonlinear energy of the second device from its own audio energy, in particular, by means of a neural network model. The neural network model may be trained by:

first, the processing device may acquire a training sample set that includes the sample's own audio energy and the sample's total audio energy that the target device acquired in an environment without other sound sources. Since the sample self audio energy is determined directly based on the audio information played by the target device itself, the sample self audio energy is the linear audio energy of the target device; the sample total audio energy is the total audio energy received from the target device, i.e. comprises linear audio energy and nonlinear audio energy, so that the processing device can determine the sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy, and then train an initial neural network model through the sample total audio energy, the sample own audio energy and the sample nonlinear energy to obtain the neural network model, and in the training process, the neural network model can learn the association relation between the nonlinear audio energy part and the linear audio energy part, so as to learn how to determine the nonlinear audio energy part based on the linear audio energy part.

For example, the microphone signal Mic (i.e., the sample total audio energy) can be echo cancelled using the stope signal Ref (i.e., the sample's own audio energy) to yield Aec (i, k) (the sample nonlinear energy). Removing linear audio energy component E by linear methods such as NLMS and RLS _linear (l, k) and also a portion of the nonlinear audio energy component E _residual (l,k)。

The speech signal mic, ref, aec is a frequency domain signal obtained by stft, which is complex, taking 16k samples, a frame length of 16ms, and a FFT length of 512 as an example, each frequency domain signal is a complex set of 257 (512/2+1=257 according to the symmetry characteristics of FFT). The complex signal takes absolute value and then is converted into Bark domain, thus 64-dimensional data can be obtained. The input of the model, the bark value of concatenation aec, ref, mic, is a 64 x 3 vector per frame.

In one possible implementation, the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer. Thus, the neural network model may be suitable for devices with poor processing power, providing audio energy analysis functionality. For example, the model structure employs a convolutional recursive network model (Convolution Recurrent Network, CRN) structure, while considering on-device memory and performance constraints, with only one layer for the encoder and decoder layers. Specifically, the encoder layer adopts one-dimensional convolution, an input channel 192, an output channel 64, a convolution kernel size of 3, and is connected with BatchNorm and PReLU; the enhancement layer takes the input 64, hiding the LSTM of the layer 64; the decoder layer adopts two-dimensional convolution, an input channel 64+64, an output channel 64, a convolution kernel size of 3 and the connections of BatchNorm and PReLU; the activation function is adoptedsigmoid. The loss function uses MES, and the mean square error of the estimated residual echo component and the real residual echo component is adopted. Adam is adopted as an optimization function, and the learning rate is 0.001 and beta ₁ ＝0.9，β ₁ ＝0.999。

Model input dimension 64 x 3, output dimension 64, and gain value G (l, k) and E obtained by converting the model input dimension into frequency domain _residual (l,k)＝(Mic(l,k)-E _linear (l, k)) G (l, k), i.e. the nonlinear audio energy can be determined in combination with the gain value after removing the linear audio energy from the total audio energy emitted by the device. Thus, through this process, the model can learn how to determine the nonlinear audio energy portion based on the linear audio energy portion.

Based on the audio energy analysis method provided in the foregoing embodiments, the present application further provides an audio energy analysis device, referring to fig. 2, fig. 2 is a block diagram of an audio energy analysis device 200 provided in the embodiment of the present application, where the device includes a first determining unit 201, an obtaining unit 202, and a second determining unit 203:

the first determining unit 201 is configured to determine an energy loss parameter of a first device, where the energy loss parameter is used to identify a loss when a second device transmits audio energy to the first device;

the acquiring unit 202 is configured to acquire a first total audio energy of the first device and an own audio energy of the second device, where the first total audio energy is audio energy received by the first device, and the own audio energy is energy generated based on audio played by the second device;

the second determining unit 203 is configured to determine a first sound source audio energy of the first device according to the energy loss parameter, the self audio energy, and the first total audio energy, where the first sound source audio energy is an audio energy obtained by the first device from a sound source.

In one possible implementation, the neural network model is trained by:

The application discloses a computer readable storage medium, which when executed by at least one processor, performs the audio energy analysis method of any of the above embodiments.

The application also discloses a computer device comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer device executable instructions, when executed by the at least one processor, perform the audio energy analysis method of any of the above embodiments.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, etc., which can store program codes.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely one specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of audio energy analysis, the method comprising:

2. The method of claim 1, wherein the energy loss parameter is based on:

3. The method according to claim 1, wherein the method further comprises:

determining a second total audio energy received by the second device;

if the second sound source audio energy is larger than the first sound source audio energy, waking up a voice interaction function of the second device;

4. A method according to claim 3, wherein said determining a second sound source audio energy of said second device from said own audio energy, said second total audio energy, comprises:

5. The method of claim 4, wherein said determining the nonlinear energy of the second device from the self audio energy comprises:

6. The method of claim 5, wherein the neural network model is trained by:

7. The method of claim 5, wherein the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer.

8. An audio energy analysis device, characterized in that the device comprises a first determination unit, an acquisition unit and a second determination unit:

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, perform the audio energy analysis method of any of claims 1-7.

10. A computer device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer device executable instructions, when executed by the at least one processor, perform the audio energy analysis method of any of claims 1-7.