CN117292691A - Audio energy analysis method and related device - Google Patents

Audio energy analysis method and related device Download PDF

Info

Publication number
CN117292691A
CN117292691A CN202210697612.5A CN202210697612A CN117292691A CN 117292691 A CN117292691 A CN 117292691A CN 202210697612 A CN202210697612 A CN 202210697612A CN 117292691 A CN117292691 A CN 117292691A
Authority
CN
China
Prior art keywords
energy
audio energy
audio
sound source
total
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210697612.5A
Other languages
Chinese (zh)
Inventor
郝斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202210697612.5A priority Critical patent/CN117292691A/en
Priority to PCT/CN2022/102036 priority patent/WO2023245700A1/en
Publication of CN117292691A publication Critical patent/CN117292691A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The embodiment of the application discloses an audio energy analysis method and a related device, when audio energy analysis is performed, energy loss parameters of a first device can be determined first, and the energy loss parameters can identify loss when a second device transmits audio energy to the first device. And the processing device can acquire the first total audio energy received by the first device and the self audio energy generated by the second device through playing the audio, so that the audio energy received by the first device from the voice interaction sound source can be obtained through combining the data analysis, the interference of the self-played audio of the second device on the voice interaction recognition is eliminated, the device which the user wants to interact with can be more accurately analyzed based on the audio energy, and the voice interaction experience of the user is improved.

Description

Audio energy analysis method and related device
Technical Field
The present disclosure relates to the field of data analysis technologies, and in particular, to an audio energy analysis method and a related device.
Background
The voice interaction is one of the common man-machine interaction means, and when a plurality of devices supporting voice interaction exist in a scene, the related devices need to judge which device the user really wants to interact with and perform corresponding voice interaction.
In the related art, when the equipment for performing voice interaction does not emit sound, the equipment for user interaction can be accurately judged; however, when these devices themselves make sounds, it is difficult to determine the device that the user wants to interact with, and the user's voice interaction experience is poor.
Disclosure of Invention
In order to solve the technical problems, the application provides an audio energy analysis method, and processing equipment can analyze the audio interference of the equipment, so that a sound source of voice interaction can be accurately identified, and the voice interaction experience of a user is improved.
The embodiment of the application discloses the following technical scheme:
in a first aspect, embodiments of the present application disclose an audio energy analysis method, the method comprising:
determining an energy loss parameter of a first device, the energy loss parameter being used to identify a loss in a second device delivering audio energy to the first device;
acquiring first total audio energy of the first device and self audio energy of the second device, wherein the first total audio energy is audio energy received by the first device, and the self audio energy is energy generated based on audio played by the second device;
and determining first sound source audio energy of the first device according to the energy loss parameter, the self audio energy and the first total audio energy, wherein the first sound source audio energy is the audio energy acquired from a sound source by the first device.
In one possible implementation, the energy loss parameter is based on the following:
determining a test own audio energy of the second device when playing audio and a received audio energy received by the first device from the second device in an environment without other sound sources than the second device;
and determining an energy loss parameter of the first device according to the ratio of the received audio energy to the test own audio energy.
In one possible implementation, the method further includes:
determining a second total audio energy received by the second device;
determining second sound source audio energy of the second device according to the self audio energy and the second total audio energy, wherein the second sound source audio energy and the first sound source audio energy are from the same sound source;
and waking up a voice interaction function of the second device in response to the second source audio energy being greater than the first source audio energy.
In a possible implementation manner, the determining the second sound source audio energy of the second device according to the self audio energy and the second total audio energy includes:
determining nonlinear energy of the second device according to the self audio energy, wherein the nonlinear energy is generated based on vibration generated when the second device plays audio;
and removing the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy.
In a possible implementation manner, the determining, according to the self audio energy, nonlinear energy of the second device includes:
and determining nonlinear energy of the second equipment according to the self audio energy through a neural network model.
In one possible implementation, the neural network model is trained by:
acquiring a training sample set, wherein the training sample set comprises sample self audio energy and sample total audio energy acquired by target equipment in an environment without other sound sources;
determining sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy;
and training an initial neural network model through the total audio energy of the sample, the self audio energy of the sample and the nonlinear energy of the sample to obtain the neural network model.
In one possible implementation, the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer.
In a second aspect, an embodiment of the present application discloses an audio energy analysis apparatus, the apparatus including a first determination unit, an acquisition unit, and a second determination unit:
the first determining unit is configured to determine an energy loss parameter of a first device, where the energy loss parameter is used to identify a loss when a second device transmits audio energy to the first device;
the acquiring unit is configured to acquire a first total audio energy of the first device and an own audio energy of the second device, where the first total audio energy is audio energy received by the first device, and the own audio energy is energy generated based on audio played by the second device;
the second determining unit is configured to determine, according to the energy loss parameter, the own audio energy, and the first total audio energy, first sound source audio energy of the first device, where the first sound source audio energy is audio energy obtained by the first device from a sound source.
In one possible implementation, the energy loss parameter is based on the following:
determining a test own audio energy of the second device when playing audio and a received audio energy received by the first device from the second device in an environment without other sound sources than the second device;
and determining an energy loss parameter of the first device according to the ratio of the received audio energy to the test own audio energy.
In a possible implementation manner, the apparatus further includes a third determining unit, a fourth determining unit, and a wake-up unit:
the third determining unit is configured to determine a second total audio energy received by the second device;
the fourth determining unit is configured to determine second sound source audio energy of the second device according to the self audio energy and the second total audio energy, where the second sound source audio energy and the first sound source audio energy are from a same sound source;
the awakening unit is used for awakening the voice interaction function of the second equipment if the second sound source audio energy is larger than the first sound source audio energy;
and if the second sound source audio energy is smaller than the first sound source audio energy, waking up the voice interaction function of the first device.
In a possible implementation manner, the fourth determining unit is specifically configured to:
determining nonlinear energy of the second device according to the self audio energy, wherein the nonlinear energy is generated based on vibration generated when the second device plays audio;
and removing the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy.
In a possible implementation manner, the fourth determining unit is specifically configured to:
and determining nonlinear energy of the second equipment according to the self audio energy through a neural network model.
In one possible implementation, the neural network model is trained by:
acquiring a training sample set, wherein the training sample set comprises sample self audio energy and sample total audio energy acquired by target equipment in an environment without other sound sources;
determining sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy;
and training an initial neural network model through the total audio energy of the sample, the self audio energy of the sample and the nonlinear energy of the sample to obtain the neural network model.
In one possible implementation, the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer.
In a third aspect, embodiments of the present application disclose a computer-readable storage medium, which when executed by at least one processor performs the audio energy analysis method of any of the first aspects.
In a fourth aspect, embodiments of the present application disclose a computer device comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer device executable instructions, when executed by the at least one processor, perform the audio energy analysis method of any of the first aspects.
As can be seen from the above technical solution, when performing audio energy analysis, the processing device may first determine an energy loss parameter of the first device, where the energy loss parameter can identify a loss when the second device transmits audio energy to the first device. And the processing device can acquire the first total audio energy received by the first device and the self audio energy generated by the second device through playing the audio, so that the audio energy received by the first device from the voice interaction sound source can be obtained through combining the data analysis, the interference of the self-played audio of the second device on the voice interaction recognition is eliminated, the device which the user wants to interact with can be more accurately analyzed based on the audio energy, and the voice interaction experience of the user is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an audio energy analysis method according to an embodiment of the present application;
fig. 2 is a block diagram of an audio energy analysis device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
It will be appreciated that the method may be applied to a processing device that is capable of audio energy analysis, for example a terminal device or a server with audio energy analysis functionality. The method can be independently executed by the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed by the cooperation of the terminal equipment and the server. The terminal equipment can be a computer, a mobile phone and other equipment. The server can be understood as an application server or a Web server, and can be an independent server or a cluster server in actual deployment.
Referring to fig. 1, fig. 1 is a flowchart of an audio energy analysis method according to an embodiment of the present application, where the method includes:
s101: an energy loss parameter of the first device is determined.
Wherein the energy loss parameter is used to identify a loss in the transfer of audio energy to the first device by the second device, i.e. how much of the audio energy the first device is able to receive when the audio energy emanates from the second device. For example, the energy loss parameter may be 0.9, i.e. when the second device emits audio energy, the first device is able to receive 90% of the emitted audio energy.
S102: a first total audio energy of the first device and an own audio energy of the second device are acquired.
The first total audio energy is audio energy received by the first device, and the self audio energy is energy generated based on audio played by the second device. It will be appreciated that in a scenario with a device self-playing, the audio energy received by the first device comprises two parts, one part being audio energy generated by audio emitted by a user performing a voice interaction and the other part being audio energy generated by audio emitted by the second device. In a multi-device scenario, in order to accurately analyze which device the user wants to interact with, typically, the audio energy received by each device from the user is determined, so in the embodiment of the present application, the processing device needs to remove the audio energy portion from the second device from the first total audio energy first, so that accurate voice interaction can be performed.
S103: a first source audio energy of the first device is determined based on the energy loss parameter, the own audio energy, and the first total audio energy.
The above-mentioned energy loss parameter can identify the loss when the second device transmits the audio energy to the first device, where the self audio energy is the audio energy generated by playing the audio by the second device, so that the processing device can determine, through the energy loss parameter and the self audio energy, the audio energy received by the first device from the second device, so that the energy can be removed from the first total audio energy, and a first sound source audio energy of the first device is obtained, where the first sound source audio energy is the audio energy obtained by the first device from a sound source, where the sound source is a sound source performing voice interaction, and where the sound source is a different sounding sound source with the second device, for example, may be a user performing voice interaction, and so on.
As can be seen from the above technical solution, when performing audio energy analysis, the processing device may first determine an energy loss parameter of the first device, where the energy loss parameter can identify a loss when the second device transmits audio energy to the first device. And the processing device can acquire the first total audio energy received by the first device and the self audio energy generated by the second device through playing the audio, so that the audio energy received by the first device from the voice interaction sound source can be obtained through combining the data analysis, the interference of the self-played audio of the second device on the voice interaction recognition is eliminated, the device which the user wants to interact with can be more accurately analyzed based on the audio energy, and the voice interaction experience of the user is improved.
In one possible implementation, the energy loss parameter may be based on the following. In an environment where there are no other sound sources than the second device, the processing device may determine a test own audio energy of the second device when playing audio, and a received audio energy received by the first device from the second device, the test own audio energy being audio energy generated by audio played by the second device. The processing device can determine the energy loss parameter of the first device according to the ratio of the received audio energy to the audio energy of the test device, and the ratio can reflect the difference between the audio energy actually received by the first device and the audio energy sent by the second device, so that the loss of the second device when the second device transmits the audio energy to the first device can be identified.
In one possible implementation, to determine whether the user wants to perform voice interaction with the second device or the first device, the processing device may further determine a second total audio energy received by the second device, and then determine a second audio source audio energy of the second device according to the second audio energy and the second total audio energy, that is, remove a portion of the second audio energy in the second total audio energy, where the second audio source audio energy and the first audio source audio energy are from the same sound source. The processing device may determine a magnitude relation between the second source audio energy and the first source audio energy, where the magnitude relation may represent, to a certain extent, a user's willingness to interact with the second device and the first device. If the second sound source audio energy is larger than the first sound source audio energy, the user is more expected to interact with the second device, and the processing device can wake up the voice interaction function of the second device; if the second source audio energy is less than the first source audio energy, indicating that the user would like to interact with the first device, the processing device may wake up the voice interaction function of the first device.
For example, two speakers a, B are taken as an example, and the flow is described. The two audios are configured with the same wake-up word and have a distributed wake-up function. When the two devices receive the wake-up word, the processing device can calculate audio energy and upload the audio energy to a cloud terminal for decision making, and the cloud terminal selects the device needing to respond according to the scoring standard.
Firstly, the cloud can control the loudspeaker box A to play white noise audio for a period of time, at this moment, audio signals received by the loudspeaker box A and B are subjected to stft conversion, and the average value of audio energy for the period of time is counted as X A (k) And X A→B (k) A. The invention relates to a method for producing a fibre-reinforced plastic composite Taking the 16k sampling rate as an example, the FFT length 512, and the frequency statistical range is selected to be 200-5000Hz (corresponding to the frequency band k=3-160), the energy loss parameter of the sound box B can be C A→B (k)=X A→B (k)/X A (k)。
Otherwise, the energy loss parameter of the sound box A can be obtained to be C B→A (k)=X B→A (k)/X B (k).
Wherein X is A (k) And X A→B (k) The equipment calculation is uploaded to the cloud end, and the cloud end calculates to obtain C A→B (k) Pushing the device A; push C in the same way B→A (k)。
When the device A plays the audio, the user gets close to the A and sends out the wake-up word, and the audio energy received by the A comprises:
Y A (l,k)=S A (l,k)+E A (l,k),S A (k) Representing the audio energy of the sound source of the loudspeaker A, E A (k) Representing the self audio energy of the sound box A, Y A (l, k) represents a second total audio energy.
The audio energy received by sound box B at this point includes:
Y A→B (l,k)=S B (l,k)+E A→B (l,k),S B (l, k) represents the audio energy of the sound source of the loudspeaker B, E A→B (l, k) represents the audio energy received by sound box B from the audio emitted by sound box AAmount of the components.
Through AEC, the sound source audio energy Y 'of the sound box A can be obtained' A (l,k)=S A (l, k), then E A (l,k)=Y A (l,k)-Y′ A (l, k). After correction E A→B (l,k)=E A (l,k)*C A→B (l, k) so that the sound source audio energy S of the sound box B can be determined B (l,k)。
According to S B (l, k) and S A The size judgment of (l, k) the processing device can determine the sound box that the user actually wants to wake up.
It will be appreciated that the audio energy received by the device based on its own generation may not be equivalent to the audio energy being played due to vibrations or the like generated when the audio is played. Thus, in one possible implementation, for more accurate audio energy analysis, in analyzing the audio energy of the sound source of the second device, the processing device may determine the nonlinear energy of the second device based on its own audio energy, the nonlinear energy being generated based on vibrations generated when the second device plays audio. The processing device may remove the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy, so as to obtain more accurate sound source energy.
In one possible implementation, the processing device may determine the nonlinear energy of the second device from its own audio energy, in particular, by means of a neural network model. The neural network model may be trained by:
first, the processing device may acquire a training sample set that includes the sample's own audio energy and the sample's total audio energy that the target device acquired in an environment without other sound sources. Since the sample self audio energy is determined directly based on the audio information played by the target device itself, the sample self audio energy is the linear audio energy of the target device; the sample total audio energy is the total audio energy received from the target device, i.e. comprises linear audio energy and nonlinear audio energy, so that the processing device can determine the sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy, and then train an initial neural network model through the sample total audio energy, the sample own audio energy and the sample nonlinear energy to obtain the neural network model, and in the training process, the neural network model can learn the association relation between the nonlinear audio energy part and the linear audio energy part, so as to learn how to determine the nonlinear audio energy part based on the linear audio energy part.
For example, the microphone signal Mic (i.e., the sample total audio energy) can be echo cancelled using the stope signal Ref (i.e., the sample's own audio energy) to yield Aec (i, k) (the sample nonlinear energy). Removing linear audio energy component E by linear methods such as NLMS and RLS linear (l, k) and also a portion of the nonlinear audio energy component E residual (l,k)。
The speech signal mic, ref, aec is a frequency domain signal obtained by stft, which is complex, taking 16k samples, a frame length of 16ms, and a FFT length of 512 as an example, each frequency domain signal is a complex set of 257 (512/2+1=257 according to the symmetry characteristics of FFT). The complex signal takes absolute value and then is converted into Bark domain, thus 64-dimensional data can be obtained. The input of the model, the bark value of concatenation aec, ref, mic, is a 64 x 3 vector per frame.
In one possible implementation, the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer. Thus, the neural network model may be suitable for devices with poor processing power, providing audio energy analysis functionality. For example, the model structure employs a convolutional recursive network model (Convolution Recurrent Network, CRN) structure, while considering on-device memory and performance constraints, with only one layer for the encoder and decoder layers. Specifically, the encoder layer adopts one-dimensional convolution, an input channel 192, an output channel 64, a convolution kernel size of 3, and is connected with BatchNorm and PReLU; the enhancement layer takes the input 64, hiding the LSTM of the layer 64; the decoder layer adopts two-dimensional convolution, an input channel 64+64, an output channel 64, a convolution kernel size of 3 and the connections of BatchNorm and PReLU; the activation function is adoptedsigmoid. The loss function uses MES, and the mean square error of the estimated residual echo component and the real residual echo component is adopted. Adam is adopted as an optimization function, and the learning rate is 0.001 and beta 1 =0.9,β 1 =0.999。
Model input dimension 64 x 3, output dimension 64, and gain value G (l, k) and E obtained by converting the model input dimension into frequency domain residual (l,k)=(Mic(l,k)-E linear (l, k)) G (l, k), i.e. the nonlinear audio energy can be determined in combination with the gain value after removing the linear audio energy from the total audio energy emitted by the device. Thus, through this process, the model can learn how to determine the nonlinear audio energy portion based on the linear audio energy portion.
Based on the audio energy analysis method provided in the foregoing embodiments, the present application further provides an audio energy analysis device, referring to fig. 2, fig. 2 is a block diagram of an audio energy analysis device 200 provided in the embodiment of the present application, where the device includes a first determining unit 201, an obtaining unit 202, and a second determining unit 203:
the first determining unit 201 is configured to determine an energy loss parameter of a first device, where the energy loss parameter is used to identify a loss when a second device transmits audio energy to the first device;
the acquiring unit 202 is configured to acquire a first total audio energy of the first device and an own audio energy of the second device, where the first total audio energy is audio energy received by the first device, and the own audio energy is energy generated based on audio played by the second device;
the second determining unit 203 is configured to determine a first sound source audio energy of the first device according to the energy loss parameter, the self audio energy, and the first total audio energy, where the first sound source audio energy is an audio energy obtained by the first device from a sound source.
In one possible implementation, the energy loss parameter is based on the following:
determining a test own audio energy of the second device when playing audio and a received audio energy received by the first device from the second device in an environment without other sound sources than the second device;
and determining an energy loss parameter of the first device according to the ratio of the received audio energy to the test own audio energy.
In a possible implementation manner, the apparatus further includes a third determining unit, a fourth determining unit, and a wake-up unit:
the third determining unit is configured to determine a second total audio energy received by the second device;
the fourth determining unit is configured to determine second sound source audio energy of the second device according to the self audio energy and the second total audio energy, where the second sound source audio energy and the first sound source audio energy are from a same sound source;
the awakening unit is used for awakening the voice interaction function of the second equipment if the second sound source audio energy is larger than the first sound source audio energy;
and if the second sound source audio energy is smaller than the first sound source audio energy, waking up the voice interaction function of the first device.
In a possible implementation manner, the fourth determining unit is specifically configured to:
determining nonlinear energy of the second device according to the self audio energy, wherein the nonlinear energy is generated based on vibration generated when the second device plays audio;
and removing the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy.
In a possible implementation manner, the fourth determining unit is specifically configured to:
and determining nonlinear energy of the second equipment according to the self audio energy through a neural network model.
In one possible implementation, the neural network model is trained by:
acquiring a training sample set, wherein the training sample set comprises sample self audio energy and sample total audio energy acquired by target equipment in an environment without other sound sources;
determining sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy;
and training an initial neural network model through the total audio energy of the sample, the self audio energy of the sample and the nonlinear energy of the sample to obtain the neural network model.
In one possible implementation, the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer.
The application discloses a computer readable storage medium, which when executed by at least one processor, performs the audio energy analysis method of any of the above embodiments.
The application also discloses a computer device comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer device executable instructions, when executed by the at least one processor, perform the audio energy analysis method of any of the above embodiments.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, etc., which can store program codes.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing is merely one specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of audio energy analysis, the method comprising:
determining an energy loss parameter of a first device, the energy loss parameter being used to identify a loss in a second device delivering audio energy to the first device;
acquiring first total audio energy of the first device and self audio energy of the second device, wherein the first total audio energy is audio energy received by the first device, and the self audio energy is energy generated based on audio played by the second device;
and determining first sound source audio energy of the first device according to the energy loss parameter, the self audio energy and the first total audio energy, wherein the first sound source audio energy is the audio energy acquired from a sound source by the first device.
2. The method of claim 1, wherein the energy loss parameter is based on:
determining a test own audio energy of the second device when playing audio and a received audio energy received by the first device from the second device in an environment without other sound sources than the second device;
and determining an energy loss parameter of the first device according to the ratio of the received audio energy to the test own audio energy.
3. The method according to claim 1, wherein the method further comprises:
determining a second total audio energy received by the second device;
determining second sound source audio energy of the second device according to the self audio energy and the second total audio energy, wherein the second sound source audio energy and the first sound source audio energy are from the same sound source;
if the second sound source audio energy is larger than the first sound source audio energy, waking up a voice interaction function of the second device;
and if the second sound source audio energy is smaller than the first sound source audio energy, waking up the voice interaction function of the first device.
4. A method according to claim 3, wherein said determining a second sound source audio energy of said second device from said own audio energy, said second total audio energy, comprises:
determining nonlinear energy of the second device according to the self audio energy, wherein the nonlinear energy is generated based on vibration generated when the second device plays audio;
and removing the self audio energy and the nonlinear energy in the second total audio energy to obtain the second sound source audio energy.
5. The method of claim 4, wherein said determining the nonlinear energy of the second device from the self audio energy comprises:
and determining nonlinear energy of the second equipment according to the self audio energy through a neural network model.
6. The method of claim 5, wherein the neural network model is trained by:
acquiring a training sample set, wherein the training sample set comprises sample self audio energy and sample total audio energy acquired by target equipment in an environment without other sound sources;
determining sample nonlinear energy of the target device according to the sample total audio energy and the sample own audio energy;
and training an initial neural network model through the total audio energy of the sample, the self audio energy of the sample and the nonlinear energy of the sample to obtain the neural network model.
7. The method of claim 5, wherein the neural network model is a convolutional recursive network model having a single layer encoding layer and a single layer decoding layer.
8. An audio energy analysis device, characterized in that the device comprises a first determination unit, an acquisition unit and a second determination unit:
the first determining unit is configured to determine an energy loss parameter of a first device, where the energy loss parameter is used to identify a loss when a second device transmits audio energy to the first device;
the acquiring unit is configured to acquire a first total audio energy of the first device and an own audio energy of the second device, where the first total audio energy is audio energy received by the first device, and the own audio energy is energy generated based on audio played by the second device;
the second determining unit is configured to determine, according to the energy loss parameter, the own audio energy, and the first total audio energy, first sound source audio energy of the first device, where the first sound source audio energy is audio energy obtained by the first device from a sound source.
9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, perform the audio energy analysis method of any of claims 1-7.
10. A computer device, comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer device executable instructions, when executed by the at least one processor, perform the audio energy analysis method of any of claims 1-7.
CN202210697612.5A 2022-06-20 2022-06-20 Audio energy analysis method and related device Pending CN117292691A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210697612.5A CN117292691A (en) 2022-06-20 2022-06-20 Audio energy analysis method and related device
PCT/CN2022/102036 WO2023245700A1 (en) 2022-06-20 2022-06-28 Audio energy analysis method and related apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210697612.5A CN117292691A (en) 2022-06-20 2022-06-20 Audio energy analysis method and related device

Publications (1)

Publication Number Publication Date
CN117292691A true CN117292691A (en) 2023-12-26

Family

ID=89252361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210697612.5A Pending CN117292691A (en) 2022-06-20 2022-06-20 Audio energy analysis method and related device

Country Status (2)

Country Link
CN (1) CN117292691A (en)
WO (1) WO2023245700A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9947333B1 (en) * 2012-02-10 2018-04-17 Amazon Technologies, Inc. Voice interaction architecture with intelligent background noise cancellation
CN111091828B (en) * 2019-12-31 2023-02-14 华为技术有限公司 Voice wake-up method, device and system
CN113593548B (en) * 2021-06-29 2023-12-19 青岛海尔科技有限公司 Method and device for waking up intelligent equipment, storage medium and electronic device
CN113674761B (en) * 2021-07-26 2023-07-21 青岛海尔科技有限公司 Device determination method and device determination system

Also Published As

Publication number Publication date
WO2023245700A1 (en) 2023-12-28

Similar Documents

Publication Publication Date Title
Szöke et al. Building and evaluation of a real room impulse response dataset
JP7434137B2 (en) Speech recognition method, device, equipment and computer readable storage medium
CN111161752B (en) Echo cancellation method and device
CN109074816B (en) Far field automatic speech recognition preprocessing
CN111833896B (en) Voice enhancement method, system, device and storage medium for fusing feedback signals
JP5127754B2 (en) Signal processing device
CN108847215B (en) Method and device for voice synthesis based on user timbre
WO2022012195A1 (en) Audio signal processing method and related apparatus
CN103124165A (en) Automatic gain control
CN109658935B (en) Method and system for generating multi-channel noisy speech
JP5634959B2 (en) Noise / dereverberation apparatus, method and program thereof
CN111142066A (en) Direction-of-arrival estimation method, server, and computer-readable storage medium
WO2020139724A1 (en) Context-based speech synthesis
CN114792524B (en) Audio data processing method, apparatus, program product, computer device and medium
Morita et al. Robust voice activity detection based on concept of modulation transfer function in noisy reverberant environments
CN116030823A (en) Voice signal processing method and device, computer equipment and storage medium
JPWO2011004503A1 (en) Noise removing apparatus and noise removing method
CN105869656B (en) Method and device for determining definition of voice signal
JPWO2014049944A1 (en) Audio processing device, audio processing method, audio processing program, and noise suppression device
CN110169082A (en) Combining audio signals output
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN109741761B (en) Sound processing method and device
JP2012181561A (en) Signal processing apparatus
WO2023051622A1 (en) Method for improving far-field speech interaction performance, and far-field speech interaction system
CN117292691A (en) Audio energy analysis method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination