CN115472175A - Echo cancellation method and device for audio resource, storage medium and electronic device - Google Patents

Echo cancellation method and device for audio resource, storage medium and electronic device Download PDF

Info

Publication number
CN115472175A
CN115472175A CN202211064742.1A CN202211064742A CN115472175A CN 115472175 A CN115472175 A CN 115472175A CN 202211064742 A CN202211064742 A CN 202211064742A CN 115472175 A CN115472175 A CN 115472175A
Authority
CN
China
Prior art keywords
audio
target
audio resource
generation network
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211064742.1A
Other languages
Chinese (zh)
Inventor
刘溪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Haier Uplus Intelligent Technology Beijing Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Haier Uplus Intelligent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd, Haier Uplus Intelligent Technology Beijing Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202211064742.1A priority Critical patent/CN115472175A/en
Publication of CN115472175A publication Critical patent/CN115472175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The application discloses an echo cancellation method and device of audio resources, a storage medium and an electronic device, which relate to the technical field of smart families, and the method comprises the following steps: acquiring a first audio resource and a second audio resource, wherein the first audio resource is an audio resource acquired by target equipment in the process of playing the second audio resource; inputting the first audio resource and the second audio resource into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network; the target audio resource is obtained by filtering the first audio resource through the target filtering parameter and the target filter.

Description

Echo cancellation method and device for audio resource, storage medium and electronic device
Technical Field
The application relates to the technical field of smart homes, in particular to an echo cancellation method and device of audio resources, a storage medium and an electronic device.
Background
With the development of science and technology, the wide application of audio acquisition technology is promoted, for example, the technology is applied to the fields of internet phone, video conference, man-machine voice interaction and the like, in these fields, certain requirements are made on the quality of acquired audio resources, and interfering audios (such as echoes, noise, reverberation and the like) are taken as important factors influencing the audio acquisition quality and influencing the quality of the audio resources, for example, the influence of echoes, namely, the audio resources played by equipment are acquired by the equipment, so that the echoes are formed, the audio resource acquisition quality is further seriously influenced, and how to eliminate the interfering audios in the acquired audio resources becomes the research focus of technicians in the field.
At present, a common mode for eliminating the interference audio is a mode for constructing an audio elimination network model, that is, an initial audio elimination network model is trained, so that the trained audio elimination network model, and then audio resources acquired by the device and audio resources played by the device at the same time are input into the trained audio elimination network model, thereby obtaining clean audio resources after the interference audio is eliminated. However, although this method eliminates the interfering audio, there are problems that the model is highly complex, the amount of parameters is huge, and high computational resources are consumed, so that these methods cannot be operated on platforms with limited computational resources.
Aiming at the problems of low echo cancellation efficiency and the like of audio resources in the related technology, no effective solution is provided.
Disclosure of Invention
The embodiment of the application provides an echo cancellation method and device for audio resources, a storage medium and an electronic device, so as to at least solve the problems of low echo cancellation efficiency and the like for the audio resources in the related art.
According to an embodiment of the present application, there is provided an echo cancellation method for an audio resource, including: acquiring a first audio resource and a second audio resource, wherein the first audio resource is an audio resource acquired by target equipment in the process of playing the second audio resource; inputting the first audio resource and the second audio resource into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, wherein the target parameter generation network is obtained by training an echo cancellation model by using a training sample labeled with echo cancellation audio, the echo cancellation model comprises an initial parameter generation network and a target filter which are sequentially connected, the training sample comprises a first audio sample and a second audio sample, the first audio sample is an audio resource collected by the target device in a process of playing the second audio sample, and the echo cancellation audio is an audio resource in which the second audio sample is eliminated from the first audio sample; and filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource.
Optionally, the inputting the first audio resource and the second audio resource to a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network includes: inputting the first audio resource and the second audio resource to the target parameter generation network; and acquiring a first filtering parameter output by a first branch of the target parameter generation network and a second filtering parameter output by a second branch of the target parameter generation network, wherein the first filtering parameter is an operation parameter of the target filter, and the second filtering parameter is used for representing the characteristic of an echo component caused by the second audio resource and carried in the first audio resource.
Optionally, before the first audio resource and the second audio resource are input to a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, the method further includes: inputting the training sample into the initial parameter generation network to obtain a filtering parameter result output by the initial parameter generation network; filtering the first audio sample by using the filtering parameter result and the target filter to obtain an audio resource result; and adjusting the network parameters of the initial parameter generation network according to the loss value between the audio resource result and the echo cancellation audio until the network converges to obtain the target parameter generation network.
Optionally, the inputting the training sample into the initial parameter generation network to obtain a filtering parameter result output by the initial parameter generation network includes: inputting the training samples into the initial parameter generation network; and acquiring a first result output by a first branch of the initial parameter generation network and a second result output by a second branch of the initial parameter generation network as the filtering parameter result, wherein the first branch of the initial parameter generation network is used for estimating the noise variance of the target filter, and the second branch of the initial parameter generation network is used for estimating an echo component carried in the first audio sample and caused by the second audio sample.
Optionally, the filtering the first audio sample by using the filtering parameter result and the target filter to obtain an audio resource result includes: extracting a reference echo component carried in the first audio sample using the second result, wherein the reference echo component is caused by the second audio sample; and filtering the reference echo component by using the target filter with the first result as an operation parameter to obtain the audio resource result.
Optionally, the filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource includes: extracting a target audio feature in the first audio resource by using a second filtering parameter included in the target filtering parameter, wherein the second filtering parameter is used for characterizing the echo component caused by the second audio resource and carried in the first audio resource; and inputting the target audio characteristic into the target filter with a first filtering parameter as an operation parameter to obtain the target audio.
Optionally, the extracting the target audio feature in the first audio resource by using the second filtering parameter included in the target filtering parameter includes: calculating a product of an image mask and the spectrogram of the first audio resource to obtain a characteristic spectrogram, wherein the second filtering parameter comprises the image mask, and the image mask is used for representing a region of the spectrogram, which comprises the target audio characteristic; and determining the spectral features recorded by the feature spectrogram as the target audio features.
According to another embodiment of the present application, there is also provided an echo cancellation device for an audio resource, including: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first audio resource and a second audio resource, and the first audio resource is an audio resource acquired by a target device in the process of playing the second audio resource; a first input module, configured to input the first audio resource and the second audio resource to a target parameter generation network, so as to obtain a target filtering parameter output by the target parameter generation network, where the target parameter generation network is obtained by training an echo cancellation model using a training sample labeled with an echo cancellation audio, the echo cancellation model includes an initial parameter generation network and a target filter that are sequentially connected, the training sample includes a first audio sample and a second audio sample, the first audio sample is an audio resource acquired by the target device in a process of playing the second audio sample, and the echo cancellation audio is an audio resource in which the second audio sample is cancelled from the first audio sample; and the first filtering module is used for filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource.
According to still another aspect of the embodiments of the present application, there is further provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the above-mentioned echo cancellation method for an audio resource when the computer program is executed.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for canceling an echo of an audio resource through the computer program.
In the embodiment of the application, a first audio resource and a second audio resource are obtained, wherein the first audio resource is an audio resource collected by a target device in the process of playing the second audio resource; inputting a first audio resource and a second audio resource into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, wherein the target parameter generation network is obtained by training an echo cancellation model by using a training sample marked with echo cancellation audio, the echo cancellation model comprises an initial parameter generation network and a target filter which are sequentially connected, the training sample comprises a first audio sample and a second audio sample, the first audio sample is an audio resource collected by target equipment in the process of playing the second audio sample, and the echo cancellation audio is an audio resource with the second audio sample eliminated from the first audio sample; the method comprises the steps of filtering a first audio resource by using a target filtering parameter and a target filter to obtain a target audio resource, namely, an echo cancellation model comprises an initial parameter generation network and a target filter which are connected in sequence, training the echo cancellation model by using a training sample marked with echo cancellation audio to obtain a trained target parameter generation network, inputting the collected first audio resource and a played second audio resource into the target parameter generation network when echo cancellation is performed, outputting a target filtering parameter used by the target filter by using the target parameter output network, filtering the first audio resource by using the target filtering parameter and the target filter to obtain the echo-cancelled target audio resource, namely, training the initial parameter generation network in the echo cancellation model by using the training sample to obtain the target parameter generation network, so that the network can output the target filtering parameter used by the target filter, but not directly outputting the filtered target audio resource by using the trained network model, thereby reducing the complexity of the model, reducing the data processing amount in the model construction process, and realizing the first limited operation of echo cancellation on the echo-cancelled target audio resource. By adopting the technical scheme, the problems of low echo cancellation efficiency on the audio resources and the like in the related technology are solved, and the technical effect of improving the echo cancellation efficiency on the audio resources is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a schematic diagram of a hardware environment of an echo cancellation method for an audio resource according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for echo cancellation of an audio resource according to an embodiment of the application;
FIG. 3 is a flow chart of an alternative target echo cancellation according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an alternative initial parameter generation network according to an embodiment of the present application;
fig. 5 is a block diagram of an echo cancellation device for an audio resource according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of an embodiment of the present application, there is provided an echo cancellation method for an audio resource. The method is widely applied to full-House intelligent digital control application scenes such as Smart homes (Smart Home), intelligent homes, intelligent Home equipment ecology, intelligent residence (Intelligent House) ecology and the like. Optionally, in this embodiment, fig. 1 is a schematic diagram of a hardware environment of an echo cancellation method for an audio resource according to an embodiment of the present application, and the method may be applied to the hardware environment formed by the terminal device 102 and the server 104 shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be configured to provide a service (e.g., an application service) for the terminal or a client installed on the terminal, set a database on the server or independent of the server, and provide a data storage service for the server 104, and configure a cloud computing and/or edge computing service on the server or independent of the server, and provide a data operation service for the server 104.
The network may include, but is not limited to, at least one of: wired networks, wireless networks. The wired network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks, which may include, but are not limited to, at least one of the following: WIFI (Wireless Fidelity), bluetooth. Terminal equipment 102 can be but not limited to be PC, the cell-phone, the panel computer, intelligent air conditioner, intelligent cigarette machine, intelligent refrigerator, intelligent oven, intelligent kitchen range, intelligent washing machine, intelligent water heater, intelligent washing equipment, intelligent dish washer, intelligent projection equipment, intelligent TV, intelligent clothes hanger, intelligent (window) curtain, intelligence audio-visual, smart jack, intelligent stereo set, intelligent audio amplifier, intelligent new trend equipment, intelligent kitchen guarding equipment, intelligent bathroom equipment, intelligence robot of sweeping the floor, intelligence robot of wiping the window, intelligence robot of mopping the ground, intelligent air purification equipment, intelligent steam ager, intelligent microwave oven, intelligent kitchen is precious, intelligent clarifier, intelligent water dispenser, intelligent lock etc..
In this embodiment, an echo cancellation method for an audio resource is provided, which is applied to the above device terminal, and fig. 2 is a flowchart of an echo cancellation method for an audio resource according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:
step S202, a first audio resource and a second audio resource are obtained, wherein the first audio resource is an audio resource collected by a target device in the process of playing the second audio resource;
step S204, inputting the first audio resource and the second audio resource into a target parameter generation network, to obtain a target filtering parameter output by the target parameter generation network, where the target parameter generation network is obtained by training an echo cancellation model using a training sample labeled with an echo cancellation audio, the echo cancellation model includes an initial parameter generation network and a target filter that are sequentially connected, the training sample includes a first audio sample and a second audio sample, the first audio sample is an audio resource acquired by the target device in a process of playing the second audio sample, and the echo cancellation audio is an audio resource from which the second audio sample is cancelled in the first audio sample;
and step S206, filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource.
Through the steps, the echo cancellation model comprises an initial parameter generation network and a target filter which are sequentially connected, the echo cancellation model is trained by using a training sample marked with echo cancellation audio, so that a trained target parameter generation network is obtained, and then when echo cancellation is performed, the acquired first audio resource and the played second audio resource are input into the target parameter generation network, the target parameter output network outputs a target filtering parameter used by the target filter, and the target filtering parameter and the target filter can be used for filtering the first audio resource, so that the echo cancelled target audio resource is obtained, that is, the initial parameter generation network in the echo cancellation model is trained by using the training sample, so that the target parameter generation network is obtained, so that the network can output the target filtering parameter used by the target filter, instead of directly outputting the filtered target audio resource by the trained network model, so that the complexity of the model is reduced, the data processing amount in the model construction process is reduced, and the operation of echo cancellation on the first audio resource is completed through limited operation resources. By adopting the technical scheme, the problems of low echo cancellation efficiency on the audio resources and the like in the related technology are solved, and the technical effect of improving the echo cancellation efficiency on the audio resources is realized.
In the technical solution provided in step S202, the second audio resource is audio played at the target device.
Optionally, in this embodiment, the audio resource is data representing audio content, and the audio resource may include, but is not limited to, audio itself, a spectrogram of the audio, and the like.
Optionally, in this embodiment, the target device is a device having an audio playing function and an audio collecting function, for example, the target device may be a phone, a smart television, a smart sound, and the like.
In the technical solution provided in step S204, the target filtering parameter is a parameter used by the target filter in the process of filtering the echo in the first audio resource by using the target filter, and may include, but is not limited to, a parameter for indicating an operation state of the filter and a parameter for characterizing a component of the echo in the first audio.
Optionally, in this embodiment, the training of the echo cancellation model using the training sample may be, but is not limited to, inputting the first audio sample and the second audio sample into an initial parameter generation network in the echo cancellation model, obtaining a filtering parameter result output by the initial parameter generation network, where the filtering parameter result is used to filter audio of an echo component caused by the second audio sample, further inputting the filtering parameter result into a target filter, filtering the first audio sample using the target filter and the filtering parameter result, obtaining an audio resource result after the second audio sample in the first audio sample is removed, further adjusting a network parameter of the initial parameter generation network according to the audio resource result output by the target filter and the echo cancellation audio, for example, calculating a loss value of the initial parameter generation network according to the audio resource result and the echo cancellation audio, performing gradient update on a parameter of the initial parameter generation network according to the loss value, and stopping update on the network parameter of the initial parameter generation network when the loss value is smaller than a target threshold, thereby obtaining the target parameter generation network.
In the technical solution provided in step S206, the target filter may, but is not limited to, simulate an echo path of a real scene according to the target filtering parameter, estimate the audio frequency of the echo component caused by the second audio resource by combining the second audio resource, and subtract the audio frequency of the echo component from the first audio resource, thereby achieving the purpose of filtering the first audio resource.
Optionally, in this embodiment, the target filter may be, but is not limited to, an adaptive filter such as a kalman filter and a wiener filter.
As an optional embodiment, the inputting the first audio resource and the second audio resource to a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network includes:
inputting the first audio resource and the second audio resource to the target parameter generation network;
and acquiring a first filtering parameter output by a first branch of the target parameter generation network and a second filtering parameter output by a second branch of the target parameter generation network, wherein the first filtering parameter is an operation parameter of the target filter, and the second filtering parameter is used for representing the characteristic of an echo component carried in the first audio resource and caused by the second audio resource.
Optionally, in this embodiment, the echo component caused by the second audio resource is the audio formed after the second audio resource plays the audio propagated in the space, and may include, but is not limited to, echo, reverberation, noise and the like formed by audio propagation.
Optionally, in this embodiment, the feature of the echo component may be a spectral feature of the audio corresponding to the echo component, or may also be a feature of a position of the audio corresponding to the echo component in the first audio resource, a position of a frequency spectrum of the audio corresponding to the echo component in the frequency spectrum of the first audio resource, and the like, so that the echo component in the first audio resource can be determined according to the second filtering parameter.
Optionally, in this embodiment, the target filter may be, but is not limited to, used to simulate an echo path of a real scene using the first filtering parameter and the echo component, and then combine with the second audio resource to estimate the echo audio, so as to filter the echo audio in the first audio resource. Fig. 3 is a flow chart of an alternative target echo cancellation according to an embodiment of the present application, which, as shown in fig. 3, may include, but is not limited to, the following steps:
s301, acquiring a second audio resource played by the target equipment;
s302, under the condition that the target device plays the second audio resource, acquiring a first audio resource in the environment collected by the target device;
s303, framing, windowing and short-time Fourier changing are carried out on the first audio resource and the second audio resource, and a converted result is input into a target parameter generation network in an echo cancellation model;
s304, obtaining a first filtering parameter outputted by a first branch of the target parameter generating network for indicating the operation of the target filter, where the first filtering parameter may be, but is not limited to, a noise variance (covariance matrix) estimated according to the first audio resource and the second audio resource, the estimated first filtering parameter is different, and filtering functions of the target filter after using the first filtering parameter are also different;
s305, obtaining a second filtering parameter outputted by a second branch of the target parameter generation network, for characterizing a feature of an echo component (which may but is not limited to include echo, noise, reverberation, etc. caused by transmission of a second audio resource) caused by the second audio resource, carried in the first audio resource, where the second filtering parameter may but is not limited to be an image mask for indicating the echo component;
s306, calculating the product of the image mask and the first audio resource to obtain an echo component, wherein the echo component can be used by a target filter to estimate a linear transfer function between the clean audio without the echo component and the echo audio;
s307, inputting the echo component and the first filtering parameter into a target filter, wherein the target filter can be but not limited to be used for simulating an echo path of a real scene by using the first filtering parameter and the echo component, and then estimating an echo audio by combining the echo path with the second audio resource, so that the echo audio in the first audio resource can be filtered out;
s308, obtaining the target audio resource output by the target filter.
According to the embodiment, the target parameter is combined with the target parameter generation network, the target parameter is generated and the target filtering parameter used by the network output target filter is output, so that the target filter uses the target filtering parameter to filter the acquired audio resource, and the filtered audio is not directly output through a trained network model in the related technology.
In the above embodiment, the target parameter generation network is obtained by training an echo cancellation model using a training sample labeled with echo cancellation audio, the echo cancellation model includes an initial parameter generation network and a target filter connected in sequence, the training sample is input into the initial parameter generation network, and a filtering parameter result output by the network is generated according to the initial parameter and the training sample is filtered by the filter to obtain an audio resource result output by the target filter, and then a network parameter of the initial parameter generation network is adjusted according to the audio resource result and the echo cancellation audio, that is, the network parameter of the initial parameter generation network is adjusted in the whole training process, so that the initial parameter generation network can output the filtering parameter used by the target filter, thereby reducing the huge parameters existing in the model training process, fig. 4 is a schematic diagram of an alternative initial parameter generation network structure according to an embodiment of the present disclosure, as shown in fig. 4, an initial parameter generation model includes a recurrent neural network with three gated cyclic units as hidden layers, and two branches of parameter output (each branch includes a full connection layer and a sigma activation function), a training sample labeled with echo cancellation audio includes a first audio sample and a second audio sample (the first audio sample is an audio resource acquired by a target device during playing of the second audio sample, and the echo cancellation audio is an audio resource with the second audio sample removed from the first audio sample), the training sample is input to the recurrent neural network with three gated cyclic units as hidden layers, and is input to two different branches after passing through the recurrent neural network, the two branches are both composed of a full connection layer and a sigma activation function, and are different in that a first branch of an initial parameter generation network is used for estimating the noise variance of a target filter, a second branch of the initial parameter generation network is used for estimating the echo component carried in a first audio sample and caused by a second audio sample, the estimated echo component and the noise variance pass through the target filter to obtain estimated clean voice, the loss of the network is calculated by using the estimated clean voice and echo cancellation voice used as a label, the loss function adopts a minimum mean square error method, and the estimated loss is transmitted back to the initial parameter production network along with the derivable Kalman filtering to perform gradient updating, so that the training of the initial parameter generation network is completed.
As an optional embodiment, before the inputting the first audio resource and the second audio resource to a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, the method further includes:
inputting the training sample into the initial parameter generation network to obtain a filtering parameter result output by the initial parameter generation network;
filtering the first audio sample by using the filtering parameter result and the target filter to obtain an audio resource result;
and adjusting the network parameters of the initial parameter generation network according to the loss value between the audio resource result and the echo cancellation audio until the network converges to obtain the target parameter generation network.
Optionally, in this embodiment, the audio resource result may be, but is not limited to, an audio frequency with echo components filtered out, and a corresponding spectrogram of the audio frequency with echo components filtered out.
Optionally, in this embodiment, the loss value between the audio resource result and the echo cancellation audio may be, but is not limited to, a value calculated by using a loss function, where the loss function may be, but is not limited to, a mean square error loss function.
Optionally, in this embodiment, the network converges when the loss value between the audio resource result and the echo cancellation audio is less than or equal to the target value.
As an alternative embodiment, the inputting the training samples into the initial parameter generation network to obtain a result of the filter parameter output by the initial parameter generation network includes:
inputting the training samples into the initial parameter generation network;
and acquiring a first result output by a first branch of the initial parameter generation network and a second result output by a second branch of the initial parameter generation network as the filtering parameter result, wherein the first branch of the initial parameter generation network is used for estimating the noise variance of the target filter, and the second branch of the initial parameter generation network is used for estimating an echo component carried in the first audio sample and caused by the second audio sample.
Alternatively, in this embodiment, the second branch of the initial parameter generating network may directly estimate the echo component caused by the second audio sample, or estimate a parameter used for extracting the echo component caused by the second audio in the first audio sample, where the parameter may be, but is not limited to, a feature indicating the echo component in the first audio sample, such as a position of the echo component in the first audio sample, a spectral feature of the echo component, and the like, and then determine the echo component in the first audio sample according to the parameter.
Optionally, in this embodiment, the first branch of the initial parameter generation network is used to estimate a noise variance matched with the target filter, where the estimated noise variance is different according to different types of the target filters, for example, the noise variance may be a covariance matrix estimated according to audio features of the first audio resource and the second audio resource, and the matrix parameters in the estimated covariance matrix are different according to different filters.
As an alternative embodiment, the filtering the first audio sample using the filtering parameter result and the target filter to obtain an audio resource result includes:
extracting a reference echo component carried in the first audio sample using the second result, wherein the reference echo component is caused by the second audio sample;
and filtering the reference echo component by using the target filter with the first result as an operation parameter to obtain the audio resource result.
Optionally, in the present embodiment, the reference echo component may include, but is not limited to, echo, reverberation, noise, etc. caused by the propagation of the second audio sample.
Optionally, in this embodiment, the second result may be an image mask estimated by the network generated using the initial parameters, the image mask may be, but is not limited to, a feature indicating an echo component caused by the second audio sample, and the extracting the reference echo component carried in the first audio sample using the second result may be by calculating a product of the second result and the first audio sample, for example, the second result may be a spectral feature indicating an echo component, position information of the echo component in the first audio, and the like, and then calculating a product of the second result and the first audio sample, thereby obtaining the echo component.
As an optional embodiment, the filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource includes:
extracting a target audio feature in the first audio resource by using a second filtering parameter included in the target filtering parameter, wherein the second filtering parameter is used for characterizing the echo component caused by the second audio resource and carried in the first audio resource;
and inputting the target audio frequency characteristics into the target filter with the first filtering parameter as an operation parameter to obtain the target audio frequency.
Optionally, in this embodiment, the target audio characteristic is a characteristic of an echo component caused by the second audio resource.
Optionally, in this embodiment, after the target audio feature is input into the target filter using the first filtering parameter as the operation parameter, the target filter may filter out the audio of the target audio feature from the first audio resource, so as to obtain the target audio.
As an alternative embodiment, the extracting the target audio feature in the first audio resource by using the second filtering parameter included in the target filtering parameter includes:
calculating a product of an image mask and the spectrogram of the first audio resource to obtain a characteristic spectrogram, wherein the second filtering parameter comprises the image mask, and the image mask is used for representing a region of the spectrogram, which comprises the target audio characteristic;
and determining the spectral features recorded by the feature spectrogram as the target audio features.
Optionally, in this embodiment, the image mask may be used to cover a region of the spectrogram of the first audio resource except the target audio feature, or may also be used to extract the region of the target audio feature from the spectrogram of the first audio resource.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present application.
Fig. 5 is a block diagram of an echo cancellation device for audio resources according to an embodiment of the present application; as shown in fig. 5, includes: an obtaining module 52, configured to obtain a first audio resource and a second audio resource, where the first audio resource is an audio resource collected by a target device in a process of playing the second audio resource; a first input module 54, configured to input the first audio resource and the second audio resource to a target parameter generation network, so as to obtain a target filtering parameter output by the target parameter generation network, where the target parameter generation network is obtained by training an echo cancellation model using a training sample labeled with an echo cancellation audio, the echo cancellation model includes an initial parameter generation network and a target filter that are sequentially connected, the training sample includes a first audio sample and a second audio sample, the first audio sample is an audio resource acquired by the target device in a process of playing the second audio sample, and the echo cancellation audio is an audio resource in which the second audio sample is cancelled from the first audio sample; a first filtering module 56, configured to filter the first audio resource by using the target filtering parameter and the target filter, so as to obtain a target audio resource.
Through the embodiment, the echo cancellation model comprises the initial parameter generation network and the target filter which are sequentially connected, the echo cancellation model is trained by using the training sample marked with the echo cancellation audio, so that a trained target parameter generation network is obtained, then when echo cancellation is performed, the collected first audio resource and the played second audio resource are input into the target parameter generation network, the target parameter output network outputs the target filtering parameter used by the target filter, the target filtering parameter and the target filter can be used for filtering the first audio resource, and the echo-cancelled target audio resource is obtained, that is, the initial parameter generation network in the echo cancellation model is trained by using the training sample, so that the target parameter generation network is obtained, so that the network can output the target filtering parameter used by the target filter, instead of directly outputting the filtered target audio resource by the trained network model, so that the complexity of the model is reduced, the data processing amount in the model construction process is reduced, and the echo cancellation operation on the first audio resource is completed through limited operation resources. By adopting the technical scheme, the problems of low echo cancellation efficiency on the audio resources and the like in the related technology are solved, and the technical effect of improving the echo cancellation efficiency on the audio resources is realized.
Optionally, the first input module includes: a first input unit configured to input the first audio resource and the second audio resource to the target parameter generation network; a first obtaining unit, configured to obtain a first filtering parameter output by a first branch of the target parameter generation network, and a second filtering parameter output by a second branch of the target parameter generation network, where the first filtering parameter is an operating parameter of the target filter, and the second filtering parameter is used to characterize an echo component caused by the second audio resource and carried in the first audio resource.
Optionally, the apparatus further comprises: a second input module, configured to input the training sample into the initial parameter generation network before the first audio resource and the second audio resource are input into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, so as to obtain a filtering parameter result output by the initial parameter generation network; a second filtering module, configured to filter the first audio sample by using the filtering parameter result and the target filter to obtain an audio resource result; and the adjusting module is used for adjusting the network parameters of the initial parameter generation network according to the loss value between the audio resource result and the echo cancellation audio until the network converges to obtain the target parameter generation network.
Optionally, the second input module includes: a second input unit, configured to input the training sample into the initial parameter generation network; a second obtaining unit, configured to obtain, as the filtering parameter result, a first result output by a first branch of the initial parameter generating network and a second result output by a second branch of the initial parameter generating network, where the first branch of the initial parameter generating network is used to estimate a noise variance of the target filter, and the second branch of the initial parameter generating network is used to estimate an echo component carried in the first audio sample and caused by the second audio sample.
Optionally, the second filtering module includes: the operation unit is used for operating the second result and the first audio sample to obtain an operation result; and the filtering unit is used for filtering the operation result by using the target filter taking the first result as an operation parameter to obtain the audio resource result.
Optionally, the first filtering module includes: an extracting unit, configured to extract a target audio feature in the first audio resource by using a second filtering parameter included in the target filtering parameter, where the second filtering parameter is used to characterize an echo component caused by the second audio resource and carried in the first audio resource; and the third input unit is used for inputting the target audio characteristics into the target filter which takes the first filtering parameters as the operating parameters to obtain the target audio.
Optionally, the extracting unit is configured to: calculating a product of an image mask and the spectrogram of the first audio resource to obtain a characteristic spectrogram, wherein the second filtering parameter comprises the image mask, and the image mask is used for representing a region of the spectrogram, which comprises the target audio characteristic; and determining the spectral features recorded by the feature spectrogram as the target audio features.
An embodiment of the present application further provides a storage medium including a stored program, where the program executes the method for echo cancellation of an audio resource according to any one of the above methods.
Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps: acquiring a first audio resource and a second audio resource, wherein the first audio resource is an audio resource collected by a target device in the process of playing the second audio resource; inputting a first audio resource and a second audio resource into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, wherein the target parameter generation network is obtained by training an echo cancellation model by using a training sample marked with echo cancellation audio, the echo cancellation model comprises an initial parameter generation network and a target filter which are sequentially connected, the training sample comprises a first audio sample and a second audio sample, the first audio sample is an audio resource collected by target equipment in the process of playing the second audio sample, and the echo cancellation audio is an audio resource with the second audio sample eliminated from the first audio sample; and filtering the first audio resource by using the target filtering parameter and the target filter to obtain the target audio resource.
Embodiments of the present application further provide an electronic device, comprising a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above embodiments of the method for echo cancellation of an audio resource.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program: acquiring a first audio resource and a second audio resource, wherein the first audio resource is an audio resource collected by a target device in the process of playing the second audio resource; inputting a first audio resource and a second audio resource into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, wherein the target parameter generation network is obtained by training an echo cancellation model by using a training sample marked with echo cancellation audio, the echo cancellation model comprises an initial parameter generation network and a target filter which are sequentially connected, the training sample comprises a first audio sample and a second audio sample, the first audio sample is an audio resource collected by target equipment in the process of playing the second audio sample, and the echo cancellation audio is an audio resource with the second audio sample eliminated from the first audio sample; and filtering the first audio resource by using the target filtering parameter and the target filter to obtain the target audio resource.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A method for echo cancellation of an audio resource, comprising:
acquiring a first audio resource and a second audio resource, wherein the first audio resource is an audio resource collected by a target device in the process of playing the second audio resource;
inputting the first audio resource and the second audio resource into a target parameter generation network to obtain a target filtering parameter output by the target parameter generation network, wherein the target parameter generation network is obtained by training an echo cancellation model by using a training sample labeled with echo cancellation audio, the echo cancellation model comprises an initial parameter generation network and a target filter which are sequentially connected, the training sample comprises a first audio sample and a second audio sample, the first audio sample is an audio resource collected by the target device in a process of playing the second audio sample, and the echo cancellation audio is an audio resource in which the second audio sample is eliminated from the first audio sample;
and filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource.
2. The method of claim 1, wherein inputting the first audio resource and the second audio resource to a target parameter generation network to obtain a target filter parameter output by the target parameter generation network comprises:
inputting the first audio resource and the second audio resource to the target parameter generation network;
and acquiring a first filtering parameter output by a first branch of the target parameter generation network and a second filtering parameter output by a second branch of the target parameter generation network, wherein the first filtering parameter is an operation parameter of the target filter, and the second filtering parameter is used for representing the characteristic of an echo component carried in the first audio resource and caused by the second audio resource.
3. The method of claim 1, wherein prior to the inputting the first audio resource and the second audio resource to a target parameter generation network resulting in target filter parameters for the target parameter generation network output, the method further comprises:
inputting the training sample into the initial parameter generation network to obtain a filtering parameter result output by the initial parameter generation network;
filtering the first audio sample by using the filtering parameter result and the target filter to obtain an audio resource result;
and adjusting the network parameters of the initial parameter generation network according to the loss value between the audio resource result and the echo cancellation audio until the network converges to obtain the target parameter generation network.
4. The method of claim 3, wherein inputting the training samples into the initial parameter generation network to obtain a result of the filter parameters output by the initial parameter generation network comprises:
inputting the training samples into the initial parameter generation network;
and acquiring a first result output by a first branch of the initial parameter generation network and a second result output by a second branch of the initial parameter generation network as the filtering parameter result, wherein the first branch of the initial parameter generation network is used for estimating the noise variance of the target filter, and the second branch of the initial parameter generation network is used for estimating an echo component carried in the first audio sample and caused by the second audio sample.
5. The method of claim 4, wherein the filtering the first audio sample using the filter parameter result and the target filter to obtain an audio resource result comprises:
extracting a reference echo component carried in the first audio sample using the second result, wherein the reference echo component is caused by the second audio sample;
and filtering the reference echo component by using the target filter with the first result as an operation parameter to obtain the audio resource result.
6. The method of claim 1, wherein the filtering the first audio resource using the target filtering parameter and the target filter to obtain a target audio resource comprises:
extracting a target audio feature in the first audio resource by using a second filtering parameter included in the target filtering parameter, wherein the second filtering parameter is used for characterizing the echo component caused by the second audio resource and carried in the first audio resource;
and inputting the target audio frequency characteristics into the target filter with the first filtering parameter as an operation parameter to obtain the target audio frequency.
7. The method of claim 6, wherein the extracting a target audio feature in the first audio resource using a second filtering parameter included in the target filtering parameter comprises:
calculating a product of an image mask and the spectrogram of the first audio resource to obtain a characteristic spectrogram, wherein the second filtering parameter comprises the image mask, and the image mask is used for representing a region of the spectrogram, which comprises the target audio characteristic;
and determining the spectral features recorded in the feature spectrogram as the target audio features.
8. An apparatus for echo cancellation of an audio resource, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first audio resource and a second audio resource, and the first audio resource is an audio resource acquired by a target device in the process of playing the second audio resource;
a first input module, configured to input the first audio resource and the second audio resource to a target parameter generation network, so as to obtain a target filtering parameter output by the target parameter generation network, where the target parameter generation network is obtained by training an echo cancellation model using a training sample labeled with an echo cancellation audio, the echo cancellation model includes an initial parameter generation network and a target filter that are sequentially connected, the training sample includes a first audio sample and a second audio sample, the first audio sample is an audio resource acquired by the target device in a process of playing the second audio sample, and the echo cancellation audio is an audio resource in which the second audio sample is cancelled from the first audio sample;
and the first filtering module is used for filtering the first audio resource by using the target filtering parameter and the target filter to obtain a target audio resource.
9. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.
CN202211064742.1A 2022-08-31 2022-08-31 Echo cancellation method and device for audio resource, storage medium and electronic device Pending CN115472175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211064742.1A CN115472175A (en) 2022-08-31 2022-08-31 Echo cancellation method and device for audio resource, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211064742.1A CN115472175A (en) 2022-08-31 2022-08-31 Echo cancellation method and device for audio resource, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN115472175A true CN115472175A (en) 2022-12-13

Family

ID=84370953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211064742.1A Pending CN115472175A (en) 2022-08-31 2022-08-31 Echo cancellation method and device for audio resource, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN115472175A (en)

Similar Documents

Publication Publication Date Title
CN110246515A (en) Removing method, device, storage medium and the electronic device of echo
CN111885275B (en) Echo cancellation method and device for voice signal, storage medium and electronic device
Svensson et al. Errors in MLS measurements caused by time variance in acoustic systems
CN114283795A (en) Training and recognition method of voice enhancement model, electronic equipment and storage medium
CN112820315A (en) Audio signal processing method, audio signal processing device, computer equipment and storage medium
CN111863015A (en) Audio processing method and device, electronic equipment and readable storage medium
CN109493883A (en) A kind of audio time-delay calculation method and apparatus of smart machine and its smart machine
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN114792524B (en) Audio data processing method, apparatus, program product, computer device and medium
CN117789744B (en) Voice noise reduction method and device based on model fusion and storage medium
CN105657203B (en) Noise-reduction method and system in smart machine voice communication
CN116612778B (en) Echo and noise suppression method, related device and medium
CN110931040B (en) Filtering sound signals acquired by a speech recognition system
CN115472175A (en) Echo cancellation method and device for audio resource, storage medium and electronic device
CN115171703B (en) Distributed voice awakening method and device, storage medium and electronic device
CN115083431A (en) Echo cancellation method and device, electronic equipment and computer readable medium
Fujimura et al. Analysis of Noisy-target Training for DNN-based speech enhancement
CN114220451A (en) Audio denoising method, electronic device, and storage medium
CN112331187A (en) Multi-task speech recognition model training method and multi-task speech recognition method
CN113470677B (en) Audio processing method, device and system
WO2023246223A1 (en) Speech enhancement method and apparatus for distributed wake-up, and storage medium
CN116959416A (en) Voice wakeup testing method, storage medium and electronic device
Vairetti Efficient parametric modeling, identification and equalization of room acoustics
CN117153178B (en) Audio signal processing method, device, electronic equipment and storage medium
CN114173259B (en) Echo cancellation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination