CN112489667A - Audio signal processing method and device - Google Patents

Audio signal processing method and device Download PDF

Info

Publication number
CN112489667A
CN112489667A CN201910777904.8A CN201910777904A CN112489667A CN 112489667 A CN112489667 A CN 112489667A CN 201910777904 A CN201910777904 A CN 201910777904A CN 112489667 A CN112489667 A CN 112489667A
Authority
CN
China
Prior art keywords
audio signal
target
microphone
sound source
source position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910777904.8A
Other languages
Chinese (zh)
Inventor
陈孝良
杨晓帆
冯大航
常乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN201910777904.8A priority Critical patent/CN112489667A/en
Publication of CN112489667A publication Critical patent/CN112489667A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/043Time compression or expansion by changing speed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention provides a method and a device for processing audio signals, which are used for acquiring pre-recorded source audio signals and preset target sound source positions; converting the source audio signal according to the position of the target sound source to obtain an audio signal corresponding to the microphone; wherein the microphone is each microphone of a microphone array; and combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array. The target audio signal obtained by processing according to the target sound source position is equivalent to the audio signal which is recorded by the microphone array and emitted from the target sound source position. Therefore, the scheme can simulate the audio signal recorded by the microphone array from any sound source position by utilizing the pre-recorded audio signal without frequently changing the sound source position and repeatedly recording the audio signal aiming at each sound source position, thereby reducing the time for acquiring the audio sample required by training the intelligent sound box and further reducing the time required by training the awakening model of the intelligent sound box.

Description

Audio signal processing method and device
Technical Field
The present invention relates to the field of signal processing technologies, and in particular, to a method and an apparatus for processing an audio signal.
Background
With the rapid development of artificial intelligence technology, the popularity of intelligent sound boxes in the public is higher and higher. Current smart enclosures generally use an array of microphones to acquire the audio signal. When the microphone array is actually used, the sound emitted from different sound source positions is based on the position of the microphone array, and the audio signals recorded by the microphone array are different. That is, the audio signal recorded by the microphone array will vary with the position of the sound source.
Therefore, in the prior art, when optimizing a novel awakening model of a smart speaker, the microphone array of the smart speaker generally needs to be used to record the sounds emitted from a plurality of different sound source positions, so as to obtain a plurality of audio signals at different sound source positions, and then train the awakening model by using the audio signals. When the intelligent sound box is actually used, the awakening model can accurately identify the audio signals recorded by the microphone array from different sound source positions, and therefore a better awakening effect is obtained.
However, the process of frequently changing the sound source position and repeatedly recording the audio signal at each sound source position takes a long time, which results in low efficiency of the existing method for training the wake-up model of the smart speaker.
Disclosure of Invention
Based on the above drawbacks of the prior art, the present invention provides a method and an apparatus for processing an audio signal, so as to improve the efficiency of training a wake-up model of a smart speaker.
A first aspect of the present invention provides a method for processing an audio signal, including:
acquiring a pre-recorded source audio signal and a preset target sound source position;
converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone; wherein the microphone is each microphone of an array of microphones;
and combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array.
Optionally, after the audio signal corresponding to each microphone is combined to obtain the target audio signal of the microphone array, the method further includes:
and training a wake-up model of the intelligent sound box provided with the microphone array by using the target audio signal.
Optionally, the converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone includes:
obtaining impulse response of the microphone; wherein the impulse response is generated in advance according to the target sound source position;
and calculating the source audio signal according to the impulse response corresponding to the target sound source position to obtain an audio signal corresponding to a microphone.
Optionally, before the converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone, the method further includes:
acquiring the preset sound absorption quantity of a target scene and noise data of the target scene;
wherein, the converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone includes:
and converting the source audio signal according to the target sound source position, the sound absorption quantity of the target scene and the noise data of the target scene to obtain an audio signal corresponding to a microphone.
Optionally, after the audio signal corresponding to each microphone is combined to obtain the target audio signal of the microphone array, the method further includes:
copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the tone of the copy of each target audio signal according to pre-collected user tone data to obtain a plurality of adjusted audio signals;
wherein the pitch of each of the adjusted audio signals is unique.
Optionally, after the audio signal corresponding to each microphone is combined to obtain the target audio signal of the microphone array, the method further includes:
copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the speech rate of the copy of each target audio signal according to pre-collected user speech rate data to obtain a plurality of adjusted audio signals;
wherein the speech rate of each of the adjusted audio signals is unique.
A second aspect of the present invention provides an apparatus for processing an audio signal, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a pre-recorded source audio signal and a preset target sound source position;
the conversion unit is used for converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone; wherein the microphone is each microphone of an array of microphones;
and the combination unit is used for combining the audio signals corresponding to the microphones to obtain the target audio signals of the microphone array.
Optionally, the conversion unit includes:
a sub-acquisition unit, configured to acquire an impulse response of the microphone; wherein the impulse response of the microphone is generated in advance according to the target sound source position;
and the computing unit is used for computing the source audio signal according to the impulse response corresponding to the target sound source position to obtain an audio signal corresponding to the microphone.
Optionally, the obtaining unit is further configured to:
acquiring the preset sound absorption quantity of a target scene and noise data of the target scene;
the conversion unit is used for:
and converting the source audio signal according to the target sound source position, the sound absorption quantity of the target scene and the noise data of the target scene to obtain an audio signal corresponding to a microphone.
Optionally, the processing apparatus further includes:
the analog unit is used for copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the tone of the copy of each target audio signal according to pre-collected user tone data to obtain a plurality of adjusted audio signals;
wherein the pitch of each of the adjusted audio signals is unique.
The invention provides a method and a device for processing audio signals, which are used for acquiring pre-recorded source audio signals and preset target sound source positions; converting the source audio signal according to the position of the target sound source to obtain an audio signal corresponding to the microphone; wherein the microphone is each microphone of a microphone array; and combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array. The target audio signal obtained by processing according to the target sound source position is equivalent to the audio signal which is recorded by the microphone array and emitted from the target sound source position. Therefore, the scheme can simulate the audio signal recorded by the microphone array from any sound source position by utilizing the pre-recorded audio signal without frequently changing the sound source position and repeatedly recording the audio signal aiming at each sound source position, thereby reducing the time for acquiring the audio sample required by training the intelligent sound box and further reducing the time required by training the awakening model of the intelligent sound box.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for processing an audio signal according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for processing an audio signal according to another embodiment of the present invention;
fig. 3 is a flowchart of a method for processing an audio signal according to another embodiment of the present invention;
fig. 4 is a flowchart of a method for processing an audio signal according to still another embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus for processing an audio signal according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The intelligent sound box is an electronic device widely used at present, and when a user uses the intelligent sound box, the user can wake up the intelligent sound box in a dormant state by speaking a specific wake-up word to enable the intelligent sound box to enter an activated state, namely a wake-up process of the intelligent sound box.
In the awakening process of the intelligent sound box, after the intelligent sound box receives the audio signal, the voice recognition of the audio signal can be carried out by utilizing the awakening model of the intelligent sound box, and if the awakening model recognizes the awakening word from the audio signal, the intelligent sound box enters an activated state.
Therefore, the recognition capability of the wake-up model is directly related to the wake-up effect of the smart speaker (it can be considered whether the smart speaker can effectively and timely respond to the wake-up command of the user).
At present, in order to obtain an awakening model with strong recognition capability, a microphone array is generally required to record voices respectively emitted from a sound source from multiple positions to obtain multiple audio signals respectively corresponding to different sound source positions, and then the audio signals respectively recorded from the multiple sound source positions are utilized to train the awakening model. The awakening model trained based on the method can accurately identify far-field voice sent by the user from any position in the actual use process of the intelligent sound box, so that the awakening instruction of the user can be effectively responded.
However, the existing method requires frequent changes of the position of the sound source, and each time the position of the sound source is changed, the sound emitted from the current sound source position is recorded by the microphone array, thereby obtaining an audio signal corresponding to the current sound source position. Wherein the sound source position refers to the position of the sound source relative to the microphone array. Specifically, the sound source position may be expressed as coordinates of the sound source in a three-dimensional coordinate system, and the three-dimensional coordinate system is a coordinate system established with reference to the position of the microphone array.
That is to say, in the conventional method for training the wake-up model by using sample data, in order to ensure that the sample data includes audio signals recorded from a plurality of sound source locations, it is necessary to spend a long time to repeatedly record the audio signals for a plurality of times at the stage of obtaining the sample data, which results in a long time required for training one wake-up model, and reduces the efficiency of training the wake-up model.
In view of the above problems in the prior art, an embodiment of the present invention provides a method for processing an audio signal, please refer to fig. 1, the method includes the following steps:
s101, acquiring a pre-recorded source audio signal and a preset target sound source position.
Wherein, the source audio signal is a near-field audio signal which is pre-recorded and stored in a database before the training. The recording device for recording the source audio signal may be a microphone array of multiple microphones or a single microphone.
Generally, after a user holds a microphone and speaks into the microphone, the microphone records the audio signal, which can be used as a source audio signal. Obviously, by changing the person who speaks and the contents of the sentence, a plurality of source audio signals can be obtained.
For convenience of understanding, the audio signal processing method provided in any embodiment of the present application is described by taking one source audio signal as an example, and based on the audio signal processing method described in the embodiment, a person skilled in the art can use the method for multiple source audio signals respectively, thereby implementing processing on multiple source audio signals.
As described above, before training the wake-up model of the smart speaker, it is necessary to acquire audio signals generated from a plurality of different sound source positions specified in advance as sample data. Each of the pre-specified sound source positions at which the corresponding audio signal is to be acquired is a target sound source position. Generally, the target sound source position is far-field position relative to the smart speaker, that is, the distance between the target sound source position and the smart speaker is generally greater than 1 meter, or close to 1 meter.
For example, in a coordinate system established with reference to the position of a target smart speaker (i.e., a smart speaker configured with a wake-up model to be trained), if the stage of acquiring the sample needs to acquire an audio signal recorded by the target smart speaker when the sound source is located at point a, point a is a target sound source position.
And S102, converting the source audio signal according to the position of the target sound source to obtain an audio signal corresponding to the microphone.
Wherein, the microphone refers to each microphone in the microphone array installed on the target smart speaker.
Therefore, it can be understood that step S102 is to perform the conversion described in step S102 for each microphone of the microphone array of the target smart sound box to obtain the audio signal corresponding to the microphone. That is, through the processing in step S102, the audio signal corresponding to each microphone in the microphone array of the target smart speaker can be obtained.
When a microphone is used to collect an audio signal, even if the same sound source emits the same sound, the audio signal recorded by the microphone changes as the positional relationship between the sound source and the microphone changes.
The target sound source position may be regarded as information describing a relative positional relationship between the sound source and the microphone array in which the positions of the respective microphones are determined. Therefore, for a specific microphone array, the position relationship between the sound source and each microphone in the microphone array can be directly determined according to the position of the target sound source, so that the source audio signal is converted to obtain the audio signal corresponding to each microphone in the microphone array. And, these converted audio signals can simulate the audio signals generated from the target sound source position directly recorded by the microphone.
And S103, combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array.
The target audio signal is used as an audio signal obtained by recording a target sound by the microphone array and is used for training a wake-up model of the intelligent sound box provided with the microphone array; the target sound refers to a sound emitted from a target sound source position.
An audio signal recorded by a microphone array is actually multi-channel data formed by combining a plurality of single-channel audio signals recorded by each microphone of the microphone array.
The audio signals corresponding to the microphones converted in step S102 are equivalent to audio signals obtained by directly recording the microphones from the target sound source position, and then the target audio signals obtained by combining the audio signals are equivalent to audio signals obtained by directly recording the target microphone array formed by the microphones from the target sound source position. The method provided by the embodiment of the application is equivalent to that the audio signal recorded by the target microphone array is simulated by the processed audio signal through processing the pre-recorded source audio signal.
Therefore, the wake-up model trained by the target audio signal output in step S103 can accurately identify the audio signal generated from any sound source position recorded by the microphone array of the target smart speaker even when the target smart speaker is used. That is to say, the wake-up model trained based on the target audio signal obtained by the processing in the embodiment of the present application has the same recognition capability as the wake-up model trained based on the audio signal obtained by directly recording the target sound source position by using the microphone array.
For convenience of understanding the processing method of the audio signal provided by the present embodiment, the following briefly describes the implementation process of the present embodiment with reference to an actual scene:
for example, user A utters a voice "XXX" into the handheld microphone, which records a source audio signal. When the awakening model of the target intelligent sound box is trained, the voice signal corresponding to the voice XXX is sent out from the position B by the user A, wherein the position B is a target sound source position, and the voice signal is recorded by the microphone array of the target intelligent sound box. Based on the method for processing audio signals introduced in the foregoing embodiment, to acquire such audio signals, a positional relationship between each microphone in the microphone array and the position B may be determined according to a position of a target sound source, and then, according to the positional relationship between each microphone and the position B, a source audio signal is converted into an audio signal corresponding to each microphone in the microphone array, and the audio signals corresponding to the microphones are combined to obtain a target audio signal of the microphone array of the target smart speaker. This target audio signal is equivalent to an audio signal recorded directly with the microphone array of the target smart speaker, and the user a emits an audio signal corresponding to the speech "XXX" from the position B.
It can be understood that the process of converting a source audio signal into a target audio signal, which is described in this embodiment by taking a source audio signal and a target sound source location as examples, can be directly applied to any pre-recorded audio signal and any sound source location. Therefore, as long as a plurality of source audio signals are prerecorded and a plurality of sound source positions are set, each sound source position can be converted for each audio signal by using the method provided by the embodiment, so that a plurality of target audio signals capable of meeting the sample number requirement required by the training of the wake-up model can be obtained.
For example, in the above example, if it is required to obtain the three audio signals corresponding to the voice "XXX" from the position B, the position C and the position D, respectively, which are recorded by the microphone array of the target smart speaker, the user a may use the position B as the target sound source position, perform conversion based on the method provided in this embodiment to obtain the target audio signal corresponding to the position B, and then use the position C and the position D as the target sound source position in sequence, thereby performing conversion based on the method provided in this embodiment to obtain the target audio signal corresponding to the position C and the target audio signal corresponding to the position D.
The invention provides a method and a device for processing audio signals, which are used for acquiring pre-recorded source audio signals and preset target sound source positions; converting the source audio signal according to the position of the target sound source to obtain an audio signal corresponding to the microphone; wherein the microphone is each microphone of a microphone array; combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array; the target audio signal is used as an audio signal obtained by recording a target sound by the microphone array and is used for training a wake-up model of the intelligent sound box provided with the microphone array; the target sound refers to a sound emitted from a target sound source position. The scheme can simulate the audio signal recorded by the microphone array from any sound source position by utilizing the pre-recorded audio signal without frequently changing the sound source position and repeatedly recording the audio signal at each sound source position, thereby effectively reducing the time required for training the awakening model of the intelligent sound box.
Another embodiment of the present application further provides an audio signal processing method, for processing a source audio signal by combining a target sound source position and an environmental parameter to simulate an audio signal recorded by a microphone array under multiple scenes, please refer to fig. 2, where the method includes:
s201, obtaining a pre-recorded source audio signal, a preset target sound source position and a preset environment parameter.
Wherein the environmental parameters include an amount of sound absorption of the target scene and noise data of the target scene.
Each set of environmental parameters corresponds to a particular target scenario. For any target scene, the sound absorption amount of the scene can be calculated according to the shape, size and acoustic characteristics (for example, the sound absorption coefficient of an object) of each object in the scene.
For example, bedroom environments include carpets, cabinets, glass, walls and beds, and the sound absorption coefficient is: 0.9 carpet, 0.3 cabinet, 0.1 glass and 0.8 bed, assuming that the bedroom is a cuboid space, the surface areas of the four walls are respectively a 1-a 4, and the sound absorption coefficients are respectively S1To S4The surface area of the ceiling is a5, and the sound absorption coefficient is S5Floor surface area a6 and sound absorption coefficient S6Then, the sound absorption amount Sa in the scene of the target bedroom can be calculated according to the following formula (1):
Figure BDA0002175645490000091
wherein A isjRepresenting the sound absorption of objects in the target scene, in particular a carpet in the above-mentioned bedroom environment1The sound absorption quantity of the cabinet is A2The sound absorption amount of the glass is A3The sound absorption of the bed is A4. The sound absorption quantity of each object can be calculated according to the sound absorption coefficient of the object and the shape, size and other information of the object.
Wherein the sound absorption coefficients of walls, ceilings and floors differ according to the building structure and the surface coating, and S is generally1To S6Are each set to a value close to 0.4, and may be set, for example, between 0.3 and 0.5.
In addition, the sound absorption coefficient of the object can be adjusted within a certain range according to the material of the object, for example, the sound absorption coefficient of a carpet can be set between 0.8 and 0.95, the sound absorption coefficient of a cabinet can be set between 0.2 and 0.4, the sound absorption coefficient of glass can be set between 0.05 and 0.2, and the sound absorption coefficient of a bed can be set between 0.7 and 0.9.
The following describes the object and its sound absorption coefficient in several environments with reference to the following selectable value ranges:
the living room environment comprises a television sofa, a balcony, a tea table and the like, the sound absorption coefficient of the balcony is 0.1 to 0.3, the sound absorption coefficient of the rest wall surfaces (including walls, ceilings and floors) is 0.3 to 0.5, the sound absorption coefficient of the sofa is 0.6 to 0.8, and the sound absorption coefficient of the tea table is 0.2 to 0.4.
The kitchen environment comprises a cabinet and a wall surface, wherein the sound absorption coefficient of the cabinet is 0.2 to 0.4, and the sound absorption coefficient of the wall surface is 0.3 to 0.5.
The cafe environment includes pillars and tables, the pillars having a sound absorption coefficient of 0.2 to 0.4 and the tables having a sound absorption coefficient of 0.1 to 0.3.
In any of the above target scenes, the sound absorption coefficients of the objects in the target scene may be determined from the value ranges of the sound absorption coefficients, and then the sound absorption amounts of the objects and the wall surfaces in the target scene are calculated, so that the sound absorption amount of the target scene is calculated based on the formula (1).
The noise data of the target scene refers to noise in the target scene, and the noise may be collected in the actual scene or simulated for a specific scene. Noise conditions under various scenarios include, but are not limited to:
40-60 db of car noise outside the window in a bedroom environment, noise generated by movement or communication of 1-5 people in different directions.
The noise of the car outside the window is 40-60 decibels under the living room environment, 2 is 10 people's noise generated by movement or communication in different directions, the noise of the air conditioner is 40-50 decibels, and the noise of the television program is 40-70 decibels.
Under the kitchen environment, 40-60 dB of underwater sound, the sound of the operation of electric appliances such as a microwave oven and the like, and the noise generated by the movement or the communication of 1 to 5 people in different directions.
In a cafe environment, 10 to 60 people move or communicate in different directions to generate noise (with an intensity of about 40-70 db), and 40-50 db of noise generated by the operation of machinery.
Of course, the noise signal may be recorded or analog.
Optionally, the environmental parameters may further include background music being played to simulate an audio signal recorded in a scene where the background music is played.
S202, converting the source audio signal according to the position of the target sound source, the sound absorption quantity of the target scene and the noise data of the target scene to obtain an audio signal corresponding to the microphone.
The same sound is collected by the microphone under different scenes, and different audio signals can be obtained. In which the influence of the scene on the audio signal is mainly reflected in both the sound absorption amount of the scene and the noise in the scene, so that the converted audio signal obtained by converting the source audio signal by combining the target sound source position, the sound absorption amount of the target scene, and the noise data of the target scene in step S202 can be used to simulate the audio signal generated from the target sound source position and recorded by the microphone in the target scene.
For example, assuming that an audio signal recorded by the smart speaker in a bedroom environment is currently required to be acquired when the distance between the sound source position and the smart speaker is 2 meters, the sound absorption amount of the target scene in step S202 is the sound absorption amount in the bedroom environment, the noise data of the target scene is the noise in the bedroom environment, the audio signal corresponding to the microphone obtained by converting the source audio signal according to the sound absorption amount and the noise in the bedroom environment and the target sound source position is equivalent to the audio signal actually recorded by the microphone under the above condition, and the audio signal obtained by combining the audio signals corresponding to the microphones in the array of the smart speaker in step S203 is equivalent to the audio signal actually recorded by the smart speaker under the above condition.
And S203, combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array.
The audio signal processing method provided by this embodiment can convert a source audio signal according to an environmental parameter and a target sound source position to obtain a converted audio signal, and the converted audio signal can be used to simulate an audio signal obtained by recording an intelligent sound box from the target sound source position in a scene corresponding to the environmental parameter. Therefore, the audio signal generated by the audio signal processing method provided by the embodiment is used for training the awakening model, so that the identification range of the trained awakening model can be effectively expanded, the awakening model can accurately identify the audio signal recorded in various scenes, and the identification capability of the awakening model is further improved.
The method provided by the embodiment of the application mainly relates to the technical field of processing source audio signals by using specific parameters (mainly including target sound source positions and environmental parameters), and the processed audio signals are used for simulating the audio signals obtained by directly recording the intelligent sound box under the actual condition. Wherein, a source audio signal is converted according to a target sound source position (optionally, further including an environmental parameter), and is mainly implemented by calculating the source audio signal using an impulse response corresponding to the target sound source position (and the environmental parameter), referring to fig. 3, the specific process is as follows:
and S301, acquiring a target impulse response of the microphone.
The microphone is any one of the microphones in the microphone array of the intelligent sound box.
And the target impulse response is the impulse response of the microphone obtained by pre-simulating according to the target sound source position and the environmental parameters. For each microphone in an array of microphones, an impulse response for each microphone can be modeled given a target sound source location and a set of environmental parameters. For one microphone, the corresponding impulse response is different for different target sound source positions and environmental parameters.
Optionally, in other embodiments of the application, if only audio signals recorded by the smart sound box at different sound source positions need to be simulated, and audio signals in different scenes do not need to be simulated, then the environment parameter in step S301 may be fixedly set as the environment parameter corresponding to an empty room.
Impulse response refers to the response of a system generated under excitation of an impulse function. Specifically, in the present application, each microphone of the microphone array of the smart speaker is equivalent to a system, sound generated by puncturing a balloon at a certain position is equivalent to sound of an impulse response, and the sound generated by puncturing the balloon is collected by the microphone to obtain an audio signal, which is the sound received by the microphone that punctures the balloon at the sound source position in the current scene.
Correspondingly, simulating the impulse response of a microphone for a given target sound source position and a set of environmental parameters is equivalent to simulating an audio signal recorded by the microphone after a balloon is punctured at the target sound source position in a scene corresponding to the set of environmental parameters.
And S302, calculating the convolution of the source audio signal and the target impulse response to obtain a convolution result.
The specific implementation process of step S302 is to perform fourier transform on the source audio signal and the target impulse response respectively to obtain a transformed source audio signal and a transformed target impulse response, multiply the transformed source audio signal and the transformed target impulse response, and then perform inverse fourier transform on the obtained product, where the result obtained after inverse fourier transform is the convolution result.
The source audio signal and the target impulse response may be regarded as two time-domain functions, and the convolution result obtained by calculating the convolution of the two functions is also a time-domain function, and this convolution result is equivalent to the audio signal recorded from the target sound source position by the microphone mentioned in step S301 in the scene corresponding to the environmental parameter.
In the method provided by this embodiment, for any microphone in the microphone array, the impulse response of the microphone under the specific target sound source position and the environmental parameter may be used to calculate the source audio signal, so as to obtain the audio signal corresponding to the microphone. After the audio signals corresponding to each microphone in the microphone array are calculated by the method, the audio signals can be combined into the target audio signal corresponding to the microphone array.
In another embodiment of the present application, a processing method of an audio signal is further provided, which is used to combine pre-collected user tone data and user speech rate data to further process a target audio signal on the basis of obtaining the target audio signal by processing, so as to simulate an audio signal obtained by recording voices of different users by a microphone array in different environments, increase the types of the audio signals in a training sample of a wake-up model, and further improve the recognition capability of the trained wake-up model.
Referring to fig. 4, the present embodiment includes the following steps:
s401, acquiring a source audio signal, a target sound source position and an environment parameter.
S402, converting the source audio signal according to the target sound source position and the environmental parameter to obtain an audio signal corresponding to the microphone.
Wherein the microphone is each microphone of a microphone array.
And S403, combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array.
And S404, copying the target audio signal to obtain an audio signal set.
The audio signal set includes a plurality of copies of a target audio signal that is a replica of the target audio signal. That is, the audio signal set includes a plurality of audio signals each of which coincides with the target audio signal output in step S403.
Wherein the number of audio signals in the audio signal set is consistent with the number of the kinds of the pre-collected user tone data. If M user tone data are collected in advance, the target audio signal is copied into M parts in step S404, so that the audio signal set is composed of M audio signals.
The user tone data is classified according to corresponding tones, one tone corresponding to one type of user tone data.
S405, adjusting each audio signal in the audio signal set according to the pre-collected user tone data to obtain a first audio signal set.
Step S405 is to adjust each audio signal in the audio signal set by using a user tone data, so that the adjusted audio signal is represented as a tone corresponding to the user tone data. And, each kind of user tone data is used for adjusting only one kind of audio signal, and each kind of audio signal is adjusted by only one kind of user tone data. All adjusted audio signals constitute the first set of audio signals.
The user tone data is obtained by analyzing a plurality of audio signals of a specific tone recorded in advance, and the frequency spectrum characteristics of the audio signals are obtained. Adjusting an audio signal with a user tone data means adjusting the frequency spectrum of the audio signal based on the recorded spectral characteristics of the user tone data to convert the tone of the audio signal to the tone corresponding to the user tone data.
S406, adjusting each audio signal in the first audio signal set according to the pre-collected user speech rate data to obtain a second audio signal set.
The pre-collected user speech rate data has a plurality of categories, and each category corresponds to a speech rate.
Step S406 specifically includes:
suppose there are X kinds of user speech rate data, each of which is used to adjust the speech rate of each audio signal in the first audio signal set, and M adjusted audio signals after the user speech rate data is adjusted are obtained. After all the user speech rate data are adjusted, M X X adjusted audio signals are obtained, and the adjusted audio signals form a second audio signal set. Each adjusted audio signal represents a speech rate corresponding to the user speech rate data for adjustment.
It is found that at least one characteristic (speech rate and pitch are two characteristics of one audio signal) necessarily differs between any two audio signals in the second set of audio signals.
The process of adjusting the speech rate of the audio signal using the user speech rate data, similar to the process of adjusting the pitch in step S405, pre-analyzes the spectral characteristics, and then adjusts the frequency spectrum of the audio signal according to the spectral characteristics recorded in the user speech rate data.
Optionally, the steps S405 and S406 may be selected and combined according to actual use requirements. In addition, the execution sequence of the steps is not limited to the sequence described in the present embodiment, and can be arbitrarily adjusted.
Optionally, in order to improve the quality of the audio signal for training the wake-up model, each audio signal in the second set of audio signals may be processed as follows:
firstly, signal gain adjustment is carried out on an audio signal, then Acoustic Echo Cancellation (AEC) is carried out on the adjusted audio signal, then wave beam generation processing is carried out on the audio signal after echo cancellation, noise suppression and automatic gain processing are carried out on the audio signal after wave beam generation, and the obtained audio signal is input into an awakening model.
With reference to fig. 5, the apparatus for processing an audio signal according to an embodiment of the present application further includes:
an obtaining unit 501 is configured to obtain a pre-recorded source audio signal and a preset target sound source position.
A conversion unit 502, configured to convert the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone; wherein the microphone is each microphone of a microphone array.
A combining unit 503, configured to combine the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array; the target audio signal is used as an audio signal obtained by recording a target sound by the microphone array and is used for training a wake-up model of the intelligent sound box provided with the microphone array; the target sound refers to a sound emitted from the target sound source position.
Specifically, the conversion unit 502 includes:
a sub-obtaining unit, configured to obtain an impulse response corresponding to the target sound source position; and generating an impulse response corresponding to the target sound source position in advance according to the target sound source position.
And the computing unit is used for computing the source audio signal according to the impulse response corresponding to the target sound source position to obtain an audio signal corresponding to the microphone.
The process of calculating the audio signal corresponding to the microphone according to the impulse response and the source audio signal may refer to the embodiment corresponding to fig. 3.
Optionally, when the audio signals in different scenes need to be simulated, the impulse response obtained by the sub-obtaining unit may be an impulse response obtained by simulation according to a given target sound source position and an environmental parameter. Wherein the environmental parameters include an amount of sound absorption of the target scene and noise data of the target scene.
Optionally, the obtaining unit 501 is further configured to:
acquiring the preset sound absorption quantity of a target scene and noise data of the target scene;
the conversion unit 502 is configured to:
converting the source audio signal according to the target sound source position, the sound absorption quantity of the target scene and the noise data of the target scene to obtain an audio signal corresponding to a microphone;
the target audio signal is used as an audio signal obtained by recording a target sound by the microphone array in the target scene, and is used for training a wake-up model of the intelligent sound box provided with the microphone array.
Optionally, the processing apparatus further includes an analog unit 504, configured to:
copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the tone of the copy of each target audio signal according to pre-collected user tone data to obtain a plurality of adjusted audio signals;
the tone of each adjusted audio signal is unique, and the plurality of adjusted audio signals are used as the microphone array, are recorded respectively to a plurality of users to obtain audio signals and are used for training the awakening model of the intelligent sound box provided with the microphone array.
Optionally, the analog unit 504 is further configured to:
copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the speech rate of the copy of each target audio signal according to pre-collected user speech rate data to obtain a plurality of adjusted audio signals;
the voice speed of each adjusted audio signal is unique, the adjusted audio signals serve as the microphone array, the audio signals obtained by recording the adjusted audio signals are recorded for a plurality of users respectively, and the voice speed of each adjusted audio signal is used for training the awakening model of the intelligent sound box provided with the microphone array.
It should be noted that, the adjustment of the target audio signal by the analog unit 504, as described in the audio signal processing method provided in the embodiment of the present application, may be arbitrarily combined as needed.
For the audio signal processing apparatus provided in the embodiment of the present application, specific working principles refer to the audio signal processing method in the embodiment of the present application, and details are not repeated here.
The invention provides a processing device of audio signals, an acquisition unit 501 acquires pre-recorded source audio signals and preset target sound source positions; the conversion unit 502 converts the source audio signal according to the target sound source position to obtain an audio signal corresponding to the microphone; wherein the microphone is each microphone of a microphone array; the combining unit 503 combines the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array; the target audio signal is used as an audio signal obtained by recording a target sound by the microphone array and is used for training a wake-up model of the intelligent sound box provided with the microphone array; the target sound refers to a sound emitted from a target sound source position. The scheme can simulate the audio signal recorded by the microphone array from any sound source position by utilizing the pre-recorded audio signal without frequently changing the sound source position and repeatedly recording the audio signal at each sound source position, thereby effectively reducing the time required for training the awakening model of the intelligent sound box.
Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of processing an audio signal, comprising:
acquiring a pre-recorded source audio signal and a preset target sound source position;
converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone; wherein the microphone is each microphone of an array of microphones;
and combining the audio signals corresponding to each microphone to obtain a target audio signal of the microphone array.
2. The processing method of claim 1, wherein after the combining the audio signals corresponding to each of the microphones to obtain the target audio signal of the microphone array, the processing method further comprises:
and training a wake-up model of the intelligent sound box provided with the microphone array by using the target audio signal.
3. The processing method according to claim 1, wherein the converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone comprises:
obtaining impulse response of the microphone; wherein the impulse response is generated in advance according to the target sound source position;
and calculating the source audio signal according to the impulse response corresponding to the target sound source position to obtain an audio signal corresponding to a microphone.
4. The processing method according to claim 1, wherein before converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone, the method further comprises:
acquiring the preset sound absorption quantity of a target scene and noise data of the target scene;
wherein, the converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone includes:
and converting the source audio signal according to the target sound source position, the sound absorption quantity of the target scene and the noise data of the target scene to obtain an audio signal corresponding to a microphone.
5. The processing method as claimed in any one of claims 1 to 4, wherein after the combining the audio signal corresponding to each of the microphones to obtain the target audio signal of the microphone array, the method further comprises:
copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the tone of the copy of each target audio signal according to pre-collected user tone data to obtain a plurality of adjusted audio signals;
wherein the pitch of each of the adjusted audio signals is unique.
6. The processing method as claimed in any one of claims 1 to 4, wherein after the combining the audio signal corresponding to each of the microphones to obtain the target audio signal of the microphone array, the method further comprises:
copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the speech rate of the copy of each target audio signal according to pre-collected user speech rate data to obtain a plurality of adjusted audio signals;
wherein the speech rate of each of the adjusted audio signals is unique.
7. An apparatus for processing an audio signal, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a pre-recorded source audio signal and a preset target sound source position;
the conversion unit is used for converting the source audio signal according to the target sound source position to obtain an audio signal corresponding to a microphone; wherein the microphone is each microphone of an array of microphones;
and the combination unit is used for combining the audio signals corresponding to the microphones to obtain the target audio signals of the microphone array.
8. The processing apparatus according to claim 7, wherein the conversion unit comprises:
a sub-acquisition unit, configured to acquire an impulse response of the microphone; wherein the impulse response of the microphone is generated in advance according to the target sound source position;
and the computing unit is used for computing the source audio signal according to the impulse response corresponding to the target sound source position to obtain an audio signal corresponding to the microphone.
9. The processing apparatus according to claim 7, wherein the obtaining unit is further configured to:
acquiring the preset sound absorption quantity of a target scene and noise data of the target scene;
the conversion unit is used for:
and converting the source audio signal according to the target sound source position, the sound absorption quantity of the target scene and the noise data of the target scene to obtain an audio signal corresponding to a microphone.
10. The processing apparatus according to any one of claims 7 to 9, characterized in that the processing apparatus further comprises:
the analog unit is used for copying the target audio signal to obtain a plurality of copies of the target audio signal;
adjusting the tone of the copy of each target audio signal according to pre-collected user tone data to obtain a plurality of adjusted audio signals;
wherein the pitch of each of the adjusted audio signals is unique.
CN201910777904.8A 2019-08-22 2019-08-22 Audio signal processing method and device Pending CN112489667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910777904.8A CN112489667A (en) 2019-08-22 2019-08-22 Audio signal processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910777904.8A CN112489667A (en) 2019-08-22 2019-08-22 Audio signal processing method and device

Publications (1)

Publication Number Publication Date
CN112489667A true CN112489667A (en) 2021-03-12

Family

ID=74920132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910777904.8A Pending CN112489667A (en) 2019-08-22 2019-08-22 Audio signal processing method and device

Country Status (1)

Country Link
CN (1) CN112489667A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101478711A (en) * 2008-12-29 2009-07-08 北京中星微电子有限公司 Method for controlling microphone sound recording, digital audio signal processing method and apparatus
CN102013252A (en) * 2010-10-27 2011-04-13 华为终端有限公司 Sound effect adjusting method and sound playing device
CN102918466A (en) * 2010-04-01 2013-02-06 视瑞尔技术公司 Method and device for encoding three-dimensional scenes which include transparent objects in a holographic system
CN107193386A (en) * 2017-06-29 2017-09-22 联想(北京)有限公司 Acoustic signal processing method and electronic equipment
CN108242234A (en) * 2018-01-10 2018-07-03 腾讯科技(深圳)有限公司 Speech recognition modeling generation method and its equipment, storage medium, electronic equipment
US20190200156A1 (en) * 2017-12-21 2019-06-27 Verizon Patent And Licensing Inc. Methods and Systems for Simulating Microphone Capture Within a Capture Zone of a Real-World Scene
CN110049408A (en) * 2019-05-10 2019-07-23 苏州静声泰科技有限公司 A kind of microphone speaker array formation optimization method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101478711A (en) * 2008-12-29 2009-07-08 北京中星微电子有限公司 Method for controlling microphone sound recording, digital audio signal processing method and apparatus
CN102918466A (en) * 2010-04-01 2013-02-06 视瑞尔技术公司 Method and device for encoding three-dimensional scenes which include transparent objects in a holographic system
CN102013252A (en) * 2010-10-27 2011-04-13 华为终端有限公司 Sound effect adjusting method and sound playing device
CN107193386A (en) * 2017-06-29 2017-09-22 联想(北京)有限公司 Acoustic signal processing method and electronic equipment
US20190200156A1 (en) * 2017-12-21 2019-06-27 Verizon Patent And Licensing Inc. Methods and Systems for Simulating Microphone Capture Within a Capture Zone of a Real-World Scene
CN108242234A (en) * 2018-01-10 2018-07-03 腾讯科技(深圳)有限公司 Speech recognition modeling generation method and its equipment, storage medium, electronic equipment
CN110049408A (en) * 2019-05-10 2019-07-23 苏州静声泰科技有限公司 A kind of microphone speaker array formation optimization method

Similar Documents

Publication Publication Date Title
Christensen et al. The CHiME corpus: a resource and a challenge for computational hearing in multisource environments
JP2019159306A (en) Far-field voice control device and far-field voice control system
JP2022542387A (en) Managing playback of multiple audio streams through multiple speakers
KR20220044204A (en) Acoustic Echo Cancellation Control for Distributed Audio Devices
Bertin et al. VoiceHome-2, an extended corpus for multichannel speech processing in real homes
KR102633176B1 (en) Methods for reducing errors in environmental noise compensation systems
US20240177726A1 (en) Speech enhancement
CN110072177A (en) Space division information acquisition methods, device and storage medium
Kirsch et al. Computationally-efficient simulation of late reverberation for inhomogeneous boundary conditions and coupled rooms
Morales et al. Receiver placement for speech enhancement using sound propagation optimization
CN110475181A (en) Equipment configuration method, device, equipment and storage medium
JPWO2018193826A1 (en) Information processing device, information processing method, audio output device, and audio output method
WO2023246327A1 (en) Audio signal processing method and apparatus, and computer device
WO2023051622A1 (en) Method for improving far-field speech interaction performance, and far-field speech interaction system
CN112489667A (en) Audio signal processing method and device
CN113782002B (en) Speech recognition testing method and system based on reverberation simulation
CN117643075A (en) Data augmentation for speech enhancement
CN212659307U (en) A speech recognition test equipment for intelligent audio amplifier
Panek et al. Challenges in adopting speech control for assistive robots
Shi et al. Automatic gain control for parametric array loudspeakers
US20240114309A1 (en) Progressive calculation and application of rendering configurations for dynamic applications
RU2818982C2 (en) Acoustic echo cancellation control for distributed audio devices
Eklund et al. Noise, Device and Room Robustness Methods for Pronunciation Error Detection
EP4002889A1 (en) Method for determining a sound field
Castro et al. Walk-through auralization framework for virtual reality environ-ments powered by game engine architectures, Part II

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination