CN110910893B

CN110910893B - Audio processing method, device and storage medium

Info

Publication number: CN110910893B
Application number: CN201911174201.2A
Authority: CN
Inventors: 张巍耀; 任伟; 张新成
Original assignee: Beijing Wutong Chelian Technology Co Ltd
Current assignee: Beijing Wutong Chelian Technology Co Ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2022-07-22
Anticipated expiration: 2039-11-26
Also published as: CN110910893A

Abstract

The application discloses an audio processing method and device, and belongs to the technical field of audio quantity. In the application, because the time-frequency data can well represent the sound characteristics of one sound source, if the quality of two mics is greatly different, after a first audio signal and a second audio signal acquired by the two mics for the same sound source are acquired, first time-frequency data can be determined according to the first audio signal, second time-frequency data can be determined according to the second audio signal, then third time-frequency data can be obtained by fitting the first time-frequency data and the second time-frequency data, and then a third audio signal can be obtained according to the third time-frequency data. Like this, the third audio signal has combined the characteristics of first audio signal and second audio signal simultaneously, compares in the audio signal that the relatively poor Mic of quality gathered, and signal quality is better and more stable, is favorable to improving subsequent speech recognition's discernment rate of accuracy.

Description

Audio processing method, device and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio processing method, an audio processing apparatus, and a storage medium.

Background

At present, in a speech recognition scheme of a diphone zone, two microphones (microphones) at a front end respectively collect one path of audio signals, for example, two microphones are arranged at a position of a ceiling lamp of an automobile, when a driver speaks, the two microphones collect two paths of audio signals, and then the front end sends the collected audio signals to a speech recognition module at a rear end for speech recognition.

Usually, the two mics are located at different positions, that is, in different sound zones, and when acquiring an audio signal, the two mics can determine which Mic the sound source is closer to according to the audio signal, and the Mic closer to the sound source can amplify the acquired audio signal and then send the amplified audio signal to the voice recognition module for voice recognition. However, in practice, the quality of two Mic may be different, so the quality of the audio signals acquired by two Mic is also different. In this case, when the sound source is closer to the Mic with poor quality, the audio signal with poor quality acquired by the Mic will be sent to the speech recognition module, so that the recognition accuracy of the speech recognition module will be lower, that is, under the condition that the quality of two mics is different, the problem that the difference between the recognition effects of two sound zones is larger exists. Based on this, it is highly desirable to provide an audio signal processing scheme to ensure the quality of the audio signal, and thus ensure the recognition effect of speech recognition.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device and a storage medium, which can solve the problem that in a speech recognition scheme of a double-tone area, when the quality of Mic close to each other is poor, the recognition rate is low due to the fact that an audio signal with low quality is collected. The technical scheme is as follows:

in one aspect, an audio processing method is provided, and the method includes:

acquiring a first audio signal acquired by first audio acquisition equipment and a second audio signal acquired by second audio acquisition equipment, wherein the first audio signal and the second audio signal are acquired from the same sound source in the same time period;

determining first time-frequency data according to the first audio signal, and determining second time-frequency data according to the second audio signal;

fitting the first time-frequency data and the second time-frequency data to obtain third time-frequency data;

and generating a third audio signal according to the third time-frequency data.

Optionally, the fitting the first time-frequency data and the second time-frequency data to obtain third time-frequency data includes:

determining a first fitted time-frequency curve according to the first time-frequency data and the second time-frequency data, wherein the first fitted time-frequency curve is used for indicating the relation between time and frequency of the third audio signal;

and determining the third time frequency data according to the first fitted time frequency curve.

Optionally, the determining a first fitted time-frequency curve according to the first time-frequency data and the second time-frequency data includes:

determining a plurality of second fitting parameters according to the first time frequency data, the second time frequency data and the plurality of first fitting parameters;

generating a second fitting time-frequency curve according to the plurality of second fitting parameters;

generating a fourth audio signal according to the second fitted time-frequency curve;

acquiring the identification accuracy of the fourth audio signal;

and if the identification accuracy is smaller than an identification rate threshold value, adjusting the plurality of first fitting parameters, returning to the step of determining a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters, and taking a second fitting time-frequency curve obtained by fitting the plurality of second fitting parameters determined last time as the first fitting time-frequency curve until the identification accuracy is not smaller than the identification rate threshold value.

Optionally, the first time-frequency data includes a plurality of first time points and a plurality of first frequency values, the plurality of first time points and the plurality of first frequency values are in one-to-one correspondence, the second time-frequency data includes a plurality of second time points and a plurality of second frequency values, and the plurality of second time points and the plurality of second frequency values are in one-to-one correspondence;

determining a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters, including:

determining a plurality of third time points and a plurality of third frequency values according to the plurality of first time points, the plurality of first frequency values, the plurality of second time points, the plurality of second frequency values and the plurality of first fitting parameters, wherein the plurality of third time points are a union set of the plurality of first time points and the plurality of second time points, and the plurality of third time points and the plurality of third frequency values are in one-to-one correspondence;

and determining the plurality of second fitting parameters according to the third frequency value corresponding to each third time point, the first frequency value corresponding to each first time point and the second frequency value corresponding to each second time point.

Optionally, the determining first time-frequency data according to the first audio signal and determining second time-frequency data according to the second audio signal includes:

performing Fourier transform on the first audio signal to obtain the first time frequency data;

and carrying out Fourier transform on the second audio signal to obtain the second time-frequency data.

In another aspect, an audio processing apparatus is provided, the apparatus comprising:

the audio acquisition device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first audio signal acquired by first audio acquisition equipment and a second audio signal acquired by second audio acquisition equipment, and the first audio signal and the second audio signal are signals acquired from the same sound source in the same time period;

a determining module, configured to determine first time-frequency data according to the first audio signal, and determine second time-frequency data according to the second audio signal;

the fitting module is used for fitting the first time-frequency data and the second time-frequency data to obtain third time-frequency data;

and the generating module is used for generating a third audio signal according to the third time-frequency data.

Optionally, the fitting module comprises:

a first determining unit, configured to determine a first fitted time-frequency curve according to the first time-frequency data and the second time-frequency data, where the first fitted time-frequency curve is used to indicate a relationship between time and frequency of the third audio signal;

and the second determining unit is used for determining the third time-frequency data according to the first fitted time-frequency curve.

Optionally, the first determining unit includes:

the first determining subunit is configured to determine a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data, and the plurality of first fitting parameters;

the first generating subunit is configured to generate a second fitting time-frequency curve according to the plurality of second fitting parameters;

the second generating subunit is used for generating a fourth audio signal according to the second fitting time-frequency curve;

an obtaining subunit, configured to obtain an identification accuracy of the fourth audio signal;

and the second determining subunit is configured to, if the identification accuracy is smaller than an identification rate threshold, adjust the plurality of first fitting parameters, return to the step of determining the plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data, and the plurality of first fitting parameters, and until the identification accuracy is not smaller than the identification rate threshold, use a second fitting time-frequency curve obtained by fitting the plurality of second fitting parameters determined last time as the first fitting time-frequency curve.

the first determining subunit is specifically configured to:

Optionally, the determining module includes:

the first conversion unit is used for carrying out Fourier transform on the first audio signal to obtain the first time frequency data;

and the second transform unit is used for carrying out Fourier transform on the second audio signal to obtain the second time-frequency data.

In another aspect, an audio processor device is provided, the audio processor device comprising a processor, a communication interface, a memory, and a communication bus;

the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory is used for storing computer programs;

the processor is used for executing the program stored on the memory so as to realize the audio processing method.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the audio processing method as provided above.

The beneficial effects that technical scheme that this application embodiment brought include at least:

in the embodiment of the application, because the time-frequency data can well represent the sound characteristics of one sound source, if the quality of two mics is greatly different, after the first audio signal and the second audio signal acquired by the two mics for the same sound source are acquired, the first time-frequency data can be determined according to the first audio signal, the second time-frequency data can be determined according to the second audio signal, then the third time-frequency data is obtained by fitting the first time-frequency data and the second time-frequency data, and then the third audio signal is obtained according to the third time-frequency data. Like this, the third audio signal has combined the characteristics of first audio signal and second audio signal simultaneously, compares in the audio signal that the relatively poor Mic of quality gathered, and signal quality is better and more stable, is favorable to improving subsequent speech recognition's discernment rate of accuracy.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a system architecture diagram according to an audio processing method provided in an embodiment of the present application;

fig. 2 is a flowchart of an audio processing method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario related to the embodiments of the present application will be described.

At present, in the speech recognition scheme of the double-sound zone, two mics arranged in the two sound zones can respectively collect one path of audio signals, for example, two mics can be arranged at the position of a ceiling lamp of an automobile, when a driver speaks, the two mics can collect two paths of audio signals, for example, in a scene that the two mics are arranged at different indoor positions, when someone speaks indoors, the two paths of audio signals are collected by the two mics. The two paths of audio signals can be processed according to the audio processing method provided by the embodiment of the application, so that the audio signals with higher quality are obtained, and the recognition accuracy of subsequent voice recognition is improved.

Next, a system architecture related to the audio processing method provided by the embodiment of the present application is described.

Fig. 1 is a system architecture diagram of an audio processing method according to an embodiment of the present application. As shown in fig. 1, the system architecture includes a first audio capture device 101, a second audio capture device 102, an audio processing device 103, and a speech recognition device 104. Any two devices of the first audio acquisition device 101, the audio processing device 103 and the voice recognition device 104 can be connected in a wireless or wired manner to perform communication, and any two devices of the second audio acquisition device 102, the audio processing device 103 and the voice recognition device 104 can also be connected in a wireless or wired manner to perform communication.

The first audio collecting device 101 is configured to collect a first audio signal and send the collected first audio signal to the audio processing device 103. The second audio collecting device 102 is configured to collect a second audio signal and send the collected second audio signal to the audio processing device 103. The first audio capturing device 101 and the second audio capturing device 102 may be two devices disposed in two sound zones, and the first audio signal and the second audio signal are signals captured from the same sound source in the same time period.

The audio processing device 103 may be configured to receive a first audio signal sent by the first audio collecting device 101 and a second audio signal sent by the second audio collecting device 102, and may process the first audio signal and the second audio signal according to the audio processing method provided in the embodiment of the present application to obtain a third audio signal.

The voice recognition device 104 may be configured to receive the third audio signal sent by the audio processing device 103, perform voice recognition on the third audio signal, and obtain a recognized text, recognition accuracy, and the like, and in addition, the voice recognition device may also receive the fourth audio signal, perform voice recognition on the fourth audio signal, and send the recognition accuracy to the audio processing device 103, so that the audio processing device may further adjust the fourth audio signal according to the recognition accuracy, and obtain a third audio signal with higher quality.

In the embodiment of the present application, both the first audio capturing device 101 and the second audio capturing device 102 may be Mic, and may also be other devices with audio capturing functions. The audio processing device 103 may be a mobile phone, a computer, an intelligent sound box, an intelligent television, an intelligent bracelet, or other devices with an audio processing function. The voice recognition device 104 may be a mobile phone, a computer, an intelligent sound box, an intelligent television, an intelligent bracelet, or other devices with a voice recognition function, which is not limited in the embodiment of the present application.

It should be noted that the speech recognition device 104 may also be integrated in the audio processing device 103, in which case the audio processing device 103 may include a speech recognition module, so that the audio processing device 103 may also have a speech recognition function.

Next, an audio processing method provided in an embodiment of the present application will be described.

Fig. 2 is a flowchart of an audio processing method provided in an embodiment of the present application, and may be applied to the audio processing apparatus shown in fig. 1. As shown in fig. 2, the method comprises the steps of:

step 201: the method comprises the steps of acquiring a first audio signal acquired by first audio acquisition equipment and a second audio signal acquired by second audio acquisition equipment, wherein the first audio signal and the second audio signal are signals acquired from the same sound source in the same time period.

In this embodiment of the application, when a sound source makes a sound, the first audio collecting device and the second audio collecting device disposed in different sound zones may simultaneously collect the sound made by the sound source, so as to obtain the first audio signal and the second audio signal. Therefore, the first audio signal and the second audio signal are audio signals for collecting the sound emitted by the sound source in the same time period. The audio processing device may receive a first audio signal collected by the first audio collecting device and a second audio signal collected by the second audio collecting device, that is, the audio processing device may obtain the first audio signal and the second audio signal.

It should be noted that both the first audio signal and the second audio signal may be PCM (Pulse Code Modulation) signals, that is, the first audio signal and the second audio signal may be discrete time domain data.

Step 202: first time-frequency data is determined according to the first audio signal, and second time-frequency data is determined according to the second audio signal.

Because the time-frequency data can well represent the sound characteristics of one sound source, in the embodiment of the application, the audio processing device can determine the first time-frequency data according to the first audio signal, determine the second time-frequency data according to the second audio signal, the first time-frequency data and the second time-frequency data can represent part of the characteristics of the sound source, and subsequently can combine the first time-frequency data and the second time-frequency data to obtain the third time-frequency data which can better represent the sound characteristics of the sound source.

In this embodiment, the audio processing device may perform fourier transform on the first audio signal to obtain first time frequency data, and perform fourier transform on the second audio signal to obtain second time frequency data.

Optionally, in this embodiment of the application, before performing fourier transform on the audio signal, the audio processing device may perform framing processing on the first audio signal and the second audio signal, that is, may segment the first audio signal according to the first inter-frame distance to obtain first time domain data of the first audio signal in each first inter-frame distance, and segment the second audio signal according to the second inter-frame distance to obtain second time domain data of the second audio signal in each second inter-frame distance.

Wherein the first inter-frame distance and the second inter-frame distance may be the same or different. For example, the first inter-frame distance and the second inter-frame distance may be both 25ms or other values, or the first inter-frame distance may be 20ms, and the second inter-frame distance may be 25ms, which is not limited in this embodiment.

It should be noted that, because the first audio signal and the second audio signal in the embodiment of the present application are both discrete time domain data, the audio processing device may perform fourier transform on the time domain data obtained after the framing processing by using the discrete fourier transform formula (1).

Where x (M) represents time domain data, M represents the number of time domain sample points, and f (k) represents frequency domain data.

In this embodiment, the audio processing device may perform fourier transform on each first time domain data according to formula (1) to obtain frequency domain data in each first frame interval, where the frequency domain data includes a frequency value and corresponding frequency energy. Since the first inter-frame distance time is short, the frequency corresponding to the audio signal in a short time is relatively stable, and based on this, the audio processing apparatus may use the frequency value with the highest frequency energy in the frequency domain data in each first inter-frame distance as the first frequency value corresponding to each of the plurality of first time points included in the corresponding first inter-frame distance. Thus, for a plurality of first inter-frame distances, a first frequency value corresponding to each of a plurality of first time points included in the plurality of first inter-frame distances, that is, first time data, may be obtained. The first time data comprises a plurality of first time points and a plurality of first frequency values, the first time points correspond to the first frequency values one by one, and the first time data can be used for representing the relation between the time and the frequency of the first audio signal.

The audio processing device may further perform fourier transform on each second time domain data according to formula (1) to obtain frequency domain data within each second frame interval. The audio processing apparatus may then use the frequency value with the highest frequency energy in the frequency domain data in each second inter-frame distance as the second frequency value corresponding to each of the plurality of second time points included in the corresponding second inter-frame distance. Thus, for a plurality of second interframe spaces, a second frequency value corresponding to each of a plurality of second time points included in the plurality of second interframe spaces, that is, second time-frequency data, may be obtained. The second time-frequency data includes a plurality of second time points and a plurality of second frequency values, the plurality of second time points and the plurality of second frequency values are in one-to-one correspondence, and the second time-frequency data may be used to represent a relationship between time and frequency of the second audio signal.

Step 203: and fitting the first time-frequency data and the second time-frequency data to obtain third time-frequency data.

In this embodiment, after the audio processing device determines the first time frequency data and the second time frequency data, the audio processing device may fit the first time frequency data and the second time frequency data to obtain third time frequency data.

Wherein the audio processing device may determine a first fitted time-frequency curve from the first time-frequency data and the second time-frequency data, the first fitted time-frequency curve being used to indicate a time-to-frequency relationship of the third audio signal. And then determining third time frequency data according to the first fitted time frequency curve.

In one possible case, the audio processing device may determine a plurality of second fitting parameters from the first time-frequency data, the second time-frequency data, and the plurality of first fitting parameters; generating a second fitting time-frequency curve according to the plurality of second fitting parameters; generating a fourth audio signal according to the second fitted time-frequency curve; acquiring the identification accuracy of the fourth audio signal; and if the identification accuracy is smaller than the identification rate threshold, adjusting the plurality of first fitting parameters, returning to the step of determining the plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters, and taking a second fitting time-frequency curve obtained by fitting the plurality of second fitting parameters determined for the last time as a first fitting time-frequency curve until the identification accuracy is not smaller than the identification rate threshold. The plurality of first fitting parameters are initialized fitting parameters.

As can be seen from the foregoing, the first time data includes a plurality of first time points and a plurality of first frequency values, the plurality of first time points correspond to the plurality of first frequency values one to one, the second time data includes a plurality of second time points and a plurality of second frequency values, and the plurality of second time points correspond to the plurality of second frequency values one to one. Based on this, the audio processing apparatus may first determine a plurality of third time points and a plurality of third frequency values, the plurality of third time points being a union of the plurality of first time points and the plurality of second time points, the plurality of third time points and the plurality of third frequency values being in one-to-one correspondence. And determining a plurality of second fitting parameters according to the third frequency value corresponding to each third time point, the first frequency value corresponding to each first time point and the second frequency value corresponding to each second time point.

It should be noted that, since the sampling periods of the first audio capturing device and the second audio capturing device when capturing the audio signals may be the same or different, the plurality of first time points may be completely the same as or different from the plurality of second time points. Based on this, in the embodiment of the present application, the audio processing apparatus may obtain a union of the plurality of first time points and the plurality of second time points, thereby obtaining a plurality of third time points. For any one of the third time points, the audio processing device may determine, according to the first fitting parameter and the third time point, a third frequency value corresponding to the third time point by using the following reference formula (2).

P(x)＝a₀+a₁x+…+a_nxⁿ (2)

Wherein, a₀,a₁,…,a_nThe first fitting parameters are weight parameters of the plurality of first fitting parameters, n is a fitting order of the plurality of first fitting parameters, and p (x) is a third frequency value corresponding to a third time point x obtained after fitting.

After obtaining the third frequency value corresponding to each third time point, the audio processing device may apply the weight parameter a to the data obtained by fitting according to a least square method and a principle of minimizing a sum of squares of errors between the data obtained by fitting and actual data₀,a₁,…,a_nAnd performing iterative adjustment, namely calculating the square of the error between each third frequency value and the corresponding first frequency value and/or second frequency value, and performing iterative adjustment on a plurality of weight parameters according to the principle that the sum of the squares of the errors is minimum. When the iteration number reaches a preset iteration number, or the sum of squares of errors is smaller than an error threshold, the fitting parameters obtained by the last iteration can be determined as a plurality of second fitting parameters. Thus, for each third time point x, the sum of the squares of the errors between the fitted frequency values p (x) and the corresponding actual frequency values approaches a minimum.

After determining the plurality of second fitting parameters, the audio processing device may generate a second fitted time-frequency curve according to the plurality of second fitting parameters.

In this embodiment of the application, after obtaining the second fitted time-frequency curve, the audio processing device may generate a fourth audio signal according to the second fitted time-frequency curve.

Optionally, the audio processing device may segment the second fitted time-frequency curve according to the third inter-frame distances to obtain time-frequency curves in each third inter-frame distance, and sample the time-frequency curves in each third inter-frame distance to obtain a plurality of fourth time points in each third inter-frame distance and a frequency value corresponding to each fourth time point. Then, the audio processing device may perform inverse fourier transform on the frequency value corresponding to each fourth time point in each third frame interval according to formula (3), to obtain time domain data in each third frame interval, and further obtain a fourth audio signal.

It should be noted that the third interframe spacing may be a value such as 25ms, and the third interframe spacing may be the same as or different from the first interframe spacing, which is not limited in the embodiment of the present application.

Then, the audio processing apparatus may acquire the recognition accuracy of the fourth audio signal. As can be seen from the foregoing description of the system architecture according to the embodiment of the present application, when the audio processing device and the voice recognition device are two independent devices, the audio processing device may send the fourth audio signal to the voice recognition device, and may obtain a recognition accuracy obtained by the voice recognition device according to the fourth audio signal. When the audio processing device includes the voice recognition module, the audio processing device may send the fourth audio data to the voice recognition module, and directly obtain a recognition accuracy rate obtained by the voice recognition module according to the fourth audio data.

After obtaining the recognition accuracy of the fourth audio signal, the audio processing device may determine whether the recognition accuracy is less than a recognition threshold. If the identification accuracy is smaller than the identification rate threshold, the plurality of first fitting parameters can be adjusted, the step of determining the plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters is returned, and the second fitting time-frequency curve obtained through the last fitting is used as the first fitting time-frequency curve until the identification accuracy is not smaller than the identification rate threshold. That is, the audio processing device may refit to determine the first fitted time-frequency curve.

It should be noted that, in the embodiment of the present application, in the case that the recognition accuracy is smaller than the recognition rate threshold, the method for adjusting the plurality of first fitting parameters may be to keep n unchanged and randomly generate a set of initialization parameters a again₀,a₁,…,a_nOr n can be increased by 1 or decreased by 1 based on the original value, or n can be increased by 1 based on the original value, and a set of initialization parameters a can be generated randomly again₀,a₁,…,a_n,a_n+1Alternatively, n may be reduced by 1 based on the original value, and a set of initialization parameters a may be generated randomly₀,a₁,…,a_n-1. Alternatively, the fitting parameters may be adjusted manually based on some manual experience, e.g., the first fitting parameters may be adjusted manually based on the wake-up rate of the speech recognition.

In this embodiment, after the audio processing device obtains the first fitted time-frequency curve, the third time-frequency data may be determined according to the first fitted time-frequency curve.

Optionally, the audio processing device may determine, according to the first fitted time-frequency curve, frequency values of the plurality of third time points on the first fitted time-frequency curve, and use the frequency values of the plurality of third time points and the corresponding third time points on the first fitted time-frequency curve as the third time-frequency data. Or the plurality of first time points and the frequency values of the corresponding first time points on the first fitted time frequency curve may be used as the third time frequency data. Or, the plurality of second time points and the frequency values of the corresponding second time points on the first fitted time-frequency curve may be used as the third time-frequency data. Or, the first fitted time-frequency curve may be sampled, and the plurality of fourth time points obtained by sampling and the frequency values of the corresponding fourth time points on the first fitted time-frequency curve may be used as third time-frequency data.

Optionally, in another possible case, the audio processing device may determine a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters with reference to the method described above, and then obtain the first fitting time-frequency curve according to the plurality of second fitting parameters. That is, the audio processing device performs fitting once to obtain the third time-frequency data, and does not need to adjust the plurality of first fitting parameters again according to the recognition accuracy of the speech recognition.

Step 204: and generating a third audio signal according to the third time-frequency data.

In this embodiment, the audio processing device may generate a third audio signal according to the third time-frequency data.

Optionally, the audio processing device may segment the third time-frequency data according to the fourth inter-frame space to obtain time-frequency data of the third time-frequency data in each fourth inter-frame space, and perform inverse fourier transform on a frequency value corresponding to each time point included in each fourth inter-frame space to obtain time-domain data in each fourth inter-frame space, so as to obtain the third audio signal.

It should be noted that, in the embodiment of the present application, since the third time-frequency data is discrete time-frequency data, the third time-frequency data may be processed according to formula (3), that is, according to an inverse discrete fourier transform formula, to obtain a third audio signal. In addition, the fourth inter-frame distance may be a value such as 25ms, and the fourth inter-frame distance may be the same as or different from the first inter-frame distance, which is not limited in the embodiment of the present application.

In summary, in the embodiment of the present application, since the time-frequency data can well represent the sound characteristics of one sound source, if there is a large difference between the qualities of two mics, after acquiring the first audio signal and the second audio signal acquired by the two mics for the same sound source, the first time-frequency data may be determined according to the first audio signal, the second time-frequency data may be determined according to the second audio signal, and then the third time-frequency data is obtained by fitting the first time-frequency data and the second time-frequency data, so as to obtain the third audio signal according to the third time-frequency data. Like this, the third audio signal has combined the characteristics of first audio signal and second audio signal simultaneously, compares in the audio signal that the relatively poor Mic of quality gathered, and signal quality is better and more stable, is favorable to improving subsequent speech recognition's discernment rate of accuracy.

Referring to fig. 3, an embodiment of the present application provides an audio processing apparatus 300, which may be an audio processing device in the system architecture shown in fig. 1, where the apparatus 300 includes:

an obtaining module 301, configured to obtain a first audio signal collected by a first audio collecting device and a second audio signal collected by a second audio collecting device, where the first audio signal and the second audio signal are signals collected from a same sound source in a same time period;

a determining module 302, configured to determine first time-frequency data according to the first audio signal, and determine second time-frequency data according to the second audio signal;

the fitting module 303 is configured to fit the first time-frequency data and the second time-frequency data to obtain third time-frequency data;

a generating module 304, configured to generate a third audio signal according to the third time-frequency data.

Optionally, the fitting module 303 comprises:

and the second determining unit is used for determining third time frequency data according to the first fitted time frequency curve.

Optionally, the first determination unit includes:

the first determining subunit is used for determining a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters;

the first generating subunit is used for generating a second fitting time-frequency curve according to the plurality of second fitting parameters;

and the second determining subunit is used for adjusting the plurality of first fitting parameters if the identification accuracy is smaller than the identification rate threshold, returning to the step of determining the plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters, and taking a second fitting time-frequency curve obtained by fitting the plurality of second fitting parameters determined last time as the first fitting time-frequency curve until the identification accuracy is not smaller than the identification rate threshold.

Optionally, the first time frequency data includes a plurality of first time points and a plurality of first frequency values, the plurality of first time points and the plurality of first frequency values are in one-to-one correspondence, the second time frequency data includes a plurality of second time points and a plurality of second frequency values, and the plurality of second time points and the plurality of second frequency values are in one-to-one correspondence;

the first determining subunit is specifically configured to:

and determining a plurality of second fitting parameters according to the third frequency value corresponding to each third time point, the first frequency value corresponding to each first time point and the second frequency value corresponding to each second time point.

Optionally, the determining module 302 includes:

the first conversion unit is used for carrying out Fourier transform on the first audio signal to obtain first time frequency data;

and the second transformation unit is used for carrying out Fourier transformation on the second audio signal to obtain second time-frequency data.

It should be noted that: in the audio processing apparatus provided in the foregoing embodiment, only the division of the functional modules is exemplified in the audio processing, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the audio processing apparatus and the audio processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 4 is a block diagram of an audio processing apparatus 400 according to an embodiment of the present disclosure. The audio processing device 400 may be a mobile phone, a computer, an intelligent speaker, an intelligent television, an intelligent bracelet, or other devices with an audio processing function. The audio processing device 400 may also be referred to by other names such as user device, portable audio processing device, laptop audio processing device, desktop audio processing device, and so forth.

In general, the audio processing device 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content required to be displayed on the display screen. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the audio processing method provided by the method embodiments herein.

In some embodiments, the audio processing device 400 may further include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by bus or signal lines. Various peripheral devices may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, touch screen display 405, camera 406, audio circuitry 407, positioning components 408, and power supply 409.

The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402, and the peripheral interface 403 may be implemented on separate chips or circuit boards, which is not limited by the embodiment.

The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with a communication network and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other audio processing devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 404 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to capture touch signals on or over the surface of the display screen 405. The touch signal may be input to the processor 401 as a control signal for processing. At this point, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 405 may be a front panel disposed on the audio processing device 400; in other embodiments, the display screen 405 may be at least two, respectively disposed on different surfaces of the audio processing device 400 or in a folded design; in other embodiments, the display screen 405 may be a flexible display screen disposed on a curved surface or on a folded surface of the audio processing device 400. Even further, the display screen 405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 405 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. In general, a front camera is provided on a front panel of an audio processing apparatus, and a rear camera is provided on a rear surface of the audio processing apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and disposed at different locations of the audio processing device 400. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 407 may also include a headphone jack.

The positioning component 408 is used to locate the current geographic Location of the audio processing device 400 for navigation or LBS (Location Based Service). The Positioning component 408 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 409 is used to supply power to the various components in the audio processing device 400. The power source 409 may be alternating current, direct current, disposable or rechargeable. When the power source 409 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charge technology.

In some embodiments, the audio processing device 400 also includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the audio processing apparatus 400. For example, the acceleration sensor 411 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 401 may control the touch display screen 405 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 411. The acceleration sensor 411 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the audio processing apparatus 400, and the gyro sensor 412 may collect a 3D motion of the user on the audio processing apparatus 400 in cooperation with the acceleration sensor 411. From the data collected by the gyro sensor 412, the processor 401 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 413 may be disposed on a side bezel of the audio processing device 400 and/or on a lower layer of the touch display screen 405. When the pressure sensor 413 is disposed at a side frame of the audio processing apparatus 400, a user's holding signal to the audio processing apparatus 400 may be detected, and left-right hand recognition or shortcut operation may be performed by the processor 401 according to the holding signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 414 is used for collecting a fingerprint of the user, and the processor 401 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 401 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 414 may be disposed on the front, back, or side of the audio processing device 400. When a physical key or vendor Logo is provided on the audio processing device 400, the fingerprint sensor 414 may be integrated with the physical key or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch screen display 405 according to the ambient light intensity collected by the optical sensor 415. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 405 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera head assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also known as a distance sensor, is typically provided at the front panel of the audio processing device 400. The proximity sensor 416 is used to capture the distance between the user and the front of the audio processing device 400. In one embodiment, the processor 401 controls the touch display screen 405 to switch from the bright screen state to the dark screen state when the proximity sensor 416 detects that the distance between the user and the front face of the audio processing device 400 is gradually decreased; when the proximity sensor 416 detects that the distance between the user and the front of the audio processing device 400 is gradually increased, the touch display screen 405 is controlled by the processor 401 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the audio processing device 400, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

An embodiment of the present application further provides a non-transitory computer-readable storage medium, and when a processor of an audio processing device executes instructions in the storage medium, the audio processing device is enabled to execute the audio processing method provided in the embodiment shown in fig. 2.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the audio processing method provided in the embodiment shown in fig. 2.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of audio processing, the method comprising:

acquiring a first audio signal acquired by first audio acquisition equipment and a second audio signal acquired by second audio acquisition equipment, wherein the first audio signal and the second audio signal are signals acquired from the same sound source in the same time period;

determining a plurality of second fitting parameters according to the first time frequency data, the second time frequency data and a plurality of first fitting parameters, wherein the plurality of first fitting parameters comprise a weight parameter and a fitting order; generating a second fitting time-frequency curve according to the plurality of second fitting parameters; generating a fourth audio signal according to the second fitted time-frequency curve; acquiring the identification accuracy rate of the fourth audio signal;

if the identification accuracy is smaller than an identification rate threshold value, adjusting the plurality of first fitting parameters, returning to the step of determining a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters, and taking a second fitting time-frequency curve obtained by fitting the plurality of second fitting parameters determined last time as a first fitting time-frequency curve until the identification accuracy is not smaller than the identification rate threshold value;

determining third time frequency data according to the first fitted time frequency curve;

and generating a third audio signal according to the third time-frequency data, wherein the first fitted time-frequency curve is used for indicating the relation between the time and the frequency of the third audio signal.

2. The method of claim 1, wherein the first time-frequency data comprises a plurality of first time points and a plurality of first frequency values, wherein the plurality of first time points and the plurality of first frequency values have a one-to-one correspondence, wherein the second time-frequency data comprises a plurality of second time points and a plurality of second frequency values, and wherein the plurality of second time points and the plurality of second frequency values have a one-to-one correspondence;

the determining a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data and the plurality of first fitting parameters includes:

3. The method according to any of claims 1-2, wherein determining first time-frequency data from the first audio signal and determining second time-frequency data from the second audio signal comprises:

performing Fourier transform on the first audio signal to obtain first time frequency data;

4. An audio processing apparatus, characterized in that the apparatus comprises:

a generating module, configured to generate a third audio signal according to the third time-frequency data;

the fitting module includes:

a second determining unit, configured to determine the third time-frequency data according to the first fitted time-frequency curve;

the first determination unit includes:

the first determining subunit is configured to determine a plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data, and a plurality of first fitting parameters, where the plurality of first fitting parameters include a weight parameter and a fitting order;

the second generating subunit is used for generating a fourth audio signal according to the second fitted time-frequency curve;

and the second determining subunit is configured to, if the identification accuracy is smaller than an identification rate threshold, adjust the plurality of first fitting parameters, return to the step of determining the plurality of second fitting parameters according to the first time-frequency data, the second time-frequency data, and the plurality of first fitting parameters, and until the identification accuracy is not smaller than the identification rate threshold, fit the plurality of second fitting time-frequency curves obtained according to the last fit to obtain the plurality of second fitting time-frequency curves as the first fitting time-frequency curves.

5. The apparatus of claim 4, wherein the first time data comprises a plurality of first time points and a plurality of first frequency values, wherein the plurality of first time points and the plurality of first frequency values are in one-to-one correspondence, wherein the second time data comprises a plurality of second time points and a plurality of second frequency values, and wherein the plurality of second time points and the plurality of second frequency values are in one-to-one correspondence;

the first determining subunit is specifically configured to:

6. The apparatus of any of claims 4-5, wherein the determining means comprises:

7. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-3.