CN111370012A

CN111370012A - Bluetooth voice audio acquisition method and system

Info

Publication number: CN111370012A
Application number: CN202010460221.2A
Authority: CN
Inventors: 江德祥
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-07-03
Anticipated expiration: 2040-05-27
Also published as: CN111370012B

Abstract

The invention provides a Bluetooth voice audio acquisition method and a system, wherein the method is applied to a mobile terminal connected with a multi-channel audio acquisition device and comprises the following steps: simultaneously acquiring a plurality of first audio data through a multi-channel audio acquisition device; processing the plurality of first audio data based on a preset processing rule to obtain second audio data; and sending the second audio data to the voice receiving terminal. The Bluetooth voice audio acquisition method of the invention realizes clear user voice acquisition; the user does not need to care which microphone collects the sound, and the user obtains better voice operation experience.

Description

Bluetooth voice audio acquisition method and system

Technical Field

The invention relates to the technical field of voice acquisition, in particular to a Bluetooth voice audio acquisition method and system.

Background

At present, when a user uses a mobile terminal such as a mobile phone and the like to connect with bluetooth audio peripherals such as a bluetooth headset/a sound box and the like and performs some voice operations, such as dialing, WeChat/QQ voice & video, WeChat sending voice, skype voice call and other operations, the voice acquisition operations are usually automatically completed under the assistance of the application and a mobile phone audio system. One possibility for collecting user voice data is that sound is collected by a microphone of an audio peripheral such as an earphone and the like and is transmitted to a mobile phone through a BT SCO link; one is that a microphone arranged on the mobile phone collects sound; different application scenarios may confuse the user because of different on-microphone acquisitions on the handset and the peripheral. In addition, the voice tone quality acquired by mic (Microphone) on the mobile phone or mic on the earphone is higher in different scenes or hardware factors of equipment, and the like, so that a user cannot select a better Microphone for inputting by himself; therefore, the optimization scheme can be used for simultaneously acquiring multi-channel voice input, and the voice acquisition quality and the user experience are improved.

In addition, the two-path microphone acquisition of the existing voice acquisition scene mobile phone terminal or the earphone accessory terminal and the like is generally independent and incompatible, and only one path of operation can be selected. Therefore, to acquire high-quality voice data, a user must speak near the earphone or near the mobile phone, and the user must distinguish which microphone acquires the voice, which results in poor user experience.

Disclosure of Invention

One of the purposes of the invention is to provide a Bluetooth voice audio acquisition method, which realizes clear user voice acquisition; the user does not need to care which microphone collects the sound, and the user obtains better voice operation experience.

The embodiment of the invention provides a Bluetooth voice audio acquisition method, which is applied to a mobile terminal connected with a multi-channel audio acquisition device and comprises the following steps:

simultaneously acquiring a plurality of first audio data through a multi-channel audio acquisition device;

processing the plurality of first audio data based on a preset processing rule to obtain second audio data;

and sending the second audio data to the voice receiving terminal.

Preferably, the preset processing rule includes:

respectively extracting parameters of the plurality of first audio data to obtain parameters representing the quality of the first audio data;

and comparing the parameters to obtain the best quality of the plurality of first audio data as second audio data.

Preferably, the parameters include: frequency response, THD + N, volume.

Preferably, the preset processing rule further includes:

and carrying out fusion processing on the plurality of first audio data to obtain second audio data.

Preferably, the fusion processing method includes: MIX, enhance one or more combinations in the compensation.

The invention also provides a bluetooth voice audio acquisition system, which is applied to a mobile terminal connected with a multi-channel audio acquisition device and comprises:

the first audio acquisition module is used for acquiring a plurality of first audio data through a multi-channel audio acquisition device at the same time;

the second audio generation module is used for processing the plurality of first audio data based on a preset processing rule to acquire second audio data;

and the audio sending module is used for sending the second audio data to the voice receiving terminal.

Preferably, the second audio generating module includes:

the parameter extraction module is used for respectively extracting parameters of the plurality of first audio data to obtain parameters representing the quality of the first audio data;

and the parameter comparison module is used for comparing the parameters and acquiring the best quality of the plurality of first audio data as second audio data.

Preferably, the parameters include: frequency response, THD + N, volume.

Preferably, the second audio generating module further comprises:

and the audio fusion module is used for carrying out fusion processing on the plurality of first audio data to obtain second audio data.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic diagram of a bluetooth voice audio acquisition method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a conventional voice audio acquisition method;

fig. 3 is a schematic diagram of another conventional voice audio acquisition method.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

and sending the second audio data to the voice receiving terminal.

The working principle and the beneficial effects of the technical scheme are as follows:

when a user uses a mobile terminal 1 such as a mobile phone and the like to connect with bluetooth audio peripherals (a bluetooth terminal 2) such as a bluetooth earphone/a sound box and the like, voice operations such as call making, WeChat/QQ voice and video, WeChat voice sending, skype voice call and the like are often performed to send voice data to a remote receiving terminal 3, and the voice acquisition operations are usually automatically completed by assistance of application and a mobile phone audio software and hardware system. The current design of the voice collecting software system only supports collecting from one microphone of the mobile terminal 1 (as shown in fig. 2) or the bluetooth terminal 2 (as shown in fig. 3), and the user is confused because it is unclear whether the voice is collected from the mobile terminal 1 or the bluetooth terminal 2. In addition, the voice tone quality acquired by mic on the mobile terminal 1 or the bluetooth terminal 2 may be higher due to environmental factor influence or equipment hardware factor, and the user cannot select a better microphone for input.

Fig. 1 shows an application scenario of a two-way audio acquisition device. In the application scene, the mobile terminal 1 and the voice receiving terminal 3 carry out voice call operation, and the bluetooth terminal 2 is connected with the mobile terminal 1 through a bluetooth wireless link. When the voice application on the mobile terminal 1 actively or passively triggers a voice operation, such as a telephone operation or a WeChat voice transmission, the application sets the audio input channel as a mobile phone microphone or as a Bluetooth BTSCO (using a Bluetooth microphone for input) by calling an audio software framework interface of the mobile phone system. The mobile phone microphone is a path of audio acquisition device; the Bluetooth microphone is another audio acquisition device.

Two paths of voice acquisition: by optimizing the audio system software architecture on the mobile terminal 1, when the audio channel selection is triggered by the mobile terminal 1, if the bluetooth terminal 2 is currently connected with the mobile terminal 1, the audio software system of the mobile terminal 1 is simultaneously controlled to start a microphone (such as a mobile phone microphone 1) on the mobile terminal 1 and establish a BT SCO Link (to start an earphone microphone), the two microphones simultaneously start voice collection, the actual application scene of a user may be attached to the bluetooth terminal 2 to make voice production or be close to the mobile terminal 1 to make voice production, the two voice collections are simultaneously gathered in the ADSP digital audio processing of the mobile terminal 1, and the audio voice data collected by the two audio nodes are actually compared and fused. And finally, a second audio with good audio quality is obtained after fusion processing, and the second audio is transmitted to the receiving terminal 3, so that the far-end equipment can obtain a clearer voice signal.

The Bluetooth voice audio acquisition method of the invention realizes clear user voice acquisition; the user does not need to care which microphone collects the sound, and the user obtains better voice operation experience.

In one embodiment, the preset processing rule includes:

Wherein the parameters include: frequency response, THD + N, volume.

in the application scenario of fig. 1, the ADSP audio fusion policy: after two paths of audio data are input into ADSP, the indexes such as frequency response, THD + N, volume and the like of two paths of collected audio are compared through a three-party algorithm, audio signals with better parameter performance are directly selected, one path with poor signal performance is discarded, and preferential selection is carried out; no matter whether the user carries out speech sound production by attaching to bluetooth terminal 2 or is close to mobile terminal 1 and carries out speech sound production, the audio frequency that mobile terminal 1 sent for receiving terminal 3 is the clearest, need not that the user distinguishes which microphone to gather sound, and has avoided the user to distinguish the mistake and cause the emergence of the condition that the audio frequency collection quality is low, and then improves user's experience.

In one embodiment, the pre-setting processing rule further includes:

The fusion processing method comprises the following steps: MIX, enhance one or more combinations in the compensation.

in the application scenario of fig. 1, the ADSP audio fusion policy: after two paths of audio data are input into ADSP, indexes such as frequency response, THD + N, volume and the like of two paths of collected audio are compared through a three-way algorithm, and MIX sound mixing and enhancement compensation are carried out on the basis of two paths of signals; and the two paths of signals are fused to realize the complementation of audio signal acquisition and reduce the probability of audio signal distortion, and then the processed signals are output. Therefore, no matter whether the user carries out voice sound production by being attached to the Bluetooth terminal 2 or is close to the mobile terminal 1, the audio sent to the receiving terminal 3 by the mobile terminal 1 is the best effect of the user sound, the user is not required to distinguish which microphone collects the sound, the situation that the audio collection quality is low due to the fact that the user distinguishes errors is avoided, and the user experience is improved.

In one embodiment, performing fusion processing on a plurality of first audio data to obtain second audio data includes the following operations:

step S1: performing time domain alignment operation on a plurality of first audio data;

step S2: calculating an amplitude sequence of each of the first audio data; the amplitude sequence includes: an amplitude value of each frame of the first audio data;

step S3: calculating an effective value corresponding to the amplitude value, wherein the calculation formula is as follows:

；

wherein the content of the first and second substances,

indicating the second in the amplitude sequence

Effective values corresponding to the amplitude values;

、

respectively is a preset maximum standard amplitude and a preset minimum standard amplitude;

step S4: corresponding the effective value based on each amplitude value in the amplitude sequence

Calculating the position of the first audio dataValue of credit

The calculation formula is as follows:

；

wherein N represents the number of the amplitude values in the amplitude sequence; when the confidence value is

When the first audio data is smaller than or equal to a preset value, the corresponding first audio data is abandoned;

step S5: when the number of the first audio data processed in step S4 is one, performing amplitude enhancement on the first audio data to make an amplitude value in the amplitude sequence of the first audio data meet a preset requirement to be used as the second audio data; the preset requirements include: the number of amplitude values in the amplitude sequence reaching the preset amplitude intensity reaches the set number;

step S6: when the number of the first audio data processed in step S4 is greater than one, performing amplitude enhancement on each first audio data to make the amplitude value in the amplitude sequence of each first audio data meet a preset requirement, averaging the amplitude values in the amplitude sequences to form a new amplitude sequence, and using the new first audio data corresponding to the new amplitude sequence as the second audio data.

while processing the first audio data; the audio acquisition devices corresponding to the first audio data are different, the setting positions of the audio acquisition devices are different, and the distances between the audio acquisition devices and the speaking positions of the users are different, so that the audio data directly acquired by the audio acquisition devices have time domain differences, the first audio data are aligned in time domain, the validity of the first audio data is verified after the alignment, and the first audio data acquired by the audio acquisition devices far away from the speaking positions of the users are removed; the collected first audio data are fused to obtain better second audio data; furthermore, the manner of fusion is not limited to fusion in amplitude.

In one embodiment, step S1: performing time domain alignment operation on a plurality of first audio data; the method specifically comprises the following steps:

step S11: sampling the first audio data based on a first preset time to obtain a plurality of first short-time energy data,

step S12: respectively calculating short-time energy values of the first short-time energy data, and arranging the short-time energy values according to a sampling sequence to form a short-time energy sequence;

step S13: comparing the sum of the short-time energy values in the short-time energy sequences of the first audio data, and taking the short-time energy sequence corresponding to the first audio data with the largest sum as a standard short-time energy sequence;

step S14: acquiring the length of an alignment label, and performing translation extraction in the standard short-time energy sequence based on the length of the alignment label to obtain a plurality of short-time energy labels;

step S15: comparing the energy sum value of each short-time energy label, and taking the short-time energy label with the maximum sum value as an alignment label;

step S16: performing translation detection on the alignment tag in each first audio data based on first preset time, and determining a position, with the highest matching degree with the alignment tag, in the first audio data as an alignment position;

step S17: and determining a good alignment position based on each piece of the first audio data, and performing alignment operation on each piece of the first audio data in a time domain.

the short-time energy sequence is used as an alignment standard to realize the accuracy of alignment operation; the alignment label takes the part of the audio data with more concentrated energy, so that the alignment label has the marking property; through the alignment step in the embodiment, the accuracy of the subsequent second audio data generation is ensured, and the mobile terminal realizes clear user voice acquisition. The position of the first audio data with the highest matching degree with the alignment tag specifically is: the closeness of the ratio of the short-time energy values in the first audio data corresponding to the position of the alignment mark to the ratio of the short-time energy values in the alignment mark.

step S16: performing translation detection on the alignment tag in each first audio data based on the first preset time, and determining a position, which is the highest in matching degree with the alignment tag, in the first audio data as a matching position;

step S16A: acquiring matched audio data corresponding to the matched position, and acquiring label audio data corresponding to the aligned label;

step S16B: intercepting and discarding data of a second preset time at the front end of the matched audio data;

step S16C: then sampling the discarded matched audio data based on first preset time to obtain a plurality of second short-time energy data; the second preset time is one M times of the first preset time;

step S16D: respectively calculating short-time energy values of the second short-time energy data, and arranging according to a sampling sequence to form a short-time energy sequence;

step 16E: continuing to intercept data of a second preset time at the front end of the matched audio data and discarding the data, repeating the steps S16C to S16D until M short-time energy sequences are obtained, and discarding the last energy value in the short-time energy sequences before the M short-time energy sequences;

step 16F: discarding the first and last short-time energy values in the short-time energy sequence corresponding to the alignment label to obtain a second standard short-time energy sequence;

step 16G: comparing the matching degrees between the second standard short-time energy sequence and the M short-time energy sequences respectively, and taking the position corresponding to the highest matching degree as the alignment position;

by sampling the matching position and the alignment label again, the more accurate alignment position is obtained, the accuracy of the subsequent second audio data generation is ensured, and the mobile terminal realizes clear user voice acquisition.

The Bluetooth voice audio acquisition system realizes clear voice acquisition of a user; the user does not need to care which microphone collects the sound, and the user obtains better voice operation experience.

In one embodiment, the second audio generation module comprises:

Wherein the parameters include: frequency response, THD + N, volume.

In one embodiment, the second audio generation module further comprises:

In one embodiment, the second audio generation module performs operations comprising:

；

wherein the content of the first and second substances,

indicating the second in the amplitude sequence

Effective values corresponding to the amplitude values;

、

Calculating a confidence value for the first audio data

The calculation formula is as follows:

；

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. The Bluetooth voice audio acquisition method is applied to a mobile terminal connected with a multi-channel audio acquisition device, and comprises the following steps:

sending the second audio data to a voice receiving terminal;

the preset processing rule comprises:

performing fusion processing on the plurality of first audio data to obtain second audio data;

the method for fusing the plurality of first audio data to obtain the second audio data comprises the following operations:

；

wherein the content of the first and second substances,

indicating the second in the amplitude sequence

Corresponding to amplitude value isA virtual value;

、

Calculating a confidence value for the first audio data

The calculation formula is as follows:

；

2. The bluetooth voice audio acquisition method according to claim 1, wherein the preset processing rule comprises:

performing parameter extraction on a plurality of first audio data respectively to obtain the parameters representing the quality of the first audio data;

and comparing the parameters to obtain the best quality of the plurality of first audio data as the second audio data.

3. The bluetooth voice audio capture method of claim 2, characterized in that the parameters comprise: frequency response, THD + N, volume.

4. The bluetooth voice audio acquisition method according to claim 1, wherein the fusion processing method comprises: MIX, enhance one or more combinations in the compensation.

5. The utility model provides a bluetooth pronunciation audio acquisition system which characterized in that is applied to and is connected with multichannel audio acquisition device's mobile terminal, includes:

the audio sending module is used for sending the second audio data to a voice receiving terminal;

the second audio generation module further comprises:

the audio fusion module is used for carrying out fusion processing on the plurality of first audio data to obtain second audio data;

the second audio generation module performs operations comprising:

；

wherein the content of the first and second substances,

indicating the second in the amplitude sequence

Effective values corresponding to the amplitude values;

、

Calculating a confidence value for the first audio data

The calculation formula is as follows:

；

wherein N represents the amplitude sequenceThe number of amplitude values in the column; when the confidence value is

6. The bluetooth voice audio capture system of claim 5, wherein the second audio generation module comprises:

a parameter extraction module, configured to perform parameter extraction on the plurality of first audio data, respectively, to obtain the parameter indicating the quality of the first audio data;

and the parameter comparison module is used for comparing the parameters to obtain the second audio data with the best quality in the plurality of first audio data.

7. The bluetooth voice audio capture system of claim 6, where the parameters comprise: frequency response, THD + N, volume.

8. The bluetooth voice audio acquisition system according to claim 5, wherein the method of fusion processing comprises: MIX, enhance one or more combinations in the compensation.