CN113782043A

CN113782043A - Voice acquisition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN113782043A
Application number: CN202111041110.9A
Authority: CN
Inventors: 蒋毅; 李健; 武卫东; 陈明
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-12-10

Abstract

The embodiment of the invention provides a voice acquisition method, a voice acquisition device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: determining a target sampling frequency of the required target audio data; according to the target sampling frequency, firstly performing oversampling on the voice sent by the user, and then performing down-sampling so as to enhance the voice signal in the voice sent by the user; and taking the voice data after the enhancement processing as the required target audio data. In the embodiment of the invention, the voice signal sent by the user is enhanced, but the noise inside the equipment is not influenced by oversampling, so the signal-to-noise ratio of the voice signal can be simply and effectively improved by enhancing the voice signal energy and simultaneously not changing the method of the noise energy inside the equipment.

Description

Voice acquisition method and device, electronic equipment and computer readable storage medium

Technical Field

The embodiment of the invention relates to the field of voice processing, in particular to a voice acquisition method, a voice acquisition device, electronic equipment and a computer-readable storage medium.

Background

Speech acquisition is involved in a large number of scenarios, such as television program recording, movie recording, music recording, educational video recording, etc. However, when voice acquisition is performed, it is difficult to avoid noise interference, which mainly includes environmental noise and noise inside the recording device. Thereby causing noise to be included in the finally acquired audio data and affecting the audio quality. Therefore, it is necessary to collect as many target speaking sounds as possible to increase the volume or reduce the noise effect so as to increase the signal-to-noise ratio of the collected audio data and improve the audio quality.

In the related art, there are two main solutions, one is a microphone array solution and one is a digital gain solution.

Wherein, the microphone array scheme adopts a microphone array to enhance the volume of the voice signal by the array gain of the microphone array. However, this solution involves multiple microphone sensors and acquisition circuits, the acquisition system has a complex structure and high cost, and due to the differences in channel gain, frequency response curve and consistency of the acquisition conditioning circuit among the multiple microphone sensors, the enhanced speech has reverberation in the time domain and distortion in the frequency domain.

The digital gain scheme is actually a dynamic gain adjustment method that uses digital gain for sound signal amplification. The scheme can improve the energy of noise signals mixed in the voice signals while amplifying the acquired weak signals, so that the signal-to-noise ratio of the acquired voice cannot be effectively improved.

It is therefore desirable to provide a simple and effective solution for improving the signal-to-noise ratio of audio data to improve the audio quality.

Disclosure of Invention

The invention provides a voice acquisition method, a voice acquisition device, electronic equipment and a computer readable storage medium.

In order to solve the above problem, in a first aspect, an embodiment of the present invention provides a method for acquiring a voice, where the method includes:

determining a target sampling frequency of the required target audio data;

according to the target sampling frequency, firstly performing oversampling on the voice sent by the user, and then performing down-sampling so as to enhance the voice signal in the voice sent by the user;

and taking the voice data after the enhancement processing as the required target audio data.

Optionally, according to the target sampling frequency, performing oversampling on a voice uttered by a user and then performing downsampling includes:

collecting voice sent by a user at an actual sampling frequency which is N times higher than the target sampling frequency to obtain audio data of N sampling values in unit time;

adding the audio data of every N adjacent sampling values to obtain the audio data of the sampling values in a unit time;

and taking the obtained audio data of the plurality of sampling values as the voice data after enhancement processing.

Optionally, determining a target sampling frequency of the desired target audio data comprises:

and determining the target sampling frequency according to the audio analysis requirement or the actual effective frequency range of the sampling object.

Optionally, collecting the voice uttered by the user at an actual sampling frequency N times higher than the target sampling frequency includes:

selecting an actual high-speed audio acquisition circuit consisting of a high-frequency response microphone and a high-speed signal acquisition circuit corresponding to the actual sampling frequency;

and acquiring the voice sent by the user by utilizing the actual high-speed audio acquisition circuit.

Optionally, the method further comprises:

collecting voice sent by a user in an initial time period;

analyzing the collected voice to determine whether the voice sent by the user is far-field voice;

according to the target sampling frequency, firstly performing oversampling on the collected voice and then performing downsampling, and the method comprises the following steps:

and under the condition that the collected voice is far-field voice, according to the target sampling frequency, firstly performing oversampling on the collected voice and then performing downsampling.

In a second aspect, an embodiment of the present invention provides a speech acquisition apparatus, where the apparatus includes:

the target sampling frequency determining module is used for determining the target sampling frequency of the required target audio data;

the enhancement processing module is used for firstly carrying out oversampling and then carrying out downsampling on the voice sent by the user according to the target sampling frequency so as to carry out enhancement processing on the voice signal in the voice sent by the user;

and the target audio data acquisition module is used for taking the voice data after the enhancement processing as the required target audio data.

Optionally, the enhancement processing module includes:

the oversampling submodule is used for acquiring voice sent by a user at an actual sampling frequency which is N times higher than the target sampling frequency to obtain audio data of N sampling values in unit time;

the down-sampling submodule is used for adding the audio data of every N adjacent sampling values to be used as the audio data of the sampling value in unit time;

and the voice data determination submodule is used for taking the obtained audio data of the plurality of sampling values as the voice data after enhancement processing.

Optionally, the target sampling frequency determination module includes:

and the target sampling frequency determining submodule is used for determining the target sampling frequency according to the audio analysis requirement or the actual effective frequency range of the sampling object.

Optionally, the oversampling submodule includes:

the selection unit is used for selecting an actual high-speed audio acquisition circuit consisting of a high-frequency response microphone and a high-speed signal acquisition circuit corresponding to the actual sampling frequency;

and the acquisition unit is used for acquiring the voice sent by the user by utilizing the actual high-speed audio acquisition circuit.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring voice sent by a user in an initial time period;

the analysis module is used for analyzing the collected voice and determining whether the voice sent by the user is far-field voice;

the enhancement processing module is further configured to perform oversampling and then down-sampling on the collected voice according to the target sampling frequency under the condition that the collected voice is far-field voice.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program that is stored in the memory and is executable on the processor, where the processor executes the computer program to implement the voice collecting method provided in the embodiment of the present invention.

In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor, and the program includes the steps of the speech acquisition method proposed in the embodiment of the present invention.

In the embodiment of the invention, the target sampling frequency of the required target audio data is determined firstly, then the voice sent by the user is subjected to oversampling and then is subjected to downsampling according to the target sampling frequency, in the process, the voice signal sent by the user is enhanced, but the noise inside the equipment is not influenced by the oversampling, therefore, the signal-to-noise ratio of the voice signal can be simply and effectively improved by a method of enhancing the voice signal energy and not changing the noise energy inside the equipment, and the audio quality is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments or the related technical descriptions will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of a voice collecting method according to an embodiment of the present invention;

fig. 2 is a flowchart of a voice collecting method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a structure of a voice collecting device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A flowchart of a speech acquisition method provided in an embodiment of the present invention is shown in fig. 1. The voice acquisition method provided by the invention can be applied to voice acquisition processes of television program recording, film recording, music recording, teaching video recording and the like. The voice acquisition method comprises the following steps:

in step S110, a target sampling frequency of the desired target audio data is determined.

In this embodiment, the sound collection scheme may be customized for a specific application, which is specifically represented by determining a target sampling frequency according to an actual requirement of voice analysis or an effective frequency range of a collection object, and collecting a voice signal with the target sampling frequency as a target.

And step S120, according to the target sampling frequency, firstly performing oversampling on the voice sent by the user, and then performing down-sampling on the voice so as to enhance the voice signal in the voice sent by the user.

In this embodiment, oversampling means: after the target sampling frequency is determined, the voice signal sent by the sampling object is collected by using the actual sampling frequency N times the target sampling frequency.

In this embodiment, the down-sampling refers to: after the audio data is acquired by using the actual sampling frequency, sampling values in the audio data are integrated according to a ratio between the actual sampling frequency and the target sampling frequency, and N sampling values in unit time (one sampling time interval of a sampling period of the target sampling frequency is taken as unit time) are added to obtain the sampling value in the unit time.

Step S130, the enhanced voice data is used as the required target audio data.

In this embodiment, the voice uttered by the user is first oversampled, so that N sampling values can be acquired in unit time. And then down-sampling the collected multiple sampling values, and adding N sampling values in unit time to obtain enhanced voice data, wherein the enhanced voice data is used as required target audio data to obtain audio data after voice signal enhancement.

In the embodiment of the invention, the target sampling frequency of the required target audio data is determined firstly, then the voice sent by the user is subjected to oversampling and then downsampling according to the target sampling frequency, in the process, the voice signal sent by the user is enhanced, but the noise inside the equipment is not influenced by the oversampling and downsampling processes, therefore, the invention can simply and effectively improve the signal-to-noise ratio of the voice signal by enhancing the voice signal energy without changing the method of the noise energy inside the equipment, thereby improving the audio quality.

In this embodiment, before executing step S120, the voice collecting method may further include:

step S1, collecting the voice sent by the user in the initial time period;

in this embodiment, the voice uttered by the user may also be collected within the initial time period to perform the test.

Step S2, analyzing the collected voice, and determining whether the voice sent by the user is far-field voice;

in this embodiment, the user voice collected in the initial time period may be analyzed to determine whether the voice uttered by the user is far-field voice.

In the present embodiment, a voice whose distance from the reference point of the acquisition sensor is much larger than the signal wavelength is referred to as a far-field voice. In this embodiment, whether the sound source voice is far-field voice can be determined by analyzing the distance between the reference point of the microphone and the position of the sound source and the wavelength of the sound source voice signal.

In this embodiment, when the collected voice is far-field voice, the collected voice is first oversampled and then downsampled according to the target sampling frequency.

In this embodiment, when the voice of the sound source is determined to be far-field voice, a method of oversampling and then downsampling may be adopted to perform enhancement processing on the subsequently acquired voice signal.

In the invention, the inventor finds that under the far-field voice acquisition condition that the target speaker is far away from the acquisition sensor, the voice becomes weak due to the attenuation of a spatial path, the energy of the voice reaching the acquisition sensor is weak, the energy of the voice signal acquired directly each time is less, and the voice is easily interfered by noise.

The inventor further discovers that in a voice acquisition scene, noise contained in the acquired audio data is mainly environmental noise and internal device noise, wherein in a far-field voice acquisition scene, the environmental noise and the internal device noise have equivalent influence on voice signals, and the influence can be approximately one to one. Therefore, the inventor proposes that the voice uttered by the user can be oversampled and then downsampled so as to enhance the voice signal in the voice uttered by the user. In the oversampling process, although the signal enhancement is performed on both the speech signal and the environmental noise signal, the internal noise signal of the device is kept unchanged, so that the signal-to-noise ratio of the speech signal can be effectively improved.

A flowchart of a speech acquisition method provided in an embodiment of the present invention is shown in fig. 2. In this embodiment, the voice collecting method includes:

in step S210, a target sampling frequency of the desired target audio data is determined.

In this embodiment, the step S210 specifically includes: and determining the target sampling frequency according to the audio analysis requirement or the actual effective frequency range of the sampling object.

In this embodiment, the target sampling frequency of the target audio data may be determined according to the actual application scenario, for example: during the voice call, the voice sampling frequency is required to be 8kHz, so that the target sampling frequency can be determined to be 8 kHz.

In this embodiment, the target application frequency may also be determined according to the actual effective frequency range of the sampling object in the actual application scene.

Step S220, collecting the voice sent by the user at the actual sampling frequency which is N times higher than the target sampling frequency to obtain the audio data of N sampling values in unit time.

In this embodiment, after the target sampling frequency is determined, an actual sampling frequency N times the target sampling frequency may be determined according to actual requirements. Wherein, N can be any natural number more than 1, and can be customized according to actual needs.

For example, assuming that the target sampling frequency is 8kHz sampling frequency, 2 times, 4 times or 8 times of actual sampling frequency (16kHz, 32kHz, 64kHz) may be selected for oversampling, resulting in audio data with 2 times, 4 times and 8 times of sampling value per unit time. In the present embodiment, the unit time refers to one sampling time interval of the sampling period of the target sampling frequency.

In this embodiment, a target sampling frequency may be used to perform test sampling, analyze the energy of the collected voice, determine the actually required voice signal energy, and determine the required multiple according to the ratio of the two.

In practical applications, the step S220 specifically includes the following sub-steps:

and step S221, selecting an actual high-speed audio acquisition circuit consisting of a high-frequency response microphone and a high-speed signal acquisition circuit corresponding to the actual sampling frequency.

In this embodiment, after determining the required actual sampling frequency, a corresponding high-frequency-response microphone may be selected, and an actual high-speed audio acquisition circuit may be formed by using the high-frequency-response microphone and the high-speed acquisition signal acquisition circuit.

And step S222, acquiring the voice sent by the user by utilizing the actual high-speed audio acquisition circuit.

In this embodiment, the corresponding actual high-speed audio acquisition circuit can be customized according to the actual voice analysis requirement, so that the actual high-speed audio acquisition circuit is used as an acquisition sensor to acquire the voice of the user.

In step S230, the audio data of every N adjacent sample values are added to be the audio data of the sample value in one unit time.

In this embodiment, after obtaining the audio data of N sampling values in a unit time, the audio data is down-sampled by a multiple of oversampling, and the audio data of every N adjacent sampling values is added to obtain the audio data of the sampling value of one unit time class.

For example, after obtaining audio data of 2 times or 4 times or 8 times of sampling values in a unit time, each 2 sampling values, each 4 sampling values, and each 8 sampling values may be respectively subjected to multipoint accumulation to serve as a single sampling value, so that the energy of the single sampling point is increased.

In step S240, the audio data of the obtained plurality of sampling values is used as the speech data after enhancement processing.

In this embodiment, audio data of a plurality of sampling values can be obtained by oversampling and down-sampling, and in the audio data, the internal noise generated by the internal circuit of the acquisition apparatus is not changed, and the energy of the sampling point per unit time is increased, so that audio data in which the sampling value is enhanced but the internal noise is not changed can be obtained.

Step S250, the enhanced voice data is taken as the required target audio data.

In this embodiment, after obtaining the enhanced voice data, the voice data may be used as the required target audio data, so as to obtain high quality audio data with a higher signal-to-noise ratio.

In this embodiment, when determining the target sampling frequency of the required target audio data, a corresponding voice acquisition scheme may be customized, and the voice sent by the user is first oversampled and then downsampled, and in the voice acquisition process, the voice signal sent by the user is enhanced, but the noise inside the device is not affected by the oversampling and downsampling processes.

Referring to fig. 3, a block diagram of a structure of a voice collecting apparatus 300 according to the present invention is shown, specifically, the voice collecting apparatus 300 may include the following modules:

a target sampling frequency determination module 301, configured to determine a target sampling frequency of the required target audio data;

the enhancement processing module 302 is configured to perform oversampling on the voice sent by the user according to the target sampling frequency, and then perform down-sampling on the voice to perform enhancement processing on a voice signal in the voice sent by the user;

and a target audio data obtaining module 303, configured to take the enhanced voice data as the required target audio data.

Optionally, the enhancement processing module 302 includes:

Optionally, the target sampling frequency determining module 301 includes:

Optionally, the oversampling submodule includes:

Optionally, the apparatus further comprises:

the enhancement processing module 302 is further configured to, when the collected voice is far-field voice, perform oversampling on the collected voice first and then perform downsampling on the collected voice according to the target sampling frequency.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Correspondingly, the invention further provides an electronic device, which includes a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the voice acquisition method according to the embodiment of the invention when executing the computer program, and can achieve the same technical effects, and the details are not repeated here to avoid repetition. The electronic device can be a PC, a mobile terminal, a personal digital assistant, a tablet computer and the like.

The present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the voice collecting method according to the embodiments of the present invention, and can achieve the same technical effects, and is not described herein again to avoid repetition. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The speech acquisition method, the apparatus, the electronic device and the computer-readable storage medium provided by the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Claims

1. A method for speech acquisition, the method comprising:

determining a target sampling frequency of the required target audio data;

2. The method of claim 1, wherein oversampling and then downsampling the speech uttered by the user according to the target sampling frequency comprises:

3. The method of claim 1, wherein determining a desired target sampling frequency for the target audio data comprises:

4. The method of claim 2, wherein collecting speech uttered by a user at an actual sampling frequency N times higher than the target sampling frequency comprises:

5. The method according to any one of claims 1-4, further comprising:

collecting voice sent by a user in an initial time period;

6. A speech acquisition device, the device comprising:

7. The apparatus of claim 6, wherein the enhancement processing module comprises:

8. The apparatus of claim 6, wherein the target sampling frequency determination module comprises:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech acquisition method of any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech acquisition method according to any one of claims 1 to 5.