CN110931047A

CN110931047A - Voice data acquisition method and device, acquisition terminal and readable storage medium

Info

Publication number: CN110931047A
Application number: CN201911247229.4A
Authority: CN
Inventors: 黄族良; 龙洪锋
Original assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Current assignee: Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-03-27

Abstract

The invention discloses a voice data acquisition method, a voice data acquisition device, an acquisition terminal and a readable storage medium, wherein the method comprises the following steps: if the voice data is detected to contain human voice data, judging whether the energy of the audio frame of the currently collected voice data meets the preset collection condition or not; and if the energy of the audio frame of the voice data does not accord with the preset acquisition condition, outputting a text display to guide a user to adjust the voice acquisition state. Therefore, if the voice data which do not accord with the preset acquisition condition is detected, the text display is output to guide the user to adjust the voice acquisition state, the voice data with good quality is obtained, and the voiceprint features with high quality are extracted from the voice data.

Description

Voice data acquisition method and device, acquisition terminal and readable storage medium

Technical Field

The invention relates to the field of voice recognition, in particular to a voice data acquisition method, a voice data acquisition device, a voice data acquisition terminal and a readable storage medium.

Background

At present, when a collection terminal collects the voice of a user speaking, the problem that the collection terminal cannot extract high-quality voiceprint features from collected voice data of the user may be caused by too small voice of the user who is collected due to too far distance between the user and the collection terminal, or too large voice of the user who is collected due to too small or too large voice of the user who speaks, or the like.

Disclosure of Invention

The invention mainly aims to provide a voice data acquisition method, a voice data acquisition device, an acquisition terminal and a readable storage medium, and aims to solve the technical problem that the acquisition terminal cannot extract high-quality voiceprint features in voice data of an acquisition user in the prior art.

In order to achieve the above object, the present invention provides a method for acquiring voice data, the method comprising:

if the voice data is detected to contain human voice data, judging whether the energy of the audio frame of the currently collected voice data meets the preset collection condition or not;

and if the energy of the audio frame of the voice data does not accord with the preset acquisition condition, outputting a text display to guide a user to adjust the voice acquisition state.

Further, the step of determining whether the energy of the audio frame of the currently acquired voice data meets a preset acquisition condition includes:

judging whether the absolute value of the difference value between the energy of the audio frame of the currently acquired voice data and the average energy of the audio frame in a first preset time length is smaller than a preset energy threshold value or not;

if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is smaller than or equal to the preset energy threshold value, judging that the energy of the audio frame of the voice data meets the preset acquisition condition;

and if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is greater than the preset energy threshold value, judging that the energy of the audio frame of the voice data does not accord with the preset acquisition condition.

Further, the step of outputting a text display to guide a user to adjust a voice acquisition state if the energy of the audio frame of the voice data does not meet the preset acquisition condition includes:

if the energy of the audio frame of the voice data is larger than the average energy of the audio frame, and the absolute value of the difference is larger than the preset energy threshold, outputting a text display to guide a user to increase the distance between the user and the acquisition terminal and/or reduce the speaking sound of the user;

and if the energy of the audio frame of the voice data is less than the average energy of the audio frame and the absolute value of the difference is greater than the preset energy threshold, outputting a text display to guide a user to reduce the distance between the user and the acquisition terminal and/or increase the speaking sound of the user.

Further, the adjusting of the voice acquisition state comprises adjusting a distance between the user and the acquisition terminal and/or adjusting the size of the speaking voice of the user;

the step of outputting a text display to guide a user to adjust a speech acquisition state includes:

and outputting a text display to guide the user to adjust the distance between the user and the acquisition terminal and/or adjust the size of the speaking sound of the user.

Further, the step of outputting a text display to guide the user to adjust the voice capture state comprises:

the acquisition terminal guides a user to adjust the voice acquisition state by increasing the brightness of the display screen and outputting text display in the display screen.

Further, the method further comprises:

and acquiring the voice data according to a second preset time length, and cutting out partial voice data in the voice data to serve as target voice data to be stored.

Further, if it is detected that the voice data includes human voice data, the step of determining whether the energy of the audio frame of the currently acquired voice data meets a preset acquisition condition includes:

inputting the acquired voice data into a preset human voice recognition model to judge whether the voice data contains the human voice data;

and if the voice data comprises the human voice data, judging whether the energy of the audio frame of the currently acquired voice data meets the preset acquisition condition.

The invention also provides a voice data acquisition device, which comprises:

the detection module is used for judging whether the energy of an audio frame of the currently acquired voice data meets a preset acquisition condition or not when the voice data is detected to contain human voice data;

and the guiding module is used for outputting text display to guide a user to adjust the voice acquisition state if the energy of the audio frame of the voice data does not accord with the preset acquisition condition.

The invention also provides an acquisition terminal, comprising: the voice data acquisition system comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the voice data acquisition program realizes the steps of the voice data acquisition method when being executed by the processor.

The invention also provides a readable storage medium, which is characterized in that the readable storage medium stores a computer program, and the computer program is executed by a processor to realize the steps of the voice data acquisition method.

According to the voice data acquisition method provided by the embodiment of the invention, if the voice data is detected to contain human voice data, whether the energy of an audio frame of the currently acquired voice data meets a preset acquisition condition is judged; and if the energy of the audio frame of the voice data does not accord with the preset acquisition condition, outputting a text display to guide a user to adjust the voice acquisition state. Therefore, if the voice data which do not accord with the preset acquisition condition is detected, the text display is output to guide the user to adjust the voice acquisition state, the voice data with good quality is obtained, and the voiceprint features with high quality are extracted from the voice data.

Drawings

Fig. 1 is a schematic structural diagram of an acquisition terminal in which hardware operates according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating a first embodiment of a method for collecting voice data according to the present invention;

fig. 3 is a schematic diagram of a frame structure of an embodiment of a voice data collecting apparatus according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of an acquisition terminal of a hardware operating environment according to an embodiment of the present invention.

The acquisition terminal of the embodiment of the invention can be a PC, and can also be an acquisition terminal device with display function, such as a smart phone, a tablet computer, an electronic book reader, an MP3(Moving Picture Experts Group Audio Layer III, dynamic video Experts compress standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, dynamic video Experts compress standard Audio Layer 3) player, a portable computer, and the like.

As shown in fig. 1, the acquisition terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the collection terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts brightness of the display screen according to brightness of ambient light, and a proximity sensor that turns off the display screen and/or backlight when the terminal moves to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the terminal is stationary, and can be used for applications of recognizing terminal gestures (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; of course, the acquisition terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.

Those skilled in the art will appreciate that the acquisition terminal configuration shown in fig. 1 does not constitute a limitation of the acquisition terminal and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice data collecting program.

In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call the voice data collecting program stored in the memory 1005 and perform the following operations:

Further, judging whether the absolute value of the difference value between the energy of the audio frame of the currently acquired voice data and the average energy of the audio frame in the first preset time length is smaller than a preset energy threshold value;

Further, if the energy of the audio frame of the voice data is greater than the average energy of the audio frame, and the absolute value of the difference is greater than the preset energy threshold, outputting a text display to guide the user to increase the distance between the user and the acquisition terminal and/or reduce the speaking sound of the user;

Further, the text display is output to guide the user to adjust the distance between the user and the acquisition terminal and/or adjust the size of the speaking voice of the user.

Further, the acquisition terminal guides a user to adjust the voice acquisition state by increasing the brightness of the display screen and outputting text display in the display screen.

Further, the voice data are collected according to a second preset time length, and part of the voice data are cut out from the voice data to serve as target voice data to be stored.

Further, inputting the acquired voice data to a preset human voice recognition model to judge whether the voice data contains the human voice data;

Referring to fig. 2, the present invention provides various embodiments of the method of the present invention based on the above-mentioned acquisition terminal hardware structure.

The invention provides a voice data acquisition method, which is applied to an acquisition terminal, and in a first embodiment of the voice data acquisition method, referring to fig. 2, the method comprises the following steps:

step S10, if the voice data is detected to contain human voice data, judging whether the energy of the audio frame of the currently collected voice data meets the preset collection condition;

and the acquisition terminal determines that the voice data contains human voice data, and judges whether the energy of the audio frame of the currently acquired voice data meets a preset acquisition condition. The collection terminal may be provided with a microphone and other devices, for example, the collection terminal may be a PC, or a smartphone, a tablet computer and other devices having a voice collection function. The preset acquisition condition may be an internal program setting of the acquisition terminal or an acquisition condition set by the user, and if there is voice data that does not satisfy the preset acquisition condition, the voice acquisition state needs to be adjusted.

The energy of the audio frame of the voice data is limited, so that the energy of the audio frame of the voice data meets the preset acquisition condition, the voice data which does not meet the preset acquisition condition can be screened out and adjusted, and the quality of voice data acquisition can be improved.

In step S20, if the energy of the audio frame of the voice data does not meet the preset collection condition, a text display is output to guide the user to adjust the voice collection status.

And if the acquisition terminal determines that the energy of the audio frame of the voice data does not accord with the preset acquisition condition, the energy can be displayed in a text content mode on a display screen of the acquisition terminal so as to guide a user to adjust the voice acquisition state. The voice acquisition state adjustment comprises adjustment of the distance between the user and the acquisition terminal and/or adjustment of the speaking voice of the user. In this embodiment, the text is displayed to guide the user to adjust the voice collecting state, so that noise can be reduced, for example, unnecessary sound can be collected by guiding through voice guidance or vibration, and the collecting effect of voice data is further affected.

In this embodiment, if it is detected that voice data includes human voice data, it is determined whether energy of an audio frame of the currently acquired voice data meets a preset acquisition condition; and if the energy of the audio frame of the voice data does not accord with the preset acquisition condition, outputting a text display to guide a user to adjust the voice acquisition state. Therefore, if the voice data which do not accord with the preset acquisition condition is detected, the text display is output to guide the user to adjust the voice acquisition state, the voice data with good quality is obtained, and the voiceprint features with high quality are extracted from the voice data.

Further, in the step S10 of the first embodiment, the step of determining whether the energy of the audio frame of the currently captured speech data meets the preset capturing condition includes:

step S11, judging whether the absolute value of the difference between the energy of the audio frame of the currently collected voice data and the average energy of the audio frame in the first preset duration is smaller than a preset energy threshold value;

if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is smaller than or equal to a preset energy threshold value, judging that the energy of the audio frame of the voice data meets a preset acquisition condition;

and if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is larger than the preset energy threshold value, judging that the energy of the audio frame of the voice data does not accord with the preset acquisition condition.

The acquisition terminal judges whether the absolute value of the difference value between the energy of the audio frame of the currently acquired voice data and the average energy of the audio frame in the first preset duration is smaller than a preset energy threshold value or not, if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is smaller than or equal to the preset energy threshold value, the energy of the audio frame of the voice data is judged to be in accordance with the preset acquisition condition, and if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame of the voice data is larger than the preset energy threshold value, the energy of the audio frame of the voice data is judged to be. The first preset duration may be set by the user, for example, 2min, 5min, or the like. The preset energy threshold value should be large, for example, more than 5 or 10, and can be set for the user, the large energy threshold value can screen out the non-compliant voice data, if the small preset energy threshold value is set, the voice acquisition state may be adjusted in real time, and the probability of misadjustment is easily increased.

In this embodiment, the average energy of the audio frame within the first preset duration and the absolute value of the difference between the average energy of the audio frame and the energy of the audio frame of the currently acquired speech data are calculated, and whether the absolute value of the difference is smaller than or equal to a preset energy threshold is determined, if the absolute value of the difference is smaller than or equal to the preset energy threshold, the preset acquisition condition is met, and if the absolute value of the difference is larger than the preset energy threshold, the preset acquisition condition is not met. For example, the preset energy threshold may be set to 5, the first preset duration is 2min, the energy of the audio frame of the voice data acquired by the acquisition terminal is 10, the average energy of the audio frame within the 2min is 15, and if the absolute value of the difference is determined to be equal to the preset energy threshold, the preset acquisition condition is met.

Optionally, step S11 may specifically include:

step S111, if the energy of the audio frame of the voice data is larger than the average energy of the audio frame, and the absolute value of the difference is larger than a preset energy threshold, outputting a text display to guide a user to increase the distance between the user and the acquisition terminal and/or reduce the speaking sound of the user;

in step S112, if the energy of the audio frame of the speech data is less than the average energy of the audio frame, and the absolute value of the difference is greater than the preset energy threshold, a text display is output to guide the user to reduce the distance between the user and the collection terminal and/or increase the speaking voice of the user.

In this embodiment, if the energy of the audio frame of the speech data is greater than the average energy of the audio frame, and the absolute value of the difference is greater than the preset energy threshold, that is, it indicates that the energy of the currently acquired speech data is too large, a text display is output to guide the user to increase the distance between the user and the acquisition terminal and/or reduce the user speaking sound, and if the energy of the audio frame of the speech data is less than the average energy of the audio frame, and the absolute value of the difference is greater than the preset energy threshold, that is, it indicates that the energy of the currently acquired speech data is too small, a text display is output to guide the user to reduce the distance between the user and the acquisition terminal and/or increase the user speaking sound. For example, when the distance between the user and the collection terminal is 1 meter away and the speaking voice of the user is too small, the collection terminal determines that the energy of the voice data currently collected by the user is too small, and displays "please reduce the distance from the collection terminal and increase the speaking voice" on the display screen.

Further, in step S20 of the above-mentioned first embodiment, the step of outputting a text display to guide the user to adjust the voice capture state includes:

and step S21, outputting a text display to guide the user to adjust the distance between the user and the acquisition terminal and/or adjust the speaking voice of the user.

In this embodiment, the collection terminal displays text content on the display screen, where the text content may be for the user to adjust the distance from the collection terminal and/or adjust the size of the speech sound of the user. For example, "please increase the voice of speaking" may be displayed in text.

Further, in step S20 of the above-mentioned first embodiment, the step of outputting the text display to guide the user to adjust the voice capture state includes:

step S201, the collection terminal guides a user to adjust the voice state by increasing the brightness of the display screen and outputting text display in the display screen. In this embodiment, by increasing the brightness of the display screen of the collection terminal, the user can visually see the text content on the display screen, and the prompt of noise is reduced, so that the collection of the voice data of the user is not affected.

Further, in the step S20 of the first embodiment, if the energy of the audio frame of the speech data does not meet the preset capture condition, the step of outputting a text display to guide the user to adjust the speech capture state includes:

and step A, acquiring voice data according to a second preset time length, and cutting out partial voice data in the voice data to serve as target voice data to be stored.

The second preset time period may be set by the user, for example, 2min, 4min, or the like. In this embodiment, the voice data is collected according to the second preset duration, and part of the voice data with good quality can be cut out from the voice data to be used as target voice data to be stored, and part of the voice data with poor quality can be cut out to be screened out, and the remaining part of the voice data is used as target voice data, so that high-quality voiceprint features can be extracted from the target voice data.

Further, in step S10 of the above-mentioned first embodiment, if it is detected that the voice data includes human voice data, the step of determining whether the energy of the audio frame of the currently acquired voice data meets the preset acquisition condition includes:

step S101, inputting the acquired voice data into a preset human voice recognition model to judge whether the voice data contains human voice data;

and if the voice data contains human voice data, judging whether the energy of the audio frame of the currently acquired voice data meets the preset acquisition condition.

The acquisition terminal inputs the acquired voice data into a preset human voice recognition model to judge whether the voice data contains human voice data, if the voice data contains human voice data, judging whether the energy of an audio frame of the currently acquired voice data meets preset acquisition conditions, and if the voice data does not contain human voice data, not judging. The preset human voice recognition model is characterized in that a plurality of sample voice data containing human voice are obtained in advance, the sample voice data are used as a training set, the training set is input to the deep neural network for training, the human voice recognition model is constructed, and whether the voice data contain the human voice data or not is judged. Wherein the sample voice data is voice data with human voice. It should be understood that model training needs to be performed using speech data with human voice as sample speech data to construct a human speech recognition model. Therefore, a large amount of sample voice data are trained, a more accurate and stable human voice recognition model can be obtained, and the human voice data can be recognized by the human voice recognition model, so that the accuracy of recognizing the human voice data is improved.

In addition, referring to fig. 3, an embodiment of the present invention further provides a voice data collecting device, where the device includes:

and the detection module is used for judging whether the energy of the audio frame of the currently acquired voice data meets the preset acquisition condition or not when the voice data is detected to contain the human voice data.

Further, a detection module comprising:

and the judging unit is used for judging whether the absolute value of the difference value between the energy of the audio frame of the currently acquired voice data and the average energy of the audio frame in the first preset duration is smaller than a preset energy threshold value.

And the first judging unit is used for judging that the energy of the audio frame of the voice data meets the preset acquisition condition if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is less than or equal to the preset energy threshold.

And the second judging unit is used for judging that the energy of the audio frame of the voice data meets the preset acquisition condition if the absolute value of the difference value between the energy of the audio frame of the voice data and the average energy of the audio frame is less than or equal to the preset energy threshold.

Further, the guiding module is further configured to output a text display to guide the user to increase the distance between the user and the acquisition terminal and/or reduce the speaking voice of the user if the energy of the audio frame of the speech data is greater than the average energy of the audio frame and the absolute value of the difference is greater than the preset energy threshold.

The guiding module is further used for outputting text display to guide the user to reduce the distance between the user and the acquisition terminal and/or increase the speaking voice of the user if the energy of the audio frame of the voice data is smaller than the average energy of the audio frame and the absolute value of the difference is larger than the preset energy threshold.

Further, the guiding module is further used for outputting text display to guide the user to adjust the distance between the user and the collecting terminal and/or adjust the size of the speaking voice of the user.

The guiding module is further used for guiding the user to adjust the voice acquisition state by increasing the brightness of the display screen and outputting text display in the display screen by the acquisition terminal.

Further, the voice data acquisition device comprises:

and the cutting module is used for collecting the voice data according to a second preset time length and cutting out partial voice data in the voice data to be used as target voice data to be stored.

Further, the detection module is further configured to input the acquired voice data to a preset human voice recognition model to determine whether the voice data includes the human voice data.

The judging unit is further configured to judge whether the energy of the audio frame of the currently acquired voice data meets the preset acquisition condition if the voice data includes the human voice data.

For the specific limitation of the voice data collecting device, reference may be made to the above limitation on the voice data collecting method, which is not described herein again. The modules in the voice data acquisition device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Furthermore, an embodiment of the present invention further provides a readable storage medium (i.e., a computer-readable memory), where a voice data collection program is stored on the readable storage medium, and when executed by a processor, the voice data collection program implements the following operations:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for voice data acquisition, the method comprising:

2. The method for acquiring voice data according to claim 1, wherein the step of determining whether the energy of the audio frame of the currently acquired voice data meets a preset acquisition condition comprises:

3. The method of claim 2, wherein the step of outputting a text display to guide the user to adjust the voice capturing status if the energy of the audio frame of the voice data does not meet the preset capturing condition comprises:

4. The voice data collection method according to claim 1, wherein the adjusting the voice collection state comprises adjusting a distance between the user and the collection terminal and/or adjusting a size of a speaking voice of the user;

5. The voice data collection method of claim 1, wherein the step of outputting a text display to guide a user to adjust a voice collection status comprises:

6. The method for collecting voice data according to claim 1, wherein the step of outputting a text display to guide a user to adjust the voice collecting status if the energy of the audio frame of the voice data does not meet the preset collecting condition comprises:

7. The method of claim 1, wherein if it is detected that the voice data includes human voice data, the step of determining whether the energy of the audio frame of the currently acquired voice data meets a preset acquisition condition includes:

8. A voice data acquisition apparatus, the apparatus comprising:

9. An acquisition terminal, characterized in that the acquisition terminal comprises: a memory, a processor and a program stored on the memory and executable on the processor, the voice data acquisition program when executed by the processor implementing the steps of the voice data acquisition method as claimed in any one of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech data acquisition method according to one of claims 1 to 7.