CN112382310B

CN112382310B - Human voice audio recording method and device

Info

Publication number: CN112382310B
Application number: CN202011258272.3A
Authority: CN
Inventors: 贾杨; 夏龙; 吴凡; 高强; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2022-09-27
Anticipated expiration: 2040-11-12
Also published as: WO2022100692A9; CN112382310A; WO2022100692A1

Abstract

The application provides a method and a device for recording voice and audio. The method comprises the following steps: obtaining a current original audio; acquiring an audio clip identified as human voice in the current original audio; splicing the audio segments of the human voice according to a time sequence to obtain spliced audio; and storing or outputting the spliced audio. The scheme provided by the application can extract the audio of the voice part from the original audio, save the storage space, save the time for the user to replay the voice content, and keep the continuity of the voice in the recording.

Description

Human voice audio recording method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for recording human voice and audio.

Background

With the development of internet technology, the similar industries such as online education and the like are developed vigorously, and the number of online learning people is increased dramatically.

The voice responses of students in the online learning process need to be recorded and output, and the requirement particularly appears in the learning of languages such as English and also appears in the learning of logic thinking and the like through language interaction. On one hand, teachers learn about the learning conditions of students, such as pronunciation of English, through sound recording so as to provide guidance; on the other hand, students can review the music through the recording. Therefore, a method for extracting and recording voice meeting the above-mentioned needs of the scene is needed.

Disclosure of Invention

The application provides a method for recording voice and audio, which comprises the following steps: obtaining a current original audio; acquiring an audio clip identified as human voice in the current original audio; splicing the audio segments of the human voice according to a time sequence to obtain spliced audio; and storing or outputting the spliced audio.

The method further comprises the following steps: obtaining a first average value of the duration of the non-human voice audio segments in the current original audio; obtaining a first variance of the duration of the audio segment of the non-human voice in the current original audio by using the mean value; the difference value of the first mean value and the first variance is used as a first threshold; and if non-human voice audio frequency fragments with the duration being less than the first threshold exist among the human voice audio frequency fragments, splicing the non-human voice audio frequency judgment with the human voice audio frequency fragments.

In parallel, the method further comprises: according to the user identification corresponding to the original audio, obtaining the total duration of the non-human voice audio fragment of at least one historical original audio of the user and the variance sum of the non-human voice audio fragments of the historical original audio; obtaining a second average value of the duration of the non-human voice audio segments by using the duration of the non-human voice audio segments of the current original audio and the total duration of the non-human voice audio segments of the historical original audio; obtaining a second variance of the non-human voice audio fragment by using the variance of the non-human voice audio fragment of the current original audio and the variance sum of the non-human voice audio fragments of the historical original audio; the difference value of the second mean value and the second variance is used as a second threshold; and if non-human voice audio frequency fragments with the duration being less than the second threshold exist among the human voice audio frequency fragments in the current original audio frequency, splicing the non-human voice audio frequency judgment and the human voice audio frequency fragments.

And, the method further comprises: obtaining the sum of the time length of the non-human voice audio clip of the current original audio and the total time length of the non-human voice audio clip of the historical original audio, and storing the sum; and obtaining the sum of the variance of the current original audio non-human voice audio fragment and the variance sum of the non-human voice audio fragment of the historical original audio, and storing the sum.

In parallel, the method further comprises: obtaining a first average value of the duration of the non-human voice audio segments in the current original audio; obtaining a first variance of the duration of the audio segment of the non-human voice in the current original audio by using the mean value; the difference value of the first mean value and the first variance is used as a first threshold; taking a user identifier corresponding to the original audio; acquiring the average value of the audio segment duration of the non-human voice in at least one historical original audio of the user as a third average value; obtaining the variance of the non-human voice audio frequency fragment in the at least one historical original audio frequency as a third variance; the difference value of the third mean value and the third variance is used as a third threshold; adjusting the first threshold by using a preset weight of a third threshold to obtain a fourth threshold; and if non-human voice audio frequency fragments with the duration being less than the fourth threshold exist among the human voice audio frequency fragments in the current original audio frequency, splicing the non-human voice audio frequency judgment and the human voice audio frequency fragments.

And, the method further comprises: acquiring and storing the time length average values of the current original audio non-human voice audio clips and the non-human voice audio clips of the historical original audio; and obtaining the variance of the current original audio non-human voice audio fragment and the non-human voice audio fragment of the historical original audio, and storing the variance.

In parallel, the method further comprises: and if the time length of the non-human voice audio frequency fragments existing between the adjacent human voice audio frequency fragments in the current original audio frequency is less than a fifth threshold, splicing the non-human voice audio frequency judgment and the human voice audio frequency fragments.

In the above embodiment, obtaining the audio piece identified as the voice in the original audio specifically includes: segmenting the original audio frequency according to a preset method to obtain a plurality of sub audio frequencies; calculating a Mel frequency cepstrum coefficient characteristic sequence of the sub-audio; the neural network obtains the probability that the sub-audio belongs to the human voice according to the Mel frequency cepstrum coefficient characteristic sequence; acquiring sub-audio with the voice probability larger than a judgment threshold; acquiring adjacent sub-audio of which the probability of human voice in the original audio is greater than a judgment threshold; and acquiring an audio clip consisting of the determined time points in the adjacent sub-audios.

In the above embodiment, the segmenting the original audio according to a preset method to obtain a plurality of sub-audios includes: obtaining an original audio, adding first-duration null data before the head of the original audio, and adding second-duration null data after the tail of the original audio to obtain an expanded audio; and taking a third duration of the sum of the first duration and the second duration as a segmentation window, and sequentially windowing from the head of the expanded audio by using the first step length to obtain a plurality of sub-audios.

In the above embodiment, after obtaining the probability that the sub-audio belongs to the human voice, the method further includes: obtaining an array of the probability that all the sub-audios of the original audio belong to the human voice; and filtering the probability values in the array by taking the first number as a window to obtain the filtered human voice probability.

In the above embodiment, before obtaining the sub audio with the voice probability greater than the decision threshold, the method further includes: acquiring audio energy values of determined time points in the sub audio in the original audio; and setting a human voice probability adjusting factor according to the audio energy value, wherein the human voice probability adjusting factor comprises the following steps: if the audio energy value is larger than the upper energy limit, setting the human voice probability regulating factor of the sub audio to be 1; the audio energy value is less than the lower energy limit, and the human voice probability regulating factor of the sub audio is set to be 0; if the audio energy value is not greater than the upper energy limit and not less than the lower energy limit, normalizing the human voice probability adjustment factor to be between 0 and 1 according to the audio energy value; and multiplying the sub-audio voice probability regulation factor by the sub-audio voice probability to obtain the corrected sub-audio voice probability.

The present application further provides an audio recognition apparatus, which includes:

a processor; and

a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.

The method and the device have the advantages that the voice part in the original audio is extracted by utilizing the voice recognition, only the voice audio clip is stored, and the non-voice audio clip is deleted, so that not only is the noise removed, but also the storage space is saved as the non-voice audio clip is deleted.

Furthermore, based on the characteristics of human speaking, such as the continuity before and after human speaking, particularly the situations of short pause, ventilation and the like frequently occurring in the voice answer questions of children, the application provides a method for realizing tolerant combination by multiple algorithms, reserves the short non-human voice parts among the voice audio fragments, and keeps the continuity of the voice in the recording.

A computer readable storage medium having stored thereon executable code which, when executed by a processor of a computing device, causes the processor to perform the method as above.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application, as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

Fig. 1 is a schematic flowchart illustrating a human voice audio recording method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating an original audio slicing preprocessing according to an embodiment of the present application;

FIG. 3 is a graph showing a probability distribution of audio voices before moving average according to an embodiment of the present application;

FIG. 4 is a graph illustrating a probability distribution of audio voices after moving average according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a wide-tolerance merge process.

Detailed Description

Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

The application provides a method for recording voice and audio. The method and the device identify the original audio, store or output the voice audio after extracting the voice audio from the original audio, and save storage resources compared with the prior art because a large number of non-voice parts are removed from the stored audio file.

An embodiment of the present invention is described with reference to fig. 1.

Step 11: the original audio is obtained.

And acquiring an original audio file, for example, when a student learns online, performing voice answering according to the prompt of learning software, and acquiring the original audio of the student when the student answers the voice by the intelligent equipment through a microphone. The original audio may contain both the desired human voice and other non-human audio such as background sounds, noise, etc.

Step 12: acquiring an audio clip identified as human voice in the current original audio;

however, the present invention is not limited to other implementation methods that can implement the same function.

In step 121, the portion of the human voice in the original audio is identified.

Respectively adding null data before the head and after the tail of the original audio to obtain expanded audio;

in one embodiment, the original audio is subdivided into smaller sub-audio, a segment of null audio is added to the beginning and the end of the original audio respectively to obtain extended audio, sub-audio segmentation is performed on the extended audio based on a segmentation window value, and the ratio of the null audio value to the segmentation window value is kept at 1: 2.

In the embodiment shown in fig. 2, in order to realize accurate statistics of the open time node, the sub-audio needs to have a smaller granularity of slicing. As shown in the figure, a is an original audio array, and null data of equal duration, i.e., zero of 480 milliseconds (ms), is added at the head and the tail of the original audio a, respectively, to obtain an extended audio b. The number of 0's in the 480ms is determined according to the sampling frequency of the audio, i.e. the data frequency in the 480ms is the same as the sampling frequency.

The time duration of the null data added before the header and after the trailer of the original audio in this embodiment is only exemplary, and the present invention does not limit other values of the time duration.

Step 122: taking 2 times of the duration as a segmentation window, and sequentially obtaining a plurality of sub-audios from the head of the extended audio by using a first step size;

as shown in fig. 2, in this embodiment, when the original audio is sliced to obtain sub-audio, the slicing window takes 960ms, that is, 2 times the 480 ms. The slicing step size is 10ms, so that the minimum slicing granularity of the sub-audio is 10 ms. The invention does not limit the choice of other segmentation granularities.

According to the segmentation method, a plurality of sub-audios are obtained, the difference between adjacent sub-audios is 10ms, and the time length of each sub-audio is 960 ms.

Suppose a certain sonThe start and end times of the audio are denoted t in the original audio, respectively _i ，t _i +0.96s, in the embodiment of the present invention, the human voice probability of the sub-audio feature map calculated in the subsequent step is taken as the time point t _i And the voice probability corresponding to the voice frequency at the +0.48s moment. Therefore, the voice probability calculated according to the first sub-audio is used as the voice probability of the original audio starting moment; and the human voice probability obtained by calculating the last sub-audio is used as the human voice probability of the end moment of the original audio.

By the segmentation method of the original audio, the human voice probability of a certain time point is approximately calculated, so that more accurate open segment detection can be realized.

Step 123: respectively calculating to obtain time sequence characteristic sequences of the sub-audios;

in the present embodiment, mel-frequency cepstrum coefficients (MFCCs) are spectral coefficients obtained by linear transformation of log energy spectrum based on nonlinear mel scale (mel scale) of sound frequency, and represent frequency domain characteristics of sound.

And for each sub audio obtained by segmentation, calculating a short-time Fourier transform result by adopting a preset window length and a preset step length to obtain a Mel frequency cepstrum coefficient characteristic sequence. For example, using a window length of 25ms and a step size of 10ms, the result of the short-time fourier transform is calculated to obtain the MFCC characteristic.

Step 124: and the neural network obtains the probability that the sub-audio belongs to the specific classification according to the time sequence characteristic sequence.

Inputting the mel frequency cepstrum coefficient characteristic sequence into the trained neural network model, and obtaining the corresponding probability of each audio frequency segment output by the neural network model. The probability ranges from 0 to 1.

For example, the trained neural network model employs a convolution kernel of 3x3 and pool layer simplified model parameters. The training of the neural network comprises two stages of pre-training and fine-tuning. The left graph shows a 500-class classification model, which is an audio classification model that has been trained using a sound data set. The right diagram is a two-class model, the network multiplexes the underlying network structure and parameters of the 500-class model, and the model converges through a back propagation algorithm. Whether the voice exists in the audio segment is identified through the binary model, and the model outputs the probability that the voice exists in the current audio segment. By introducing pre-training and fine-tuning, the network trained by the method disclosed by the invention is more focused on the classification scene of human voice and non-human voice, and the model performance is improved.

Step 125: comparing the probabilities with a judgment threshold respectively to judge whether the sub-audio belongs to the classification of the human voice; therefore, the sub-audio with the probability of human voice being larger than the judgment threshold in the original audio is obtained.

And setting the judgment threshold as a basis for judging whether the voice is the voice, judging the voice if the probability is greater than the judgment threshold, and judging the voice as the non-voice if the probability is less than the judgment threshold.

Through the above steps, the original audio a is divided into individual voices or non-human voice segments. The duration of the human voice in the original audio can be obtained by accumulating the durations of all the segments,

step 126: for the sub-audio with the human voice probability larger than the judgment threshold and adjacent to the judgment threshold in the original audio; and acquiring an audio clip consisting of the central time points in the adjacent sub-audios, namely the human voice audio clip.

In the method for acquiring the human voice audio clip, null data with equal time length is added before the head and after the tail of the original audio, for example, 480 milliseconds are added; and adopting a window with 2 times of 480 milliseconds, namely 960 milliseconds duration to segment the original audio to obtain a plurality of sub-audios.

Step 13: and splicing the audio segments of the human voice according to a time sequence to obtain spliced audio.

And splicing the voice audio clips according to a time sequence to obtain a spliced voice audio file, namely the voice part audio in the original audio.

Step 14: and storing or outputting the voice part audio frequency in the original audio frequency obtained after splicing. Therefore, on one hand, the file occupies small storage space; on the other hand, the audio does not contain the non-human voice part, so the audio time is shorter, compared with the original audio, the playing time is shorter, and the user can repeatedly listen to the voice content without wasting time.

On the basis of the embodiment, after the probability that the sub-audio is the human voice is obtained by the neural network, the following preprocessing steps can be executed before threshold judgment is carried out, so that the aim of optimizing the probability value is fulfilled.

1) And performing moving average preprocessing on the currently obtained probability.

Due to the segmentation granularity and noise, the human voice probability array of the original audio obtained according to the method described above contains noise points. As shown in fig. 3, which is a probability distribution diagram of a human voice of 200 milliseconds, the ordinate represents the probability that the audio point is a human voice, the abscissa represents time, and each point represents 10 ms. There are many abrupt changes of probability values, i.e., spikes, in the probability value distribution of 0 to 1 corresponding to the horizontal axis time axis. Therefore, it is necessary to perform a moving average preprocessing on the currently obtained probability so that the probability distribution is smoother, resulting in a human voice probability distribution map of 200 milliseconds as shown in fig. 4.

Performing moving average preprocessing, namely adopting a median moving filtering method, wherein the probability that the ith sub-audio after median filtering is a human voice is as follows:

wherein, the human voice probability array of all sub-audio frequencies in the original audio frequency

P＝{p ₁ ,p ₂ ,p ₃ ,...,p _i ...,p _n N is the total number of sub-audios obtained by segmenting the original audio, p _i Representing the probability that the ith sub-audio is a human voice.

w _ smooth is the selected window size. For example, in this embodiment, the window is 31, that is, the window is 31 values in the human voice probability array of the sub audio.

For p _i And determining the upper limit index and the lower limit index of the moving average.

The lower bound index is: lo ═ max (0, i-15), representing the first probability value in the array;

the upper index limit is: hi min (n, i +15) represents the last probability value in the array.

In this embodiment, the median filtering is to average probability values of 31 adjacent points to obtain a probability value of an intermediate point; according to the method, the probability value of each point is recalculated with step size 1.

Comparing fig. 3 and fig. 4, it can be seen that after the sliding average, the burr of the sub-audio human voice probability map is effectively corrected, and the accuracy of the split segment is improved to a certain extent.

The above median filtering is an implementation manner of the present invention, and the present invention does not limit the adoption of other filtering methods.

2) And (5) energy correction pretreatment.

After the moving average preprocessing, fine-grained sub-audio segmentation is adopted in the embodiment of the invention, and the strategy of large overlap of sub-audio results in that the audio probability of a small part of non-human voice is corrected by surrounding points to be more inclined to human voice when the audio probability is filtered, namely, the human voice probability is increased, but the audio probability is essentially non-human voice.

In order to solve the above problem, in the embodiments of the present invention, the characteristic that the energy of noise or silence is weaker than the voice is used, and the energy of the original audio is used to further correct the voice probability, so as to improve the accuracy.

The audio voice probability array after the moving average is as follows:

and (3) calculating to obtain an energy array of the original audio by taking 10ms as a window size and 10ms as a step length:

P _ower ＝{w ₁ ,w ₂ ,w ₃ ,...,w _i ,..w _n }

in the above-described embodiment, the original audio is sliced by using the step length of 10ms to obtain sub-audio, and then the human voice probability with the interval of 10ms is obtained, so that the energy array of the original audio is obtained by using the step length of 10ms, and the time of the energy array of the original audio corresponds to the time of the human voice probability array of the original audio.

Will P _ower Normalizing the values of the array to be between 0 and 1, and determining an upper energy limit P _up And lower energy limit P _down Then w is _i Normalization can be done in the following way:

as can be seen from the above formula, when the audio energy is greater than the upper limit P at a certain moment _up When, w _i The value is 1, if the audio frequency energy at a certain time is less than the lower energy limit P _down When w _i Taking the value as 0 to obtain

Array P ^f Sum array

Carrying out dot product operation on the corresponding values to obtain an audio frequency human voice probability value array P after energy correction ^T . Through the operation, when the audio frequency energy is larger than the upper energy limit P at a certain moment _up If so, the probability value of the voice at the moment is unchanged; if the audio frequency energy at a certain time is less than the lower energy limit P _down And if so, the human voice probability value at the moment is 0.

In an embodiment, if the audio energy is between the energy lower limit and the energy upper limit (including the energy upper limit and the energy lower limit), the obtained probability adjustment factor is between 0 and 1, the voice probability value of the corresponding time point is adjusted by the probability adjustment factor, and finally the audio voice probability value array P after energy correction is obtained ^T 。

It can be seen from the above that, by using the energy matrix of the original audio, if the audio energy at a certain time is lower than the energy lower limit, the audio at the time is considered to be non-human voice, so that the probability of human voice at the time is changed to zero, and part of the audio of non-human voice is further removed by this method.

In the embodiment, the obtained probability is subjected to sliding average preprocessing, energy correction preprocessing, and finally, the judgment algorithm is used for distinguishing the voice from the non-voice, the voice opening segment is determined, and the user opening duration is counted; the energy correction and the moving average are performed on the probability obtained currently, the sequence is not existed, and the energy correction pretreatment can be performed first and then the moving average pretreatment can be performed.

The invention can also adopt one of the two preprocessing methods to achieve the aim of improving the accuracy rate of the human voice recognition.

In the above embodiment, the sub-audio is obtained by segmenting the original audio after performing audio expansion on the equal-duration null data added before the head and after the tail of the original audio. However, the length of time of the dummy data added before the original audio header and after the trailer may not be equal. Adding null data of a first duration before the original audio header and a second duration after the original audio trailer; and segmenting the original audio by taking a third duration of the sum of the first duration and the second duration as a segmentation window to obtain sub-audio.

For example, the first duration is 240 ms, the second duration is 720 ms, and the slicing window is the sum of the first duration and the second duration, i.e., 960 ms. It can be seen that the sub-audio duration obtained by the present method is the same as the previous embodiment, and still 960 ms.

By using the segmentation method, the calculated sub-audio voice probability is approximate to the voice probability of the sub-audio at 1/4. Suppose that the start time and the end time of a certain sub-audio are denoted as t in the original audio respectively _i ，t _i +0.96s, the sub-audio human voice probability value is approximated as t in the sub-audio _i The vocal probability value at time +0.24 s. And, continuouslyAnd judging the voice segment in the original audio of the audio segment consisting of the 1/4 th time point in each sub audio of the voice. It can be seen that, since the sub-audio is obtained by slicing the original audio with the first step size, the 1/4 th time instants of the adjacent sub-audio are separated by the first step size, for example, 10ms as used in the above embodiment.

And when the obtained human voice probability array of the sub-audio is subjected to audio energy correction preprocessing, the energy value of the sub-audio at the previous 1/4 moment is preferably calculated. For example, assume that the start time and the end time of a certain sub-audio are denoted as t in the original audio, respectively _i ，t _i +0.96s, then t is calculated _i The energy value at the +0.24s moment, and the sub-audio (t) is obtained according to the energy value _i ，t _i +0.96 s).

According to the description of the above embodiments, the human voice audio can be saved or output after the non-human voice in the original audio is removed.

Considering the continuous before and after human speaking, especially the scene of online learning of children and teenagers, there is a short pause between words expressing complete meaning sentences, usually used for ventilation or characterizing a certain mood.

In this embodiment, a certain tolerance is adopted to maintain the front-back continuity of the human voice audio clip. Therefore, the voice frequency of the person with higher quality can be output, the corpus with higher content quality is provided for teachers and students, and the use of the voice recording content by the teachers and students is facilitated.

Three algorithms are described below to maintain the front-back continuity of the human voice audio segment.

A first method embodiment.

Because the content of each audio is different, and the mood and the state of the vocalization of the user are different. And dynamically adjusting the tolerance based on the statistical characteristics of the non-human voice in the original audio. The specific method comprises the following steps:

for an original audio, the duration of a certain non-human voice segment is denoted as l _i And assume that there are a total of n non-human audio segments in the original audio.

First, a current source is obtainedFirst mean value m of duration of audio segments identified as non-human voices in the starting audio _l ；

Obtaining a first variance delta of the duration of the audio segment identified as the non-human voice in the current original audio by using the average value;

the difference value of the first mean value and the first variance is used as a first threshold T;

T ₁ ＝m _l -δ

and splicing the audio segments of the human voice and the audio segments of the non-human voice of which the audio segment duration is less than the first threshold T. That is, in the original audio, if the duration of the non-human voice audio segment between two human voice audio segments is less than the first threshold T ₁ Then the non-human voice audio clip is retained in the saved or output audio.

The first method sets different segment splicing thresholds for different audios, dynamically adjusts splicing effects, and is simple in calculation.

A second method embodiment.

For each user, the statistical properties of their non-human audio segments are preserved. For example, for user u, the mean value of the non-human voice audio segment statistically derived from the existing data is recorded as m _u Variance of δ _u The number N of all non-human voice audio frequency segments in the existing original audio frequency _u The sum of the time lengths of all the non-human voice audio frequency segments in the existing original audio frequency is S _u Variance sum S of non-human voice segments in existing original audio _δu 。

Thus, for a new user u, the characteristics of his non-human audio piece calculated from his first original audio piece are as follows and saved.

Average of audio segment durations identified as non-human voices in the original audio:

sum of durations of non-human audio segments in the existing original audio:

variance of duration of audio segments identified as non-human voices in original audio:

the sum of the variances of the durations of audio segments identified as non-human voices in the existing original audio:

number N of all non-human audio segments in the existing original audio _u N, where n represents a total of n segments of non-human voice for the audio, each segment having a duration of l _i 。

If the user records at least one original audio, when recording a new original audio, calculating whether the non-human voice audio part in the current new original audio can be output or stored according to the statistical characteristics of the non-human voice audio fragment of the existing original audio of the user, which is specifically described as follows.

Acquiring a user identifier corresponding to the original audio; thus, the above-described saved characteristics of the non-human voice audio piece of the user are obtained.

And calculating the following parameters by using the statistical characteristics of the non-human voice audio segments of the stored non-human voice audio segments of the user and the statistical characteristics of the non-human voice audio segments of the newly recorded audio. As a preferred embodiment, the following description will take the saved non-human voice characteristics of the original audio of all the histories of the user as an example.

In the following, the corner mark old represents the statistical parameters of the non-human voice segments of all the historical original audios generated before the current original audio is generated for the user, and the corner mark new represents the statistical parameters of the non-human voice segments after the current original audio is obtained.

S _u old is the sum of the durations of all original audio non-human voice audio clips before the user obtains the current original audio; s _u new is the sum of the durations of all non-human audio segments of the original audio after the current original audio is obtained.

S _δu old is the sum of time length variances of all original audio non-human voice audio segments before the current original audio is obtained by the user; s. the _δu new is the sum of the time length variances of all original audio non-human voice audio segments after the current original audio is obtained.

N _u old is the sum of the number of all original audio non-human voice audio fragments before the user obtains the current original audio; n is a radical of _u new is the sum of the number of all original audio non-human audio pieces after the current original audio is obtained.

N _u new＝N _u old+n，

Obtaining a second average value m of the durations of all original audio non-human voice audio segments after the user obtains the current original audio according to the parameters _u new, and a second variance δ _u new。

m _u new＝S _u new/(N _u new)

δ _u new＝S _δu new/(N _u new)

Wherein n is n segments of non-human voice audio frequency segments in the current original audio frequency, and the time length of each segment is l _i 。

And further, obtaining a second threshold by the difference value of the second mean value and the second variance.

T ₂ ＝m _u new-δ _u new

In the current original audio, if the duration of a non-human voice audio clip between two human voice audio clips is less than a second threshold T ₂ And then, in the stored or output audio, the non-human voice audio segment is reserved, and the non-human voice audio segment is spliced with the front and rear human voice audio segments.

In the second method embodiment, the dynamic adjustment of the splicing threshold of the user granularity is carried out by considering the different sounding habits of different users; and on the basis, the mean value and the variance of the variable are calculated in a streaming manner, so that the calculation and storage resources are saved.

A third method embodiment.

For each user, the statistical properties of their non-human audio segments are preserved. For example, for user u, the mean is denoted m _u Variance of δ _u The number N of all non-human voice audio frequency segments in the existing original audio frequency _u The sum of the time lengths of all the non-human voice audio frequency segments in the existing original audio frequency is S _u Variance sum S of non-human voice segments in existing original audio _δu 。

Thus, for a new user u, the characteristics of his non-human audio piece calculated from his first original audio piece are as follows.

the sum of the durations of the non-human audio segments in the existing original audio:

number N of all non-human audio pieces in existing original audio _u N, where n represents a total of n segments of non-human voice for the audio, each segment having a duration of l _i 。

If the user records at least one original audio, when recording a new original audio, calculating whether the non-human voice audio part in the current new original audio can be output or stored according to the characteristics of the non-human voice audio fragment of the existing original audio of the user, which is specifically described as follows.

Acquiring a user identifier corresponding to the original audio; thus, the characteristics of the non-human voice audio segments in the historical original audio of the user, which have been saved as described above, are obtained. Specifically, S _u old is the sum of the durations of all original audio non-human voice audio clips before the user obtains the current original audio; s _δu old is the sum of time length variances of all original audio non-human voice audio segments before the current original audio is obtained by the user; n is a radical of _u old is the sum of the number of all original audio non-human audio clips before the current original audio is obtained by the user.

Obtaining a third average value m of the audio segment time lengths of the non-human voices of all historical original audios of the user, which are stored before the current original audio, by using the parameters and referring to the calculation method _u old and third party difference delta _u old; and a third threshold value is obtained.

T ₃ ＝m _u old-δ _u old

With reference to the method as described above in the first method embodiment, the first threshold T is obtained from the current original audio ₁ 。

Defining the weight value alpha, 0< a ≦ 1.

Using the weight value and the third threshold value to the first threshold T ₁ Adjusting to obtain a fourth threshold T ₄ 。

In the original audio, if the duration of the non-human voice audio clip between the two human voice audio clips is less than the fourth threshold T ₄ And then, in the stored or output audio, the non-human voice audio segment is reserved, and the non-human voice audio segment is spliced with the front and rear human voice audio segments.

In the third method embodiment, on the basis of considering the overall statistical information of the user, the splicing threshold of the non-human voice audio clip is dynamically adjusted by combining with the specific audio statistical information.

A fourth method embodiment.

Since the original audio a is divided into a plurality of segments of human or non-human voices. And if the interval between the two sub-audios which are judged to be the human voice is smaller than a threshold value (a fifth threshold), further acquiring the audio segments which are judged to be the non-human voice between the adjacent audio segments identified to be the human voice. And splicing the audio segments of the human voice and the audio segments between the adjacent audio segments identified as the human voice in the current original audio.

Four methods of whether audio splicing is performed are described above. Referring to FIG. 5, assume that the original audio contains two segments a of human voice _i ,a _i+1 The starting and stopping time nodes are respectively

Assume that the threshold value is 500ms using any of the methods described above. If it is not

The two fragments are merged into one. It can be seen that the treatment by the above methodThe resulting human voice audio maintains the front-to-back continuity of the speech segments.

The above various methods of tolerance merging have no difference in quality and are respectively adapted to different scenes and different user groups. For example, the first method embodiment is suitable for lightweight recording, feedback systems, without the need to record any user information; the second method embodiment is suitable for adult user groups, because the emotion is stable when adults record, the statistical information of the users can better describe most audio features; the third method embodiment is suitable for infant groups, and considering that the emotion fluctuation of the infant groups is large and irregular, the audio frequency spoken by the same text at different adjacent moments is very different, so that the statistical characteristics of users need to be considered and reasonable adjustment needs to be carried out by referring to the statistical information of the audio frequency.

Corresponding to the embodiment of the application function implementation method, the application also provides a voice audio recording device. The device includes:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method recited above. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the applications disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for recording human voice audio, comprising:

obtaining a current original audio;

acquiring an audio clip identified as human voice in the current original audio;

splicing the audio segments of the human voice according to a time sequence to obtain spliced audio;

obtaining a first average value of the duration of the non-human voice audio segment in the current original audio; obtaining a first variance of the audio segment duration of the non-human voice in the current original audio by using the mean value; the difference value of the first mean value and the first variance is used as a first threshold; if non-human voice audio segments with the duration being less than the first threshold exist among the human voice audio segments, splicing the non-human voice audio segments with the human voice audio segments;

and storing or outputting the spliced audio.

2. The method of claim 1, further comprising:

according to the user identification corresponding to the original audio, obtaining the total duration of the non-human voice audio fragment of at least one historical original audio of the user and the variance sum of the non-human voice audio fragments of the historical original audio;

obtaining a second average value of the duration of the non-human voice audio segments by using the duration of the non-human voice audio segments of the current original audio and the total duration of the non-human voice audio segments of the historical original audio;

obtaining a second variance of the non-human voice audio fragment by using the variance of the non-human voice audio fragment of the current original audio and the variance sum of the non-human voice audio fragments of the historical original audio;

the difference value of the second mean value and the second variance is used as a second threshold;

and if non-human voice audio segments with the duration less than the second threshold exist among the human voice audio segments in the current original audio, splicing the non-human voice audio segments with the human voice audio segments.

3. The method of claim 2, further comprising:

obtaining the sum of the time length of the non-human voice audio clip of the current original audio and the total time length of the non-human voice audio clip of the historical original audio, and storing the sum;

and obtaining the sum of the variance of the current original audio non-human voice audio fragment and the variance sum of the non-human voice audio fragment of the historical original audio, and storing the sum.

4. The method of claim 1, further comprising:

obtaining a first average value of the duration of the non-human voice audio segments in the current original audio;

obtaining a first variance of the duration of the audio segment of the non-human voice in the current original audio by using the mean value; the difference value of the first mean value and the first variance is used as a first threshold;

acquiring a user identifier corresponding to the original audio;

acquiring the average value of the audio segment duration of the non-human voice in at least one historical original audio of the user as a third average value;

obtaining the variance of the non-human voice audio frequency fragment in the at least one historical original audio frequency as a third variance; the difference value of the third mean value and the third variance is used as a third threshold;

adjusting the first threshold by using a preset weight of a third threshold to obtain a fourth threshold;

and if non-human voice audio frequency fragments with the duration less than the fourth threshold exist among the human voice audio frequency fragments in the current original audio frequency, splicing the non-human voice audio frequency fragments with the human voice audio frequency fragments.

5. The method of claim 4, further comprising:

acquiring and storing the time length average values of the current original audio non-human voice audio clips and the non-human voice audio clips of the historical original audio;

and obtaining the variance of the current original audio non-human voice audio frequency fragment and the non-human voice audio frequency fragment of the historical original audio frequency, and storing the variance.

6. The method of claim 1, further comprising:

and if the time length of the non-human voice audio segments existing between the adjacent human voice audio segments in the current original audio is less than a fifth threshold, splicing the non-human voice audio segments with the human voice audio segments.

7. The method according to one of claims 1 to 6, wherein obtaining the audio pieces identified as human voices in the original audio is specifically:

segmenting the original audio frequency according to a preset method to obtain a plurality of sub audio frequencies; calculating a Mel frequency cepstrum coefficient characteristic sequence of the sub-audio;

the neural network obtains the probability that the sub-audio belongs to the human voice according to the Mel frequency cepstrum coefficient characteristic sequence;

acquiring sub-audio with the voice probability larger than a judgment threshold;

acquiring adjacent sub-audio with the human voice probability greater than a judgment threshold in the original audio;

and acquiring an audio clip consisting of the determined time points in the adjacent sub-audio.

8. The method of claim 7, wherein the slicing the original audio according to the preset method into a plurality of sub-audios comprises:

obtaining an original audio, adding first-duration null data before the head of the original audio, and adding second-duration null data after the tail of the original audio to obtain an expanded audio;

and taking a third duration of the sum of the first duration and the second duration as a segmentation window, and sequentially windowing from the head of the expanded audio by using the first step length to obtain a plurality of sub-audios.

9. The method of claim 8, wherein after obtaining the probability that the sub-audio belongs to the human voice, the method further comprises:

obtaining an array of the probability that all the sub-audios of the original audio belong to the human voice;

and filtering the probability values in the array by taking the first number as a window to obtain the filtered human voice probability.

10. The method of claim 8 or 9, wherein obtaining the sub-audio with the vocal probability greater than the decision threshold further comprises:

acquiring audio energy values of determined time points in the sub audio in the original audio; and setting a human voice probability adjustment factor according to the audio energy value, comprising:

if the audio energy value is larger than the upper energy limit, setting the human voice probability regulating factor of the sub audio to be 1;

if the audio energy value is less than the lower energy limit, setting the human voice probability regulating factor of the sub audio to be 0;

if the audio energy value is not greater than the upper energy limit and not less than the lower energy limit, normalizing the human voice probability adjustment factor to be between 0 and 1 according to the audio energy value;

and multiplying the sub-audio voice probability regulation factor by the sub-audio voice probability to obtain the corrected sub-audio voice probability.

11. A human voice audio recording device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-10.

12. A computer-readable storage medium having stored thereon executable code, which when executed by a processor of a computing device, causes the processor to perform the method of any of claims 1-10.