CN112599152A

CN112599152A - Voice data labeling method, system, electronic equipment and storage medium

Info

Publication number: CN112599152A
Application number: CN202110242305.3A
Authority: CN
Inventors: 张旺
Original assignee: Beijing Smart Starlight Information Technology Co ltd
Current assignee: Beijing Smart Starlight Information Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-04-02
Anticipated expiration: 2041-03-05
Also published as: CN112599152B

Abstract

The invention discloses a voice data labeling method, a system, electronic equipment and a storage medium, wherein the method comprises the steps of screening original voice data, and matching the text read aloud with the screened voice to obtain proofreading voice and proofreading text; performing word segmentation on the proofreading text to obtain a word segmentation text; denoising the corrected voice to obtain denoised voice, and inputting the voice characteristics after characteristic extraction into a VAD model to obtain VAD effective voice duration of the denoised voice; performing voice forced alignment on the word segmentation text by adopting an acoustic model to obtain word level alignment time, word level time interval, segmented text starting time, ending time and text alignment time; determining the speech rate, the effective time ratio and the error word number according to the plurality of times, and carrying out voice quality inspection; segmenting the original voice according to the starting time and the ending time of the segmented text, and taking the segmented text and the segmented voice as voice labeling results; the automatic acquisition of the qualified voice labeling text is realized.

Description

Voice data labeling method, system, electronic equipment and storage medium

Technical Field

The invention relates to the field of voice data processing, in particular to a voice data labeling method, a voice data labeling system, electronic equipment and a storage medium.

Background

With the rapid development of speech technology, the requirements for reliable and high-quality speech tagging data required by model training are increasing day by day, and particularly in the field of speech recognition, the difficulty in obtaining a large amount of reliable tagging data in a short time and rapidly establishing a model is large. The voice annotation data requirements have four characteristics: the method has the advantages of large data volume, high labeling quality, multiple scenes, multiple languages and the like, and the traditional pure manual voice data labeling method is difficult to meet the current voice production requirements. Therefore, how to automatically obtain the voice labeling text and ensure the quality of the voice labeling text becomes an urgent problem to be solved.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for voice data annotation, so as to automatically obtain a voice annotation text with qualified quality.

Therefore, the embodiment of the invention provides the following technical scheme:

according to a first aspect, an embodiment of the present invention provides a method for annotating voice data, including: acquiring original voice data; screening the original voice data to obtain screened voice; matching the screened voice with a pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other; performing word segmentation processing on the proofreading text to obtain a word segmentation text; performing noise reduction processing on the proofreading voice to obtain noise-reduced voice; extracting the characteristics of the noise-reduced voice to obtain voice characteristics; detecting the voice characteristics according to a VAD model to obtain VAD effective voice starting time, VAD effective voice ending time and VAD effective voice duration of the noise-reduced voice; performing voice forced alignment by adopting an acoustic model according to the word segmentation text, the voice characteristics and a pre-stored pronunciation dictionary to obtain an alignment result; obtaining word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time according to the alignment result; obtaining the total word number of the text in the word segmentation text according to the word segmentation text; obtaining the speech speed, the effective time ratio and the error word number according to the VAD effective speech duration, the text alignment time, the word level alignment time and the total word number of the text; performing voice quality inspection according to the speed, the effective time ratio and the error word number to obtain qualified voice; and segmenting the original voice data corresponding to the qualified voice according to the segmented text starting time and the segmented text ending time to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results.

Optionally, the step of obtaining word-level alignment time, word-level time interval, segmented text start time, segmented text end time, and text alignment time according to the alignment result includes: obtaining word level alignment time and word level time interval according to the alignment result; segmenting the participle text according to a preset word interval threshold and a word level time interval to obtain a segmented text; obtaining the starting time and ending time of the segmented text according to the segmented text; and obtaining the text alignment time according to the segmented text starting time and the segmented text ending time.

Optionally, the step of segmenting the segmented text according to a preset word interval threshold and a word-level time interval to obtain a segmented text includes: acquiring a preset word space threshold, wherein the preset word space threshold is determined according to the mute segment time before and after the effective voice and the voice acquisition pause time; judging whether the word level time interval is smaller than the preset word interval threshold value or not; if the word-level time interval is smaller than the preset word interval threshold, paragraph segmentation is not performed on adjacent words; and if the word-level time interval is greater than or equal to the preset word interval threshold, performing paragraph segmentation on adjacent words.

Optionally, the step of obtaining the speech rate, the effective time ratio and the error word count according to the VAD valid speech duration, the text alignment time, the word level alignment time and the total word count of the text comprises: obtaining the average duration of the word level according to the alignment time of the word level and the total word number of the text, wherein the formula for calculating the average duration of the word level is as follows:

wherein the content of the first and second substances,

indicating the average duration at the word level,

which represents the total number of words of the text,

representing the word level alignment time of the ith word, wherein the value range of i is more than or equal to 1 and is less than or equal to N;

obtaining the speech rate according to the VAD effective voice duration and the total word number of the text, wherein the formula for calculating the speech rate is as follows:

wherein the content of the first and second substances,

the speed of a speech is represented by,

which represents the total number of words of the text,

represents the VAD active speech duration;

obtaining an effective time ratio according to the VAD effective voice duration and the text alignment time, wherein the formula for calculating the effective time ratio is as follows:

wherein the content of the first and second substances,

it is shown that the effective time ratio,

indicating the duration of VAD active speech,

representing a text alignment time;

obtaining the error word number according to the VAD effective voice duration, the word level alignment time and the word level average duration, wherein the formula for calculating the error word number is as follows:

wherein the content of the first and second substances,

the number of words representing the error is,

indicating the duration of VAD active speech,

the word level alignment time of the ith word is represented, i is more than or equal to 1 and less than or equal to N,

indicating the word-level average duration.

Optionally, the step of performing voice quality check according to the speech rate, the effective time ratio, and the number of error words to obtain qualified voice includes: judging whether the speech rate is in the range of a preset speech rate threshold value; if the speech rate is not in the range of the preset speech rate threshold value, the voice quality detection is unqualified; if the speech rate is in the range of the preset speech rate threshold value, judging whether the effective time ratio is in the range of a preset time ratio; if the effective time ratio is not in the range of the preset time ratio, the voice quality detection is unqualified; if the effective time ratio is in the range of the preset time ratio, judging whether the error word number is in the range of the preset error word number; if the error word number is not in the range of the preset error word number, the voice quality detection is unqualified; and if the error word number is within the range of the preset error word number, the voice quality detection is qualified, and qualified voice is obtained.

Optionally, the step of detecting the voice feature according to a VAD model to obtain a VAD valid voice start time, a VAD valid voice end time, and a VAD valid voice duration of the noise-reduced voice includes: step S7001: inputting the voice characteristics into a VAD model to obtain a voice prediction result of each frame; step S7002: judging whether the voice prediction result of the continuous first preset frame number is effective voice; step S7003: if the voice prediction result of the continuous first preset frame number is not valid voice, moving the first preset frame number backwards, and returning to the step S7002; step S7004: if the voice prediction result of the continuous first preset frame number is effective voice, taking the time corresponding to the voice initial position of the continuous first preset frame number as VAD effective voice initial time; step S7005: judging whether the voice prediction result of the continuous second preset frame number is noise; step S7006: if the voice prediction result of the continuous second preset frame number is not noise, moving the second preset frame number backwards, and returning to the step S7005; step S7007: if the voice prediction result of the continuous second preset frame number is noise, taking the time corresponding to the voice initial position of the continuous second preset frame number as VAD effective voice ending time; step S7008: calculating the VAD effective voice duration according to the VAD effective voice starting time and the VAD effective voice ending time; step S7009: judging whether the VAD valid voice time length is less than a preset voice minimum time length or not; step S7010: if the VAD valid voice duration is smaller than the preset voice minimum duration, returning to the step S7002; step S7011: and if the VAD effective voice duration is greater than or equal to the preset voice minimum duration, the VAD effective voice duration is the VAD effective voice duration.

Optionally, the screening process comprises: voice signal to noise ratio detection, voice reverberation detection, voice amplitude interception detection, voice frequency band loss detection, voice volume detection and microphone spraying detection.

According to a second aspect, an embodiment of the present invention provides a system for annotating voice data, including: the acquisition module is used for acquiring original voice data; the first processing module is used for screening the original voice data to obtain screened voice; the second processing module is used for matching the screened voice with a pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other; the third processing module is used for performing word segmentation processing on the proofreading text to obtain a word segmentation text; the fourth processing module is used for carrying out noise reduction processing on the proofreading voice to obtain noise-reduced voice; the fifth processing module is used for extracting the characteristics of the noise-reduced voice to obtain voice characteristics; a sixth processing module, configured to detect the voice feature according to a VAD model, so as to obtain VAD valid voice start time, VAD valid voice end time, and VAD valid voice duration of the noise-reduced voice; the seventh processing module is used for performing voice forced alignment by adopting an acoustic model according to the word segmentation text, the voice characteristics and a pre-stored pronunciation dictionary to obtain an alignment result; the eighth processing module is used for obtaining word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time according to the alignment result; the ninth processing module is used for obtaining the total word number of the text in the word segmentation text according to the word segmentation text; a tenth processing module, configured to obtain a speech rate, an effective time ratio, and an error word count according to the VAD valid speech duration, the text alignment time, the word level alignment time, and the total number of text words; the eleventh processing module is used for carrying out voice quality inspection according to the speech speed, the effective time ratio and the error word number to obtain qualified voice; and the twelfth processing module is used for segmenting the original voice data corresponding to the qualified voice according to the segmented text starting time and the segmented text ending time to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results.

Optionally, the eighth processing module includes: the first processing unit is used for obtaining word level alignment time and word level time interval according to the alignment result; the second processing unit is used for segmenting the participle text according to a preset word interval threshold value and a word level time interval to obtain a segmented text; the third processing unit is used for obtaining the starting time of the segmented text and the ending time of the segmented text according to the segmented text; and the fourth processing unit is used for obtaining the text alignment time according to the segmented text starting time and the segmented text ending time.

Optionally, the second processing unit comprises: the acquiring subunit is used for acquiring a preset word space threshold, and the preset word space threshold is determined according to the mute segment time before and after the effective voice and the voice acquisition pause time; a judging subunit, configured to judge whether the word-level time interval is smaller than the preset word interval threshold; the first processing subunit is configured to not perform paragraph segmentation on adjacent words if the word-level time interval is smaller than the preset word interval threshold; and the second processing subunit is used for performing paragraph segmentation on adjacent words if the word-level time interval is greater than or equal to the preset word interval threshold.

Optionally, the tenth processing module includes:

a fifth processing unit, configured to obtain a word-level average duration according to the word-level alignment time and the total number of words in the text, where the formula for calculating the word-level average duration is as follows:

wherein the content of the first and second substances,

indicating the average duration at the word level,

which represents the total number of words of the text,

representing the word level alignment time of the ith word, wherein the value range of i is more than or equal to 1 and is less than or equal to N; (ii) a

wherein the content of the first and second substances,

the speed of a speech is represented by,

which represents the total number of words of the text,

represents the VAD active speech duration;

wherein the content of the first and second substances,

it is shown that the effective time ratio,

indicating the duration of VAD active speech,

representing a text alignment time;

wherein the content of the first and second substances,

the number of words representing the error is,

indicating the duration of VAD active speech,

indicating the word-level average duration.

Optionally, the eleventh processing module comprises: the first judging unit is used for judging whether the speech rate is in the range of a preset speech rate threshold value; a ninth processing unit, configured to determine that the voice quality detection is not qualified if the speech rate is not within the range of the preset speech rate threshold; a tenth processing unit, configured to determine whether the effective time ratio is within a range of a preset time ratio if the speech rate is within the range of the preset speech rate threshold; the eleventh processing unit is used for determining that the voice quality detection is unqualified if the effective time ratio is not in the range of the preset time ratio; a twelfth processing unit, configured to determine whether the error word count is within a preset error word count range if the valid time ratio is within a preset time ratio range; the thirteenth processing unit is used for determining that the voice quality detection is unqualified if the error word number is not in the range of the preset error word number; and the fourteenth processing unit is used for detecting qualified voice quality if the error word number is within the range of the preset error word number, so as to obtain qualified voice quality.

Optionally, the sixth processing module includes: a fifteenth processing unit, configured to input the speech features into a VAD model to obtain a speech prediction result of each frame; the second judgment unit is used for judging whether the voice prediction result of the continuous first preset frame number is valid voice or not; a sixteenth processing unit, configured to move the first preset frame number backwards if the voice prediction result of the continuous first preset frame number is not valid voice, and return to the second determining unit; a seventeenth processing unit, configured to, if the result of the continuous speech prediction with the first preset frame number is an effective speech, take a time corresponding to a speech start position with the continuous first preset frame number as a VAD effective speech start time; the third judging unit is used for judging whether the voice prediction result of the continuous second preset frame number is noise or not; the eighteenth processing unit is used for moving the second preset frame number backwards and returning to the third judging unit if the voice prediction result of the continuous second preset frame number is not noise; a nineteenth processing unit, configured to, if the result of the voice prediction for the consecutive second preset frame number is noise, take a time corresponding to the voice start position for the consecutive second preset frame number as VAD valid voice end time; a twentieth processing unit, configured to calculate a VAD valid speech duration according to the VAD valid speech start time and the VAD valid speech end time; a fourth judging unit, configured to judge whether the VAD valid voice duration is less than a preset voice minimum duration; a twenty-first processing unit, configured to return to the second determining unit if the VAD valid speech duration is less than the preset speech minimum duration; and a twenty-second processing unit, configured to determine that the VAD effective speech duration is VAD effective speech duration if the VAD effective speech duration is greater than or equal to the preset speech minimum duration.

According to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method of annotating speech data as described in any one of the above first aspects.

According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are used to enable a computer to execute the voice data annotation method described in any one of the first aspect.

The technical scheme of the embodiment of the invention has the following advantages:

the embodiment of the invention provides a voice data labeling method, a voice data labeling system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring original voice data; screening the original voice data to obtain screened voice; matching the screened voice with a pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other; performing word segmentation processing on the proofreading text to obtain a word segmentation text; performing noise reduction processing on the proofreading voice to obtain noise-reduced voice; extracting the characteristics of the noise-reduced voice to obtain voice characteristics; detecting the voice characteristics according to a VAD model to obtain VAD effective voice starting time, VAD effective voice ending time and VAD effective voice duration of the noise-reduced voice; performing voice forced alignment by adopting an acoustic model according to the word segmentation text, the voice characteristics and a pre-stored pronunciation dictionary to obtain an alignment result; obtaining word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time according to the alignment result; obtaining the total word number of the text in the word segmentation text according to the word segmentation text; obtaining the speech speed, the effective time ratio and the error word number according to the VAD effective speech duration, the text alignment time, the word level alignment time and the total word number of the text; performing voice quality inspection according to the speed, the effective time ratio and the error word number to obtain qualified voice; and segmenting the original voice data corresponding to the qualified voice according to the segmented text starting time and the segmented text ending time to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results. The method comprises the steps of firstly screening original voice data, and matching the text read aloud with the screened voice after primary screening to obtain corresponding proofreading voice and proofreading text; then, performing word segmentation on the proofreading text to obtain a word segmentation text, and performing noise reduction on the proofreading voice to obtain noise reduction voice so as to avoid the influence of noise on subsequent operation; performing voice feature extraction on the noise-reduced voice, and inputting the voice feature after the feature extraction into a VAD model to obtain VAD effective voice starting time, VAD effective voice ending time and VAD effective voice duration of the noise-reduced voice; performing voice forced alignment on the word segmentation text by adopting an acoustic model to obtain word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time; calculating the multiple times to obtain the speed of speech, the effective time ratio and the error word number, then carrying out voice quality inspection according to the speed of speech, the effective time ratio and the error word number to obtain qualified voice, finally segmenting original voice data corresponding to the qualified voice according to the starting time of the segmented text and the ending time of the segmented text to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice which are matched with each other as voice labeling results; the automatic acquisition of the voice labeling text is realized, and the quality of the voice labeling is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart illustrating a specific example of a method for annotating voice data according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating another specific example of a method for annotating voice data according to an embodiment of the present invention;

FIG. 3 is a block diagram of a specific example of a speech data annotation system according to an embodiment of the invention;

fig. 4 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method in the embodiment is mainly applied to industrial voice data labeling production and customizing a voice data labeling preprocessing flow. The method aims at the problems that in the industrial voice labeling data production, the data labeling speed is low, the labor cost investment is large, the data quality is difficult to guarantee, and the like. The invention adopts a voice technical means, and rapidly processes the voice data to be labeled through information such as voice time information, text content, voice speed and the like, thereby accelerating the data production efficiency, improving the voice data quality and improving the industrial production efficiency of the voice data.

An embodiment of the present invention provides a method for annotating voice data, as shown in fig. 1, the method includes steps S1-S13.

Step S1: raw speech data is acquired.

As an exemplary embodiment, the raw speech data may be free-dialog-like speech, or speech from other types of speech capture tasks. Other voice collection tasks can design the text content to be spoken in advance and collect voice according to the text. The present embodiment is only illustrative of the original voice data, and not limited thereto.

Step S2: and screening the original voice data to obtain screened voice.

As an exemplary embodiment, data screening processing is performed on original voice data, and voices which do not meet requirements are filtered out through a technical means, that is, the original voice data which cannot be used in actual voice training is filtered out to obtain screened voices. Specifically, the signal-to-noise ratio and reverberation detection can be performed on the collected original voice data, and unqualified original voice data with too low signal-to-noise ratio and too large reverberation is discarded according to a certain threshold value to obtain screened voice, wherein the screened voice is the original voice data which can be used in voice training.

Step S3: and matching the screened voice with the pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other.

As an exemplary embodiment, the pre-stored speakable text is a collection of voice texts, including several voice speakable texts. And voice labeling requires matching of voice and text, matching of the screened voice and the pre-stored reading text is performed, and a proofreading voice and a proofreading text which correspond to each other are found.

Specifically, when voice is collected for each text, the text and the collected voice corresponding to the text are mapped one by one, so that a reading text matched with the text can be quickly found in the reading text set according to the voice. In this embodiment, the names of the voice and the text may be the same, so that the corresponding text can be found according to the name of the voice. Of course, in other embodiments, the speech and the text may have the same id, which is only schematically illustrated in this embodiment and not limited thereto.

The method comprises the steps of acquiring voice according to a text, wherein the voice and the corresponding reading text of the voice are checked to determine whether the voice and the corresponding reading text exist or not, the voice lost by the text and the text lost by the voice are removed, and the matched proofreading voice and proofreading text are obtained so as to be used in subsequent voice training.

Step S4: and performing word segmentation processing on the proofreading text to obtain a word segmentation text.

As an exemplary embodiment, the specific word segmentation method may be a statistical-based word segmentation method, such as HMM, CRF, SVM, deep learning, and other algorithms; or dictionary word segmentation based algorithms such as forward maximum matching, reverse maximum matching, and bi-directional matching. Specific word segmentation tools may be nod segmentation, pkuseg, stanford, Hanlp, etc.

The word segmentation method is only schematically illustrated in the embodiment, and is not limited to this, and in practical application, the word segmentation method can be reasonably selected according to actual needs, so as to obtain a word segmentation text.

Step S5: and carrying out noise reduction processing on the corrected voice to obtain noise-reduced voice.

As an exemplary embodiment, noise reduction processing is performed on speech to avoid noise interference on later results due to unreasonable correlated noise in speech. The training speech needs the original collected speech which is not processed by any technical means, and the noise reduction is to prevent the speech from containing unknown noise to influence other subsequent processes. The noise reduction process is used to remove the sounds of the non-target sounds in the recorded speech.

In this embodiment, the noise reduction method may be wiener filtering noise reduction, spectral subtraction noise reduction, LMS adaptive filter noise reduction, deep neural network model noise reduction, or the like; of course, in other embodiments, other noise reduction methods may also be adopted, and this embodiment is not limited thereto.

Step S6: and extracting the characteristics of the noise-reduced voice to obtain the voice characteristics.

As an exemplary embodiment, the feature extraction may be Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive Coding (LPC), Mel-scale Filter banks (FBank), I-Vector features, Bottleneck features (Bottleneck features), zero-crossing rate (ZCR), short-time average energy, etc., which is only schematically illustrated in the present embodiment and not limited thereto. The speech features obtained by different feature extraction methods are different, and the extraction method can be reasonably determined according to the needs. And performing feature extraction on the noise-reduced voice to obtain voice features, wherein the voice features are represented by the noise-reduced voice in another dimension so as to be input into a VAD (voice Activity detection) model for detection.

Step S7: and detecting the voice characteristics according to the VAD model to obtain the VAD effective voice starting time, the VAD effective voice ending time and the VAD effective voice duration of the noise-reduced voice.

As an exemplary embodiment, in order to obtain the valid voice duration of the whole voice, a VAD detection module is used to obtain a VAD valid voice start time and a VAD valid voice end time of the noise-reduced voice, and then obtain a VAD valid voice duration of the noise-reduced voice, that is, a VAD valid voice duration.

For example, the duration of the noise-reduced voice is 10 seconds, the VAD valid voice start time obtained after VAD detection is 3s, the VAD valid voice end time is 7 seconds, and the VAD valid voice duration is 4 seconds.

Step S8: and performing voice forced alignment by adopting an acoustic model according to the word segmentation text, the voice characteristics and a pre-stored pronunciation dictionary to obtain an alignment result.

As an exemplary embodiment, the pre-stored pronunciation dictionary is a huge Chinese character comparison table, which includes Chinese characters and pinyin corresponding to the Chinese characters to provide standard phonetic symbols of the Chinese characters.

As an exemplary embodiment, the acoustic model may be a conventional GMM-HMM model, or may be a deep neural network model, such as CNN, RNN-T, LSTM, BLSTM, DNN, TDNN, CLDNN, FSMN, transducer-entry, etc.; the present embodiment is only illustrative, and not limited thereto.

As an exemplary embodiment, the text and speech forced alignment uses the Viterbi algorithm (Viterbi). Viterbi decoding is a dynamically programmed algorithm, which simply cuts audio into frames (called a sample) with a short length, and the length of the frame is usually between 5 and 10ms, which is considered that various characteristics of the audio are not changed in such a short time. Extracting the characteristics of each sample of an audio frequency, calculating the similarity with the characteristics of a standard phonetic symbol, and using b_i(o_t) To represent the similarity of the t-th sample and the i-th phonetic symbol model. By delta_t(i) Representing the maximum probability δ that the current audio reaches the phonetic symbol i at the moment of sample t_t(i) Then the result delta at t +1 th time can be derived from the t-th sample by using the formula_t+1(i) In that respect In the decoding process, t is continuously increased from 0 to the end of the audio frequency, and finally delta corresponding to each phonetic symbol i is obtained_T(i)。

The viterbi algorithm actually solves the prediction problem by dynamic programming by finding the most likely corresponding state sequence for a given observation sequence. I.e. the models λ = (a, B, pi) andobservation sequence O = (O)₁,o₂,…,o_T) Then, a state sequence I = (I) in which the conditional probability P (I | O) is maximized is obtained₁,i₂,...,i_t)。

The algorithm first needs to import two variables δ and ψ. δ is all the single paths with state i at time t (i)₁, i₂,...,i_t) Maximum value of the median probability:

from the recursion formula defining the available variable δ:

the algorithm is setting an initial value delta₁(i)= π_ib_i (o_i) Then, iteration is continuously carried out, and the termination condition is as follows:

psi represents all the individual paths with state i at time t (i)₁, i₂,..., i_t) The t-1 node of the path with the highest probability is:

representing which state j at time t-1 is when the probability of transition from state j at time t-1 to state i at time t is greatest.

Specifically, the alignment result includes each character in the participle text, and a start time and an end time corresponding to each character. When reading the reading text, the person who records the sound is not determined whether to read the text according to the reading. The forced alignment is to predict a specific word in the read-aloud text and which section of audio in the read-aloud voice corresponds, and estimate the start and end time of the word in the audio to obtain a word-level alignment result.

For example, the voice collated text is "this is a piece of voice aligned text", and the participled text obtained by participling the collated text is "this is a piece of voice aligned text". The alignment results obtained after performing the forced voice alignment using the acoustic model are shown below.

This 0.200.50

Is 0.600.70

0.750.85

Strip 0.860.95

Phrase 1.031.1

Sound 1.121.20

Pair 1.301.40

Neat 1.411.50

Character 1.651.80

Book 1.821.99

Step S9: and obtaining word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time according to the alignment result.

As an exemplary embodiment, time extraction is performed on the alignment result to obtain a word start time and a word end time corresponding to each word, a difference between the word end time and the word start time of the same word is a word level alignment time corresponding to the word, and a difference between the next word start time and the previous word end time is a word level time interval between adjacent words. And then segmenting the word segmentation text according to the word level time interval to obtain a segmented text, segmented text starting time, segmented text ending time and text alignment time so as to perform voice segmentation on the qualified voice in the following.

Step S10: and obtaining the total word number of the text in the word segmentation text according to the word segmentation text.

As an exemplary embodiment, word count of text in the segmented text may be performed by word count statistics in the code, and the total word count of the text may be obtained. The present embodiment is only illustrative, and not limited thereto.

Step S11: and obtaining the speech speed, the effective time ratio and the error word number according to the VAD effective speech duration, the text alignment time, the word level alignment time and the total word number of the text.

As an exemplary embodiment, the speech rate is generally controlled within a certain range during voice data collection, and the pronunciation is not satisfactory too slow or too fast in a unit time. Meanwhile, when the speaker reads the text more or less according to the requirement, the speed of speech is too fast or too slow, and the unqualified speech can be filtered out through the speed of speech check.

Under the condition that the speech speed meets the requirement, the effective time and the number of error words also need to be further checked, and the speech quality is ensured.

The effective time ratio is used for checking the text reading integrity, and the closer the effective time ratio is to 1, the higher the text reading integrity is; the larger the value is, the more content does not appear in the spoken text during the process of reading the text, or the text portion content is continuously repeated.

The effect of the error word count is to check the number of words that are read more or less.

Step S12: and carrying out voice quality inspection according to the speed of speech, the effective time ratio and the number of error words to obtain qualified voice.

As an exemplary embodiment, the voice is checked according to the speed of speech, the effective time ratio and the number of error words, when the voice simultaneously meets the above requirements, the voice quality is qualified, otherwise, the voice quality is unqualified.

Step S13: and segmenting the original voice data corresponding to the qualified voice according to the segmented text starting time and the segmented text ending time to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results.

The marking is to mark the original voice, and the final marking result is to obtain a small segment of audio and a segment text corresponding to the small segment of audio.

As an exemplary embodiment, the original speech data corresponding to the qualified speech and the segmented text corresponding to the qualified speech are matched according to the qualified speech, and then the original speech is segmented according to the start time and the end time of the segmented text to form the segmented speech and the segmented text corresponding to the segmented speech, where the segmented text is the segmented text and the segmented text is matched with the corresponding segmented speech, and the two match is the text of a sentence corresponding to the speech of the sentence. The segmentation voice and the segmentation text which are matched with each other are voice labeling results, and the labeling results meet the requirements of model training and can be used for model training.

Firstly, screening original voice data, and matching the text read aloud with the screened voice after primary screening to obtain corresponding proofreading voice and proofreading text; then, performing word segmentation on the proofreading text to obtain a word segmentation text, and performing noise reduction on the proofreading voice to obtain noise reduction voice so as to avoid the influence of noise on subsequent operation; performing voice feature extraction on the noise-reduced voice, and inputting the voice feature after the feature extraction into a VAD model to obtain VAD effective voice starting time, VAD effective voice ending time and VAD effective voice duration of the noise-reduced voice; performing voice forced alignment on the word segmentation text by adopting an acoustic model to obtain word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time; and finally, segmenting original voice data corresponding to the qualified voice according to the starting time of the segmented text and the ending time of the segmented text to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results. The method realizes the automatic acquisition of the voice labeling text and improves the quality of the voice labeling text.

As an exemplary embodiment, the step S9 of obtaining the word-level alignment time, the word-level time interval, the segmented text start time, the segmented text end time, and the text alignment time according to the alignment result includes steps S91-S94.

Step S91: and obtaining the word-level alignment time and the word-level time interval according to the alignment result.

In this embodiment, the alignment result is analyzed and calculated, specifically, the start time and the end time of each word are obtained according to the alignment result, so as to obtain the time corresponding to each word, that is, the word-level alignment time. The time interval between adjacent words, i.e. the word-level time interval, is obtained from the end time of the previous word and the start time of the next word.

Step S92: and segmenting the word segmentation text according to a preset word interval threshold value and a word level time interval to obtain a segmented text.

In this embodiment, the preset word space threshold is used to segment the participle text, where the segmentation specifically refers to segmenting a plurality of words in the participle text to form a plurality of sentence segments, that is, paragraphs. A paragraph in this embodiment refers to one sentence.

Specifically, when the word-level time interval is greater than or equal to the preset word interval threshold, it indicates that the pause time between words is too long, so that the word is segmented there, and the word and the previous word belong to different paragraphs.

Step S93: and obtaining the starting time and the ending time of the segmented text according to the segmented text.

In this embodiment, a segmented text is obtained after segmenting the word text, the start time of the segmented text of the paragraph is obtained according to the word at the start position of each paragraph, and the end time of the segmented text of the paragraph is obtained according to the word at the end position of each paragraph.

Step S94: and obtaining the text alignment time according to the segmented text starting time and the segmented text ending time.

In this embodiment, the paragraph alignment time of each paragraph is obtained according to the start time and the end time of the segmented text, and the paragraph alignment times of all paragraphs are added to obtain the text alignment time.

For example, speech is divided into three segments, with paragraph alignment times of 10 seconds, 15 seconds, and 8 seconds, respectively, and the text alignment time is 10+15+8=33 seconds.

As an exemplary embodiment, the step of segmenting the participle text according to the preset word space threshold and the word level time space in the step S92 includes steps S921-S924.

Step S921: and acquiring a preset word space threshold, wherein the preset word space threshold is determined according to the mute period time before and after the effective voice and the voice acquisition pause time.

In this embodiment, the general voice labeling requires that a silence period of 0.2s is satisfied before and after the effective voice, and in combination with the requirement for the voice pause time when the voice is collected, the preset word space threshold is set to be greater than or equal to 0.4s in the text segmentation process. Specifically, in this embodiment, the preset word interval threshold is set to 0.4 s; of course, in other embodiments, the preset word-space threshold may also be set to other values, such as 0.6s, and may be reasonably set as required.

Step S922: and judging whether the word level time interval is smaller than a preset word interval threshold value or not. If the word-level time interval is smaller than the preset word interval threshold, performing step S923; if the word-level time interval is greater than or equal to the preset word interval threshold, step S924 is performed.

Step S923: and if the word-level time interval is smaller than a preset word interval threshold, segmenting adjacent words.

In this embodiment, when the word-level time interval is smaller than the preset word interval threshold, it indicates that the pause time between adjacent words is short, that is, the time interval between the adjacent words is short, and the adjacent words belong to the same paragraph, so that paragraph segmentation is not required.

Step S924: and if the word-level time interval is greater than or equal to a preset word interval threshold, performing paragraph segmentation on adjacent words.

In this embodiment, when the word-level time interval is greater than or equal to the preset word interval threshold, it indicates that the pause time between adjacent words is longer, that is, the time interval between the adjacent words is longer, the adjacent words do not belong to the same paragraph, the previous word belongs to the previous paragraph, and the next word belongs to the next paragraph, so that paragraph segmentation is performed at this point to distinguish different paragraphs.

In the steps, the segmented word text is segmented by comparing the preset word interval threshold value with the word level time interval, so that the accuracy of text alignment time is improved.

As an exemplary embodiment, the step S11 of obtaining the speech rate, the effective time ratio and the error word count according to the VAD valid speech duration, the text alignment time, the word level alignment time and the total word count of the text includes steps S111 to S114.

Step S111: obtaining the average duration of the word level according to the alignment time of the word level and the total word number of the text, wherein the formula for calculating the average duration of the word level is as follows:

wherein the content of the first and second substances,

indicating the average duration at the word level,

which represents the total number of words of the text,

specifically, the word level alignment times corresponding to each word in the text are added to obtain the alignment times corresponding to all the words in the text

Then, the average duration of the word level is obtained by dividing the total number of words in the text.

Step S112: obtaining the speech rate according to the VAD effective voice duration and the total word number of the text, wherein the formula for calculating the speech rate is as follows:

wherein the content of the first and second substances,

the speed of a speech is represented by,

which represents the total number of words of the text,

represents the VAD active speech duration;

step S113: obtaining the speech rate according to the VAD effective voice duration and the total word number of the text, wherein the formula for calculating the speech rate is as follows:

wherein the content of the first and second substances,

it is shown that the effective time ratio,

indicating the duration of VAD active speech,

representing a text alignment time;

step S114: obtaining the error word number according to the VAD effective voice duration, the word level alignment time and the word level average duration, wherein the formula for calculating the error word number is as follows:

wherein the content of the first and second substances,

the number of words representing the error is,

indicating the duration of VAD active speech,

indicating the word-level average duration.

The effective time ratio in the above steps is obtained by obtaining the VAD effective voice duration and the text alignment time in two ways after removing the noise in the original voice, i.e. after removing the noise interference factor, and then comparing the two times, so that the effective time ratio is more accurate.

As an exemplary embodiment, the step S12 of performing voice quality check according to the speech rate, the valid time ratio and the number of error words to obtain a quality qualified voice includes steps S121 to S127.

Step S121: and judging whether the speech rate is in the range of the preset speech rate threshold value. If the speech rate is not within the range of the preset speech rate threshold, executing step S122; if the speech rate is within the range of the preset speech rate threshold, step S123 is executed.

Specifically, the preset speech rate threshold is obtained by performing statistical analysis on qualified speech data acquired in the past similar acquisition scene, and is usually within a range of 120 words/second to 250 words/second, and the specific threshold is determined according to the actual acquisition scene.

Step S122: and if the speech rate is not in the range of the preset speech rate threshold value, the voice quality detection is unqualified.

Specifically, when the speech rate is not within the range of the preset speech rate threshold, it is indicated that the speech rate is too fast or too slow, and the speeches do not meet the requirements of speech labeling, the quality detection is unqualified, and the unqualified speeches are discarded.

Step S123: if the speech rate is in the range of the preset speech rate threshold value, judging whether the effective time ratio is in the range of the preset time ratio. If the valid time ratio is not within the preset time ratio, go to step S124; if the valid time ratio is within the preset time ratio, step S125 is executed.

If the speech rate is within the range of the preset speech rate threshold, it is indicated that the speech rate of the speech is qualified, and it is further determined whether the effective time ratio meets the requirement. In this embodiment, the preset time ratio is set to be 0.9-1.1, and certainly, in other embodiments, the preset time ratio can also be set to be other values, such as 0.95-1.05, and the specific value range can be reasonably set as required.

Step S124: and if the effective time ratio is not in the range of the preset time ratio, the voice quality detection is unqualified.

If the effective time ratio is not within the range of the preset time ratio, the text reading is incomplete, the quality detection is unqualified, and unqualified voice is removed.

For example, the audio content to be acquired is "i want to eat and eat, the voice duration obtained after acquisition is 10s, and the VAD valid voice duration obtained according to the VAD model is 7 s; and the text alignment time obtained after the word segmentation text alignment is 5s, and the effective time ratio is 7: 5. The effective time ratio is larger than 1, and the content is much spoken when the voice is collected with a high probability, for example, the actually collected voice may be "i want to go to have a meal", which may cause the effective time ratio to exceed the range of the preset time ratio, and the text reading has the problem of being read many times, and the voice quality is not qualified.

Step S125: and if the effective time ratio is within the range of the preset time ratio, judging whether the error word number is within the range of the preset error word number. If the error word number is not within the preset error word number range, go to step S126; if the number of error words is within the predetermined range of the number of error words, step S127 is executed.

If the effective time ratio is within the range of the preset time ratio, the text is completely read, and whether the error word number meets the requirement needs to be further judged. In this embodiment, the number of error words is set to 1 to 2 words, however, in other embodiments, the number of error words may also be set to other values, such as 1 to 3 words, and the specific value range may be reasonably set as required.

Step S126: and if the error word number is not in the range of the preset error word number, the voice quality detection is unqualified.

If the error word number is not in the preset error word number range, the word number which is read more or less is more, the voice quality detection is unqualified, and unqualified voice is removed.

Step S127: and if the error word number is within the range of the preset error word number, the voice quality is detected to be qualified, and qualified voice is obtained.

If the error word number is within the range of the preset error word number, the voice quality detection is qualified, and qualified voice is obtained.

The method can check the voice quality in many aspects through the speed of speech, the effective time ratio and the error words, thereby ensuring the voice quality.

As an exemplary embodiment, the step S7 is to detect the voice features according to the VAD model, and obtain the VAD valid voice start time, VAD valid voice end time and VAD valid voice duration of the noise-reduced voice, as shown in fig. 2, including steps S7001-S7011.

Step S7001: and inputting the voice characteristics into a VAD model to obtain a voice prediction result of each frame.

Specifically, the prediction result comprises effective speech and noise, and whether each frame is effective speech or noise is obtained through VAD model prediction, so that the start time and the end time of the effective speech are determined according to the speech prediction result of each frame.

Step S7002: and judging whether the voice prediction result of the continuous first preset frame number is effective voice. If the voice prediction result of the continuous first preset frame number is not valid voice, executing the step S7003; if the result of the voice prediction of the first preset frame number is a valid voice, step S7004 is executed.

Specifically, a segment of voice has M frames in total, and the detection is performed sequentially from the first frame to the next, and the step S7004 is executed after the first continuous M frames are found to be valid; otherwise, step S7003 is executed.

The first preset frame number is determined according to actual requirements, and the time length of continuous minimum voice and the length of continuous maximum invalid voice can be accepted. The normal null is set to 0.2s, the valid is set to 0.5 s; therefore, in this embodiment, the total duration of the first preset number of frames is set to 0.5 s.

Step S7003: and if the voice prediction result of the continuous first preset frame number is not valid voice, moving the first preset frame number backwards, and returning to the step S7002.

Specifically, when the prediction result of the continuous first preset frame number is not valid speech, that is, there is noise in the part of speech and it is not the starting point of valid speech, the step moves backward by the first preset frame number, and returns to step S7002 to perform valid speech detection on the prediction result of the first preset frame number later.

Step S7004: and if the voice prediction result of the continuous first preset frame number is effective voice, taking the time corresponding to the voice initial position of the continuous first preset frame number as VAD effective voice initial time.

When the result of the continuous voice prediction with the first preset frame number is valid voice, it indicates that each frame of voice with the first preset frame number is valid voice, so the time corresponding to the voice start position with the continuous first preset frame number is used as the VAD valid voice start time.

Step S7005: and judging whether the voice prediction result of the continuous second preset frame number is noise or not. If the voice prediction result of the continuous second preset frame number is not noise, executing the step S7006; if the result of the continuous voice prediction of the second preset frame number is noise, step S7007 is executed.

Specifically, the acceptable continuous minimum voice time length and the continuous maximum invalid voice length are determined according to actual requirements. The normal continuous inactive speech is set to 0.2s and the continuous minimum active speech duration is set to 0.5 s. Therefore, in this embodiment, the total duration of the second preset frame number is set to 0.2 s.

Step S7006: and if the voice prediction result of the continuous second preset frame number is not noise, moving the second preset frame number backwards, and returning to the step S7005.

And when the voice prediction result of the continuous second preset frame number is not noise, indicating that effective voice exists in the part of voice, moving backwards by the second preset frame number, returning to the step S7005, and detecting the noise of the prediction result of the subsequent second preset frame number.

Step S7007: and if the voice prediction result of the continuous second preset frame number is noise, taking the time corresponding to the voice initial position of the continuous second preset frame number as the VAD effective voice ending time.

When the result of the voice prediction with the second preset frame number is noise, it indicates that there is no valid voice in the part of voice and the valid voice is over, so the time corresponding to the voice start position with the second preset frame number is used as the VAD valid voice end time.

Step S7008: and calculating the VAD effective voice duration according to the VAD effective voice starting time and the VAD effective voice ending time.

Specifically, the VAD valid voice duration is obtained by subtracting the VAD valid voice start time from the VAD valid voice end time.

Step S7009: and judging whether the VAD valid voice time length is less than the preset voice minimum time length or not. If the VAD valid voice duration is less than the preset voice minimum duration, executing step S7010; if the VAD valid speech duration is greater than or equal to the preset speech minimum duration, step S7011 is executed. The accuracy of effective speech recognition is improved by presetting the minimum duration of speech.

Specifically, the acceptable continuous minimum voice time length and the continuous maximum invalid voice length are determined according to actual requirements. The normal continuous inactive speech is set to 0.2s and the continuous minimum active speech duration is set to 0.5 s. Therefore, in this embodiment, the preset minimum duration of speech is set to 0.5 s.

Step S7010: if the VAD valid voice duration is less than the preset voice minimum duration, the step S7002 is returned to.

When the VAD valid voice duration is less than the preset voice minimum duration, the VAD valid voice duration does not meet the requirement, and the voice is invalid voice.

Step S7011: and if the VAD valid voice duration is greater than or equal to the preset voice minimum duration, the VAD valid voice duration is the VAD valid voice duration.

When the VAD effective voice duration is greater than or equal to the preset voice minimum duration, the VAD effective voice duration meets the requirement, the voice is effective voice, and the VAD effective voice duration is used as the VAD effective voice duration.

In the above steps, voice after noise reduction is used for VAD detection, so that the detection accuracy can be greatly improved, and more accurate VAD effective voice starting and ending time can be obtained.

As an exemplary embodiment, the screening process includes: voice signal to noise ratio detection, voice reverberation detection, voice amplitude clipping detection, frequency band loss detection, volume detection and wheat spraying detection.

In this embodiment, the collected original voice data is subjected to voice signal to noise ratio detection, voice reverberation detection, voice amplitude clipping detection, frequency band loss detection, volume detection and wheat spraying detection, so as to screen the original voice data and remove unqualified voice data. Of course, in other embodiments, the screening process may be several of the above-mentioned multiple tests, or may include tests other than the above-mentioned multiple tests, and may be appropriately configured as needed.

Specifically, the speech snr detection may be to calculate an snr of the original speech data, compare the snr with a preset snr, and when the snr is smaller than the preset snr, indicate that the noise of the original speech data is too large and need to be discarded.

The signal-to-noise ratio calculation formula is as follows:

wherein, P_signalIs the Signal Power (Power of Signal); p_noiseIs the Power of Noise (Power of Noise); a. the_signalSignal Amplitude (Amplitude of Signal); a. the_noiseIs the Noise Amplitude (Amplitude of Noise).

Specifically, when the sound wave propagates indoors, the sound wave is reflected by obstacles such as walls, ceilings, floors and the like, and the sound wave is absorbed by the obstacles once. Therefore, after the sound source stops sounding, the sound waves are reflected and absorbed for many times in the room and disappear at last, and a plurality of sound waves are mixed for a period of time (the sound continuation phenomenon still exists after the sound source in the room stops sounding) after the sound source stops sounding, the phenomenon is called reverberation, and the period of time is called reverberation time. Different sites have different reverberation times, and reverberation calculation comprises the following steps: t =0.161V/(S × a), where T is the reverberation time, V is the room volume, S is the total room wall area, and a is the average sound absorption coefficient of the room surface. The reverberation time is different when voice data under different scenes are collected, and the specific reference is made to table 1.

Table 1 optimal reverberation time for different scenes:

scene	Reverberation time (seconds)	Scene	Reverberation time (seconds)
				Cinema and conference hall	1.0~1.2	Television studio	0.7~1.0
Speech, drama and drama	1.0~1.4	Language recording	0.3~0.4
				Opera and music hall	1.5~1.8	Music recording	1.4~1.6
Multifunctional hall	1.2~1.4	Multifunctional gymnasium	Less than 1.8

In this embodiment, the collected original voice data is a voice recording, so the preset reverberation time is 0.3-0.4 seconds, the calculated reverberation time of the original voice data is compared with the preset reverberation time, and the voice with serious reverberation is removed.

When the waveform amplitude of the voice is too large, the linear range of a system or equipment can be exceeded, the voice with the amplitude exceeding the linear range is cut off, the voice data is incomplete, and voice cut-off detection is to remove the voice with the waveform amplitude not meeting the requirement. Specifically, the voice amplitude-clipping detection is to compare the amplitude of the voice with a preset amplitude (the preset amplitude can be determined according to the amplitude of the waveform that can be recognized by the device), and when the amplitude of the voice is beyond a preset amplitude range, the voice is discarded.

The frequency band loss detection is to detect whether the frequency band information of the voice is complete, and when the frequency band information is lost to cause the voice to be incomplete, the voice with the incomplete frequency band is removed. For example, the format of the voice data is 8K data, and when the detected actual frequency band is 3.5kHz, the number corresponding to the frequency band is multiplied by 2 to obtain actual frequency domain information of 3.5 × 2=7K, which is smaller than 8K, indicating that frequency domain information above 7K is lost, and the voice segment information of the voice is incomplete.

The voice volume detection is to check the volume of voice, and when the voice volume is smaller than the preset volume, the waveform amplitude of the voice is too small, the sound energy is too low to be used for voice marking, and the voice with too small voice volume is removed.

The wheat spraying detection is to detect whether strong airflow interference sound exists in effective voice when a person recording the voice is too close to a microphone. When the airflow sound is too loud, the voice waveform vibrates violently, and the quality of effective voice is interfered. By detecting the energy and waveform changes in the speech, records that change too quickly are removed.

It should be noted that, in this embodiment, the order between the various methods of the screening process is arbitrary, and this is only schematically described in this embodiment; in practice, screening methods other than those listed above may be included.

In this embodiment, a speech data annotation system is further provided, and the system is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

The embodiment further provides a voice data annotation system, as shown in fig. 3, including: the processing module comprises an acquisition module 1, a first processing module 2, a second processing module 3, a third processing module 4, a fourth processing module 5, a fifth processing module 6, a sixth processing module 7, a seventh processing module 8, an eighth processing module 9, a ninth processing module 10, a tenth processing module 11, an eleventh processing module 12 and a twelfth processing module 13.

The acquisition module 1 is used for acquiring original voice data;

the first processing module 2 is used for screening the original voice data to obtain screened voice;

the second processing module 3 is used for matching the screened voice with a pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other;

the third processing module 4 is used for performing word segmentation processing on the proofreading text to obtain a word segmentation text;

the fourth processing module 5 is configured to perform noise reduction processing on the corrected voice to obtain noise-reduced voice;

a fifth processing module 6, configured to perform feature extraction on the noise-reduced speech to obtain a speech feature;

a sixth processing module 7, configured to detect the voice feature according to the VAD model, so as to obtain VAD valid voice start time, VAD valid voice end time, and VAD valid voice duration of the noise-reduced voice;

the seventh processing module 8 is configured to perform forced speech alignment by using an acoustic model according to the segmented text, the speech features, and a pre-stored pronunciation dictionary to obtain an alignment result;

an eighth processing module 9, configured to obtain, according to the alignment result, word-level alignment time, word-level time interval, segmented text start time, segmented text end time, and text alignment time;

a ninth processing module 10, configured to obtain a total number of text words in the word segmentation text according to the word segmentation text;

a tenth processing module 11, configured to obtain a speech rate, an effective time ratio, and an error word count according to the VAD valid speech duration, the text alignment time, the word level alignment time, and the total number of text words;

the eleventh processing module 12 is configured to perform voice quality inspection according to the speech rate, the effective time ratio, and the error word count, so as to obtain qualified voice;

a twelfth processing module 13, configured to segment the original voice data corresponding to the qualified-quality voice according to the start time of the segmented text and the end time of the segmented text, so as to obtain a segmented voice corresponding to the segmented text, and use the segmented text and the segmented voice as a voice tagging result.

As an exemplary embodiment, the eighth processing module includes: the first processing unit is used for obtaining word level alignment time and word level time interval according to the alignment result; the second processing unit is used for segmenting the participle text according to a preset word interval threshold value and a word level time interval to obtain a segmented text; the third processing unit is used for obtaining the starting time and the ending time of the segmented text according to the segmented text; and the fourth processing unit is used for obtaining the text alignment time according to the segmented text starting time and the segmented text ending time.

As an exemplary embodiment, the second processing unit includes: the acquiring subunit is used for acquiring a preset word space threshold, and the preset word space threshold is determined according to the mute segment time before and after the effective voice and the voice acquisition pause time; a judging subunit, configured to judge whether the word-level time interval is smaller than the preset word interval threshold; the first processing subunit is configured to not perform paragraph segmentation on adjacent words if the word-level time interval is smaller than the preset word interval threshold; and the second processing subunit is used for performing paragraph segmentation on adjacent words if the word-level time interval is greater than or equal to the preset word interval threshold.

As an exemplary embodiment, the tenth processing module includes:

wherein the content of the first and second substances,

indicating the average duration at the word level,

which represents the total number of words of the text,

wherein the content of the first and second substances,

the speed of a speech is represented by,

which represents the total number of words of the text,

represents the VAD active speech duration;

wherein the content of the first and second substances,

it is shown that the effective time ratio,

indicating the duration of VAD active speech,

representing a text alignment time;

wherein the content of the first and second substances,

the number of words representing the error is,

indicating the duration of VAD active speech,

indicating the word-level average duration.

As an exemplary embodiment, the eleventh processing module includes: the first judging unit is used for judging whether the speech rate is in the range of a preset speech rate threshold value; a ninth processing unit, configured to determine that the voice quality detection is not qualified if the speech rate is not within the range of the preset speech rate threshold; a tenth processing unit, configured to determine whether the effective time ratio is within a range of a preset time ratio if the speech rate is within the range of the preset speech rate threshold; the eleventh processing unit is used for determining that the voice quality detection is unqualified if the effective time ratio is not in the range of the preset time ratio; a twelfth processing unit, configured to determine whether the error word count is within a preset error word count range if the valid time ratio is within a preset time ratio range; the thirteenth processing unit is used for determining that the voice quality detection is unqualified if the error word number is not in the range of the preset error word number; and the fourteenth processing unit is used for detecting qualified voice quality if the error word number is within the range of the preset error word number, so as to obtain qualified voice quality.

As an exemplary embodiment, the sixth processing module includes: a fifteenth processing unit, configured to input the speech features into a VAD model to obtain a speech prediction result of each frame; the second judgment unit is used for judging whether the voice prediction result of the continuous first preset frame number is valid voice or not; a sixteenth processing unit, configured to move the first preset frame number backwards if the voice prediction result of the continuous first preset frame number is not valid voice, and return to the second determining unit; a seventeenth processing unit, configured to, if the result of the continuous speech prediction with the first preset frame number is an effective speech, take a time corresponding to a speech start position with the continuous first preset frame number as a VAD effective speech start time; the third judging unit is used for judging whether the voice prediction result of the continuous second preset frame number is noise or not; the eighteenth processing unit is used for moving the second preset frame number backwards and returning to the third judging unit if the voice prediction result of the continuous second preset frame number is not noise; a nineteenth processing unit, configured to, if the result of the voice prediction for the consecutive second preset frame number is noise, take a time corresponding to the voice start position for the consecutive second preset frame number as VAD valid voice end time; a twentieth processing unit, configured to calculate a VAD valid speech duration according to the VAD valid speech start time and the VAD valid speech end time; a fourth judging unit, configured to judge whether the VAD valid voice duration is less than a preset voice minimum duration; a twenty-first processing unit, configured to return to the second determining unit if the VAD valid speech duration is less than the preset speech minimum duration; and a twenty-second processing unit, configured to determine that the VAD effective speech duration is VAD effective speech duration if the VAD effective speech duration is greater than or equal to the preset speech minimum duration.

As an exemplary embodiment, the screening process includes: voice signal to noise ratio detection, voice reverberation detection, voice amplitude interception detection, voice frequency band loss detection, voice volume detection, microphone ejection detection and the like.

The voice data tagging system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory that execute one or more software or firmware programs, and/or other devices that provide the above functionality.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An embodiment of the present invention further provides an electronic device, as shown in fig. 4, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 4.

The controller may further include: an input device 73 and an output device 74.

The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.

The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 72 is a non-transitory computer readable storage medium, and can be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the voice data tagging method in the embodiment of the present application. The processor 71 executes various functional applications of the server and data processing, namely, implements the voice data tagging method of the above-described method embodiment, by running non-transitory software programs, instructions and modules stored in the memory 72.

The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.

One or more modules are stored in the memory 72 and, when executed by the one or more processors 71, perform the methods shown in fig. 1-2.

It will be understood by those skilled in the art that all or part of the processes in the method for implementing the above embodiment may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes in the embodiment of the above-mentioned voice data annotation method. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method for labeling voice data, comprising:

acquiring original voice data;

screening the original voice data to obtain screened voice;

matching the screened voice with a pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other;

performing word segmentation processing on the proofreading text to obtain a word segmentation text;

performing noise reduction processing on the proofreading voice to obtain noise-reduced voice;

extracting the characteristics of the noise-reduced voice to obtain voice characteristics;

detecting the voice characteristics according to a VAD model to obtain VAD effective voice starting time, VAD effective voice ending time and VAD effective voice duration of the noise-reduced voice;

performing voice forced alignment by adopting an acoustic model according to the word segmentation text, the voice characteristics and a pre-stored pronunciation dictionary to obtain an alignment result;

obtaining word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time according to the alignment result;

obtaining the total word number of the text in the word segmentation text according to the word segmentation text;

obtaining the speech speed, the effective time ratio and the error word number according to the VAD effective speech duration, the text alignment time, the word level alignment time and the total word number of the text;

performing voice quality inspection according to the speed, the effective time ratio and the error word number to obtain qualified voice;

and segmenting the original voice data corresponding to the qualified voice according to the segmented text starting time and the segmented text ending time to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results.

2. The method for annotating voice data according to claim 1, wherein the step of obtaining word-level alignment time, word-level time interval, segmented text start time, segmented text end time, and text alignment time according to the alignment result comprises:

obtaining word level alignment time and word level time interval according to the alignment result;

segmenting the participle text according to a preset word interval threshold and a word level time interval to obtain a segmented text;

obtaining the starting time and ending time of the segmented text according to the segmented text;

and obtaining the text alignment time according to the segmented text starting time and the segmented text ending time.

3. The method for labeling voice data according to claim 2, wherein the step of segmenting the segmented text according to a preset word interval threshold and a word-level time interval to obtain a segmented text comprises:

acquiring a preset word space threshold, wherein the preset word space threshold is determined according to the mute segment time before and after the effective voice and the voice acquisition pause time;

judging whether the word level time interval is smaller than the preset word interval threshold value or not;

if the word-level time interval is smaller than the preset word interval threshold, paragraph segmentation is not performed on adjacent words;

and if the word-level time interval is greater than or equal to the preset word interval threshold, performing paragraph segmentation on adjacent words.

4. The method of claim 1, wherein the step of deriving the speech rate, the effective time ratio and the number of error words from the VAD valid speech duration, the text alignment time, the word level alignment time and the total number of text words comprises:

wherein the content of the first and second substances,

indicating the average duration at the word level,

which represents the total number of words of the text,

wherein the content of the first and second substances,

the speed of a speech is represented by,

which represents the total number of words of the text,

represents the VAD active speech duration;

wherein the content of the first and second substances,

it is shown that the effective time ratio,

indicating the duration of VAD active speech,

representing a text alignment time;

wherein the content of the first and second substances,

the number of words representing the error is,

indicating the duration of VAD active speech,

indicating the word-level average duration.

5. The method for annotating voice data according to claim 1, wherein the step of performing voice quality check according to the voice speed, the valid time ratio and the number of error words to obtain qualified voice comprises:

judging whether the speech rate is in the range of a preset speech rate threshold value;

if the speech rate is not in the range of the preset speech rate threshold value, the voice quality detection is unqualified;

if the speech rate is in the range of the preset speech rate threshold value, judging whether the effective time ratio is in the range of a preset time ratio;

if the effective time ratio is not in the range of the preset time ratio, the voice quality detection is unqualified;

if the effective time ratio is in the range of the preset time ratio, judging whether the error word number is in the range of the preset error word number;

if the error word number is not in the range of the preset error word number, the voice quality detection is unqualified;

and if the error word number is within the range of the preset error word number, the voice quality detection is qualified, and qualified voice is obtained.

6. The method according to claim 1, wherein the step of detecting the voice features according to the VAD model to obtain a VAD valid voice start time, a VAD valid voice end time and a VAD valid voice duration of the noise-reduced voice comprises:

step S7001: inputting the voice characteristics into a VAD model to obtain a voice prediction result of each frame;

step S7002: judging whether the voice prediction result of the continuous first preset frame number is effective voice;

step S7003: if the voice prediction result of the continuous first preset frame number is not valid voice, moving the first preset frame number backwards, and returning to the step S7002;

step S7004: if the voice prediction result of the continuous first preset frame number is effective voice, taking the time corresponding to the voice initial position of the continuous first preset frame number as VAD effective voice initial time;

step S7005: judging whether the voice prediction result of the continuous second preset frame number is noise;

step S7006: if the voice prediction result of the continuous second preset frame number is not noise, moving the second preset frame number backwards, and returning to the step S7005;

step S7007: if the voice prediction result of the continuous second preset frame number is noise, taking the time corresponding to the voice initial position of the continuous second preset frame number as VAD effective voice ending time;

step S7008: calculating the VAD effective voice duration according to the VAD effective voice starting time and the VAD effective voice ending time;

step S7009: judging whether the VAD valid voice time length is less than a preset voice minimum time length or not;

step S7010: if the VAD valid voice duration is smaller than the preset voice minimum duration, returning to the step S7002;

step S7011: and if the VAD effective voice duration is greater than or equal to the preset voice minimum duration, the VAD effective voice duration is the VAD effective voice duration.

7. The method according to any one of claims 1 to 6, wherein the filtering process comprises: voice signal to noise ratio detection, voice reverberation detection, voice amplitude interception detection, voice frequency band loss detection, voice volume detection and microphone spraying detection.

8. A system for annotating speech data, comprising:

the acquisition module is used for acquiring original voice data;

the first processing module is used for screening the original voice data to obtain screened voice;

the second processing module is used for matching the screened voice with a pre-stored reading text to obtain a proofreading voice and a proofreading text which correspond to each other;

the third processing module is used for performing word segmentation processing on the proofreading text to obtain a word segmentation text;

the fourth processing module is used for carrying out noise reduction processing on the proofreading voice to obtain noise-reduced voice;

the fifth processing module is used for extracting the characteristics of the noise-reduced voice to obtain voice characteristics;

a sixth processing module, configured to detect the voice feature according to a VAD model, so as to obtain VAD valid voice start time, VAD valid voice end time, and VAD valid voice duration of the noise-reduced voice;

the seventh processing module is used for performing voice forced alignment by adopting an acoustic model according to the word segmentation text, the voice characteristics and a pre-stored pronunciation dictionary to obtain an alignment result;

the eighth processing module is used for obtaining word level alignment time, word level time interval, segmented text starting time, segmented text ending time and text alignment time according to the alignment result;

the ninth processing module is used for obtaining the total word number of the text in the word segmentation text according to the word segmentation text;

a tenth processing module, configured to obtain a speech rate, an effective time ratio, and an error word count according to the VAD valid speech duration, the text alignment time, the word level alignment time, and the total number of text words;

the eleventh processing module is used for carrying out voice quality inspection according to the speech speed, the effective time ratio and the error word number to obtain qualified voice;

and the twelfth processing module is used for segmenting the original voice data corresponding to the qualified voice according to the segmented text starting time and the segmented text ending time to obtain segmented voice corresponding to the segmented text, and taking the segmented text and the segmented voice as voice labeling results.

9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the method of voice data annotation according to any one of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of annotating speech data according to any of claims 1 to 7.