CN112530409A

CN112530409A - Voice sample screening method and device based on geometry and computer equipment

Info

Publication number: CN112530409A
Application number: CN202011387398.0A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-03-19
Anticipated expiration: 2040-12-01
Also published as: CN112530409B; WO2022116442A1

Abstract

The invention discloses a method and a device for screening voice samples based on geometry, computer equipment and a storage medium, which relate to the artificial intelligence technology and comprise the steps of obtaining an initial voice sample set, and extracting voice characteristics corresponding to each initial voice sample data in the initial voice sample set to form a voice characteristic set; acquiring Euclidean distances among all voice features in the voice feature set through a dynamic time warping algorithm to perform K-means clustering to obtain a clustering result; calling a preset sample subset screening condition, and obtaining cluster clusters meeting the sample subset screening condition in a cluster result to form a target cluster set; and acquiring a label value corresponding to each voice feature in the target cluster set so as to obtain a current voice sample set corresponding to the target cluster set. The method realizes automatic selection of samples with low redundancy to train the voice recognition model, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

Description

Voice sample screening method and device based on geometry and computer equipment

Technical Field

The invention relates to the technical field of artificial intelligence voice semantics, in particular to a method and a device for screening a voice sample based on geometry, computer equipment and a storage medium.

Background

In recent years, with the great success of Deep Neural Network (DNN) in the field of signal processing, DNN-based speech recognition algorithms are becoming a research focus, however, training DNN of speech recognition by supervised learning generally requires a large amount of labeled speech data. Although unlabelled speech data becomes more readily available with the development and deployment of perceptual devices. However, manual labeling of unlabeled speech data still consumes a lot of labor cost.

Active learning techniques can be employed for tagging unlabeled speech data, which is a branch of machine learning that allows the model to self-select the data to be learned. The idea of active learning comes from a hypothesis, i.e., a machine learning algorithm, which performs better with less training data if it can select the desired data by itself.

The most widely used active learning query strategy is called Uncertainty Sampling (uncertainly Sampling), in which the model selects the most uncertain sample of the model prediction to label. The technology has good effect under the condition that the sample selection quantity is small, but under the background of using the deep neural network as a training model, the model needs a large amount of training data, and as the quantity of the selected labeled samples increases, the uncertain samples predicted by the model have redundancy and overlap, so that similar samples can be selected more easily. However, the selection of these similar samples is very limited to aid in model training.

Moreover, the voice data is different from non-sequence data such as pictures, and the voice data has the characteristics of indefinite length, rich structured information and the like, so that the processing and selection difficulty of the voice data is higher.

Disclosure of Invention

The embodiment of the invention provides a method, a device, computer equipment and a storage medium for screening a speech sample based on geometry, and aims to solve the problems that when an uncertain sampling technology is applied to training of a neural network for speech recognition in the prior art, uncertain samples predicted by a model have redundancy and overlap, the similar samples have limited help on model training, and the difficulty in selecting the speech sample by the uncertain sampling technology is higher due to the complex speech structure.

In a first aspect, an embodiment of the present invention provides a method for screening a speech sample based on geometry, including:

acquiring an initial voice sample set, and extracting voice characteristics corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice characteristic set; wherein the initial voice sample set comprises a plurality of pieces of initial voice sample data;

acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and performing K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result;

calling a preset sample subset screening condition, and acquiring the cluster meeting the sample subset screening condition in the clustering result to form a target cluster set;

acquiring a label value corresponding to each voice feature in the target clustering set to obtain a current voice sample set corresponding to the target clustering set;

taking each voice feature in the current voice sample set as the input of a voice recognition model to be trained, and taking a label value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, thereby obtaining a voice recognition model; the to-be-trained voice recognition model comprises a link time sequence classification submodel and an attention-based mechanism submodel; and

and if the current voice data to be recognized uploaded by the user side is detected, inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, and obtaining and sending a current voice recognition result to the user side.

In a second aspect, an embodiment of the present invention provides a speech sample screening apparatus based on geometry, which includes:

the voice feature extraction unit is used for acquiring an initial voice sample set and extracting the voice feature corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set comprises a plurality of pieces of initial voice sample data;

the voice feature clustering unit is used for acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and performing K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result;

the clustering result screening unit is used for calling preset sample subset screening conditions to obtain clustering clusters which meet the sample subset screening conditions in the clustering results so as to form a target clustering cluster set;

a label value acquiring unit, configured to acquire a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;

the speech recognition model training unit is used for taking each speech feature in the current speech sample set as the input of a speech recognition model to be trained, and taking a label value corresponding to each speech feature as the output of the speech recognition model to be trained so as to train the speech recognition model to be trained, so as to obtain a speech recognition model; the to-be-trained voice recognition model comprises a link time sequence classification submodel and an attention-based mechanism submodel; and

and the voice recognition result sending unit is used for inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation if the current voice data to be recognized uploaded by the user side is detected, and obtaining and sending the current voice recognition result to the user side.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the geometry-based speech sample screening method according to the first aspect.

In a fourth aspect, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the geometry-based speech sample screening method according to the first aspect.

The embodiment of the invention provides a method, a device, computer equipment and a storage medium for screening voice samples based on geometry, which comprises the steps of obtaining an initial voice sample set, and extracting voice characteristics corresponding to each initial voice sample data in the initial voice sample set to form a voice characteristic set; acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and performing K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result; calling a preset sample subset screening condition, and acquiring the cluster meeting the sample subset screening condition in the clustering result to form a target cluster set; acquiring a label value corresponding to each voice feature in the target clustering set to obtain a current voice sample set corresponding to the target clustering set; taking each voice feature in the current voice sample set as the input of a voice recognition model to be trained, and taking a label value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, thereby obtaining a voice recognition model; the to-be-trained speech recognition model comprises a link time sequence classification submodel and an attention-based mechanism submodel. The method realizes automatic selection of samples with low redundancy to train the voice recognition model, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a geometry-based speech sample screening method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for screening a speech sample based on geometry according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a geometry-based speech sample screening apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a speech sample screening method based on geometry according to an embodiment of the present invention; fig. 2 is a schematic flow chart of a geometry-based speech sample screening method according to an embodiment of the present invention, where the geometry-based speech sample screening method is applied in a server, and the method is executed by application software installed in the server.

As shown in fig. 2, the method includes steps S110 to S160.

S110, obtaining an initial voice sample set, and extracting voice characteristics corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice characteristic set; wherein the initial set of voice samples comprises a plurality of pieces of initial voice sample data.

In this embodiment, in order to train the speech recognition model by using a few labeled samples in the server, data preprocessing and feature extraction may be performed on the initial speech sample set first to obtain a speech feature corresponding to each piece of initial speech sample data in the initial speech sample set, so as to form a speech feature set. The data preprocessing comprises the operations of pre-emphasis, framing, windowing and the like, and the purpose of the operations is to eliminate the influence of factors such as aliasing, higher harmonic distortion, high frequency and the like on the quality of a voice signal caused by the defects of human vocal organs and acquisition equipment, so that the obtained signal is more uniform and smooth as far as possible.

In one embodiment, step S110 includes:

calling a pre-stored sampling period to respectively sample each piece of initial voice sample data in the initial voice sample set to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

calling a pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal corresponding to each piece of initial voice sample data respectively to obtain a current pre-emphasized voice signal corresponding to each piece of initial voice sample data;

calling a prestored Hamming window to respectively window the current pre-emphasis voice signal corresponding to each piece of initial voice sample data to obtain windowed voice data corresponding to each piece of initial voice sample data;

calling a frame shift and a frame length which are stored in advance to respectively frame the windowed voice data corresponding to each piece of initial voice sample data to obtain preprocessed voice data corresponding to each piece of initial voice sample data;

and respectively carrying out Mel frequency cepstrum coefficient extraction or filter bank extraction on the preprocessed voice data corresponding to each piece of initial voice sample data to obtain voice features corresponding to each piece of initial voice sample data so as to form a voice feature set.

In this embodiment, before performing digital processing on a speech signal, initial speech sample data (denoted as s (T)) is sampled with a sampling period T and discretized into s (n).

Then, when a prestored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and the transfer function of the first-order FIR high-pass digital filter is as follows (1):

H(z)＝1-az^-1 (1)

in specific implementation, the value of a is 0.98. For example, let x (n) be the sample value of the current discrete speech signal at time n, and y (n) be x (n) -ax (n-1) be the sample value corresponding to x (n) in the current pre-emphasized speech signal after pre-emphasis processing.

Then, the function of the hamming window called is as follows (2):

windowing the current pre-emphasis speech signal through a hamming window, the resulting windowed speech data can be represented as: q (n) ═ y (n) × ω (n).

Finally, when the pre-stored frame shift and frame length are called to frame the windowed speech data, for example, a time domain signal corresponding to the windowed speech data is x (l), the nth frame speech data in the pre-processed speech data after the windowing and framing processing is xn (m), and xn (m) satisfies formula (3):

xn(m)＝ω(n)*x(n+m)，0≤m≤N-1 (3)

where N is 0, 1T, 2T, … …, N is the frame length, T is the frame shift, and ω (N) is a function of the hamming window.

The initial voice sample data is preprocessed, so that the initial voice sample data can be effectively used for subsequent voice parameter extraction, for example, a Mel Frequency Cepstrum Coefficient (namely Mel Frequency Cepstrum Coefficient) or a Filter Bank (namely Filter-Bank) is extracted, and after extraction, the voice characteristics corresponding to each piece of initial voice sample data can be obtained to form a voice characteristic set.

S120, acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and carrying out K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result.

In this embodiment, since there is a difference between initial voice sample data in the initial voice sample set, in order to compare the difference between two initial voice sample data, quantization processing may be performed to calculate a euclidean distance between voice features corresponding to the two initial voice sample data.

Considering that the lengths of any two initial voice sample data are mostly unequal, and appear as different human speeds in the voice processing field. Even if the same person makes the same sound at different times, the same person cannot have the same time length. And the pronunciation speed of each person is different for different phonemes of the same word, and some people will drag the sound of 'E' to be slightly longer or 'o' to be slightly shorter. In such a complicated case, the similarity between two initial voice sample data cannot be accurately obtained using the conventional euclidean distance. At this time, the euclidean distance between the voice features in the voice feature set may be obtained through a dynamic time warping algorithm.

In an embodiment, the step S120 of obtaining the euclidean distance between the speech features in the speech feature set by using a dynamic time warping algorithm includes:

acquiring the ith voice feature and the jth voice feature in the voice feature set; the voice feature set comprises N voice features, the value ranges of i and j are [1, N ], and i and j are not equal;

judging whether the number of first voice sequence frames corresponding to the ith voice feature is equal to the number of second voice sequence frames corresponding to the jth voice feature;

if the number of first voice sequence frames corresponding to the ith voice feature is not equal to the number of second voice sequence frames corresponding to the jth voice feature, constructing a distance matrix D of n x m, and acquiring the minimum value of each matrix element in the distance matrix D as the Euclidean distance between the ith voice feature and the jth voice feature; wherein n is equal to the number of the first voice sequence frames, m is equal to the number of the second voice sequence frames, and D (x, y) in the distance matrix D represents the Euclidean distance between the x frame voice sequence in the ith voice characteristic and the y frame voice sequence in the jth voice characteristic.

In this embodiment, the euclidean distance calculation method between any two speech features in the speech feature set is described by taking the calculation of the ith speech feature and the jth speech feature in the speech feature set as an example, and the calculation process can be stopped until the euclidean distance between the speech features in the speech feature set is calculated.

When calculating the euclidean distance between any two speech features, it is first determined whether the number of speech sequence frames between the two speech features is equal (e.g., determining whether the number of first speech sequence frames corresponding to the i-th speech feature is equal to the number of second speech sequence frames corresponding to the j-th speech feature). If the number of the voice sequence frames between the first voice sequence frame and the second voice sequence frame is not equal, constructing a distance matrix D of n x m, and acquiring the minimum value of each matrix element in the distance matrix D as the Euclidean distance between the ith voice characteristic and the jth voice characteristic; wherein n is equal to the number of the first voice sequence frames, m is equal to the number of the second voice sequence frames, and D (x, y) in the distance matrix D represents the Euclidean distance between the x frame voice sequence in the ith voice characteristic and the y frame voice sequence in the jth voice characteristic.

For example, the matrix element d (i, j) represents the euclidean distance between the x frame in the i-th speech feature and the y-th frame in the j-th speech feature, a shortest path from d (0, 0) to d (n, m) is found, and the length of the path is taken as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and monotonicity in time (irretrievable). Wherein, the above calculation process adopts a Dynamic Time Warping algorithm (called Dynamic Time Warping, abbreviated as DTW).

In an embodiment, after the determining whether the number of the first speech sequence frames corresponding to the i-th speech feature is equal to the number of the second speech sequence frames corresponding to the j-th speech feature in step S120, the method further includes:

and if the number of the first voice sequence frames corresponding to the ith voice feature is equal to the number of the second voice sequence frames corresponding to the jth voice feature, calculating to obtain the Euclidean distance between the ith voice feature and the jth voice feature.

In this embodiment, when it is determined that the number of first speech sequence frames corresponding to the i-th speech feature is equal to the number of second speech sequence frames corresponding to the j-th speech feature, indicating that the time lengths between the two are the same, the euclidean distance between the two is calculated directly without referring to the process of constructing the distance matrix D.

In an embodiment, the performing K-means clustering according to the euclidean distance between the speech features in step S120 to obtain a clustering result includes:

selecting the voice features with the same number as the preset clustering clusters in the voice feature set, and taking the selected voice features as the initial clustering center of each cluster;

dividing the voice feature set according to Euclidean distances between the voice features in the voice feature set and initial clustering centers to obtain initial clustering results;

obtaining the adjusted clustering center of each cluster according to the initial clustering result;

and according to the adjusted clustering center, dividing the voice feature set of the voice feature set according to the Euclidean distance from the adjusted clustering center until the clustering result keeps the same times more than the preset times, and obtaining clustering clusters corresponding to the preset clustering cluster number to form a clustering result.

In this embodiment, since the euclidean distance between the speech features may be calculated by using a dynamic time warping algorithm, the speech feature set may be clustered by using a K-means clustering method, and the specific process is as follows:

a) randomly selecting N2 voice features from a voice feature set comprising N1 voice features, and using the voice features as initial clustering centers of N2 clusters; the initial total number of the voice features in the voice feature set is N1, N2 voice features are arbitrarily selected from the N1 voice features (N2< N1, N2 is a preset number of cluster clusters, i.e., the number of desired clusters), and the initially selected N2 voice features are used as initial cluster centers.

b) Respectively calculating Euclidean distances from the remaining voice features to N2 initial clustering centers, and classifying the remaining voice features into clusters with the minimum Euclidean distances to obtain initial clustering results; selecting the initial clustering center which is closest to each remaining voice feature, and classifying the initial clustering centers into one class; this divides the speech features into N2 clusters with an initial cluster center of initial selection, one for each cluster of data.

c) And according to the initial clustering result, re-calculating the clustering centers of the N2 clusters.

d) Re-clustering all elements in the N1 voice features according to a new clustering center;

e) and d), repeating the step d) until the clustering result is not changed any more, and obtaining the clustering result corresponding to the preset clustering cluster number.

After the cluster classification is completed, the voice feature set can be quickly grouped to obtain a plurality of cluster clusters. And then, the server can select the clustering cluster meeting the conditions from the plurality of clustering clusters as a training sample and label the training sample.

S130, calling a preset sample subset screening condition, and obtaining the cluster meeting the sample subset screening condition in the clustering result to form a target cluster set.

In this embodiment, the sample subset screening condition may be set to set the sample redundancy to be the minimum value among the plurality of sample subsets, so that the target cluster may be screened out to form the target cluster set. When the sample redundancy of a certain cluster is calculated, the data repetition degree is calculated, for example, the total number of data in a certain sample subset is Y1, wherein the total number of repeated data is Y2, and the sample redundancy of the sample subset is Y2/Y1 at this time. And selecting a sample subset with lower redundancy, and remarkably reducing the labeling cost of a speech recognition task in a deep learning background.

S140, obtaining a label value corresponding to each voice feature in the target cluster set so as to obtain a current voice sample set corresponding to the target cluster set.

In this embodiment, since the target cluster set is selected, only a small number of sample labels may be performed at this time, and the current voice sample set corresponding to the target cluster set is obtained. And less labeled data are used, so that the training speed of the voice recognition model can be obviously improved, and the calculation pressure of a voice processing system is reduced.

S150, taking each voice feature in the current voice sample set as the input of a voice recognition model to be trained, and taking a label value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, thereby obtaining a voice recognition model; the to-be-trained speech recognition model comprises a link time sequence classification submodel and an attention-based mechanism submodel.

In this embodiment, in order to train a speech recognition model with higher speech recognition accuracy based on the current speech sample set, a model that is decoded by combining a CTC (connection Temporal Classification) model and an Attention model (Attention-based model) may be adopted. CTC decoding recognizes speech by predicting the output of each frame, and the implementation of algorithms is based on the assumption that the decoding of each frame remains independent, and therefore lacks the link between the preceding and following speech features during decoding, relying on the modification of language models. The Attention decoding process is independent of the frame sequence of the input voice, each decoding unit generates the current result through the decoding result of the previous unit and the overall voice characteristic, and the monotonous time sequence of the voice is neglected in the decoding process, so that a mixed model can be adopted, and the advantages of the two are considered. Generally, the link timing classification submodel is arranged closer to the input end for preliminary processing, and the attention-based submodel is arranged closer to the output end for subsequent processing. The network structure of the speech recognition model to be trained adopts structures such as LSTM/CNN/GRU, and the two decoders output the recognition result together.

In an embodiment, step S150 is followed by:

and uploading a first model parameter set corresponding to the link timing classification submodel in the voice recognition model and a second model parameter set corresponding to the attention mechanism submodel to the block chain network.

In this embodiment, the corresponding digest information is obtained based on the first model parameter set and the second model parameter set, and specifically, the digest information is obtained by performing hash processing on the first model parameter set and the second model parameter set, for example, by using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user equipment may download the summary information from the blockchain to verify whether the first and second sets of model parameters are tampered with.

The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

And S160, if the current voice data to be recognized uploaded by the user side is detected, inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, and obtaining and sending a current voice recognition result to the user side.

In this embodiment, after the training of the speech recognition model is completed in the server, the method can be specifically applied to speech recognition. Namely, once receiving the current voice data to be recognized uploaded by the user side, the server inputs the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, obtains and sends the current voice recognition result to the user side, and the current voice recognition result can be fed back rapidly through the method.

In one embodiment, step S160 includes:

inputting the voice features corresponding to the current voice data to be recognized to a link time sequence classification submodel for operation to obtain a first recognition sequence;

and inputting the first recognition sequence into an attention-based mechanism sub-model for operation, and obtaining and sending a current voice recognition result to the user side.

In this embodiment, since the link timing classification submodel is disposed closer to the input end and the attention-based mechanism submodel is disposed closer to the output end, the current speech data to be recognized is input to the link timing classification submodel for operation to obtain a first recognition sequence, and then the first recognition sequence is input to the attention-based mechanism submodel for operation to obtain a current speech recognition result. Therefore, the relation between the front and rear voice characteristics in the decoding process is fully considered, the monotone time sequence of the voice is also considered, and the result obtained by the model recognition is more accurate.

The method realizes automatic selection of samples with low redundancy to train the voice recognition model, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

The embodiment of the invention also provides a speech sample screening device based on the geometry, which is used for executing any embodiment of the speech sample screening method based on the geometry. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a speech sample screening apparatus based on geometry according to an embodiment of the present invention. The geometry-based speech sample screening apparatus 100 may be configured in a server.

As shown in fig. 3, the geometry-based speech sample screening apparatus 100 includes: a speech feature extraction unit 110, a speech feature clustering unit 120, a clustering result screening unit 130, a labeling value acquisition unit 140, a speech recognition model training unit 150, and a speech recognition result transmission unit 160.

The voice feature extraction unit 110 is configured to obtain an initial voice sample set, and extract a voice feature corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial set of voice samples comprises a plurality of pieces of initial voice sample data.

In one embodiment, the speech feature extraction unit 110 includes:

the discrete sampling unit is used for calling a pre-stored sampling period to respectively sample each piece of initial voice sample data in the initial voice sample set to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

the pre-emphasis unit is used for calling a pre-stored first-order FIR high-pass digital filter to respectively pre-emphasize the current discrete voice signal corresponding to each piece of initial voice sample data to obtain a current pre-emphasized voice signal corresponding to each piece of initial voice sample data;

the windowing unit is used for calling a pre-stored Hamming window to respectively window the current pre-emphasis voice signal corresponding to each piece of initial voice sample data to obtain windowed voice data corresponding to each piece of initial voice sample data;

the framing unit is used for calling the frame shift and the frame length which are stored in advance to respectively frame the windowed voice data corresponding to each piece of initial voice sample data to obtain preprocessed voice data corresponding to each piece of initial voice sample data;

and the feature extraction unit is used for respectively extracting the pre-processed voice data corresponding to each piece of initial voice sample data by using Mel frequency cepstrum coefficient or filter bank extraction to obtain the voice features corresponding to each piece of initial voice sample data so as to form a voice feature set.

Then, when the prestored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is the first-order non-recursive high-pass digital filter, and the transfer function of the first-order FIR high-pass digital filter is as the above expression (1).

Then, the called hamming window function is as above formula (2), and the current pre-emphasis speech signal is windowed by the hamming window, and the obtained windowed speech data can be represented as: q (n) ═ y (n) × ω (n).

Finally, when the pre-stored frame shift and frame length are called to frame the windowed speech data, for example, the time domain signal corresponding to the windowed speech data is x (l), the nth frame of speech data in the pre-processed speech data after the windowing and framing processing is xn (m), and xn (m) satisfies the formula (3).

And the voice feature clustering unit 120 is configured to obtain the euclidean distances between the voice features in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the euclidean distances between the voice features to obtain a clustering result.

In an embodiment, the speech feature clustering unit 120 includes:

the voice feature selection unit is used for acquiring the ith voice feature and the jth voice feature in the voice feature set; the voice feature set comprises N voice features, the value ranges of i and j are [1, N ], and i and j are not equal;

the voice sequence frame number comparison unit is used for judging whether the first voice sequence frame number corresponding to the ith voice characteristic is equal to the second voice sequence frame number corresponding to the jth voice characteristic or not;

a first calculating unit, configured to construct a distance matrix D of n × m if a first voice sequence frame number corresponding to the i-th voice feature is not equal to a second voice sequence frame number corresponding to the j-th voice feature, and obtain a minimum value in each matrix element in the distance matrix D to serve as a euclidean distance between the i-th voice feature and the j-th voice feature; wherein n is equal to the number of the first voice sequence frames, m is equal to the number of the second voice sequence frames, and D (x, y) in the distance matrix D represents the Euclidean distance between the x frame voice sequence in the ith voice characteristic and the y frame voice sequence in the jth voice characteristic.

In an embodiment, the speech feature clustering unit 120 further includes:

and the second calculation unit is used for calculating to obtain the Euclidean distance between the ith voice feature and the jth voice feature if the number of the first voice sequence frames corresponding to the ith voice feature is equal to the number of the second voice sequence frames corresponding to the jth voice feature.

In an embodiment, the speech feature clustering unit 120 includes:

an initial clustering center obtaining unit, configured to select, in the voice feature set, voice features with the same number as that of preset clustering clusters, and use the selected voice features as an initial clustering center of each cluster;

the initial clustering unit is used for dividing the voice feature set according to Euclidean distances between the voice features in the voice feature set and the initial clustering centers to obtain an initial clustering result;

the cluster center adjusting unit is used for acquiring the adjusted cluster center of each cluster according to the initial cluster result;

and the clustering adjustment unit is used for dividing the voice feature set of the voice feature set according to the adjusted clustering center and the Euclidean distance between the adjusted clustering center and the voice feature set of the voice feature set until the clustering result keeps the same times which are more than the preset times, and obtaining clustering clusters corresponding to the preset clustering cluster number to form a clustering result.

In this embodiment, since the euclidean distance between the speech features may be calculated by using a dynamic time warping algorithm, the speech feature set may be clustered by using a K-means clustering method. After the cluster classification is completed, the voice feature set can be quickly grouped to obtain a plurality of cluster clusters. And then, the server can select the clustering cluster meeting the conditions from the plurality of clustering clusters as a training sample and label the training sample.

And the clustering result screening unit 130 is configured to invoke preset sample subset screening conditions, and obtain a cluster meeting the sample subset screening conditions in the clustering results to form a target cluster set.

A labeled value obtaining unit 140, configured to obtain a labeled value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set.

The speech recognition model training unit 150 is configured to use each speech feature in the current speech sample set as an input of a speech recognition model to be trained, and use a label value corresponding to each speech feature as an output of the speech recognition model to be trained to train the speech recognition model to be trained, so as to obtain a speech recognition model; the to-be-trained speech recognition model comprises a link time sequence classification submodel and an attention-based mechanism submodel.

In one embodiment, the geometry-based speech sample screening apparatus 100 further comprises:

and the data uplink unit is used for uploading a first model parameter set corresponding to the link sequence classification submodel in the voice recognition model and a second model parameter set corresponding to the attention mechanism submodel to the block chain network.

And the voice recognition result sending unit 160 is configured to, if it is detected that the current to-be-recognized voice data uploaded by the user terminal is detected, input the voice features corresponding to the current to-be-recognized voice data into the voice recognition model for operation, and obtain and send a current voice recognition result to the user terminal.

In one embodiment, the voice recognition result transmitting unit 160 includes:

the first decoding unit is used for inputting the voice characteristics corresponding to the current voice data to be recognized to the link time sequence classification submodel for operation to obtain a first recognition sequence;

and the second decoding unit is used for inputting the first recognition sequence to the attention-based mechanism sub-model for operation, and obtaining and sending the current voice recognition result to the user side.

The device realizes that the samples with lower redundancy are automatically selected to train the voice recognition model, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

The above-described geometry-based speech sample screening apparatus may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of a plurality of servers.

Referring to fig. 4, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a geometry-based speech sample screening method.

The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 may be caused to perform the geometry-based speech sample screening method.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The processor 502 is configured to run the computer program 5032 stored in the memory to implement the method for screening a speech sample based on geometry disclosed in the embodiment of the present invention.

Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.

It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the geometry-based speech sample screening method disclosed by the embodiments of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A speech sample screening method based on geometry is characterized by comprising the following steps:

2. The method of claim 1, wherein the obtaining an initial set of speech samples and extracting the speech features corresponding to each initial set of speech samples in the initial set of speech samples to form a set of speech features comprises:

3. The method of claim 1, wherein the obtaining Euclidean distance between the speech features in the speech feature set by a dynamic time warping algorithm comprises:

4. The method of claim 3, wherein after determining whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, the method further comprises:

5. The method of claim 4, wherein the K-means clustering according to Euclidean distance between speech features to obtain a clustering result comprises:

6. The method for screening a geometric-based speech sample according to claim 1, wherein the inputting the speech features corresponding to the current speech data to be recognized into the speech recognition model for operation to obtain and send a current speech recognition result to a user side comprises:

7. The geometry-based speech sample screening method of claim 1, further comprising:

8. A geometry-based speech sample screening apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the geometry-based speech sample screening method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the geometry-based speech sample screening method according to any one of claims 1 to 7.