CN112530409B

CN112530409B - Speech sample screening method and device based on geometry and computer equipment

Info

Publication number: CN112530409B
Application number: CN202011387398.0A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2024-01-23
Anticipated expiration: 2040-12-01
Also published as: WO2022116442A1; CN112530409A

Abstract

The invention discloses a geometric-based voice sample screening method, a geometric-based voice sample screening device, a geometric-based voice sample screening computer device and a geometric-based voice sample screening storage medium, which relate to an artificial intelligence technology and comprise the steps of acquiring an initial voice sample set, and extracting voice characteristics corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice characteristic set; obtaining Euclidean distance among all voice features in the voice feature set through a dynamic time warping algorithm to perform K-means clustering so as to obtain a clustering result; invoking a preset sample subset screening condition, and acquiring a cluster meeting the sample subset screening condition in a clustering result to form a target cluster set; and obtaining a labeling value corresponding to each voice feature in the target cluster set to obtain a current voice sample set corresponding to the target cluster set. The method realizes the training of the voice recognition model by automatically selecting the sample with smaller redundancy, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

Description

Speech sample screening method and device based on geometry and computer equipment

Technical Field

The invention relates to the technical field of artificial intelligence voice semantics, in particular to a geometric-based voice sample screening method, a geometric-based voice sample screening device, computer equipment and a storage medium.

Background

In recent years, with great success of deep neural network technology (Deep Neural Network, DNN) in the field of signal processing, DNN-based speech recognition algorithms have become a research hotspot, however training DNNs for speech recognition by means of supervised learning generally requires a large amount of voice data with labels. Although with the development and popularization of the perception device, unlabeled voice data becomes easier to acquire. However, manually marking unlabeled voice data still requires a significant amount of labor cost.

For labeling unlabeled speech data, active learning techniques can be employed, which is a branch of machine learning that allows the model to select the data to learn by itself. The idea of active learning is derived from an assumption, i.e. a machine learning algorithm, which will perform better if the data to be learned can be selected by itself, with less training data.

The most widely used active learning query strategy is called uncertainty sampling (Uncertainty Sampling), in which the model labels the sample for which the selection model predicts the least certainty. The technology achieves good effect under the condition that the selection number of the samples is small, but under the background that the deep neural network is used as a training model, the model needs a large amount of training data, and along with the increase of the number of the selected labeling samples, the model predicts uncertain samples to have redundancy and overlap, so that similar samples can be selected more easily. However, the choice of these similar samples is very limited in helping to model training.

In addition, the voice data is different from non-sequence data such as pictures, and has the characteristics of variable length, rich structured information and the like, and the processing and selecting difficulty of the voice data is higher.

Disclosure of Invention

The embodiment of the invention provides a geometric-based voice sample screening method, a geometric-based voice sample screening device, a geometric-based voice sample screening computer device and a geometric-based voice sample screening storage medium, and aims to solve the problems that when the geometric-based voice sample screening method, the geometric-based voice sample screening device, the geometric-based voice sample screening computer device and the geometric-based voice sample storage medium are applied to training of a neural network for voice recognition through an uncertainty sampling technology in the prior art, the model prediction uncertainty samples are redundant and overlapped, the similar samples have limited help to model training, and the difficulty in selecting voice samples is high due to the fact that the uncertainty sampling technology is complex in voice structure.

In a first aspect, an embodiment of the present invention provides a geometric-based speech sample screening method, including:

acquiring an initial voice sample set, and extracting voice characteristics corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice characteristic set; wherein the initial voice sample set comprises a plurality of pieces of initial voice sample data;

acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and carrying out K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result;

Invoking a preset sample subset screening condition, and acquiring a cluster meeting the sample subset screening condition in the clustering result to form a target cluster set;

obtaining a labeling value corresponding to each voice feature in the target cluster set to obtain a current voice sample set corresponding to the target cluster set;

taking each voice feature in the current voice sample set as the input of a voice recognition model to be trained, taking a labeling value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, and obtaining a voice recognition model; the speech recognition model to be trained comprises a link time sequence classification sub-model and an attention mechanism-based sub-model; and

if the current voice data to be recognized uploaded by the user side is detected, inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, obtaining and sending the current voice recognition result to the user side.

In a second aspect, an embodiment of the present invention provides a geometry-based speech sample screening apparatus, including:

the voice feature extraction unit is used for obtaining an initial voice sample set and extracting voice features corresponding to each piece of initial voice sample data in the initial voice sample set so as to form a voice feature set; wherein the initial voice sample set comprises a plurality of pieces of initial voice sample data;

The voice feature clustering unit is used for acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and K-means clustering is carried out according to the Euclidean distances among the voice features so as to obtain a clustering result;

the clustering result screening unit is used for calling preset sample subset screening conditions, and obtaining clustering clusters meeting the sample subset screening conditions in the clustering result to form a target clustering cluster set;

the labeling value acquisition unit is used for acquiring a labeling value corresponding to each voice feature in the target cluster set so as to obtain a current voice sample set corresponding to the target cluster set;

the voice recognition model training unit is used for taking each voice feature in the current voice sample set as the input of the voice recognition model to be trained, and taking the labeling value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, so that a voice recognition model is obtained; the speech recognition model to be trained comprises a link time sequence classification sub-model and an attention mechanism-based sub-model; and

and the voice recognition result sending unit is used for inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation if the current voice data to be recognized uploaded by the user terminal is detected, so as to obtain and send the current voice recognition result to the user terminal.

In a third aspect, an embodiment of the present invention further provides a computer apparatus, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the geometry-based speech sample screening method according to the first aspect.

In a fourth aspect, embodiments of the present invention further provide a computer readable storage medium, where the computer readable storage medium stores a computer program, which when executed by a processor, causes the processor to perform the geometry-based speech sample screening method according to the first aspect.

The embodiment of the invention provides a geometric-based voice sample screening method, a geometric-based voice sample screening device, computer equipment and a storage medium, which comprise the steps of obtaining an initial voice sample set, and extracting voice characteristics corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice characteristic set; acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and carrying out K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result; invoking a preset sample subset screening condition, and acquiring a cluster meeting the sample subset screening condition in the clustering result to form a target cluster set; obtaining a labeling value corresponding to each voice feature in the target cluster set to obtain a current voice sample set corresponding to the target cluster set; taking each voice feature in the current voice sample set as the input of a voice recognition model to be trained, taking a labeling value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, and obtaining a voice recognition model; the speech recognition model to be trained comprises a link time sequence classification sub-model and an attention mechanism-based sub-model. The method realizes the training of the voice recognition model by automatically selecting the sample with smaller redundancy, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a geometric-based voice sample screening method according to an embodiment of the present invention;

fig. 2 is a flow chart of a geometric-based voice sample screening method according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a geometry-based speech sample screening apparatus according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of an application scenario of a geometric-based voice sample screening method according to an embodiment of the present invention; fig. 2 is a flow chart of a geometric-based voice sample screening method according to an embodiment of the present invention, where the geometric-based voice sample screening method is applied to a server, and the method is executed by application software installed in the server.

As shown in fig. 2, the method includes steps S110 to S160.

S110, acquiring an initial voice sample set, and extracting voice characteristics corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice characteristic set; wherein the initial speech sample set includes a plurality of pieces of initial speech sample data.

In this embodiment, in order to train a speech recognition model by using fewer labeling samples in the server, data preprocessing and feature extraction may be performed on the initial speech sample set first, so as to obtain speech features corresponding to each piece of initial speech sample data in the initial speech sample set, so as to form a speech feature set. The data preprocessing comprises pre-emphasis, framing, windowing and other operations, and the purpose of the operations is to eliminate the influence of aliasing, higher harmonic distortion, high frequency and other factors on the voice signal quality caused by the defects of human sounding organs and the defects of acquisition equipment, so that the obtained signal is more uniform and smoother as much as possible.

In one embodiment, step S110 includes:

invoking a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set respectively to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

Invoking a prestored first-order FIR high-pass digital filter to respectively pre-emphasis the current discrete voice signals corresponding to each piece of initial voice sample data to obtain current pre-emphasis voice signals corresponding to each piece of initial voice sample data;

invoking a pre-stored Hamming window to window the current pre-emphasis voice signal corresponding to each piece of initial voice sample data respectively to obtain windowed voice data corresponding to each piece of initial voice sample data;

invoking a pre-stored frame shift and frame length to respectively frame the windowed voice data corresponding to each piece of initial voice sample data to obtain pre-processed voice data corresponding to each piece of initial voice sample data;

and respectively extracting Mel frequency cepstrum coefficients or filter banks from the preprocessed voice data corresponding to each piece of initial voice sample data to obtain voice features corresponding to each piece of initial voice sample data so as to form a voice feature set.

In the present embodiment, before digitally processing a speech signal, initial speech sample data (the initial speech sample data is denoted as s (T)) is first sampled at a sampling period T and discretized into s (n).

Then, when the prestored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is the first-order non-recursion high-pass digital filter, and the transfer function of the first-order FIR high-pass digital filter is as follows (1):

H(z)＝1-az ^-1 (1)

in specific implementation, the value of a is 0.98. For example, let the sampling value of the current discrete speech signal at the time n be x (n), and the sampling value corresponding to x (n) in the current pre-emphasis speech signal after the pre-emphasis processing be y (n) =x (n) -ax (n-1).

Thereafter, the function of the invoked hamming window is as follows (2):

windowing is carried out on the current pre-emphasis voice signal through a Hamming window, and the obtained windowed voice data can be expressed as: q (n) =y (n) ×ω (n).

Finally, when the pre-stored frame shift and frame length are called to frame the windowed voice data, for example, a time domain signal corresponding to the windowed voice data is x (l), n-th frame voice data in the pre-processed voice data after the windowed frame is xn (m), and xn (m) satisfies the formula (3):

xn(m)＝ω(n)*x(n+m)，0≤m≤N-1 (3)

where n=0, 1T,2T, … …, N is the frame length, T is the frame shift, and ω (N) is a function of the hamming window.

By preprocessing the initial voice sample data, the method can be effectively used for subsequent voice parameter extraction, such as extracting mel frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient) or Filter Bank (Filter-Bank), and voice characteristics corresponding to each piece of initial voice sample data can be obtained after extraction so as to form a voice characteristic set.

S120, acquiring Euclidean distances among the voice features in the voice feature set through a dynamic time warping algorithm, and carrying out K-means clustering according to the Euclidean distances among the voice features to obtain a clustering result.

In this embodiment, since there is a difference between the initial voice sample data in the initial voice sample set, in order to compare the difference between the two initial voice sample data, the quantization process may be performed to calculate the euclidean distance between the voice features corresponding to the two initial voice sample data.

The length of any two initial speech sample data is in most cases considered unequal and appears to differ in speech speed for different persons in the field of speech processing. Even if the same person sounds at different times, it is not possible to have exactly the same length of time. And each person will sound at a different rate for different phonemes of the same word, some will drag the "E" tone slightly longer, or "o" slightly shorter. In such complex situations, the similarity between the two initial speech sample data cannot be accurately obtained using conventional euclidean distances. At this time, the euclidean distance between each of the speech features in the set of speech features may be obtained by a dynamic time warping algorithm.

In one embodiment, the step S120 of obtaining the euclidean distance between each of the voice features in the voice feature set by using a dynamic time warping algorithm includes:

acquiring the ith voice feature and the jth voice feature in the voice feature set; the voice feature set comprises N voice features, wherein the value ranges of i and j are 1 and N, and i and j are unequal;

judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;

if the number of frames of the first voice sequence corresponding to the ith voice feature is not equal to the number of frames of the second voice sequence corresponding to the jth voice feature, constructing a distance matrix D of n x m, and acquiring the minimum value of each matrix element in the distance matrix D to serve as the Euclidean distance between the ith voice feature and the jth voice feature; wherein n is equal to the number of frames of the first voice sequence, m is equal to the number of frames of the second voice sequence, D (x, y) in the distance matrix D represents Euclidean distance between the x-th frame voice sequence in the ith voice feature and the y-th frame voice sequence in the jth voice feature.

In this embodiment, the calculation of the i-th speech feature and the j-th speech feature in the speech feature set is taken as an example to describe a euclidean distance calculation method between any two speech features in the speech feature set, until the euclidean distance between each speech feature in the speech feature set is calculated, the above calculation process can be stopped.

When calculating the Euclidean distance between any two voice features, it is first determined whether the number of voice sequence frames between the two voice features is equal (e.g. determining whether the number of first voice sequence frames corresponding to the ith voice feature is equal to the number of second voice sequence frames corresponding to the jth voice feature). If the number of frames of the voice sequences between the two voice sequences is not equal, a distance matrix D with n x m is needed to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained to be used as the Euclidean distance between the ith voice feature and the jth voice feature; wherein n is equal to the number of frames of the first voice sequence, m is equal to the number of frames of the second voice sequence, D (x, y) in the distance matrix D represents Euclidean distance between the x-th frame voice sequence in the ith voice feature and the y-th frame voice sequence in the jth voice feature.

For example, the matrix element d (i, j) represents the euclidean distance between the x-frame in the i-th speech feature and the y-th frame in the j-th speech feature, a shortest path from d (0, 0) to d (n, m) is found, and the length of the path is taken as the distance between the i-th speech feature and the j-th speech feature, which satisfies the continuity and temporal monotonicity (non-traceable). The calculation process adopts a dynamic time warping algorithm (Dynamic Time Warping, abbreviated as DTW).

In an embodiment, after determining whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature in step S120, the method further includes:

and if the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, calculating to obtain the Euclidean distance between the ith voice feature and the jth voice feature.

In this embodiment, when it is determined that the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the time length between the two frames is the same, and the euclidean distance between the two frames is directly calculated, without referring to the process of constructing the distance matrix D.

In one embodiment, the step S120 of performing K-means clustering according to the euclidean distance between the speech features to obtain a clustering result includes:

selecting the voice features with the same number as the preset clustering clusters in the voice feature set, and taking the selected voice features as an initial clustering center of each cluster;

dividing the voice feature set according to Euclidean distance between each voice feature in the voice feature set and each initial clustering center to obtain an initial clustering result;

Acquiring an adjusted clustering center of each cluster according to the initial clustering result;

and dividing the voice feature set according to the Euclidean distance between the voice feature set and the adjusted clustering center according to the adjusted clustering center until the clustering result is kept the same for more than the preset times, and obtaining a clustering cluster corresponding to the preset clustering cluster number so as to form a clustering result.

In this embodiment, since the euclidean distance between the voice features can be calculated by the dynamic time warping algorithm, the voice feature set can be clustered by the K-means clustering method at this time, and the specific procedure is as follows:

a) Randomly selecting N2 voice features from a voice feature set comprising N1 voice features, and taking the N2 voice features as an initial clustering center of N2 clusters; wherein the initial total number of voice features in the voice feature set is N1, N2 voice features are arbitrarily selected from the voice feature set (N2 < N1, N2 is a preset cluster number, i.e. the number of desired clusters), and the initially selected N2 voice features are taken as an initial cluster center.

b) Respectively calculating Euclidean distances from the rest voice features to N2 initial clustering centers, and respectively classifying the rest voice features into clusters with minimum Euclidean distances to obtain initial clustering results; the rest voice features select the initial clustering center closest to the voice features and are classified with the initial clustering center; thus, the voice features are divided into N2 clusters by the initial cluster centers selected initially, and each cluster of data has an initial cluster center.

c) And recalculating the clustering centers of the N2 clusters according to the initial clustering result.

d) Reclustering all elements in the N1 voice features according to the new clustering center;

e) Repeating the step d) until the clustering result is not changed, and obtaining the clustering result corresponding to the preset clustering number.

After the clustering classification is completed, the voice feature sets can be quickly grouped, and a plurality of clustering clusters are obtained. And then, the server can select the cluster meeting the conditions from the plurality of clusters as a training sample and mark the training sample.

S130, calling a preset sample subset screening condition, and obtaining a cluster meeting the sample subset screening condition in the clustering result to form a target cluster set.

In this embodiment, the sample subset screening condition may be set such that the sample redundancy is the minimum value among the plurality of sample subsets, so that the target cluster may be screened out to form the target cluster set. In calculating the sample redundancy of a certain cluster, the data repetition degree is calculated, for example, the total number of data in a certain sample subset is Y1, the total number of repeated data is Y2, and the sample redundancy of the sample subset is Y2/Y1. And selecting a sample subset with smaller redundancy, and obviously reducing the labeling cost of the voice recognition task in the deep learning background.

S140, obtaining a labeling value corresponding to each voice feature in the target cluster set to obtain a current voice sample set corresponding to the target cluster set.

In this embodiment, since the target cluster set has been selected, only a small number of sample labels may be performed at this time, so as to obtain the current speech sample set corresponding to the target cluster set. Less labeling data is used, so that the training speed of the voice recognition model can be remarkably improved, and the calculation pressure of a voice processing system is reduced.

S150, taking each voice feature in the current voice sample set as the input of a voice recognition model to be trained, and taking a labeling value corresponding to each voice feature as the output of the voice recognition model to be trained so as to train the voice recognition model to be trained, so that a voice recognition model is obtained; the speech recognition model to be trained comprises a link time sequence classification sub-model and an attention mechanism-based sub-model.

In this embodiment, in order to train a speech recognition model with higher speech recognition accuracy based on the current speech sample set, a model in which a hybrid CTC (Connectionist Temporal Classification, i.e., link timing classification) model and an Attention model (i.e., attention mechanism-based model) are decoded together may be employed. CTC decoding recognizes speech by predicting the output of each frame, and the implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, thus lacking the association between the front and rear speech features in the decoding process, relying on the modification of the language model. The Attention decoding process is irrelevant to the frame sequence of the input voice, each decoding unit generates the current result through the decoding result of the previous unit and the integral voice characteristic, and the monotone time sequence of the voice is ignored in the decoding process, so that a mixed model can be adopted, and the advantages of the two are considered. Typically, the link time sequence classification sub-model is arranged closer to the input end for preliminary processing, and the attention mechanism-based sub-model is arranged closer to the output end for subsequent processing. The network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders output recognition results together.

In an embodiment, step S150 further includes:

and uploading a first model parameter set corresponding to the link time sequence classification sub-model and a second model parameter set corresponding to the attention mechanism-based sub-model in the voice recognition model to the blockchain network.

In this embodiment, the corresponding digest information is obtained based on the first model parameter set and the second model parameter set, and specifically, the digest information is obtained by performing hash processing on the first model parameter set and the second model parameter set, for example, processing by using sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fair transparency to the user. The user device may download the digest information from the blockchain to verify whether the first model parameter set and the second model parameter set are tampered with.

The blockchain referred to in this example is a novel mode of application for computer technology such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

And S160, if the current voice data to be recognized uploaded by the user terminal is detected, inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, obtaining and sending the current voice recognition result to the user terminal.

In this embodiment, after training of the speech recognition model is completed in the server, the method can be specifically applied to speech recognition. The server receives the current voice data to be recognized uploaded by the user side, inputs the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, obtains and sends the current voice recognition result to the user side, and the current voice recognition result can be fed back rapidly in this way.

In one embodiment, step S160 includes:

inputting the voice characteristics corresponding to the voice data to be recognized to a link time sequence classification sub-model for operation to obtain a first recognition sequence;

and inputting the first recognition sequence into a submodel based on an attention mechanism for operation, obtaining and sending a current voice recognition result to a user side.

In this embodiment, since the link timing classification sub-model is disposed closer to the input end and the attention mechanism-based sub-model is disposed closer to the output end, the current speech data to be recognized is first input to the link timing classification sub-model to perform an operation to obtain a first recognition sequence, and then the first recognition sequence is input to the attention mechanism-based sub-model to perform an operation to obtain a current speech recognition result. Thus, not only the relation between the front voice characteristic and the rear voice characteristic in the decoding process is fully considered, but also the monotone time sequence of the voice is considered, and the result obtained by the model identification is more accurate.

The method realizes the training of the voice recognition model by automatically selecting the sample with smaller redundancy, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

The embodiment of the invention also provides a geometry-based voice sample screening device, which is used for executing any embodiment of the method for screening the voice samples based on geometry. In particular, referring to fig. 3, fig. 3 is a schematic block diagram of a geometry-based speech sample screening apparatus according to an embodiment of the present invention. The geometry-based speech sample screening device 100 may be configured in a server.

As shown in fig. 3, the geometry-based voice sample screening apparatus 100 includes: the device comprises a voice feature extraction unit 110, a voice feature clustering unit 120, a clustering result screening unit 130, a labeling value acquisition unit 140, a voice recognition model training unit 150 and a voice recognition result sending unit 160.

A voice feature extraction unit 110, configured to obtain an initial voice sample set, and extract voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial speech sample set includes a plurality of pieces of initial speech sample data.

In one embodiment, the speech feature extraction unit 110 includes:

the discrete sampling unit is used for calling a prestored sampling period to sample each piece of initial voice sample data in the initial voice sample set respectively to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

the pre-emphasis unit is used for calling a pre-stored first-order FIR high-pass digital filter to pre-emphasis the current discrete voice signals corresponding to each piece of initial voice sample data respectively, so as to obtain the current pre-emphasis voice signals corresponding to each piece of initial voice sample data;

The windowing unit is used for calling a pre-stored Hamming window to window the current pre-emphasis voice signal corresponding to each piece of initial voice sample data respectively to obtain windowed voice data corresponding to each piece of initial voice sample data;

the framing unit is used for calling a pre-stored frame shift and frame length to respectively frame the windowed voice data corresponding to each piece of initial voice sample data to obtain the pre-processed voice data corresponding to each piece of initial voice sample data;

and the feature extraction unit is used for respectively extracting the Mel frequency cepstrum coefficient or the filter bank of the preprocessed voice data corresponding to each piece of initial voice sample data to obtain the voice feature corresponding to each piece of initial voice sample data so as to form a voice feature set.

Then, when the prestored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is the first-order non-recursion high-pass digital filter, and the transfer function of the first-order FIR high-pass digital filter is expressed as the formula (1).

Then, the function of the invoked hamming window is as in equation (2), the current pre-emphasis voice signal is windowed through the hamming window, and the obtained windowed voice data can be expressed as: q (n) =y (n) ×ω (n).

And finally, when the pre-stored frame shift and frame length are called to frame the windowed voice data, for example, the time domain signal corresponding to the windowed voice data is x (l), the nth frame voice data in the pre-processed voice data after the windowed frame is xn (m), and xn (m) satisfies the formula (3).

The voice feature clustering unit 120 is configured to obtain the euclidean distance between the voice features in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the euclidean distance between the voice features, so as to obtain a clustering result.

In an embodiment, the speech feature clustering unit 120 comprises:

the voice feature selection unit is used for acquiring the ith voice feature and the jth voice feature in the voice feature set; the voice feature set comprises N voice features, wherein the value ranges of i and j are 1 and N, and i and j are unequal;

The voice sequence frame number comparison unit is used for judging whether the first voice sequence frame number corresponding to the ith voice feature is equal to the second voice sequence frame number corresponding to the jth voice feature;

the first calculation unit is used for constructing a distance matrix D with n x m if the number of frames of the first voice sequence corresponding to the ith voice feature is not equal to the number of frames of the second voice sequence corresponding to the jth voice feature, and obtaining the minimum value of matrix elements in the distance matrix D to be used as the Euclidean distance between the ith voice feature and the jth voice feature; wherein n is equal to the number of frames of the first voice sequence, m is equal to the number of frames of the second voice sequence, D (x, y) in the distance matrix D represents Euclidean distance between the x-th frame voice sequence in the ith voice feature and the y-th frame voice sequence in the jth voice feature.

In an embodiment, the speech feature clustering unit 120 further comprises:

and the second calculating unit is used for calculating the Euclidean distance between the ith voice feature and the jth voice feature if the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature.

In an embodiment, the speech feature clustering unit 120 comprises:

The initial cluster center acquisition unit is used for selecting the voice features with the same number as the preset cluster number in the voice feature set, and taking the selected voice features as the initial cluster center of each cluster;

the initial clustering unit is used for dividing the voice feature set according to Euclidean distance between each voice feature in the voice feature set and each initial clustering center to obtain an initial clustering result;

the cluster center adjusting unit is used for acquiring an adjusted cluster center of each cluster according to the initial cluster result;

and the clustering adjustment unit is used for dividing the voice feature set according to the adjusted clustering center and the Euclidean distance between the voice feature set and the adjusted clustering center until the number of times of the same clustering result is kept more than the preset number of times, and obtaining the clustering clusters corresponding to the preset clustering cluster number so as to form the clustering result.

In this embodiment, since the euclidean distance between the speech features can be calculated by the dynamic time warping algorithm, the speech feature set can be clustered by the K-means clustering method. After the clustering classification is completed, the voice feature sets can be quickly grouped, and a plurality of clustering clusters are obtained. And then, the server can select the cluster meeting the conditions from the plurality of clusters as a training sample and mark the training sample.

And the clustering result screening unit 130 is configured to call a preset sample subset screening condition, and acquire a cluster meeting the sample subset screening condition in the clustering result, so as to form a target cluster set.

The labeling value obtaining unit 140 is configured to obtain a labeling value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set.

The speech recognition model training unit 150 is configured to take each speech feature in the current speech sample set as input of a speech recognition model to be trained, and take a labeling value corresponding to each speech feature as output of the speech recognition model to be trained so as to train the speech recognition model to be trained, so as to obtain a speech recognition model; the speech recognition model to be trained comprises a link time sequence classification sub-model and an attention mechanism-based sub-model.

In one embodiment, the geometry-based speech sample screening apparatus 100 further comprises:

and the data uplink unit is used for uploading a first model parameter set corresponding to the link time sequence classification sub-model and a second model parameter set corresponding to the attention mechanism-based sub-model in the voice recognition model to the blockchain network.

And the voice recognition result sending unit 160 is configured to, if the current voice data to be recognized uploaded by the user terminal is detected, input the voice feature corresponding to the current voice data to be recognized into the voice recognition model for operation, obtain and send the current voice recognition result to the user terminal.

In one embodiment, the voice recognition result transmitting unit 160 includes:

the first decoding unit is used for inputting the voice characteristics corresponding to the voice data to be recognized currently into the link time sequence classification sub-model for operation to obtain a first recognition sequence;

and the second decoding unit is used for inputting the first recognition sequence into the submodel based on the attention mechanism to operate so as to obtain and send the current voice recognition result to the user side.

The device realizes the training of the voice recognition model by automatically selecting samples with smaller redundancy, reduces the labeling cost of the voice recognition task under the deep learning background, and improves the training speed of the voice recognition model.

The above-described geometry-based speech sample screening apparatus may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be a stand-alone server or a server cluster formed by a plurality of servers.

With reference to FIG. 4, the computer device 500 includes a processor 502, memory, and a network interface 505, connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a geometry-based speech sample screening method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a geometry-based speech sample screening method.

The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device 500 to which the present inventive arrangements may be implemented, and that a particular computer device 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The processor 502 is configured to execute a computer program 5032 stored in a memory to implement the geometry-based speech sample screening method disclosed in the embodiment of the present invention.

Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 4 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 4, and will not be described again.

It should be appreciated that in an embodiment of the invention, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements the geometry-based speech sample screening method disclosed by the embodiments of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A geometric-based speech sample screening method, comprising:

if the current voice data to be recognized uploaded by the user side is detected, inputting the voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation, and obtaining and sending a current voice recognition result to the user side;

the obtaining an initial voice sample set, extracting voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set includes: performing data preprocessing and feature extraction on an initial voice sample set to obtain voice features corresponding to each piece of initial voice sample data in the initial voice sample set so as to form a voice feature set; the data preprocessing comprises pre-emphasis, framing and windowing;

K-means clustering is carried out according to Euclidean distance among the voice features to obtain a clustering result, and the method comprises the following steps:

dividing the voice feature set according to the Euclidean distance between the voice feature set and the adjusted clustering center according to the adjusted clustering center until the clustering result is kept the same for more than the preset times, and obtaining a clustering cluster corresponding to the preset clustering cluster number to form a clustering result;

the step of calling a preset sample subset screening condition, and obtaining the cluster meeting the sample subset screening condition in the clustering result to form a target cluster set, comprises the following steps: setting a sample subset screening condition that the sample redundancy is the minimum value of a plurality of sample subsets, and screening out target cluster clusters to form a target cluster set; the redundancy of the samples of a certain cluster is calculated, and the degree of data repetition is calculated.

2. The geometric-based voice sample screening method according to claim 1, wherein the obtaining an initial voice sample set, extracting voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set, includes:

3. The geometric-based speech sample screening method according to claim 1, wherein the obtaining the euclidean distance between each speech feature in the speech feature set by the dynamic time warping algorithm comprises:

4. The geometric-based speech sample filtering method according to claim 3, wherein after determining whether the number of frames of the first speech sequence corresponding to the ith speech feature is equal to the number of frames of the second speech sequence corresponding to the jth speech feature, further comprising:

5. The geometric-based voice sample screening method according to claim 1, wherein the step of inputting the voice feature corresponding to the current voice data to be identified into the voice recognition model for operation to obtain and send the current voice recognition result to the user terminal includes:

6. The geometry-based speech sample screening method of claim 1, further comprising:

7. A geometry-based speech sample screening device, comprising:

the voice recognition result sending unit is used for inputting voice characteristics corresponding to the current voice data to be recognized into the voice recognition model for operation if the current voice data to be recognized uploaded by the user terminal is detected, so as to obtain and send the current voice recognition result to the user terminal;

the voice feature extraction unit is further used for carrying out data preprocessing and feature extraction on the initial voice sample set to obtain voice features corresponding to each piece of initial voice sample data in the initial voice sample set so as to form a voice feature set; the data preprocessing comprises pre-emphasis, framing and windowing;

the speech feature clustering unit includes:

the clustering adjustment unit is used for dividing the voice feature set according to the adjusted clustering center and Euclidean distance between the voice feature set and the adjusted clustering center until the number of times of the same clustering result is kept more than a preset number of times, and obtaining a clustering cluster corresponding to the preset clustering cluster number so as to form a clustering result;

the clustering result screening unit is further used for setting the sample subset screening condition that the sample redundancy is the minimum value in a plurality of sample subsets, and screening out target cluster clusters to form a target cluster set; the redundancy of the samples of a certain cluster is calculated, and the degree of data repetition is calculated.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the geometry-based speech sample screening method according to any one of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the geometry-based speech sample screening method according to any of claims 1 to 6.