WO2022116442A1

WO2022116442A1 - Speech sample screening method and apparatus based on geometry, and computer device and storage medium

Info

Publication number: WO2022116442A1
Application number: PCT/CN2021/083934
Authority: WO
Inventors: 罗剑; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-01
Filing date: 2021-03-30
Publication date: 2022-06-09
Also published as: CN112530409B; CN112530409A

Abstract

A speech sample screening method and apparatus (100) based on geometry, and a computer device (500) and a storage medium, which relate to artificial intelligence technology. The method comprises: acquiring an initial speech sample set, and extracting a speech feature corresponding to each piece of initial speech sample data in the initial speech sample set, so as to constitute a speech feature set (S110); acquiring a Euclidean distance between speech features in the speech feature set by means of a dynamic time warping algorithm, so as to perform K-means clustering to obtain a clustering result (S120); calling a preset sample subset screening condition, and acquiring, from the clustering result, a cluster that meets the sample subset screening condition, so as to constitute a target cluster set (S130); and acquiring, from the target cluster set, an annotated value corresponding to each speech feature, so as to obtain a current speech sample set corresponding to the target cluster set (S140). Samples with a relatively small redundancy are automatically selected for the training of a speech recognition model, thereby reducing the annotation cost of a speech recognition task in a deep learning background, and improving the training speed of a speech recognition model.

Description

Geometry-based voice sample screening method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on December 1, 2020 with the application number 202011387398.0 and the invention titled "Geometry-based voice sample screening method, device and computer equipment", the entire content of which is approved by Reference is incorporated in this application.

technical field

The present application relates to the technical field of speech semantics of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for screening speech samples based on geometry.

Background technique

In recent years, with the great success of Deep Neural Network (DNN) in the field of signal processing, DNN-based speech recognition algorithms have become a research hotspot. However, training DNNs for speech recognition through supervised learning usually requires a large number of Annotated speech data. Although with the development and promotion of perception devices, unlabeled speech data has become more accessible. However, manual labeling of unlabeled speech data still requires a lot of labor costs.

In order to label unlabeled speech data, active learning techniques can be used. Active learning is a branch of machine learning that allows the model to choose the data to learn on its own. The idea of active learning comes from the assumption that a machine learning algorithm, if it can choose the data it wants to learn from, will perform better with less training data.

The most widely used active learning query strategy is called uncertainty sampling (Uncertainty Sampling), in which the model will select the most uncertain samples predicted by the model for labeling. The inventors realized that this technique achieves good results with a small number of selected samples, but in the context of using a deep neural network as a training model, the model requires a large amount of training data, and as the number of selected labeled samples grows , the model predicts uncertain samples will have redundancy and overlap, and it is easier to select similar samples. However, selecting these similar samples is of limited help in model training.

Moreover, voice data is different from non-sequential data such as pictures. Voice data has the characteristics of variable length and rich structured information, which makes it more difficult to process and select voice data.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a geometry-based voice sample screening method, device, computer equipment, and storage medium, aiming to solve the problem of using the uncertainty sampling technology in the training of a neural network for voice recognition in the prior art. The uncertain samples predicted by the model will have redundancy and overlap. These similar samples are of limited help for model training. Moreover, due to the complex structure of speech, it is difficult for uncertain sampling techniques to select speech samples.

In a first aspect, an embodiment of the present application provides a method for screening speech samples based on geometry, which includes:

Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;

Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;

Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;

Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;

Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; Wherein, the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model; and

If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.

In a second aspect, an embodiment of the present application provides a geometry-based voice sample screening device, which includes:

A voice feature extraction unit, configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples sample;

A voice feature clustering unit, configured to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result;

A clustering result screening unit, configured to invoke preset sample subset screening conditions, and obtain clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set;

a label value obtaining unit, configured to obtain a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;

The speech recognition model training unit is used to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and the label value corresponding to each speech feature as the output of the speech recognition model to be trained to be trained speech recognition The model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention-based mechanism sub-model; and

The speech recognition result sending unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the currently to-be-recognized speech data uploaded by the user terminal is detected, and to obtain and send the current to-be-recognized speech data to the user terminal. Speech recognition results.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer The program implements the following steps:

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to perform the following operations :

The embodiments of the present application provide a method, device, computer equipment and storage medium for selecting voice samples based on geometry, which realizes automatic selection of samples with less redundancy to train a voice recognition model, and reduces the number of voices in the context of deep learning. The labeling cost of the recognition task improves the training speed of the speech recognition model.

Description of drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by an embodiment of the present application;

2 is a schematic flowchart of a geometry-based voice sample screening method provided by an embodiment of the present application;

3 is a schematic block diagram of a geometry-based voice sample screening apparatus provided by an embodiment of the present application;

FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

It is to be understood that, when used in this specification and the appended claims, the terms "comprising" and "comprising" indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or The presence or addition of a number of other features, integers, steps, operations, elements, components, and/or sets thereof.

It should also be understood that the terminology used in the specification of the application herein is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise.

It should also be further understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items .

Please refer to FIGS. 1 and 2. FIG. 1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by the embodiment of the application; FIG. 2 is a schematic flowchart of the geometry-based voice sample screening method provided by the embodiment of the application. , the geometry-based voice sample screening method is applied in the server, and the method is executed by application software installed in the server.

As shown in FIG. 2, the method includes steps S110-S160.

S110. Obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data.

In this embodiment, in order to train the speech recognition model by using less labeled samples in the server, data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained. Voice features to form a voice feature set. Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.

In one embodiment, step S110 includes:

Calling a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, respectively, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

Calling the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete speech signal corresponding to each piece of initial speech sample data, respectively, to obtain the current pre-emphasized speech signal corresponding to each piece of initial speech sample data;

Invoke the pre-stored Hamming window to perform windowing on the current pre-emphasized voice signal corresponding to each piece of initial voice sample data respectively, and obtain the voice data after windowing corresponding to each piece of initial voice sample data;

Call pre-stored frame shift and frame length to carry out framing respectively to the voice data after windowing corresponding to each initial voice sample data, obtain the voice data after preprocessing corresponding to each initial voice sample data;

The preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.

In this embodiment, before digitally processing the speech signal, the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .

Then, when calling the pre-stored first-order FIR high-pass digital filter, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:

H(z)=1-az ^-1 (1)

In specific implementation, the value of a is 0.98. For example, suppose the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasized speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).

After that, the function of the called Hamming window is as follows (2):

The current pre-emphasized speech signal is windowed through a Hamming window, and the obtained speech data after windowing can be expressed as: Q(n)=y(n)*ω(n).

Finally, when calling the pre-stored frame shift and frame length to frame the windowed speech data, for example, the time domain signal corresponding to the windowed speech data is x(l), and the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m), and xn(m) satisfies Equation (3):

xn(m)=ω(n)*x(n+m), 0≤m≤N-1 (3)

Among them, n=0, 1T, 2T, ..., N is the frame length, T is the frame shift, and ω(n) is the function of the Hamming window.

By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank). The speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.

S120. Obtain the Euclidean distance between each voice feature in the voice feature set by using a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result.

In this embodiment, since there is a difference between the initial speech sample data in the initial speech sample set, in order to compare the difference between the two initial speech sample data, the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.

Considering that the lengths of any two initial speech sample data are not equal in most cases, and in the field of speech processing, it is manifested that different people have different speech rates. Even if the same person utters the same sound at different times, it cannot have exactly the same length of time. And everyone pronounces different phonemes of the same word at different speeds. Some people will drag the "E" sound a little longer, or "o" a little shorter. In such a complex situation, the similarity between two initial speech sample data cannot be accurately obtained by using the traditional Euclidean distance. At this time, the Euclidean distance between the speech features in the speech feature set can be obtained by using a dynamic time warping algorithm.

In one embodiment, in step S120, the Euclidean distance between each speech feature in the speech feature set is obtained by a dynamic time warping algorithm, including:

Obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;

Determine whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;

If the number of frames of the first speech sequence corresponding to the i-th speech feature is not equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, construct a distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D. The value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D Indicates the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.

In this embodiment, the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example. The above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.

When calculating the Euclidean distance between any two speech features, it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature. The Euclidean distance of the feature; where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature. The Euclidean distance between the y-th frame speech sequences in the speech features of No.

For example, the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking). Among them, the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).

In one embodiment, after judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature in step S120, it also includes:

If the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the Euclidean distance between the ith voice feature and the jth voice feature is calculated.

In this embodiment, when it is determined that the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, it means that the time length between the two is the same, and the two are calculated directly. The Euclidean distance between them is sufficient, and there is no need to refer to the process of constructing the distance matrix D.

In one embodiment, K-means clustering is performed according to the Euclidean distance between the speech features in step S120 to obtain a clustering result, including:

Select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;

According to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center, the voice feature set is divided to obtain an initial clustering result;

Obtain the adjusted cluster center of each cluster according to the initial clustering result;

According to the adjusted clustering center, the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times. The number of clusters corresponding to the number of clusters to form the clustering results.

In this embodiment, since the Euclidean distance between each speech feature can be calculated by the dynamic time warping algorithm, the speech feature set can be clustered by the K-means clustering method at this time, and the specific process is as follows:

a) arbitrarily select N2 voice features from the voice feature set including N1 voice features, and use them as the initial clustering centers of the N2 clusters; wherein, the initial total number of voice features in the voice feature set is N1, and N2 is arbitrarily selected from it voice features (N2<N1, N2 is the preset number of clusters, that is, the number of expected clusters), and the initially selected N2 voice features are used as the initial clustering centers.

b) Calculate the Euclidean distances from the remaining speech features to the N2 initial cluster centers respectively, classify the remaining speech features into the cluster with the smallest Euclidean distance, and obtain the initial clustering result; A voice feature selects the initial clustering center with the closest distance to it, and classifies it with the initial clustering center; in this way, the voice feature is divided into N2 clusters based on the initially selected initial clustering center, and each cluster data has an initial cluster center.

c) According to the initial clustering results, recalculate the respective cluster centers of the N2 clusters.

d) re-clustering all elements in the N1 speech features according to the new clustering center;

e) Step d) is repeated until the clustering result no longer changes, and the clustering result corresponding to the preset number of clusters is obtained.

After the cluster classification is completed, the speech feature set can be quickly grouped to obtain multiple clusters. After that, the server can select clusters that meet the conditions from multiple clusters as training samples and label them.

S130. Invoke the preset sample subset screening conditions, and obtain the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set.

In this embodiment, the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set. Among them, when calculating the sample redundancy of a cluster, it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.

S140. Acquire a label value corresponding to each speech feature in the target cluster set to obtain a current speech sample set corresponding to the target cluster set.

In this embodiment, since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.

S150, using each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the label value corresponding to each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain the speech A recognition model; wherein, the speech recognition model to be trained includes a link time series classification sub-model and an attention-based mechanism sub-model.

In this embodiment, in order to train a speech recognition model with higher speech recognition accuracy based on the current speech sample set, a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used. A model for co-decoding. CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model. The Attention decoding process has nothing to do with the frame order of the input speech. Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics. The decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both. Generally, the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing. The network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.

In one embodiment, after step S150, it further includes:

Upload the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.

In this embodiment, the corresponding summary information is obtained based on the first model parameter set and the second model parameter set. Specifically, the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.

The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

S160. If the current voice data to be recognized uploaded by the user terminal is detected, input the voice feature corresponding to the current voice data to be recognized into the voice recognition model for operation, and obtain and send the current voice recognition result to the user terminal.

In this embodiment, after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.

In one embodiment, step S160 includes:

Inputting the speech feature corresponding to the current speech data to be recognized into the link sequence classification sub-model for operation to obtain a first recognition sequence;

The first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.

In this embodiment, since the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier The model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result. In this way, the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.

This method realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.

An embodiment of the present application further provides a geometry-based voice sample screening apparatus, which is used to perform any of the foregoing embodiments of the geometry-based voice sample screening method. Specifically, please refer to FIG. 3 , which is a schematic block diagram of the apparatus for screening speech samples based on geometry provided by an embodiment of the present application. The geometry-based voice sample screening apparatus 100 may be configured in a server.

As shown in FIG. 3 , the geometry-based voice sample screening device 100 includes: a voice feature extraction unit 110, a voice feature clustering unit 120, a clustering result screening unit 130, a label value obtaining unit 140, a voice recognition model training unit 150, Speech recognition result sending unit 160 .

The voice feature extraction unit 110 is configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples. Speech sample data.

In one embodiment, the speech feature extraction unit 110 includes:

A discrete sampling unit, configured to call a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

The pre-emphasis unit is used to call the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal corresponding to each initial voice sample data, and obtain the current pre-emphasized voice signal corresponding to each initial voice sample data ;

The windowing unit is used to call the pre-stored Hamming window to add a window to the current pre-emphasized voice signal corresponding to each initial voice sample data, to obtain the windowed voice data corresponding to each initial voice sample data;

The framing unit is used to call the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, and obtain the preprocessed voice data corresponding to each initial voice sample data. ;

The feature extraction unit is used for extracting Mel frequency cepstral coefficients or filter bank extraction on the preprocessed speech data corresponding to each piece of initial speech sample data, to obtain the speech features corresponding to each piece of initial speech sample data, to form a speech feature set.

Then, when the pre-stored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as shown in the above formula (1).

After that, the function of the called Hamming window is as the above formula (2), and the current pre-emphasized speech signal is windowed by the Hamming window, and the obtained speech data after windowing can be expressed as: Q(n)=y( n)*ω(n).

Finally, when calling the pre-stored frame shift and frame length to frame the windowed speech data, for example, the time domain signal corresponding to the windowed speech data is x(l), and the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m), and xn(m) satisfies the formula (3).

The voice feature clustering unit 120 is used to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result .

In one embodiment, the speech feature clustering unit 120 includes:

The voice feature selection unit is used to obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, and the value ranges of i and j are both [1, N], and i and j are not equal;

A voice sequence frame number comparison unit, used for judging whether the first voice sequence frame number corresponding to the ith voice feature is equal to the second voice sequence frame number corresponding to the jth voice feature;

The first calculation unit is used for constructing a distance matrix D of n*m and obtaining the distance matrix D if the number of frames of the first voice sequence corresponding to the ith voice feature is not equal to the number of frames of the second voice sequence corresponding to the jth voice feature The minimum value in each matrix element is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and the distance matrix D where d(x, y) represents the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.

In one embodiment, the voice feature clustering unit 120 further includes:

The second calculation unit is configured to calculate the i-th speech feature and the j-th speech if the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the second speech sequence frame number corresponding to the j-th speech feature Euclidean distance for features.

In one embodiment, the speech feature clustering unit 120 includes:

The initial cluster center acquisition unit is used to select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;

The initial clustering unit is used to divide the voice feature set according to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center to obtain an initial clustering result;

The cluster center adjustment unit is used to obtain the adjusted cluster center of each cluster according to the initial clustering result;

The clustering adjustment unit is used to divide the voice feature set of the voice feature set according to the Euclidean distance from the adjusted cluster center according to the adjusted cluster center, until the clustering results remain the same for more times than preset number of times to obtain a clustering cluster corresponding to the preset number of clustering clusters to form a clustering result.

In this embodiment, since the Euclidean distance between the speech features can be calculated by the dynamic time warping algorithm, the speech feature set can be clustered by the K-means clustering method at this time. After the cluster classification is completed, the speech feature set can be quickly grouped to obtain multiple clusters. After that, the server can select clusters that meet the conditions from multiple clusters as training samples and label them.

The clustering result screening unit 130 is configured to invoke preset sample subset screening conditions, and acquire clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set.

The label value obtaining unit 140 is configured to obtain label values corresponding to each speech feature in the target cluster set, so as to obtain a current speech sample set corresponding to the target cluster set.

The speech recognition model training unit 150 is configured to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and use the label value corresponding to each speech feature as the output of the speech recognition model to be trained to treat the speech for training The recognition model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention mechanism-based sub-model.

In one embodiment, the geometry-based voice sample screening apparatus 100 further includes:

The data uploading unit is used for uploading the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.

The speech recognition result sending unit 160 is configured to input the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the current speech data to be recognized uploaded by the user terminal is detected, and obtain and send to the user terminal The current speech recognition result.

In one embodiment, the speech recognition result sending unit 160 includes:

The first decoding unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the link sequence classification sub-model for operation to obtain a first recognition sequence;

The second decoding unit is configured to input the first recognition sequence into the sub-model based on the attention mechanism for operation, obtain and send the current speech recognition result to the user terminal.

The device realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.

The above-mentioned apparatus for screening speech samples based on geometry can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 4 .

Please refer to FIG. 4 , which is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.

Referring to FIG. 4 , the computer device 500 includes a processor 502 , a memory and a network interface 505 connected through a system bus 501 , wherein the memory may include a non-volatile storage medium 503 and an internal memory 504 .

The nonvolatile storage medium 503 can store an operating system 5031 and a computer program 5032 . The computer program 5032, when executed, can cause the processor 502 to perform a geometry-based voice sample screening method.

The processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500 .

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, the computer program 5032, when executed by the processor 502, can cause the processor 502 to perform a geometry-based voice sample screening method.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.

The processor 502 is configured to run the computer program 5032 stored in the memory to implement the geometry-based voice sample screening method disclosed in the embodiment of the present application.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific structure of the computer device, and in other embodiments, the computer device may include more or less components than those shown in the figure, Either some components are combined, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4 , and details are not repeated here.

It should be understood that, in this embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program implements the geometry-based voice sample screening method disclosed in the embodiments of the present application when the computer program is executed by the processor.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described devices, devices and units, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only logical function division. In actual implementation, there may be other division methods, or units with the same function may be grouped into one Units, such as multiple units or components, may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a storage medium. Based on this understanding, the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk or an optical disk and other media that can store program codes.

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A geometry-based voice sample screening method, comprising:

Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;

Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;

Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;

Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;

Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; Wherein, the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model; and

If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
The geometry-based voice sample screening method according to claim 1, wherein the acquiring an initial voice sample set, extracting a voice feature corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set, include:

Calling a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, respectively, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

Calling the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete speech signal corresponding to each piece of initial speech sample data, respectively, to obtain the current pre-emphasized speech signal corresponding to each piece of initial speech sample data;

Invoke the pre-stored Hamming window to perform windowing on the current pre-emphasized voice signal corresponding to each piece of initial voice sample data respectively, and obtain the voice data after windowing corresponding to each piece of initial voice sample data;

Calling the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, to obtain the preprocessed voice data corresponding to each initial voice sample data;

The preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
The geometry-based voice sample screening method according to claim 1, wherein the obtaining the Euclidean distance between each voice feature in the voice feature set by a dynamic time warping algorithm, comprising:

Obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;

Determine whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;

If the number of frames of the first speech sequence corresponding to the i-th speech feature is not equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, construct a distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D. The value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D Indicates the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
The method for screening speech samples based on geometry according to claim 3, wherein after judging whether the number of frames of the first speech sequence corresponding to the ith speech feature is equal to the number of frames of the second speech sequence corresponding to the jth speech feature ,Also includes:

If the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
The method for screening speech samples based on geometry according to claim 4, wherein the performing K-means clustering according to the Euclidean distance between each speech feature to obtain a clustering result, comprising:

Select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;

According to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center, the voice feature set is divided to obtain an initial clustering result;

Obtain the adjusted cluster center of each cluster according to the initial clustering result;

According to the adjusted clustering center, the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times. The number of clusters corresponding to the number of clusters to form the clustering results.
The geometry-based voice sample screening method according to claim 1, wherein the voice feature corresponding to the currently to-be-recognized voice data is input into the voice recognition model for calculation, and the current voice is obtained and sent to the user terminal. Identification results, including:

Inputting the speech feature corresponding to the current speech data to be recognized into the link sequence classification sub-model for operation to obtain a first recognition sequence;

The first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
The geometry-based voice sample screening method according to claim 1, wherein, further comprising:

Upload the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
The geometry-based voice sample screening method according to claim 2, wherein the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and the corresponding transfer function is as follows: H(z)= 1-az -1 ; wherein a=0.98.
The geometry-based voice sample screening method according to claim 1, wherein the sample subset screening condition is that the sample redundancy is a minimum value in a plurality of sample subsets;

The calling of the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, to form a target cluster cluster set, including:

The sample redundancy corresponding to each cluster of the clustering result is obtained, and the target cluster set is composed of a cluster whose sample redundancy is the minimum value among the multiple clusters.
A geometry-based voice sample screening device, comprising:

A voice feature extraction unit, configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples sample;

A voice feature clustering unit, configured to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result;

A clustering result screening unit, configured to invoke preset sample subset screening conditions, and obtain clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set;

a label value obtaining unit, configured to obtain a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;

The speech recognition model training unit is used to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and the label value corresponding to each speech feature as the output of the speech recognition model to be trained to be trained speech recognition The model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention-based mechanism sub-model; and

The speech recognition result sending unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the currently to-be-recognized speech data uploaded by the user terminal is detected, and to obtain and send the current to-be-recognized speech data to the user terminal. Speech recognition results.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the following steps when executing the computer program:

Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;

Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;

Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;

Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;

Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; wherein, the speech recognition model to be trained includes a link timing classification sub-model and an attention-based mechanism sub-model; and

If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
computer equipment according to claim 11, wherein, described acquisition initial voice sample set, extract the voice feature corresponding to each initial voice sample data in described initial voice sample set, to form voice feature set, including:

Calling a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, respectively, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;

Calling the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete speech signal corresponding to each piece of initial speech sample data, respectively, to obtain the current pre-emphasized speech signal corresponding to each piece of initial speech sample data;

Invoke the pre-stored Hamming window to perform windowing on the current pre-emphasized voice signal corresponding to each piece of initial voice sample data respectively, and obtain the voice data after windowing corresponding to each piece of initial voice sample data;

Calling the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, to obtain the preprocessed voice data corresponding to each initial voice sample data;

The preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
The computer device according to claim 11, wherein the obtaining the Euclidean distance between the speech features in the speech feature set by a dynamic time warping algorithm comprises:

Obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;

Determine whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;

If the number of frames of the first speech sequence corresponding to the i-th speech feature is not equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, construct a distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D. The value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D It represents the Euclidean distance between the speech sequence of the xth frame in the ith speech feature and the yth frame speech sequence in the jth speech feature.
The computer equipment according to claim 13, wherein after judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, it further comprises:

If the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
The computer device according to claim 14, wherein the performing K-means clustering according to the Euclidean distance between the speech features to obtain a clustering result, comprising:

Select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;

According to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center, the voice feature set is divided to obtain an initial clustering result;

Obtain the adjusted cluster center of each cluster according to the initial clustering result;

According to the adjusted clustering center, the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times. The number of clusters corresponding to the number of clusters to form the clustering results.
The computer device according to claim 11, wherein the inputting the voice feature corresponding to the currently to-be-recognized voice data into the voice recognition model for calculation, and obtaining and sending the current voice recognition result to the user, comprising:

Inputting the speech feature corresponding to the current speech data to be recognized into the link sequence classification sub-model for operation to obtain a first recognition sequence;

The first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
The computer device of claim 11, further comprising:

Upload the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
The computer device according to claim 12, wherein the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and the corresponding transfer function is as follows: H(z)=1-az -1 ; where a=0.98.
The computer device according to claim 11, wherein the sample subset screening condition is that the sample redundancy is a minimum value in a plurality of sample subsets;

The calling of the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, to form a target cluster cluster set, including:

The sample redundancy corresponding to each cluster of the clustering result is obtained, and the target cluster set is composed of a cluster whose sample redundancy is the minimum value among the multiple clusters.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:

Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;

Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;

Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;

Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;

Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; wherein, the speech recognition model to be trained includes a link timing classification sub-model and an attention-based mechanism sub-model; and

If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.