WO2022116442A1 - Speech sample screening method and apparatus based on geometry, and computer device and storage medium - Google Patents

Speech sample screening method and apparatus based on geometry, and computer device and storage medium Download PDF

Info

Publication number
WO2022116442A1
WO2022116442A1 PCT/CN2021/083934 CN2021083934W WO2022116442A1 WO 2022116442 A1 WO2022116442 A1 WO 2022116442A1 CN 2021083934 W CN2021083934 W CN 2021083934W WO 2022116442 A1 WO2022116442 A1 WO 2022116442A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
feature
sample
initial
Prior art date
Application number
PCT/CN2021/083934
Other languages
French (fr)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116442A1 publication Critical patent/WO2022116442A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the technical field of speech semantics of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for screening speech samples based on geometry.
  • DNN Deep Neural Network
  • Active learning is a branch of machine learning that allows the model to choose the data to learn on its own.
  • the idea of active learning comes from the assumption that a machine learning algorithm, if it can choose the data it wants to learn from, will perform better with less training data.
  • the most widely used active learning query strategy is called uncertainty sampling (Uncertainty Sampling), in which the model will select the most uncertain samples predicted by the model for labeling.
  • uncertainty sampling Uncertainty Sampling
  • the inventors realized that this technique achieves good results with a small number of selected samples, but in the context of using a deep neural network as a training model, the model requires a large amount of training data, and as the number of selected labeled samples grows , the model predicts uncertain samples will have redundancy and overlap, and it is easier to select similar samples. However, selecting these similar samples is of limited help in model training.
  • voice data is different from non-sequential data such as pictures.
  • Voice data has the characteristics of variable length and rich structured information, which makes it more difficult to process and select voice data.
  • the embodiments of the present application provide a geometry-based voice sample screening method, device, computer equipment, and storage medium, aiming to solve the problem of using the uncertainty sampling technology in the training of a neural network for voice recognition in the prior art.
  • the uncertain samples predicted by the model will have redundancy and overlap. These similar samples are of limited help for model training.
  • due to the complex structure of speech it is difficult for uncertain sampling techniques to select speech samples.
  • an embodiment of the present application provides a method for screening speech samples based on geometry, which includes:
  • Obtaining an initial voice sample set extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
  • the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model;
  • the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  • an embodiment of the present application provides a geometry-based voice sample screening device, which includes:
  • a voice feature extraction unit configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples sample;
  • a voice feature clustering unit configured to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result;
  • a clustering result screening unit configured to invoke preset sample subset screening conditions, and obtain clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set;
  • a label value obtaining unit configured to obtain a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;
  • the speech recognition model training unit is used to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and the label value corresponding to each speech feature as the output of the speech recognition model to be trained to be trained speech recognition
  • the model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention-based mechanism sub-model; and
  • the speech recognition result sending unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the currently to-be-recognized speech data uploaded by the user terminal is detected, and to obtain and send the current to-be-recognized speech data to the user terminal. Speech recognition results.
  • an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer
  • the program implements the following steps:
  • Obtaining an initial voice sample set extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
  • the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model;
  • the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  • embodiments of the present application further provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to perform the following operations :
  • Obtaining an initial voice sample set extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
  • the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model;
  • the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for selecting voice samples based on geometry, which realizes automatic selection of samples with less redundancy to train a voice recognition model, and reduces the number of voices in the context of deep learning.
  • the labeling cost of the recognition task improves the training speed of the speech recognition model.
  • FIG. 1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a geometry-based voice sample screening method provided by an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a geometry-based voice sample screening apparatus provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by the embodiment of the application
  • FIG. 2 is a schematic flowchart of the geometry-based voice sample screening method provided by the embodiment of the application. , the geometry-based voice sample screening method is applied in the server, and the method is executed by application software installed in the server.
  • the method includes steps S110-S160.
  • data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained.
  • Voice features to form a voice feature set.
  • Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.
  • step S110 includes:
  • the preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
  • the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .
  • the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:
  • the time domain signal corresponding to the windowed speech data is x(l)
  • the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m)
  • xn(m) satisfies Equation (3):
  • T is the frame shift
  • ⁇ (n) is the function of the Hamming window.
  • the initial speech sample data By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank).
  • the speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.
  • the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.
  • step S120 the Euclidean distance between each speech feature in the speech feature set is obtained by a dynamic time warping algorithm, including:
  • the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;
  • the distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D.
  • the value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D Indicates the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
  • the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example.
  • the above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.
  • the Euclidean distance between any two speech features it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature.
  • the Euclidean distance of the feature where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature.
  • the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking).
  • the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).
  • step S120 after judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature in step S120, it also includes:
  • the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
  • K-means clustering is performed according to the Euclidean distance between the speech features in step S120 to obtain a clustering result, including:
  • the voice feature set is divided to obtain an initial clustering result
  • the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times.
  • the number of clusters corresponding to the number of clusters to form the clustering results is provided.
  • the speech feature set can be clustered by the K-means clustering method at this time, and the specific process is as follows:
  • N2 voice features from the voice feature set including N1 voice features, and use them as the initial clustering centers of the N2 clusters; wherein, the initial total number of voice features in the voice feature set is N1, and N2 is arbitrarily selected from it voice features (N2 ⁇ N1, N2 is the preset number of clusters, that is, the number of expected clusters), and the initially selected N2 voice features are used as the initial clustering centers.
  • Step d) is repeated until the clustering result no longer changes, and the clustering result corresponding to the preset number of clusters is obtained.
  • the speech feature set can be quickly grouped to obtain multiple clusters.
  • the server can select clusters that meet the conditions from multiple clusters as training samples and label them.
  • the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set.
  • the sample redundancy of a cluster it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.
  • the target cluster set since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.
  • the speech recognition model to be trained includes a link time series classification sub-model and an attention-based mechanism sub-model.
  • a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used.
  • CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model.
  • the Attention decoding process has nothing to do with the frame order of the input speech.
  • Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics.
  • the decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both.
  • the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing.
  • the network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.
  • step S150 it further includes:
  • the corresponding summary information is obtained based on the first model parameter set and the second model parameter set.
  • the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed.
  • Uploading summary information to the blockchain ensures its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the server after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.
  • step S160 includes:
  • the first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
  • the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier
  • the model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result.
  • the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.
  • This method realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.
  • An embodiment of the present application further provides a geometry-based voice sample screening apparatus, which is used to perform any of the foregoing embodiments of the geometry-based voice sample screening method.
  • FIG. 3 is a schematic block diagram of the apparatus for screening speech samples based on geometry provided by an embodiment of the present application.
  • the geometry-based voice sample screening apparatus 100 may be configured in a server.
  • the geometry-based voice sample screening device 100 includes: a voice feature extraction unit 110, a voice feature clustering unit 120, a clustering result screening unit 130, a label value obtaining unit 140, a voice recognition model training unit 150, Speech recognition result sending unit 160 .
  • the voice feature extraction unit 110 is configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples. Speech sample data.
  • data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained.
  • Voice features to form a voice feature set.
  • Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.
  • the speech feature extraction unit 110 includes:
  • a discrete sampling unit configured to call a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;
  • the pre-emphasis unit is used to call the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal corresponding to each initial voice sample data, and obtain the current pre-emphasized voice signal corresponding to each initial voice sample data ;
  • the windowing unit is used to call the pre-stored Hamming window to add a window to the current pre-emphasized voice signal corresponding to each initial voice sample data, to obtain the windowed voice data corresponding to each initial voice sample data;
  • the framing unit is used to call the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, and obtain the preprocessed voice data corresponding to each initial voice sample data. ;
  • the feature extraction unit is used for extracting Mel frequency cepstral coefficients or filter bank extraction on the preprocessed speech data corresponding to each piece of initial speech sample data, to obtain the speech features corresponding to each piece of initial speech sample data, to form a speech feature set.
  • the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .
  • the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as shown in the above formula (1).
  • the time domain signal corresponding to the windowed speech data is x(l)
  • the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m)
  • xn(m) satisfies the formula (3).
  • the initial speech sample data By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank).
  • the speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.
  • the voice feature clustering unit 120 is used to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result .
  • the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.
  • the speech feature clustering unit 120 includes:
  • the voice feature selection unit is used to obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, and the value ranges of i and j are both [1, N], and i and j are not equal;
  • a voice sequence frame number comparison unit used for judging whether the first voice sequence frame number corresponding to the ith voice feature is equal to the second voice sequence frame number corresponding to the jth voice feature;
  • the first calculation unit is used for constructing a distance matrix D of n*m and obtaining the distance matrix D if the number of frames of the first voice sequence corresponding to the ith voice feature is not equal to the number of frames of the second voice sequence corresponding to the jth voice feature
  • the minimum value in each matrix element is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and the distance matrix D where d(x, y) represents the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
  • the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example.
  • the above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.
  • the Euclidean distance between any two speech features it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature.
  • the Euclidean distance of the feature where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature.
  • the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking).
  • the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).
  • the voice feature clustering unit 120 further includes:
  • the second calculation unit is configured to calculate the i-th speech feature and the j-th speech if the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the second speech sequence frame number corresponding to the j-th speech feature Euclidean distance for features.
  • the speech feature clustering unit 120 includes:
  • the initial cluster center acquisition unit is used to select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;
  • the initial clustering unit is used to divide the voice feature set according to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center to obtain an initial clustering result
  • the cluster center adjustment unit is used to obtain the adjusted cluster center of each cluster according to the initial clustering result
  • the clustering adjustment unit is used to divide the voice feature set of the voice feature set according to the Euclidean distance from the adjusted cluster center according to the adjusted cluster center, until the clustering results remain the same for more times than preset number of times to obtain a clustering cluster corresponding to the preset number of clustering clusters to form a clustering result.
  • the speech feature set can be clustered by the K-means clustering method at this time. After the cluster classification is completed, the speech feature set can be quickly grouped to obtain multiple clusters. After that, the server can select clusters that meet the conditions from multiple clusters as training samples and label them.
  • the clustering result screening unit 130 is configured to invoke preset sample subset screening conditions, and acquire clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set.
  • the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set.
  • the sample redundancy of a cluster it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.
  • the label value obtaining unit 140 is configured to obtain label values corresponding to each speech feature in the target cluster set, so as to obtain a current speech sample set corresponding to the target cluster set.
  • the target cluster set since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.
  • the speech recognition model training unit 150 is configured to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and use the label value corresponding to each speech feature as the output of the speech recognition model to be trained to treat the speech for training
  • the recognition model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention mechanism-based sub-model.
  • a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used.
  • CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model.
  • the Attention decoding process has nothing to do with the frame order of the input speech.
  • Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics.
  • the decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both.
  • the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing.
  • the network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.
  • the geometry-based voice sample screening apparatus 100 further includes:
  • the data uploading unit is used for uploading the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
  • the corresponding summary information is obtained based on the first model parameter set and the second model parameter set.
  • the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed.
  • Uploading summary information to the blockchain ensures its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the speech recognition result sending unit 160 is configured to input the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the current speech data to be recognized uploaded by the user terminal is detected, and obtain and send to the user terminal The current speech recognition result.
  • the server after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.
  • the speech recognition result sending unit 160 includes:
  • the first decoding unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the link sequence classification sub-model for operation to obtain a first recognition sequence;
  • the second decoding unit is configured to input the first recognition sequence into the sub-model based on the attention mechanism for operation, obtain and send the current speech recognition result to the user terminal.
  • the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier
  • the model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result.
  • the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.
  • the device realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.
  • the above-mentioned apparatus for screening speech samples based on geometry can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 4 .
  • FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502 , a memory and a network interface 505 connected through a system bus 501 , wherein the memory may include a non-volatile storage medium 503 and an internal memory 504 .
  • the nonvolatile storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the computer program 5032 when executed, can cause the processor 502 to perform a geometry-based voice sample screening method.
  • the processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, the computer program 5032, when executed by the processor 502, can cause the processor 502 to perform a geometry-based voice sample screening method.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • FIG. 4 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory to implement the geometry-based voice sample screening method disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific structure of the computer device, and in other embodiments, the computer device may include more or less components than those shown in the figure, Either some components are combined, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4 , and details are not repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, where the computer program implements the geometry-based voice sample screening method disclosed in the embodiments of the present application when the computer program is executed by the processor.
  • the disclosed apparatus, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, or units with the same function may be grouped into one Units, such as multiple units or components, may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a storage medium.
  • the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk or an optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A speech sample screening method and apparatus (100) based on geometry, and a computer device (500) and a storage medium, which relate to artificial intelligence technology. The method comprises: acquiring an initial speech sample set, and extracting a speech feature corresponding to each piece of initial speech sample data in the initial speech sample set, so as to constitute a speech feature set (S110); acquiring a Euclidean distance between speech features in the speech feature set by means of a dynamic time warping algorithm, so as to perform K-means clustering to obtain a clustering result (S120); calling a preset sample subset screening condition, and acquiring, from the clustering result, a cluster that meets the sample subset screening condition, so as to constitute a target cluster set (S130); and acquiring, from the target cluster set, an annotated value corresponding to each speech feature, so as to obtain a current speech sample set corresponding to the target cluster set (S140). Samples with a relatively small redundancy are automatically selected for the training of a speech recognition model, thereby reducing the annotation cost of a speech recognition task in a deep learning background, and improving the training speed of a speech recognition model.

Description

基于几何学的语音样本筛选方法、装置、计算机设备及存储介质Geometry-based voice sample screening method, device, computer equipment and storage medium
本申请要求于2020年12月1日提交中国专利局、申请号为202011387398.0,发明名称为“基于几何学的语音样本筛选方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 1, 2020 with the application number 202011387398.0 and the invention titled "Geometry-based voice sample screening method, device and computer equipment", the entire content of which is approved by Reference is incorporated in this application.
技术领域technical field
本申请涉及人工智能的语音语义技术领域,尤其涉及一种基于几何学的语音样本筛选方法、装置、计算机设备及存储介质。The present application relates to the technical field of speech semantics of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for screening speech samples based on geometry.
背景技术Background technique
近年来,随着深度神经网络技术(Deep Neural Network,DNN)在信号处理领域的巨大成功,基于DNN的语音识别算法愈来成为研究热点,然而通过监督学习的方式训练语音识别的DNN通常需要大量的带有标注的语音数据。虽然随着感知设备的发展和推广,未标注的语音数据变得更加容易获取。但是给未标注的语音数据人工打上标注仍然需耗费大量的人力成本。In recent years, with the great success of Deep Neural Network (DNN) in the field of signal processing, DNN-based speech recognition algorithms have become a research hotspot. However, training DNNs for speech recognition through supervised learning usually requires a large number of Annotated speech data. Although with the development and promotion of perception devices, unlabeled speech data has become more accessible. However, manual labeling of unlabeled speech data still requires a lot of labor costs.
为了给未标注的语音数据打标注可采用主动学习技术,主动学习是机器学习的一个分支,它允许模型自行选择要学习的数据。主动学习的思想来源于一个假设∶即一个机器学习算法,如果能自行选择想要学习的数据,那么只用较少的训练数据,它将表现得更好。In order to label unlabeled speech data, active learning techniques can be used. Active learning is a branch of machine learning that allows the model to choose the data to learn on its own. The idea of active learning comes from the assumption that a machine learning algorithm, if it can choose the data it wants to learn from, will perform better with less training data.
最广泛使用的主动学习查询策略叫做不确定性采样(Uncertainty Sampling),在该项技术中模型将选择模型预测最不确定的样本进行标注。发明人意识到该技术在样本选择数量较小的情况下取得了良好的效果,但是在使用深度神经网络作为训练模型的背景下,模型需要大量的训练数据,随着选择的标注样本数量的增长,模型预测不确定的样本就会有冗余和重叠,更加容易选择到相似的样本。然而选择出这些相似的样本对模型训练的帮助是十分有限的。The most widely used active learning query strategy is called uncertainty sampling (Uncertainty Sampling), in which the model will select the most uncertain samples predicted by the model for labeling. The inventors realized that this technique achieves good results with a small number of selected samples, but in the context of using a deep neural network as a training model, the model requires a large amount of training data, and as the number of selected labeled samples grows , the model predicts uncertain samples will have redundancy and overlap, and it is easier to select similar samples. However, selecting these similar samples is of limited help in model training.
而且,语音数据不同于图片等非序列数据,语音数据具长度不定,结构化信息丰富等特点,对于语音数据的处理和选择难度会更大。Moreover, voice data is different from non-sequential data such as pictures. Voice data has the characteristics of variable length and rich structured information, which makes it more difficult to process and select voice data.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种基于几何学的语音样本筛选方法、装置、计算机设备及存储介质,旨在解决现有技术中通过不确定性采样技术应用于语音识别用途的神经网络的训练时,模型预测不确定的样本就会有冗余和重叠,这些相似的样本对模型训练的帮助有限,而且因语音结构复杂导致不确定性采样技术选择出语音样本的难度较大的的问题。The embodiments of the present application provide a geometry-based voice sample screening method, device, computer equipment, and storage medium, aiming to solve the problem of using the uncertainty sampling technology in the training of a neural network for voice recognition in the prior art. The uncertain samples predicted by the model will have redundancy and overlap. These similar samples are of limited help for model training. Moreover, due to the complex structure of speech, it is difficult for uncertain sampling techniques to select speech samples.
第一方面,本申请实施例提供了一种基于几何学的语音样本筛选方法,其包括:In a first aspect, an embodiment of the present application provides a method for screening speech samples based on geometry, which includes:
获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;
调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;
获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;
将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; Wherein, the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model; and
若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
第二方面,本申请实施例提供了一种基于几何学的语音样本筛选装置,其包括:In a second aspect, an embodiment of the present application provides a geometry-based voice sample screening device, which includes:
语音特征提取单元,用于获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;A voice feature extraction unit, configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples sample;
语音特征聚类单元,用于通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;A voice feature clustering unit, configured to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result;
聚类结果筛选单元,用于调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;A clustering result screening unit, configured to invoke preset sample subset screening conditions, and obtain clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set;
标注值获取单元,用于获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;a label value obtaining unit, configured to obtain a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;
语音识别模型训练单元,用于将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及The speech recognition model training unit is used to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and the label value corresponding to each speech feature as the output of the speech recognition model to be trained to be trained speech recognition The model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention-based mechanism sub-model; and
语音识别结果发送单元,用于若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。The speech recognition result sending unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the currently to-be-recognized speech data uploaded by the user terminal is detected, and to obtain and send the current to-be-recognized speech data to the user terminal. Speech recognition results.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer The program implements the following steps:
获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;
调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;
获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;
将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; Wherein, the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model; and
若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to perform the following operations :
获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;
调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;
获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;
将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; Wherein, the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model; and
若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
本申请实施例提供了一种基于几何学的语音样本筛选方法、装置、计算机设备及存储介质,实现了自动选择出冗余度较小的样本对语音识别模型进行训练,减少深度学习背景下语音识别任务的标注代价,提升语音识别模型的训练速度。The embodiments of the present application provide a method, device, computer equipment and storage medium for selecting voice samples based on geometry, which realizes automatic selection of samples with less redundancy to train a voice recognition model, and reduces the number of voices in the context of deep learning. The labeling cost of the recognition task improves the training speed of the speech recognition model.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1为本申请实施例提供的基于几何学的语音样本筛选方法的应用场景示意图;1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by an embodiment of the present application;
图2为本申请实施例提供的基于几何学的语音样本筛选方法的流程示意图;2 is a schematic flowchart of a geometry-based voice sample screening method provided by an embodiment of the present application;
图3为本申请实施例提供的基于几何学的语音样本筛选装置的示意性框图;3 is a schematic block diagram of a geometry-based voice sample screening apparatus provided by an embodiment of the present application;
图4为本申请实施例提供的计算机设备的示意性框图。FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It is to be understood that, when used in this specification and the appended claims, the terms "comprising" and "comprising" indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or The presence or addition of a number of other features, integers, steps, operations, elements, components, and/or sets thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of the application herein is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be further understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items .
请参阅图1和图2,图1为本申请实施例提供的基于几何学的语音样本筛选方法的应用场景示意图;图2为本申请实施例提供的基于几何学的语音样本筛选方法的流程示意图,该基于几何学的语音样本筛选方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。Please refer to FIGS. 1 and 2. FIG. 1 is a schematic diagram of an application scenario of the geometry-based voice sample screening method provided by the embodiment of the application; FIG. 2 is a schematic flowchart of the geometry-based voice sample screening method provided by the embodiment of the application. , the geometry-based voice sample screening method is applied in the server, and the method is executed by application software installed in the server.
如图2所示,该方法包括步骤S110~S160。As shown in FIG. 2, the method includes steps S110-S160.
S110、获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据。S110. Obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data.
在本实施例中,为了在服务器中采用较少标注样本训练语音识别模型,可以先对初始语音样本集进行数据预处理和特征提取,得到所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集。数据预处理包括预加重,分帧,加窗等操作,这些操作的目的是消除因为人类发声器官本身缺陷和采集设备缺陷带来的混叠、高次谐波失真、高频等因素对语音信号质量的影响,尽可能使得到的信号更均匀、平滑。In this embodiment, in order to train the speech recognition model by using less labeled samples in the server, data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained. Voice features to form a voice feature set. Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.
在一实施例中,步骤S110包括:In one embodiment, step S110 includes:
调用预先存储的采样周期将所述初始语音样本集中每一条初始语音样本数据分别进行采样,得到与每一条初始语音样本数据对应的当前离散语音信号;Calling a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, respectively, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;
调用预先存储的一阶FIR高通数字滤波器对每一条初始语音样本数据对应的当前离散语音信号分别进行预加重,得到与每一条初始语音样本数据对应的当前预加重语音信号;Calling the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete speech signal corresponding to each piece of initial speech sample data, respectively, to obtain the current pre-emphasized speech signal corresponding to each piece of initial speech sample data;
调用预先存储的汉明窗对与每一条初始语音样本数据对应的当前预加重语音信号分别进行加窗,得到与每一条初始语音样本数据对应的加窗后语音数据;Invoke the pre-stored Hamming window to perform windowing on the current pre-emphasized voice signal corresponding to each piece of initial voice sample data respectively, and obtain the voice data after windowing corresponding to each piece of initial voice sample data;
调用预先存储的帧移和帧长对与每一条初始语音样本数据对应的加窗后语音数据分别进 行分帧,得到与每一条初始语音样本数据对应的预处理后语音数据;Call pre-stored frame shift and frame length to carry out framing respectively to the voice data after windowing corresponding to each initial voice sample data, obtain the voice data after preprocessing corresponding to each initial voice sample data;
将每一条初始语音样本数据对应的预处理后语音数据分别进行梅尔频率倒谱系数提取或是滤波器组提取,得到与每一条初始语音样本数据对应的语音特征,以组成语音特征集。The preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
在本实施例中,在对语音信号进行数字处理之前,首先要将初始语音样本数据(将初始语音样本数据记为s(t))以采样周期T采样,将其离散化为s(n)。In this embodiment, before digitally processing the speech signal, the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .
然后,调用预先存储的一阶FIR高通数字滤波器时,一阶FIR高通数字滤波器即为一阶非递归型高通数字滤波器,其传递函数如下式(1):Then, when calling the pre-stored first-order FIR high-pass digital filter, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:
H(z)=1-az -1            (1) H(z)=1-az -1 (1)
具体实施时,a的取值为0.98。例如,设n时刻的所述当前离散语音信号的采样值为x(n),经过预加重处理后的当前预加重语音信号中与x(n)对应的采样值为y(n)=x(n)-ax(n-1)。In specific implementation, the value of a is 0.98. For example, suppose the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasized speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).
之后,所调用的汉明窗的函数如下式(2):After that, the function of the called Hamming window is as follows (2):
Figure PCTCN2021083934-appb-000001
Figure PCTCN2021083934-appb-000001
通过汉明窗对所述当前预加重语音信号进行加窗,得到的加窗后语音数据可以表示为:Q(n)=y(n)*ω(n)。The current pre-emphasized speech signal is windowed through a Hamming window, and the obtained speech data after windowing can be expressed as: Q(n)=y(n)*ω(n).
最后,调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧时,例如所述加窗后语音数据对应的时域信号为x(l),加窗分帧处理后的预处理后语音数据中第n帧语音数据为xn(m),且xn(m)满足式(3):Finally, when calling the pre-stored frame shift and frame length to frame the windowed speech data, for example, the time domain signal corresponding to the windowed speech data is x(l), and the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m), and xn(m) satisfies Equation (3):
xn(m)=ω(n)*x(n+m),0≤m≤N-1         (3)xn(m)=ω(n)*x(n+m), 0≤m≤N-1 (3)
其中,n=0,1T,2T,……,N是帧长,T是帧移,ω(n)是汉明窗的函数。Among them, n=0, 1T, 2T, ..., N is the frame length, T is the frame shift, and ω(n) is the function of the Hamming window.
通过对初始语音样本数据进行预处理,能有效用于后续的声音参数提取,例如提取梅尔频率倒谱系数(即Mel Frequency Cepstrum Coefficient)或是滤波器组(即Filter-Bank),提取之后即可得到每一条初始语音样本数据对应的语音特征,以组成语音特征集。By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank). The speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.
S120、通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果。S120. Obtain the Euclidean distance between each voice feature in the voice feature set by using a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result.
在本实施例中,由于初始语音样本集中各初始语音样本数据之间存在差异,此时为了比较两个初始语音样本数据之间的差异,可以量化处理为计算两个初始语音样本数据各自对应的语音特征之间的欧式距离。In this embodiment, since there is a difference between the initial speech sample data in the initial speech sample set, in order to compare the difference between the two initial speech sample data, the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.
在考虑到任意两个初始语音样本数据的长度大部分情况下是不相等的,而且在语音处理领域上表现为不同人的语速不同。即时同一个人不同一时刻发同一个音,也不可能具有完全相同的时间长度。而且每个人对同一个单词的不同音素的发音速度也是不同的,有的人会把"E"这个音拖得稍长,或者"o"稍短。在这种复杂的情况下,利用传统的欧式距离是无法准确获取两个初始语音样本数据之间的相似性的。此时,可以通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离。Considering that the lengths of any two initial speech sample data are not equal in most cases, and in the field of speech processing, it is manifested that different people have different speech rates. Even if the same person utters the same sound at different times, it cannot have exactly the same length of time. And everyone pronounces different phonemes of the same word at different speeds. Some people will drag the "E" sound a little longer, or "o" a little shorter. In such a complex situation, the similarity between two initial speech sample data cannot be accurately obtained by using the traditional Euclidean distance. At this time, the Euclidean distance between the speech features in the speech feature set can be obtained by using a dynamic time warping algorithm.
在一实施例中,步骤S120中通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,包括:In one embodiment, in step S120, the Euclidean distance between each speech feature in the speech feature set is obtained by a dynamic time warping algorithm, including:
获取所述语音特征集中第i号语音特征及第j号语音特征;其中,所述语音特征集中包括N个语音特征,i和j的取值范围均是[1,N],且i与j不相等;Obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;
判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数;Determine whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;
若第i号语音特征对应的第一语音序列帧数不等于第j号语音特征对应的第二语音序列帧数,构造n*m的距离矩阵D,获取距离矩阵D中各矩阵元素中的最小值以作为所述第i号语音特征与第j号语音特征的欧式距离;其中,n等于第一语音序列帧数,m等于第二语音序列帧数,距离矩阵D中d(x,y)表示第i号语音特征中第x帧语音序列与第j号语音特征中第y帧语音序列之间的欧氏距离。If the number of frames of the first speech sequence corresponding to the i-th speech feature is not equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, construct a distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D. The value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D Indicates the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
在本实施例中,是以计算所述语音特征集中第i号语音特征及第j号语音特征为例来说明所述语音特征集中任意两个语音特征之间的欧氏距离计算方法,直至计算出所述语音特征集中各语音特征之间的欧式距离即可停止上述计算过程。In this embodiment, the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example. The above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.
在计算任意两个语音特征之间的欧式距离时,需先判断两者之间的语音序列帧数是否相等(如判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数)。若两者之间的语音序列帧数不相等,需要构造一个n*m的距离矩阵D,获取距离矩阵D中各矩阵元素中的最小值以作为所述第i号语音特征与第j号语音特征的欧式距离;其中,n等于第一语音序列帧数,m等于第二语音序列帧数,距离矩阵D中d(x,y)表示第i号语音特征中第x帧语音序列与第j号语音特征中第y帧语音序列之间的欧氏距离。When calculating the Euclidean distance between any two speech features, it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature. The Euclidean distance of the feature; where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature. The Euclidean distance between the y-th frame speech sequences in the speech features of No.
例如,矩阵元素d(i,j)表示第i号语音特征中x帧和第j号语音特征中第y帧语音序列之间的欧式距离,找出从d(0,0)到d(n,m)的一条最短路径,把路径的长度作为第i号语音特征和第j号语音特征之间的距离,该路径满足连续性和时间单调性(不可回溯)。其中,上述计算过程则采用的是动态时间规整算法(全称是Dynamic Time Warping,简记为DTW)。For example, the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking). Among them, the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).
在一实施例中,步骤S120中所述判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数之后,还包括:In one embodiment, after judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature in step S120, it also includes:
若第i号语音特征对应的第一语音序列帧数等于第j号语音特征对应的第二语音序列帧数,计算得到所述第i号语音特征与第j号语音特征的欧式距离。If the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
在本实施例中,当判定第i号语音特征对应的第一语音序列帧数等于第j号语音特征对应的第二语音序列帧数,表示两者之间的时间长度相同,直接计算两者之间的欧式距离即可,无需参考构造距离矩阵D的过程。In this embodiment, when it is determined that the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, it means that the time length between the two is the same, and the two are calculated directly. The Euclidean distance between them is sufficient, and there is no need to refer to the process of constructing the distance matrix D.
在一实施例中,步骤S120中所述根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果,包括:In one embodiment, K-means clustering is performed according to the Euclidean distance between the speech features in step S120 to obtain a clustering result, including:
在语音特征集中选取与预设的聚类簇数相同个数的语音特征,将所选取的语音特征作为每一簇的初始聚类中心;Select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;
根据语音特征集中各语音特征与各初始聚类中心的欧氏距离,将所述语音特征集进行划分,得到初始聚类结果;According to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center, the voice feature set is divided to obtain an initial clustering result;
根据初始聚类结果,获取每一簇的调整后聚类中心;Obtain the adjusted cluster center of each cluster according to the initial clustering result;
根据调整后聚类中心,将所述语音特征集的语音特征集根据与调整后聚类中心的欧氏距离进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇,以组成聚类结果。According to the adjusted clustering center, the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times. The number of clusters corresponding to the number of clusters to form the clustering results.
在本实施例中,由于可以通过动态时间规整算法计算各语音特征之间的欧式距离,此时可以通过K-means聚类方法对语音特征集进行聚类处理,具体过程如下:In this embodiment, since the Euclidean distance between each speech feature can be calculated by the dynamic time warping algorithm, the speech feature set can be clustered by the K-means clustering method at this time, and the specific process is as follows:
a)从包括N1个语音特征的语音特征集中任意选取N2个语音特征,并作为N2个簇的初始聚类中心;其中,语音特征集中语音特征的初始总个数为N1,从其中任意选择N2个语音特征(N2<N1,N2是预设的聚类簇数,即所期望的簇的个数),将初始选择的N2个语音特征作为初始聚类中心。a) arbitrarily select N2 voice features from the voice feature set including N1 voice features, and use them as the initial clustering centers of the N2 clusters; wherein, the initial total number of voice features in the voice feature set is N1, and N2 is arbitrarily selected from it voice features (N2<N1, N2 is the preset number of clusters, that is, the number of expected clusters), and the initially selected N2 voice features are used as the initial clustering centers.
b)分别计算剩下的语音特征到N2个初始聚类中心的欧氏距离,将剩下的语音特征分别划归到欧氏距离最小的簇,得到初始聚类结果;即是剩下的每一语音特征选择距其距离最近的初始聚类中心,并与该初始聚类中心归为一类;这样就以初始选择的初始聚类中心将语音特征划分为N2簇,每一簇数据都有一个初始聚类中心。b) Calculate the Euclidean distances from the remaining speech features to the N2 initial cluster centers respectively, classify the remaining speech features into the cluster with the smallest Euclidean distance, and obtain the initial clustering result; A voice feature selects the initial clustering center with the closest distance to it, and classifies it with the initial clustering center; in this way, the voice feature is divided into N2 clusters based on the initially selected initial clustering center, and each cluster data has an initial cluster center.
c)根据初始聚类结果,重新计算N2个簇各自的聚类中心。c) According to the initial clustering results, recalculate the respective cluster centers of the N2 clusters.
d)将N1个语音特征中全部元素按照新的聚类中心重新聚类;d) re-clustering all elements in the N1 speech features according to the new clustering center;
e)重复d)步,直到聚类结果不再变化,得到与预设的聚类簇数对应的聚类结果。e) Step d) is repeated until the clustering result no longer changes, and the clustering result corresponding to the preset number of clusters is obtained.
在完成了聚类分类之后,即可实现快速的将语音特征集进行分组,得到多个聚类簇。之后,服务器即可从多个聚类簇中挑选满足条件的聚类簇作为训练样本并标注。After the cluster classification is completed, the speech feature set can be quickly grouped to obtain multiple clusters. After that, the server can select clusters that meet the conditions from multiple clusters as training samples and label them.
S130、调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合。S130. Invoke the preset sample subset screening conditions, and obtain the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set.
在本实施例中,可以将样本子集筛选条件设置为样本冗余度为多个样本子集中最小值,这样可以筛选出目标聚类簇,以组成目标聚类簇集合。其中,在计算某一聚类簇的样本冗余度时,是计算数据重复程度,例如某一样本子集中数据总数是Y1,其中重复数据的总条数是Y2,此时该样本子集的样本冗余度为Y2/Y1。选择冗余度较小的样本子集,显著地减少深度学习背景下语音识别任务的标注代价。In this embodiment, the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set. Among them, when calculating the sample redundancy of a cluster, it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.
S140、获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集。S140. Acquire a label value corresponding to each speech feature in the target cluster set to obtain a current speech sample set corresponding to the target cluster set.
在本实施例中,由于已挑选出目标聚类簇集合,此时可以仅进行少量的样本标注,即得到与所述目标聚类簇集合对应的当前语音样本集。使用较少的标注数据,可以显著提高语音识别模型的训练速度,减少了语音处理系统的计算压力。In this embodiment, since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.
S150、将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型。S150, using each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the label value corresponding to each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain the speech A recognition model; wherein, the speech recognition model to be trained includes a link time series classification sub-model and an attention-based mechanism sub-model.
在本实施例中,为了基于当前语音样本集训练语音识别准确度更高的语音识别模型,可以采用混合CTC(Connectionist Temporal Classification,即链接时序分类)模型与Attention 模型(即基于注意力机制模型)共同解码的模型。CTC解码通过预测每个帧的输出来识别语音,算法的实现基于假设每帧的解码保持独立,因而缺乏解码过程中前后语音特征之间的联系,依赖语言模型的修正。Attention解码过程则与输入语音的帧顺序无关,每个解码单元通过前一单元的解码结果与整体语音特征来生成当前的结果,解码过程忽略了语音的单调时序性,所以可以采用混合模型,兼顾两者的优点。一般是链接时序分类子模型设置在更靠近输入端进行初步处理,将基于注意力机制子模型设置在更靠近输出端进行后续处理。待训练语音识别模型的网络结构则采用LSTM/CNN/GRU等结构,两种解码器共同输出识别的结果。In this embodiment, in order to train a speech recognition model with higher speech recognition accuracy based on the current speech sample set, a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used. A model for co-decoding. CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model. The Attention decoding process has nothing to do with the frame order of the input speech. Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics. The decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both. Generally, the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing. The network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.
在一实施例中,步骤S150之后还包括:In one embodiment, after step S150, it further includes:
将语音识别模型中链接时序分类子模型对应的第一模型参数集和基于注意力机制子模型对应的第二模型参数集上传至区块链网络。Upload the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
在本实施例中,基于第一模型参数集和第二模型参数集得到对应的摘要信息,具体来说,摘要信息由第一模型参数集和第二模型参数集进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户设备可以从区块链中下载得该摘要信息,以便查证第一模型参数集和第二模型参数集是否被篡改。In this embodiment, the corresponding summary information is obtained based on the first model parameter set and the second model parameter set. Specifically, the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.
本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
S160、若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。S160. If the current voice data to be recognized uploaded by the user terminal is detected, input the voice feature corresponding to the current voice data to be recognized into the voice recognition model for operation, and obtain and send the current voice recognition result to the user terminal.
在本实施例中,当在服务器中完成了对语音识别模型的训练后,即可具体应用于语音识别。即服务器一旦接收到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果,通过这一方式能迅速的反馈当前语音识别结果。In this embodiment, after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.
在一实施例中,步骤S160包括:In one embodiment, step S160 includes:
将所述当前待识别语音数据对应的语音特征输入至链接时序分类子模型进行运算,得到第一识别序列;Inputting the speech feature corresponding to the current speech data to be recognized into the link sequence classification sub-model for operation to obtain a first recognition sequence;
将所述第一识别序列输入至基于注意力机制子模型进行运算,得到并向用户端发送当前语音识别结果。The first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
在本实施例中,由于链接时序分类子模型设置在更靠近输入端处,基于注意力机制子模型设置在更靠近输出端处,故所述当前待识别语音数据是先输入至链接时序分类子模型进行运算得到第一识别序列,然后将所述第一识别序列输入至基于注意力机制子模型进行运算,得到当前语音识别结果。这样既充分考虑到解码过程中前后语音特征之间的联系,也考虑到了语音的单调时序性,通过这一模型识别得到的结果更加准确。In this embodiment, since the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier The model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result. In this way, the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.
该方法实现了自动选择出冗余度较小的样本对语音识别模型进行训练,减少深度学习背 景下语音识别任务的标注代价,提升语音识别模型的训练速度。This method realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.
本申请实施例还提供一种基于几何学的语音样本筛选装置,该基于几何学的语音样本筛选装置用于执行前述基于几何学的语音样本筛选方法的任一实施例。具体地,请参阅图3,图3是本申请实施例提供的基于几何学的语音样本筛选装置的示意性框图。该基于几何学的语音样本筛选装置100可以配置于服务器中。An embodiment of the present application further provides a geometry-based voice sample screening apparatus, which is used to perform any of the foregoing embodiments of the geometry-based voice sample screening method. Specifically, please refer to FIG. 3 , which is a schematic block diagram of the apparatus for screening speech samples based on geometry provided by an embodiment of the present application. The geometry-based voice sample screening apparatus 100 may be configured in a server.
如图3所示,基于几何学的语音样本筛选装置100包括:语音特征提取单元110、语音特征聚类单元120、聚类结果筛选单元130、标注值获取单元140、语音识别模型训练单元150、语音识别结果发送单元160。As shown in FIG. 3 , the geometry-based voice sample screening device 100 includes: a voice feature extraction unit 110, a voice feature clustering unit 120, a clustering result screening unit 130, a label value obtaining unit 140, a voice recognition model training unit 150, Speech recognition result sending unit 160 .
语音特征提取单元110,用于获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据。The voice feature extraction unit 110 is configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples. Speech sample data.
在本实施例中,为了在服务器中采用较少标注样本训练语音识别模型,可以先对初始语音样本集进行数据预处理和特征提取,得到所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集。数据预处理包括预加重,分帧,加窗等操作,这些操作的目的是消除因为人类发声器官本身缺陷和采集设备缺陷带来的混叠、高次谐波失真、高频等因素对语音信号质量的影响,尽可能使得到的信号更均匀、平滑。In this embodiment, in order to train the speech recognition model by using less labeled samples in the server, data preprocessing and feature extraction may be performed on the initial speech sample set, and the corresponding data of each initial speech sample data in the initial speech sample set is obtained. Voice features to form a voice feature set. Data preprocessing includes pre-emphasis, framing, windowing and other operations. The purpose of these operations is to eliminate the aliasing, high-order harmonic distortion, high frequency and other factors caused by the defects of the human vocal organs and the defects of the acquisition equipment. The influence of quality, as much as possible to make the obtained signal more uniform and smooth.
在一实施例中,语音特征提取单元110包括:In one embodiment, the speech feature extraction unit 110 includes:
离散采样单元,用于调用预先存储的采样周期将所述初始语音样本集中每一条初始语音样本数据分别进行采样,得到与每一条初始语音样本数据对应的当前离散语音信号;A discrete sampling unit, configured to call a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;
预加重单元,用于调用预先存储的一阶FIR高通数字滤波器对每一条初始语音样本数据对应的当前离散语音信号分别进行预加重,得到与每一条初始语音样本数据对应的当前预加重语音信号;The pre-emphasis unit is used to call the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal corresponding to each initial voice sample data, and obtain the current pre-emphasized voice signal corresponding to each initial voice sample data ;
加窗单元,用于调用预先存储的汉明窗对与每一条初始语音样本数据对应的当前预加重语音信号分别进行加窗,得到与每一条初始语音样本数据对应的加窗后语音数据;The windowing unit is used to call the pre-stored Hamming window to add a window to the current pre-emphasized voice signal corresponding to each initial voice sample data, to obtain the windowed voice data corresponding to each initial voice sample data;
分帧单元,用于调用预先存储的帧移和帧长对与每一条初始语音样本数据对应的加窗后语音数据分别进行分帧,得到与每一条初始语音样本数据对应的预处理后语音数据;The framing unit is used to call the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, and obtain the preprocessed voice data corresponding to each initial voice sample data. ;
特征提取单元,用于将每一条初始语音样本数据对应的预处理后语音数据分别进行梅尔频率倒谱系数提取或是滤波器组提取,得到与每一条初始语音样本数据对应的语音特征,以组成语音特征集。The feature extraction unit is used for extracting Mel frequency cepstral coefficients or filter bank extraction on the preprocessed speech data corresponding to each piece of initial speech sample data, to obtain the speech features corresponding to each piece of initial speech sample data, to form a speech feature set.
在本实施例中,在对语音信号进行数字处理之前,首先要将初始语音样本数据(将初始语音样本数据记为s(t))以采样周期T采样,将其离散化为s(n)。In this embodiment, before digitally processing the speech signal, the initial speech sample data (denoted as s(t)) is first sampled with a sampling period T, and then discretized into s(n) .
然后,调用预先存储的一阶FIR高通数字滤波器时,一阶FIR高通数字滤波器即为一阶非递归型高通数字滤波器,其传递函数如上式(1)。Then, when the pre-stored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as shown in the above formula (1).
具体实施时,a的取值为0.98。例如,设n时刻的所述当前离散语音信号的采样值为x(n),经过预加重处理后的当前预加重语音信号中与x(n)对应的采样值为y(n)=x(n)-ax(n-1)。In specific implementation, the value of a is 0.98. For example, suppose the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasized speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).
之后,所调用的汉明窗的函数如上式(2),通过汉明窗对所述当前预加重语音信号进行 加窗,得到的加窗后语音数据可以表示为:Q(n)=y(n)*ω(n)。After that, the function of the called Hamming window is as the above formula (2), and the current pre-emphasized speech signal is windowed by the Hamming window, and the obtained speech data after windowing can be expressed as: Q(n)=y( n)*ω(n).
最后,调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧时,例如所述加窗后语音数据对应的时域信号为x(l),加窗分帧处理后的预处理后语音数据中第n帧语音数据为xn(m),且xn(m)满足式(3)。Finally, when calling the pre-stored frame shift and frame length to frame the windowed speech data, for example, the time domain signal corresponding to the windowed speech data is x(l), and the windowed and framed The nth frame of speech data in the preprocessed speech data is xn(m), and xn(m) satisfies the formula (3).
通过对初始语音样本数据进行预处理,能有效用于后续的声音参数提取,例如提取梅尔频率倒谱系数(即Mel Frequency Cepstrum Coefficient)或是滤波器组(即Filter-Bank),提取之后即可得到每一条初始语音样本数据对应的语音特征,以组成语音特征集。By preprocessing the initial speech sample data, it can be effectively used for subsequent sound parameter extraction, such as extracting Mel Frequency Cepstrum Coefficient (ie Mel Frequency Cepstrum Coefficient) or filter bank (ie Filter-Bank). The speech feature corresponding to each piece of initial speech sample data can be obtained to form a speech feature set.
语音特征聚类单元120,用于通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果。The voice feature clustering unit 120 is used to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result .
在本实施例中,由于初始语音样本集中各初始语音样本数据之间存在差异,此时为了比较两个初始语音样本数据之间的差异,可以量化处理为计算两个初始语音样本数据各自对应的语音特征之间的欧式距离。In this embodiment, since there is a difference between the initial speech sample data in the initial speech sample set, in order to compare the difference between the two initial speech sample data, the quantization process can be calculated by calculating the corresponding data of the two initial speech sample data. Euclidean distance between phonetic features.
在考虑到任意两个初始语音样本数据的长度大部分情况下是不相等的,而且在语音处理领域上表现为不同人的语速不同。即时同一个人不同一时刻发同一个音,也不可能具有完全相同的时间长度。而且每个人对同一个单词的不同音素的发音速度也是不同的,有的人会把"E"这个音拖得稍长,或者"o"稍短。在这种复杂的情况下,利用传统的欧式距离是无法准确获取两个初始语音样本数据之间的相似性的。此时,可以通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离。Considering that the lengths of any two initial speech sample data are not equal in most cases, and in the field of speech processing, it is manifested that different people have different speech rates. Even if the same person utters the same sound at different times, it cannot have exactly the same length of time. And everyone pronounces different phonemes of the same word at different speeds. Some people will drag the "E" sound a little longer, or "o" a little shorter. In such a complex situation, the similarity between two initial speech sample data cannot be accurately obtained by using the traditional Euclidean distance. At this time, the Euclidean distance between the speech features in the speech feature set can be obtained by using a dynamic time warping algorithm.
在一实施例中,语音特征聚类单元120,包括:In one embodiment, the speech feature clustering unit 120 includes:
语音特征选取单元,用于获取所述语音特征集中第i号语音特征及第j号语音特征;其中,所述语音特征集中包括N个语音特征,i和j的取值范围均是[1,N],且i与j不相等;The voice feature selection unit is used to obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, and the value ranges of i and j are both [1, N], and i and j are not equal;
语音序列帧数比对单元,用于判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数;A voice sequence frame number comparison unit, used for judging whether the first voice sequence frame number corresponding to the ith voice feature is equal to the second voice sequence frame number corresponding to the jth voice feature;
第一计算单元,用于若第i号语音特征对应的第一语音序列帧数不等于第j号语音特征对应的第二语音序列帧数,构造n*m的距离矩阵D,获取距离矩阵D中各矩阵元素中的最小值以作为所述第i号语音特征与第j号语音特征的欧式距离;其中,n等于第一语音序列帧数,m等于第二语音序列帧数,距离矩阵D中d(x,y)表示第i号语音特征中第x帧语音序列与第j号语音特征中第y帧语音序列之间的欧氏距离。The first calculation unit is used for constructing a distance matrix D of n*m and obtaining the distance matrix D if the number of frames of the first voice sequence corresponding to the ith voice feature is not equal to the number of frames of the second voice sequence corresponding to the jth voice feature The minimum value in each matrix element is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and the distance matrix D where d(x, y) represents the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
在本实施例中,是以计算所述语音特征集中第i号语音特征及第j号语音特征为例来说明所述语音特征集中任意两个语音特征之间的欧氏距离计算方法,直至计算出所述语音特征集中各语音特征之间的欧式距离即可停止上述计算过程。In this embodiment, the method for calculating the Euclidean distance between any two voice features in the voice feature set is described by taking the calculation of the ith voice feature and the jth voice feature in the voice feature set as an example. The above calculation process can be stopped after the Euclidean distance between the speech features in the speech feature set is obtained.
在计算任意两个语音特征之间的欧式距离时,需先判断两者之间的语音序列帧数是否相等(如判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数)。若两者之间的语音序列帧数不相等,需要构造一个n*m的距离矩阵D,获取距离矩阵D中各矩阵元素中的最小值以作为所述第i号语音特征与第j号语音特征的欧式距离;其中,n等于第一语音序列帧数,m等于第二语音序列帧数,距离矩阵D中d(x,y)表示 第i号语音特征中第x帧语音序列与第j号语音特征中第y帧语音序列之间的欧氏距离。When calculating the Euclidean distance between any two speech features, it is necessary to first determine whether the number of frames of the speech sequence between the two are equal (for example, to determine whether the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the j-th speech) The number of frames of the second speech sequence corresponding to the feature). If the number of frames of the speech sequence between the two is not equal, a distance matrix D of n*m needs to be constructed, and the minimum value of each matrix element in the distance matrix D is obtained as the i-th speech feature and the j-th speech feature. The Euclidean distance of the feature; where n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D represents the relationship between the x-th frame of speech sequence and the j-th speech sequence in the i-th speech feature. The Euclidean distance between the y-th frame speech sequences in the speech features of No.
例如,矩阵元素d(i,j)表示第i号语音特征中x帧和第j号语音特征中第y帧语音序列之间的欧式距离,找出从d(0,0)到d(n,m)的一条最短路径,把路径的长度作为第i号语音特征和第j号语音特征之间的距离,该路径满足连续性和时间单调性(不可回溯)。其中,上述计算过程则采用的是动态时间规整算法(全称是Dynamic Time Warping,简记为DTW)。For example, the matrix element d(i,j) represents the Euclidean distance between the x frame in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature, find the distance from d(0,0) to d(n ,m), taking the length of the path as the distance between the i-th speech feature and the j-th speech feature, and the path satisfies continuity and time monotonicity (not backtracking). Among them, the above calculation process adopts the dynamic time warping algorithm (the full name is Dynamic Time Warping, abbreviated as DTW).
在一实施例中,语音特征聚类单元120,还包括:In one embodiment, the voice feature clustering unit 120 further includes:
第二计算单元,用于若第i号语音特征对应的第一语音序列帧数等于第j号语音特征对应的第二语音序列帧数,计算得到所述第i号语音特征与第j号语音特征的欧式距离。The second calculation unit is configured to calculate the i-th speech feature and the j-th speech if the number of frames of the first speech sequence corresponding to the i-th speech feature is equal to the second speech sequence frame number corresponding to the j-th speech feature Euclidean distance for features.
在本实施例中,当判定第i号语音特征对应的第一语音序列帧数等于第j号语音特征对应的第二语音序列帧数,表示两者之间的时间长度相同,直接计算两者之间的欧式距离即可,无需参考构造距离矩阵D的过程。In this embodiment, when it is determined that the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, it means that the time length between the two is the same, and the two are calculated directly. The Euclidean distance between them is sufficient, and there is no need to refer to the process of constructing the distance matrix D.
在一实施例中,语音特征聚类单元120,包括:In one embodiment, the speech feature clustering unit 120 includes:
初始聚类中心获取单元,用于在语音特征集中选取与预设的聚类簇数相同个数的语音特征,将所选取的语音特征作为每一簇的初始聚类中心;The initial cluster center acquisition unit is used to select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;
初始聚类单元,用于根据语音特征集中各语音特征与各初始聚类中心的欧氏距离,将所述语音特征集进行划分,得到初始聚类结果;The initial clustering unit is used to divide the voice feature set according to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center to obtain an initial clustering result;
聚类中心调整单元,用于根据初始聚类结果,获取每一簇的调整后聚类中心;The cluster center adjustment unit is used to obtain the adjusted cluster center of each cluster according to the initial clustering result;
聚类调整单元,用于根据调整后聚类中心,将所述语音特征集的语音特征集根据与调整后聚类中心的欧氏距离进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇,以组成聚类结果。The clustering adjustment unit is used to divide the voice feature set of the voice feature set according to the Euclidean distance from the adjusted cluster center according to the adjusted cluster center, until the clustering results remain the same for more times than preset number of times to obtain a clustering cluster corresponding to the preset number of clustering clusters to form a clustering result.
在本实施例中,由于可以通过动态时间规整算法计算各语音特征之间的欧式距离,此时可以通过K-means聚类方法对语音特征集进行聚类处理。在完成了聚类分类之后,即可实现快速的将语音特征集进行分组,得到多个聚类簇。之后,服务器即可从多个聚类簇中挑选满足条件的聚类簇作为训练样本并标注。In this embodiment, since the Euclidean distance between the speech features can be calculated by the dynamic time warping algorithm, the speech feature set can be clustered by the K-means clustering method at this time. After the cluster classification is completed, the speech feature set can be quickly grouped to obtain multiple clusters. After that, the server can select clusters that meet the conditions from multiple clusters as training samples and label them.
聚类结果筛选单元130,用于调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合。The clustering result screening unit 130 is configured to invoke preset sample subset screening conditions, and acquire clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set.
在本实施例中,可以将样本子集筛选条件设置为样本冗余度为多个样本子集中最小值,这样可以筛选出目标聚类簇,以组成目标聚类簇集合。其中,在计算某一聚类簇的样本冗余度时,是计算数据重复程度,例如某一样本子集中数据总数是Y1,其中重复数据的总条数是Y2,此时该样本子集的样本冗余度为Y2/Y1。选择冗余度较小的样本子集,显著地减少深度学习背景下语音识别任务的标注代价。In this embodiment, the sample subset screening condition may be set such that the sample redundancy is the minimum value among the multiple sample subsets, so that target clusters can be filtered out to form a target cluster set. Among them, when calculating the sample redundancy of a cluster, it is to calculate the degree of data repetition. For example, the total number of data in a certain sample subset is Y1, and the total number of repeated data is Y2. At this time, the samples of this sample subset are The redundancy is Y2/Y1. Selecting a subset of samples with less redundancy can significantly reduce the labeling cost of speech recognition tasks in the context of deep learning.
标注值获取单元140,用于获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集。The label value obtaining unit 140 is configured to obtain label values corresponding to each speech feature in the target cluster set, so as to obtain a current speech sample set corresponding to the target cluster set.
在本实施例中,由于已挑选出目标聚类簇集合,此时可以仅进行少量的样本标注,即得到与所述目标聚类簇集合对应的当前语音样本集。使用较少的标注数据,可以显著提高语音识别模型的训练速度,减少了语音处理系统的计算压力。In this embodiment, since the target cluster set has been selected, only a small number of samples can be labeled at this time, that is, the current speech sample set corresponding to the target cluster set is obtained. Using less labeled data can significantly improve the training speed of the speech recognition model and reduce the computational pressure on the speech processing system.
语音识别模型训练单元150,用于将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型。The speech recognition model training unit 150 is configured to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and use the label value corresponding to each speech feature as the output of the speech recognition model to be trained to treat the speech for training The recognition model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention mechanism-based sub-model.
在本实施例中,为了基于当前语音样本集训练语音识别准确度更高的语音识别模型,可以采用混合CTC(Connectionist Temporal Classification,即链接时序分类)模型与Attention模型(即基于注意力机制模型)共同解码的模型。CTC解码通过预测每个帧的输出来识别语音,算法的实现基于假设每帧的解码保持独立,因而缺乏解码过程中前后语音特征之间的联系,依赖语言模型的修正。Attention解码过程则与输入语音的帧顺序无关,每个解码单元通过前一单元的解码结果与整体语音特征来生成当前的结果,解码过程忽略了语音的单调时序性,所以可以采用混合模型,兼顾两者的优点。一般是链接时序分类子模型设置在更靠近输入端进行初步处理,将基于注意力机制子模型设置在更靠近输出端进行后续处理。待训练语音识别模型的网络结构则采用LSTM/CNN/GRU等结构,两种解码器共同输出识别的结果。In this embodiment, in order to train a speech recognition model with higher speech recognition accuracy based on the current speech sample set, a hybrid CTC (Connectionist Temporal Classification, that is, link timing classification) model and an Attention model (that is, a model based on an attention mechanism) can be used. A model for co-decoding. CTC decoding recognizes speech by predicting the output of each frame. The implementation of the algorithm is based on the assumption that the decoding of each frame remains independent, so it lacks the connection between the speech features before and after the decoding process, and relies on the correction of the language model. The Attention decoding process has nothing to do with the frame order of the input speech. Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics. The decoding process ignores the monotonic timing of the speech, so a hybrid model can be used, taking into account advantages of both. Generally, the link time series classification sub-model is set closer to the input for preliminary processing, and the attention-based sub-model is set closer to the output for subsequent processing. The network structure of the speech recognition model to be trained adopts LSTM/CNN/GRU and other structures, and the two decoders jointly output the recognition result.
在一实施例中,基于几何学的语音样本筛选装置100还包括:In one embodiment, the geometry-based voice sample screening apparatus 100 further includes:
数据上链单元,用于将语音识别模型中链接时序分类子模型对应的第一模型参数集和基于注意力机制子模型对应的第二模型参数集上传至区块链网络。The data uploading unit is used for uploading the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
在本实施例中,基于第一模型参数集和第二模型参数集得到对应的摘要信息,具体来说,摘要信息由第一模型参数集和第二模型参数集进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户设备可以从区块链中下载得该摘要信息,以便查证第一模型参数集和第二模型参数集是否被篡改。In this embodiment, the corresponding summary information is obtained based on the first model parameter set and the second model parameter set. Specifically, the summary information is obtained by hashing the first model parameter set and the second model parameter set, for example, using The sha256s algorithm is processed. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain, so as to verify whether the first model parameter set and the second model parameter set have been tampered with.
本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
语音识别结果发送单元160,用于若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。The speech recognition result sending unit 160 is configured to input the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the current speech data to be recognized uploaded by the user terminal is detected, and obtain and send to the user terminal The current speech recognition result.
在本实施例中,当在服务器中完成了对语音识别模型的训练后,即可具体应用于语音识别。即服务器一旦接收到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果,通过这一方式能迅速的反馈当前语音识别结果。In this embodiment, after the training of the speech recognition model is completed in the server, it can be specifically applied to speech recognition. That is, once the server receives the current voice data to be recognized uploaded by the client, it inputs the voice feature corresponding to the current voice data to be recognized into the voice recognition model for calculation, and obtains and sends the current voice recognition result to the client. In one way, the current speech recognition result can be quickly fed back.
在一实施例中,语音识别结果发送单元160包括:In one embodiment, the speech recognition result sending unit 160 includes:
第一解码单元,用于将所述当前待识别语音数据对应的语音特征输入至链接时序分类子模型进行运算,得到第一识别序列;The first decoding unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the link sequence classification sub-model for operation to obtain a first recognition sequence;
第二解码单元,用于将所述第一识别序列输入至基于注意力机制子模型进行运算,得到 并向用户端发送当前语音识别结果。The second decoding unit is configured to input the first recognition sequence into the sub-model based on the attention mechanism for operation, obtain and send the current speech recognition result to the user terminal.
在本实施例中,由于链接时序分类子模型设置在更靠近输入端处,基于注意力机制子模型设置在更靠近输出端处,故所述当前待识别语音数据是先输入至链接时序分类子模型进行运算得到第一识别序列,然后将所述第一识别序列输入至基于注意力机制子模型进行运算,得到当前语音识别结果。这样既充分考虑到解码过程中前后语音特征之间的联系,也考虑到了语音的单调时序性,通过这一模型识别得到的结果更加准确。In this embodiment, since the link sequence classification sub-model is set closer to the input end, and the attention-based sub-model is set closer to the output end, the currently to-be-recognized speech data is first input to the link sequence classifier The model performs an operation to obtain a first recognition sequence, and then the first recognition sequence is input into the attention mechanism-based sub-model for operation to obtain the current speech recognition result. In this way, the relationship between the speech features before and after the decoding process is fully considered, and the monotonic timing of speech is also considered, and the results obtained by this model are more accurate.
该装置实现了自动选择出冗余度较小的样本对语音识别模型进行训练,减少深度学习背景下语音识别任务的标注代价,提升语音识别模型的训练速度。The device realizes the automatic selection of samples with less redundancy to train the speech recognition model, reduces the labeling cost of speech recognition tasks in the context of deep learning, and improves the training speed of the speech recognition model.
上述基于几何学的语音样本筛选装置可以实现为计算机程序的形式,该计算机程序可以在如图4所示的计算机设备上运行。The above-mentioned apparatus for screening speech samples based on geometry can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 4 .
请参阅图4,图4是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 4 , which is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
参阅图4,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to FIG. 4 , the computer device 500 includes a processor 502 , a memory and a network interface 505 connected through a system bus 501 , wherein the memory may include a non-volatile storage medium 503 and an internal memory 504 .
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于几何学的语音样本筛选方法。The nonvolatile storage medium 503 can store an operating system 5031 and a computer program 5032 . The computer program 5032, when executed, can cause the processor 502 to perform a geometry-based voice sample screening method.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500 .
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于几何学的语音样本筛选方法。The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, the computer program 5032, when executed by the processor 502, can cause the processor 502 to perform a geometry-based voice sample screening method.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图4中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的基于几何学的语音样本筛选方法。The processor 502 is configured to run the computer program 5032 stored in the memory to implement the geometry-based voice sample screening method disclosed in the embodiment of the present application.
本领域技术人员可以理解,图4中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图4所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific structure of the computer device, and in other embodiments, the computer device may include more or less components than those shown in the figure, Either some components are combined, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as those of the embodiment shown in FIG. 4 , and details are not repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,也可以是易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的基于几何学的语音样本筛选方法。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program implements the geometry-based voice sample screening method disclosed in the embodiments of the present application when the computer program is executed by the processor.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described devices, devices and units, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only logical function division. In actual implementation, there may be other division methods, or units with the same function may be grouped into one Units, such as multiple units or components, may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a storage medium. Based on this understanding, the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: a U disk, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk or an optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于几何学的语音样本筛选方法,其中,包括:A geometry-based voice sample screening method, comprising:
    获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
    通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;
    调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;
    获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;
    将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; Wherein, the to-be-trained speech recognition model includes a link timing classification sub-model and an attention-based mechanism sub-model; and
    若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  2. 根据权利要求1所述的基于几何学的语音样本筛选方法,其中,所述获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集,包括:The geometry-based voice sample screening method according to claim 1, wherein the acquiring an initial voice sample set, extracting a voice feature corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set, include:
    调用预先存储的采样周期将所述初始语音样本集中每一条初始语音样本数据分别进行采样,得到与每一条初始语音样本数据对应的当前离散语音信号;Calling a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, respectively, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;
    调用预先存储的一阶FIR高通数字滤波器对每一条初始语音样本数据对应的当前离散语音信号分别进行预加重,得到与每一条初始语音样本数据对应的当前预加重语音信号;Calling the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete speech signal corresponding to each piece of initial speech sample data, respectively, to obtain the current pre-emphasized speech signal corresponding to each piece of initial speech sample data;
    调用预先存储的汉明窗对与每一条初始语音样本数据对应的当前预加重语音信号分别进行加窗,得到与每一条初始语音样本数据对应的加窗后语音数据;Invoke the pre-stored Hamming window to perform windowing on the current pre-emphasized voice signal corresponding to each piece of initial voice sample data respectively, and obtain the voice data after windowing corresponding to each piece of initial voice sample data;
    调用预先存储的帧移和帧长对与每一条初始语音样本数据对应的加窗后语音数据分别进行分帧,得到与每一条初始语音样本数据对应的预处理后语音数据;Calling the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, to obtain the preprocessed voice data corresponding to each initial voice sample data;
    将每一条初始语音样本数据对应的预处理后语音数据分别进行梅尔频率倒谱系数提取或是滤波器组提取,得到与每一条初始语音样本数据对应的语音特征,以组成语音特征集。The preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
  3. 根据权利要求1所述的基于几何学的语音样本筛选方法,其中,所述通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,包括:The geometry-based voice sample screening method according to claim 1, wherein the obtaining the Euclidean distance between each voice feature in the voice feature set by a dynamic time warping algorithm, comprising:
    获取所述语音特征集中第i号语音特征及第j号语音特征;其中,所述语音特征集中包括N个语音特征,i和j的取值范围均是[1,N],且i与j不相等;Obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;
    判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数;Determine whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;
    若第i号语音特征对应的第一语音序列帧数不等于第j号语音特征对应的第二语音序列 帧数,构造n*m的距离矩阵D,获取距离矩阵D中各矩阵元素中的最小值以作为所述第i号语音特征与第j号语音特征的欧式距离;其中,n等于第一语音序列帧数,m等于第二语音序列帧数,距离矩阵D中d(x,y)表示第i号语音特征中第x帧语音序列与第j号语音特征中第y帧语音序列之间的欧氏距离。If the number of frames of the first speech sequence corresponding to the i-th speech feature is not equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, construct a distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D. The value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D Indicates the Euclidean distance between the x-th frame speech sequence in the i-th speech feature and the y-th frame speech sequence in the j-th speech feature.
  4. 根据权利要求3所述的基于几何学的语音样本筛选方法,其中,所述判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数之后,还包括:The method for screening speech samples based on geometry according to claim 3, wherein after judging whether the number of frames of the first speech sequence corresponding to the ith speech feature is equal to the number of frames of the second speech sequence corresponding to the jth speech feature ,Also includes:
    若第i号语音特征对应的第一语音序列帧数等于第j号语音特征对应的第二语音序列帧数,计算得到所述第i号语音特征与第j号语音特征的欧式距离。If the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
  5. 根据权利要求4所述的基于几何学的语音样本筛选方法,其中,所述根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果,包括:The method for screening speech samples based on geometry according to claim 4, wherein the performing K-means clustering according to the Euclidean distance between each speech feature to obtain a clustering result, comprising:
    在语音特征集中选取与预设的聚类簇数相同个数的语音特征,将所选取的语音特征作为每一簇的初始聚类中心;Select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;
    根据语音特征集中各语音特征与各初始聚类中心的欧氏距离,将所述语音特征集进行划分,得到初始聚类结果;According to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center, the voice feature set is divided to obtain an initial clustering result;
    根据初始聚类结果,获取每一簇的调整后聚类中心;Obtain the adjusted cluster center of each cluster according to the initial clustering result;
    根据调整后聚类中心,将所述语音特征集的语音特征集根据与调整后聚类中心的欧氏距离进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇,以组成聚类结果。According to the adjusted clustering center, the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times. The number of clusters corresponding to the number of clusters to form the clustering results.
  6. 根据权利要求1所述的基于几何学的语音样本筛选方法,其中,所述将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果,包括:The geometry-based voice sample screening method according to claim 1, wherein the voice feature corresponding to the currently to-be-recognized voice data is input into the voice recognition model for calculation, and the current voice is obtained and sent to the user terminal. Identification results, including:
    将所述当前待识别语音数据对应的语音特征输入至链接时序分类子模型进行运算,得到第一识别序列;Inputting the speech feature corresponding to the current speech data to be recognized into the link sequence classification sub-model for operation to obtain a first recognition sequence;
    将所述第一识别序列输入至基于注意力机制子模型进行运算,得到并向用户端发送当前语音识别结果。The first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
  7. 根据权利要求1所述的基于几何学的语音样本筛选方法,其中,还包括:The geometry-based voice sample screening method according to claim 1, wherein, further comprising:
    将语音识别模型中链接时序分类子模型对应的第一模型参数集和基于注意力机制子模型对应的第二模型参数集上传至区块链网络。Upload the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
  8. 根据权利要求2所述的基于几何学的语音样本筛选方法,其中,所述一阶FIR高通数字滤波器是一阶非递归型高通数字滤波器,对应的传递函数如下式:H(z)=1-az -1;其中,a=0.98。 The geometry-based voice sample screening method according to claim 2, wherein the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and the corresponding transfer function is as follows: H(z)= 1-az -1 ; wherein a=0.98.
  9. 根据权利要求1所述的基于几何学的语音样本筛选方法,其中,所述样本子集筛选条件是样本冗余度为多个样本子集中最小值;The geometry-based voice sample screening method according to claim 1, wherein the sample subset screening condition is that the sample redundancy is a minimum value in a plurality of sample subsets;
    所述调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合,包括:The calling of the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, to form a target cluster cluster set, including:
    获取所述聚类结果每一聚类簇对应的样本冗余度,由聚类簇的样本冗余度为多个聚类簇中最小值的聚类簇组成目标聚类簇集合。The sample redundancy corresponding to each cluster of the clustering result is obtained, and the target cluster set is composed of a cluster whose sample redundancy is the minimum value among the multiple clusters.
  10. 一种基于几何学的语音样本筛选装置,其中,包括:A geometry-based voice sample screening device, comprising:
    语音特征提取单元,用于获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;A voice feature extraction unit, configured to obtain an initial voice sample set, and extract the voice features corresponding to each piece of initial voice sample data in the initial voice sample set to form a voice feature set; wherein the initial voice sample set includes a plurality of initial voice samples sample;
    语音特征聚类单元,用于通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;A voice feature clustering unit, configured to obtain the Euclidean distance between each voice feature in the voice feature set through a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between each voice feature to obtain a clustering result;
    聚类结果筛选单元,用于调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;A clustering result screening unit, configured to invoke preset sample subset screening conditions, and obtain clusters in the clustering results that satisfy the sample subset screening conditions, so as to form a target cluster set;
    标注值获取单元,用于获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;a label value obtaining unit, configured to obtain a label value corresponding to each voice feature in the target cluster set, so as to obtain a current voice sample set corresponding to the target cluster set;
    语音识别模型训练单元,用于将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及The speech recognition model training unit is used to use each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and the label value corresponding to each speech feature as the output of the speech recognition model to be trained to be trained speech recognition The model is trained to obtain a speech recognition model; wherein, the speech recognition model to be trained includes a link time sequence classification sub-model and an attention-based mechanism sub-model; and
    语音识别结果发送单元,用于若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。The speech recognition result sending unit is used for inputting the speech feature corresponding to the currently to-be-recognized speech data into the speech recognition model for operation if the currently to-be-recognized speech data uploaded by the user terminal is detected, and to obtain and send the current to-be-recognized speech data to the user terminal. Speech recognition results.
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the following steps when executing the computer program:
    获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
    通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;
    调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;
    获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;
    将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; wherein, the speech recognition model to be trained includes a link timing classification sub-model and an attention-based mechanism sub-model; and
    若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
  12. 根据权利要求11所述的计算机设备,其中,所述获取初始语音样本集,提取所述初 始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集,包括:computer equipment according to claim 11, wherein, described acquisition initial voice sample set, extract the voice feature corresponding to each initial voice sample data in described initial voice sample set, to form voice feature set, including:
    调用预先存储的采样周期将所述初始语音样本集中每一条初始语音样本数据分别进行采样,得到与每一条初始语音样本数据对应的当前离散语音信号;Calling a pre-stored sampling period to sample each piece of initial voice sample data in the initial voice sample set, respectively, to obtain a current discrete voice signal corresponding to each piece of initial voice sample data;
    调用预先存储的一阶FIR高通数字滤波器对每一条初始语音样本数据对应的当前离散语音信号分别进行预加重,得到与每一条初始语音样本数据对应的当前预加重语音信号;Calling the pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete speech signal corresponding to each piece of initial speech sample data, respectively, to obtain the current pre-emphasized speech signal corresponding to each piece of initial speech sample data;
    调用预先存储的汉明窗对与每一条初始语音样本数据对应的当前预加重语音信号分别进行加窗,得到与每一条初始语音样本数据对应的加窗后语音数据;Invoke the pre-stored Hamming window to perform windowing on the current pre-emphasized voice signal corresponding to each piece of initial voice sample data respectively, and obtain the voice data after windowing corresponding to each piece of initial voice sample data;
    调用预先存储的帧移和帧长对与每一条初始语音样本数据对应的加窗后语音数据分别进行分帧,得到与每一条初始语音样本数据对应的预处理后语音数据;Calling the pre-stored frame shift and frame length to divide the windowed voice data corresponding to each initial voice sample data into frames, to obtain the preprocessed voice data corresponding to each initial voice sample data;
    将每一条初始语音样本数据对应的预处理后语音数据分别进行梅尔频率倒谱系数提取或是滤波器组提取,得到与每一条初始语音样本数据对应的语音特征,以组成语音特征集。The preprocessed speech data corresponding to each piece of initial speech sample data is extracted by Mel frequency cepstral coefficients or filter bank, respectively, to obtain speech features corresponding to each piece of initial speech sample data to form a speech feature set.
  13. 根据权利要求11所述的计算机设备,其中,所述通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,包括:The computer device according to claim 11, wherein the obtaining the Euclidean distance between the speech features in the speech feature set by a dynamic time warping algorithm comprises:
    获取所述语音特征集中第i号语音特征及第j号语音特征;其中,所述语音特征集中包括N个语音特征,i和j的取值范围均是[1,N],且i与j不相等;Obtain the ith voice feature and the jth voice feature in the voice feature set; wherein, the voice feature set includes N voice features, the value ranges of i and j are both [1, N], and i and j not equal;
    判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数;Determine whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature;
    若第i号语音特征对应的第一语音序列帧数不等于第j号语音特征对应的第二语音序列帧数,构造n*m的距离矩阵D,获取距离矩阵D中各矩阵元素中的最小值以作为所述第i号语音特征与第j号语音特征的欧式距离;其中,n等于第一语音序列帧数,m等于第二语音序列帧数,距离矩阵D中d(x,y)表示第i号语音特征中第x帧语音序列与第j号语音特征中第y帧语音序列之间的欧氏距离。If the number of frames of the first speech sequence corresponding to the i-th speech feature is not equal to the number of frames of the second speech sequence corresponding to the j-th speech feature, construct a distance matrix D of n*m, and obtain the minimum value among the matrix elements in the distance matrix D. The value is used as the Euclidean distance between the i-th speech feature and the j-th speech feature; wherein, n is equal to the number of frames of the first speech sequence, m is equal to the number of frames of the second speech sequence, and d(x, y) in the distance matrix D It represents the Euclidean distance between the speech sequence of the xth frame in the ith speech feature and the yth frame speech sequence in the jth speech feature.
  14. 根据权利要求13所述的计算机设备,其中,所述判断第i号语音特征对应的第一语音序列帧数是否等于第j号语音特征对应的第二语音序列帧数之后,还包括:The computer equipment according to claim 13, wherein after judging whether the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, it further comprises:
    若第i号语音特征对应的第一语音序列帧数等于第j号语音特征对应的第二语音序列帧数,计算得到所述第i号语音特征与第j号语音特征的欧式距离。If the number of frames of the first voice sequence corresponding to the ith voice feature is equal to the number of frames of the second voice sequence corresponding to the jth voice feature, the Euclidean distance between the ith voice feature and the jth voice feature is calculated.
  15. 根据权利要求14所述的计算机设备,其中,所述根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果,包括:The computer device according to claim 14, wherein the performing K-means clustering according to the Euclidean distance between the speech features to obtain a clustering result, comprising:
    在语音特征集中选取与预设的聚类簇数相同个数的语音特征,将所选取的语音特征作为每一簇的初始聚类中心;Select the same number of voice features as the preset number of clusters in the voice feature set, and use the selected voice feature as the initial cluster center of each cluster;
    根据语音特征集中各语音特征与各初始聚类中心的欧氏距离,将所述语音特征集进行划分,得到初始聚类结果;According to the Euclidean distance between each voice feature in the voice feature set and each initial cluster center, the voice feature set is divided to obtain an initial clustering result;
    根据初始聚类结果,获取每一簇的调整后聚类中心;Obtain the adjusted cluster center of each cluster according to the initial clustering result;
    根据调整后聚类中心,将所述语音特征集的语音特征集根据与调整后聚类中心的欧氏距离进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇,以组成聚类结果。According to the adjusted clustering center, the voice feature set of the voice feature set is divided according to the Euclidean distance from the adjusted clustering center, until the clustering result remains the same number of times more than the preset number of times, obtaining the same number as the preset number of times. The number of clusters corresponding to the number of clusters to form the clustering results.
  16. 根据权利要求11所述的计算机设备,其中,所述将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果,包括:The computer device according to claim 11, wherein the inputting the voice feature corresponding to the currently to-be-recognized voice data into the voice recognition model for calculation, and obtaining and sending the current voice recognition result to the user, comprising:
    将所述当前待识别语音数据对应的语音特征输入至链接时序分类子模型进行运算,得到第一识别序列;Inputting the speech feature corresponding to the current speech data to be recognized into the link sequence classification sub-model for operation to obtain a first recognition sequence;
    将所述第一识别序列输入至基于注意力机制子模型进行运算,得到并向用户端发送当前语音识别结果。The first recognition sequence is input into the attention mechanism-based sub-model for operation, and the current speech recognition result is obtained and sent to the user.
  17. 根据权利要求11所述的计算机设备,其中,还包括:The computer device of claim 11, further comprising:
    将语音识别模型中链接时序分类子模型对应的第一模型参数集和基于注意力机制子模型对应的第二模型参数集上传至区块链网络。Upload the first model parameter set corresponding to the link time series classification sub-model and the second model parameter set corresponding to the attention mechanism-based sub-model in the speech recognition model to the blockchain network.
  18. 根据权利要求12所述的计算机设备,其中,所述一阶FIR高通数字滤波器是一阶非递归型高通数字滤波器,对应的传递函数如下式:H(z)=1-az -1;其中,a=0.98。 The computer device according to claim 12, wherein the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and the corresponding transfer function is as follows: H(z)=1-az -1 ; where a=0.98.
  19. 根据权利要求11所述的计算机设备,其中,所述样本子集筛选条件是样本冗余度为多个样本子集中最小值;The computer device according to claim 11, wherein the sample subset screening condition is that the sample redundancy is a minimum value in a plurality of sample subsets;
    所述调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合,包括:The calling of the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, to form a target cluster cluster set, including:
    获取所述聚类结果每一聚类簇对应的样本冗余度,由聚类簇的样本冗余度为多个聚类簇中最小值的聚类簇组成目标聚类簇集合。The sample redundancy corresponding to each cluster of the clustering result is obtained, and the target cluster set is composed of a cluster whose sample redundancy is the minimum value among the multiple clusters.
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:
    获取初始语音样本集,提取所述初始语音样本集中每一条初始语音样本数据对应的语音特征,以组成语音特征集;其中,所述初始语音样本集中包括多条初始语音样本数据;Obtaining an initial voice sample set, extracting the voice features corresponding to each piece of initial voice sample data in the initial voice sample set, to form a voice feature set; wherein the initial voice sample set includes multiple pieces of initial voice sample data;
    通过动态时间规整算法获取所述语音特征集中各语音特征之间的欧式距离,根据各语音特征之间的欧式距离进行K-means聚类,以得到聚类结果;Obtain the Euclidean distance between the voice features in the voice feature set by a dynamic time warping algorithm, and perform K-means clustering according to the Euclidean distance between the voice features to obtain a clustering result;
    调用预设的样本子集筛选条件,获取所述聚类结果中满足所述样本子集筛选条件的聚类簇,以组成目标聚类簇集合;Calling the preset sample subset screening conditions, and obtaining the cluster clusters that satisfy the sample subset screening conditions in the clustering result, so as to form a target cluster cluster set;
    获取所述目标聚类簇集合中每一语音特征对应的标注值,以得到与所述目标聚类簇集合对应的当前语音样本集;Obtain the label value corresponding to each voice feature in the target cluster set to obtain the current voice sample set corresponding to the target cluster set;
    将所述当前语音样本集中每一语音特征作为待训练语音识别模型的输入,将每一语音特征对应的标注值作为待训练语音识别模型的输出以对待训练语音识别模型进行训练,得到语音识别模型;其中,所述待训练语音识别模型中包括链接时序分类子模型和基于注意力机制子模型;以及Taking each speech feature in the current speech sample set as the input of the speech recognition model to be trained, and using the corresponding label value of each speech feature as the output of the speech recognition model to be trained to train the speech recognition model to be trained to obtain a speech recognition model ; wherein, the speech recognition model to be trained includes a link timing classification sub-model and an attention-based mechanism sub-model; and
    若检测到用户端上传的当前待识别语音数据,将所述当前待识别语音数据对应的语音特征输入至所述语音识别模型进行运算,得到并向用户端发送当前语音识别结果。If the currently to-be-recognized voice data uploaded by the client is detected, the voice feature corresponding to the current to-be-recognized voice data is input into the voice recognition model for operation, and the current voice recognition result is obtained and sent to the client.
PCT/CN2021/083934 2020-12-01 2021-03-30 Speech sample screening method and apparatus based on geometry, and computer device and storage medium WO2022116442A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011387398.0 2020-12-01
CN202011387398.0A CN112530409B (en) 2020-12-01 2020-12-01 Speech sample screening method and device based on geometry and computer equipment

Publications (1)

Publication Number Publication Date
WO2022116442A1 true WO2022116442A1 (en) 2022-06-09

Family

ID=74996045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083934 WO2022116442A1 (en) 2020-12-01 2021-03-30 Speech sample screening method and apparatus based on geometry, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112530409B (en)
WO (1) WO2022116442A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825169A (en) * 2023-08-31 2023-09-29 悦芯科技股份有限公司 Abnormal memory chip detection method based on test equipment
CN117334186A (en) * 2023-10-13 2024-01-02 武汉赛思云科技有限公司 Speech recognition method and NLP platform based on machine learning

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530409B (en) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 Speech sample screening method and device based on geometry and computer equipment
CN113345424B (en) * 2021-05-31 2024-02-27 平安科技(深圳)有限公司 Voice feature extraction method, device, equipment and storage medium
CN115146716A (en) * 2022-06-22 2022-10-04 腾讯科技(深圳)有限公司 Labeling method, device, equipment, storage medium and program product
CN114863939B (en) * 2022-07-07 2022-09-13 四川大学 Panda attribute identification method and system based on sound

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931043A (en) * 2019-12-06 2020-03-27 湖北文理学院 Integrated speech emotion recognition method, device, equipment and storage medium
US10699719B1 (en) * 2011-12-31 2020-06-30 Reality Analytics, Inc. System and method for taxonomically distinguishing unconstrained signal data segments
CN111813905A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Corpus generation method and device, computer equipment and storage medium
CN111966798A (en) * 2020-07-24 2020-11-20 北京奇保信安科技有限公司 Intention identification method and device based on multi-round K-means algorithm and electronic equipment
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065028B (en) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN109657711A (en) * 2018-12-10 2019-04-19 广东浪潮大数据研究有限公司 A kind of image classification method, device, equipment and readable storage medium storing program for executing
CN110648671A (en) * 2019-08-21 2020-01-03 广州国音智能科技有限公司 Voiceprint model reconstruction method, terminal, device and readable storage medium
CN110929771B (en) * 2019-11-15 2020-11-20 北京达佳互联信息技术有限公司 Image sample classification method and device, electronic equipment and readable storage medium
CN111179914B (en) * 2019-12-04 2022-12-16 华南理工大学 Voice sample screening method based on improved dynamic time warping algorithm
CN111046947B (en) * 2019-12-10 2023-06-30 成都数联铭品科技有限公司 Training system and method of classifier and recognition method of abnormal sample
CN111554270B (en) * 2020-04-29 2023-04-18 北京声智科技有限公司 Training sample screening method and electronic equipment
CN111950294A (en) * 2020-07-24 2020-11-17 北京奇保信安科技有限公司 Intention identification method and device based on multi-parameter K-means algorithm and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10699719B1 (en) * 2011-12-31 2020-06-30 Reality Analytics, Inc. System and method for taxonomically distinguishing unconstrained signal data segments
CN110931043A (en) * 2019-12-06 2020-03-27 湖北文理学院 Integrated speech emotion recognition method, device, equipment and storage medium
CN111813905A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Corpus generation method and device, computer equipment and storage medium
CN111966798A (en) * 2020-07-24 2020-11-20 北京奇保信安科技有限公司 Intention identification method and device based on multi-round K-means algorithm and electronic equipment
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825169A (en) * 2023-08-31 2023-09-29 悦芯科技股份有限公司 Abnormal memory chip detection method based on test equipment
CN116825169B (en) * 2023-08-31 2023-11-24 悦芯科技股份有限公司 Abnormal memory chip detection method based on test equipment
CN117334186A (en) * 2023-10-13 2024-01-02 武汉赛思云科技有限公司 Speech recognition method and NLP platform based on machine learning
CN117334186B (en) * 2023-10-13 2024-04-30 北京智诚鹏展科技有限公司 Speech recognition method and NLP platform based on machine learning

Also Published As

Publication number Publication date
CN112530409B (en) 2024-01-23
CN112530409A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022116442A1 (en) Speech sample screening method and apparatus based on geometry, and computer device and storage medium
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
US10854193B2 (en) Methods, devices and computer-readable storage media for real-time speech recognition
WO2022121185A1 (en) Model training method and apparatus, dialect recognition method and apparatus, and server and storage medium
WO2022033327A1 (en) Video generation method and apparatus, generation model training method and apparatus, and medium and device
US11646032B2 (en) Systems and methods for audio processing
KR20230018534A (en) Speaker diarization using speaker embedding(s) and trained generative model
WO2022121257A1 (en) Model training method and apparatus, speech recognition method and apparatus, device, and storage medium
WO2021082420A1 (en) Voiceprint authentication method and device, medium and electronic device
CN110534099A (en) Voice wakes up processing method, device, storage medium and electronic equipment
WO2021114841A1 (en) User report generating method and terminal device
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
JP7230806B2 (en) Information processing device and information processing method
WO2022227190A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
US20230022004A1 (en) Dynamic vocabulary customization in automated voice systems
WO2020052069A1 (en) Method and apparatus for word segmentation
CN111081230A (en) Speech recognition method and apparatus
WO2022142115A1 (en) Adversarial learning-based speaker voice conversion method and related device
WO2020220824A1 (en) Voice recognition method and device
CN113314119A (en) Voice recognition intelligent household control method and device
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
CN113823257B (en) Speech synthesizer construction method, speech synthesis method and device
US10923113B1 (en) Speechlet recommendation based on updating a confidence value
WO2023035529A1 (en) Intent recognition-based information intelligent query method and apparatus, device and medium
WO2022057759A1 (en) Voice conversion method and related device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899486

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899486

Country of ref document: EP

Kind code of ref document: A1